logo
ResearchBunny Logo
The rule of four: anomalous distributions in the stoichiometries of inorganic compounds

Chemistry

The rule of four: anomalous distributions in the stoichiometries of inorganic compounds

E. Gazzarrini, R. K. Cersonsky, et al.

Discover the intriguing phenomenon of the 'rule of four' in inorganic compounds, where an unexpected abundance of primitive unit cells featuring a multiple of four atoms is uncovered. This research by Elena Gazzarrini, Rose K. Cersonsky, Marnik Bercx, Carl S. Adorf, and Nicola Marzari reveals surprising connections between crystal structure and symmetry using advanced machine learning techniques.

00:00
00:00
~3 min • Beginner • English
Introduction
Computational materials discovery is a fast-growing discipline with impact across technology sectors, where efficient identification of structural and energetic characteristics is crucial. The rise of materials informatics and machine learning, alongside first-principles methods such as density-functional theory (DFT), enables data-intensive and systematic approaches for classifying materials’ features and screening candidates, though outcomes are limited by data quality and diversity. Understanding the structure of the data space is essential, and anomalous correlations warrant investigation. The authors introduce the rule of four (RoF): an unusually high relative abundance of inorganic crystal structures whose primitive unit cells contain a number of atoms that is a multiple of four. The phenomenon is examined across two databases: the Materials Project (MP), comprising DFT-relaxed structures sourced from experiments or predictions, and the Materials Cloud 3D crystal structures ‘source’ database (MC3D-source), which aggregates experimental entries from COD, ICSD, and MPDS (with occasional theory-predicted structures). Initial visualizations and statistics show a striking overrepresentation of RoF structures. The study outlines a plan to test whether this effect is an artifact of cell reduction or database processing, and to probe relationships to symmetries, packing, local structure descriptors, and energetics, ultimately employing symmetry-adapted machine learning to understand the origins and implications of the RoF. Although many plausible causes are explored, a full explanation remains open, motivating further research.
Literature Review
The paper situates its work within materials informatics and machine learning for materials discovery, citing advances in high-throughput experiments and calculations, and ML methods for property prediction and structure–property mappings. Prior literature highlights the importance of FAIR data repositories and statistical learning, as well as symmetry and packing concepts in crystallography. However, no prior studies have reported the anomalous overabundance of primitive unit cells with atom counts as multiples of four. The authors therefore address a gap by systematically analyzing this distribution across major databases and by leveraging modern structure representations (SOAP) and hybrid dimensionality-reduction/regression (PCovR) to relate local environments to energetics and symmetry.
Methodology
Datasets and scope: Two inorganic crystal-structure datasets were analyzed. (1) Materials Project (MP): 83,989 entries (mp_all_20181018 via matminer), DFT-relaxed structures. (2) MC3D-source: 79,854 unique structures curated via an AiiDA workflow from MPDS, ICSD, and COD (experimental structures; some theory predictions can appear). Analyses generally consider primitive cells with up to 100 atoms (97.51% of MP and 91.00% of MC3D-source fall within this range). RoF labeling: structures whose primitive unit cells contain a number of atoms that is a multiple of four are labeled RoF; otherwise non-RoF. Primitive cell reduction and artifact check: All structures were reduced to primitive cells using spglib’s find_primitive. The symprec tolerance was varied across 1e-8 to 1 Å to test whether primitivization artifacts induce RoF. Only around 1% of RoF structures switched to non-RoF at typical parameters, rising to about 6% only at unrealistically large symprec (~1 Å), indicating the RoF is not an artifact of symmetrization. Energetics: For MP, formation energies per atom (relative to elements at 0 K, 0 atm) were analyzed for RoF vs non-RoF. Kernel density comparisons were made; further, to compare structurally similar compounds fairly, the authors used Principal Covariates Regression (PCovR) trained on species-tagged SOAP vectors with formation energy as the target to build a latent space reflecting both structural similarity and energy trends. Global symmetry and composition descriptors: Space groups and point groups were obtained (spglib/seekpath for MC3D-source; pymatgen’s SymmetryAnalyzer relying on spglib for MP at symprec 0.3 Å). The inherited symmetry operations and point-group distributions were compared between RoF and non-RoF subsets. Composition complexity was quantified via the number of species (Nspecies). Additional size- and packing-related descriptors included: x = Ns/(Ns+Nl), the fraction of small-to-total (small plus large) atom counts; α = Rs/R, the ratio of smallest to largest atomic radii; packing fraction PF = Vtot_atoms/Vcell (with Vtot_atoms computed from atomic radii and Vcell the unit-cell volume). Distributions of these parameters were contrasted for RoF vs non-RoF, focusing especially on MC3D-source due to its broader size polydispersity and structural complexity. Local structural descriptors and ML pipeline: Local environments were represented with Smooth Overlap of Atomic Positions (SOAP) using librascal. Two SOAP parameterizations were used: (i) species-tagged (separate channels per element) for energetic analyses (high dimensional; ~100k sparse features per compound, reduced via furthest point sampling to 2,000 features); and (ii) species-invariant (ignoring chemical identity to emphasize geometry; ~80 features per compound) for classification of RoF vs non-RoF. Features were averaged per compound and normalized across datasets. PCovR (scikit-matter) was used to embed species-tagged SOAP vs formation energy. For classification of RoF vs non-RoF from local symmetry alone, a RandomForestClassifier (scikit-learn) with 100 trees was trained with 90/10 train/test split. The SOAP cutoff radius r_cut was varied (e.g., 1–8 Å) to assess locality effects; performance was reported as accuracy and confusion matrix entries across r_cut. A learning curve at r_cut = 4.0 Å assessed model scaling. Additional classifiers (SVM, Logistic Regression) were benchmarked (details in SI). Additional ML regression: Linear ridge regression with 4-fold cross-validation was trained on formation energy (and other energetics, e.g., energy above hull and PBE band gap) using species-tagged SOAP for PCovR purposes; RMSE and uncertainties are reported (e.g., formation energy per atom RMSE 0.0530 eV; uncertainty 0.4002 eV), sufficient for latent-space analysis rather than SOTA energy prediction. Data and code availability: Datasets (IDs and SOAP) are available via Materials Cloud Archive; code to reproduce figures and analyses is provided in a public GitHub repository.
Key Findings
- Prevalence of RoF: A high proportion of structures have primitive unit-cell atom counts that are multiples of four. Counts: MP nRoF = 50,401 vs n_non-RoF = 33,588; MC3D-source nRoF = 46,779 vs n_non-RoF = 33,075. Percentage by multiples (not mutually exclusive): MP—×3: 32.38%, ×4: 60.01%, ×5: 18.41%, ×6: 26.82%, ×7: 12.43%; MC3D-source—×3: 36.57%, ×4: 58.58%, ×5: 20.89%, ×6: 30.99%, ×7: 12.51%. - Not a symmetrization artifact: Varying spglib symprec from 1e-8 to 1e-1 Å converts only about 1% of RoF to non-RoF; only at ~1 Å does conversion rise to ~6%. - Energetics: No significant correlation between RoF and lower formation energies. PCovR embeddings trained on species-tagged SOAP with formation energy show that structurally similar RoF and non-RoF compounds do not separate along orthogonal covariates; distributions of formation energy are similar aside from a modest positive-energy tail for RoF that disappears when controlling for structure. - Symmetry: RoF compounds are enriched in low-symmetry point groups (e.g., 2, m, 2/m, mm2, 222, mmm). Non-RoF structures are relatively more abundant in higher-symmetry point groups. Fourfold rotational symmetry is not a determinant of RoF. - Composition and size polydispersity: RoF materials more often contain 4 or more elements (higher Nspecies). RoF compounds show higher abundance of small atomic radii (higher x) and greater spread in atomic sizes (α peaking ~0.35 in MC3D-source), consistent with interstitial-like occupancy lowering global symmetry. - Packing: RoF structures exhibit lower packing fractions (PF peaks ~0.1–0.2), indicating loosely packed arrangements and ruling out dense packing as a cause of the anomaly; porous/disordered compounds contribute sharp PF peaks (noted as outliers in MC3D-source). - Local symmetry classification: Using species-invariant SOAP and Random Forest, RoF vs non-RoF can be predicted from local environments alone with test accuracy saturating at ~87% at r_cut ≈ 4.0 Å (first two neighbor shells). At small r_cut, high false positives indicate insufficient locality information; learning curves show continued improvement with more data but no saturation, suggesting secondary effects beyond strictly local environments. - Overall: The RoF correlates with local symmetry features, lower global symmetry, higher compositional complexity, and looser packing, but not with stabilized energetics.
Discussion
The study establishes that the rule of four is a robust, cross-database phenomenon not attributable to technical artifacts in primitive-cell determination. Contrary to initial expectations that more common stoichiometries might reflect thermodynamic stabilization, energetics (formation energy, energy above hull, or DFT band gaps analyzed via PCovR) do not differentiate RoF from non-RoF when comparing structurally similar compounds. Instead, RoF prevalence aligns with structural characteristics: compounds tend to have lower point-group symmetry, more complex chemistries (higher Nspecies), broader atomic size distributions (suggesting interstitial incorporation), and significantly lower packing fractions that maximize free volume. The finding that species-invariant local environment descriptors can classify RoF membership with high accuracy underscores that local symmetry motifs, rather than global crystallographic symmetry or density, are primary indicators of the phenomenon. Nonetheless, the absence of a clear physical mechanism that necessitates multiples of four in primitive unit cells indicates that RoF likely arises from a confluence of crystallographic building rules, chemical diversity, and packing tendencies common in experimentally realized and DFT-relaxed inorganic structures. The results delineate what the RoF is not (an energy-driven or high-symmetry effect) and point to local structural organization as the salient signature.
Conclusion
This work documents and analyzes an anomalous overabundance of inorganic compounds with primitive unit cells containing atom counts that are multiples of four (the rule of four), across both experimental-leaning and DFT-relaxed databases. The authors show: (i) the effect is not a reduction or symmetrization artifact; (ii) it is not associated with lower formation energies; (iii) RoF compounds are biased toward low global symmetries, higher compositional complexity, and low packing fractions; and (iv) local structural symmetries captured by species-invariant SOAP enable accurate classification (≈87%). The study provides a reference framework and open resources (datasets, code) for further investigations. Future research directions include: probing crystallographic group–subgroup relations for hidden multiplicity constraints; dissecting prototypical structure families to identify local motif counts enforcing fourfold multiplicities; evaluating dataset and synthesis-route biases; and extending analyses to nonstoichiometric, defected, and low-dimensional systems to test the generality and origins of the RoF.
Limitations
- A definitive physical or crystallographic mechanism explaining why multiples of four dominate is not identified; the analysis is correlative. - Dataset biases may influence statistics: MP avoids very large cells due to computational cost; MC3D-source includes porous/disordered outliers and has licensing constraints restricting full structural release (SOAP-only for MC3D-source). - The PF and size-distribution parameters rely on tabulated atomic radii choices and averaging, which may oversimplify complex bonding environments. - SOAP-based regression accuracy on energetics is adequate for PCovR but not state-of-the-art; subtle energy differences might be under-resolved. - Classification relies on averaged local descriptors; nonlocal or long-range ordering effects may not be fully captured, as hinted by learning curves without saturation. - Some statistics (e.g., very high Nspecies) have limited sample support and higher uncertainty.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny