Engineering and Technology

An artificial intelligence-aided virtual screening recipe for two-dimensional materials discovery

M. C. Sorkun, S. Astruc, et al.

This paper delves into a revolutionary data-driven strategy for accelerating the virtual screening of two-dimensional materials, leading to a comprehensive database of over 316,000 stable materials with key properties for energy applications. The research was conducted by Murat Cihan Sorkun, Séverin Astruc, J. M. Vianney A. Koelman, and Süleyman Er.

00:00

~3 min • Beginner • English

Index

Introduction

Material discovery has progressed through three paradigms—experiment (trial-and-error), theory (interpreting experiments), and computation (simulation of theoretical models). The rapid growth of data from these paradigms and advances in algorithms have enabled a fourth paradigm: data-driven discovery using AI. AI and virtual screening have shown success across chemical and materials science sub-fields but each domain requires tailored workflows and sufficient high-quality data. Two-dimensional (2D) materials exhibit exceptional and tunable properties for semiconductor, energy, and health applications, yet only a few hundred have been experimentally synthesized since graphene’s discovery. In silico repositories based on high-throughput DFT have recently emerged, generating 2D materials via (i) exfoliation of layered 3D structures and (ii) combinatorial atomic substitution within 2D prototypes. Despite progress, physics-based calculations constrain the chemical search space to only a few thousand studied 2D materials, though these FAIR datasets enable machine learning. This work introduces a recipe for AI-aided virtual screening of 2D materials: generate an unprecedented chemical space of candidates from prototypes and elements, filter for neutrality and stability, and predict key properties. As proof of concept, the authors built the Virtual 2D Materials Database (V2DB) with 316,505 likely stable 2D materials and predicted properties, aiming to accelerate discovery of functional 2D materials.

Literature Review

The paper situates AI-driven materials discovery as the fourth paradigm, referencing prior successes of machine learning and virtual screening in materials and chemistry. It outlines two main computational approaches to 2D materials design: (1) top-down exfoliation from 3D layered compounds and (2) bottom-up combinatorial substitution in known 2D crystal prototypes. The limitation of DFT in covering vast chemical spaces is emphasized, but existing FAIR databases (e.g., C2DB, 2DMatPedia) provide high-quality training data. The authors note known shortcomings of the PBE functional for electronic properties and discuss established practices for rescaling to more accurate methods (e.g., GW), motivating their regression-based correction between PBE and G0W0 for band edges and band gaps.

Methodology

Workflow: A three-step pipeline was developed. (1) Generation: Using 22 distinct 2D crystal prototypes and 52 elements partitioned into cationic (A-type) and anionic (B-type) sets, exhaustive brute-force elemental substitution produced 72,522,240 unverified candidate 2D compounds. Prototypes were grouped by unit-cell compositions (AB, ABB, AABB, AABBBB) and symmetry labels (Z, X, Y, XY, XYY). (2) Filtering: Three sequential filters were applied: symmetry filter to remove geometric duplicates (leaving 10,321,920 unique structures); neutrality filter to enforce charge neutrality based on Greenwood’s tabulated oxidation states and A/B charge constraints (resulting in 9,732,136 neutral candidates); stability filter via ML predictions using criteria is_stable=True, heat of formation ΔH < 0.2 eV/atom, and energy above convex hull ΔH_hull < 0.2 eV/atom (yielding 316,505 likely stable candidates). (3) Property prediction: Separate artificial neural network (ANN) models were trained to predict stability, heat of formation, energy above convex hull, band gap (value and direct/indirect), VBM, CBM, work function, and magnetic state (NM/FM/AFM) for filtered materials. Training data: 2,226 2D materials from the C2DB were selected using criteria: inclusion in chosen prototypes and elements, availability of dynamic stability data, and PBE band gap data. All calculations are DFT-PBE based with consistent workflows. Features: Non-DFT, composition-based features only: number of atoms per unit cell; one-hot prototype vector; composition vectors (A and B site elemental ratios); electronegativity (Pauling) geometric mean for A and B atoms. No structural relaxation or high-cost features were used. Model development and evaluation: Separate ANNs per target property, implemented with scikit-learn (Python 3.6). Hyperparameters tuned by grid search; 20-fold cross-validation used for evaluation. Final models trained on the full dataset to reduce bias. Classification tasks included stability, direct/indirect band gap, and magnetic state; regression tasks included ΔH, ΔH_hull, band gap, VBM, CBM, and work function. Due to deterministic labeling and elemental-only features, materials sharing the same prototype and chemical formula receive identical predicted labels. Virtual screening criteria for applications: For photovoltaics, semiconductors with band gaps in 0.75–1.75 eV (single-junction) were highlighted (others potentially relevant for multi-junction). Photocatalyst candidates for solar-driven conversions were selected via band edge alignment: overall: band gap < 1.75 eV and VBM < −5.67 eV; water splitting: CBM > −4.44 eV; CO2 reduction: −4.47 < CBM < −4.44 eV; N2 reduction: −4.53 < CBM < −4.44 eV. Band edge correction: Regression between PBE and G0W0 for 188 C2DB materials yielded linear mappings used to estimate G0W0-level properties for V2DB: Egap(G0W0) = 1.63·Egap(PBE) + 0.54; EVBM(G0W0) = 1.14·EVBM(PBE) + 0.55; ECBM(G0W0) = 1.01·ECBM(PBE) + 0.71. Validation: External validation against 2DMatPedia identified 103 matched materials (27 not in C2DB). The predicted PBE band gaps showed MAE = 0.438 eV between V2DB predictions and 2DMatPedia DFT values. Considering inter-database DFT differences (C2DB vs 2DMatPedia MAE = 0.132 eV) and model cross-validation MAE (0.135 eV), the result supports model applicability, though validation covered only 5 of 22 prototypes. Data/code availability: V2DB provided as CSV with compositions, prototypes, and predicted properties at Harvard Dataverse and AMD lab website; generation and prediction scripts available on Code Ocean.

Key Findings

- Scale of exploration: Generated 72,522,240 2D candidate compositions via prototype-based brute-force substitution across 52 elements and 22 prototypes. - Filtering outcomes (funnel): unique after symmetry filter: 10,321,920 (14%); neutral after charge filter: 9,732,136 (13%); likely stable after ML stability filter: 316,505 (0.4%). - Per-prototype examples (after stability filter): e.g., BiTeI (ABB, Z): 1,249 stable of 3,735 neutral (0.334); MoSSe (ABB, Z): 1,211 (0.324); WTe2 (AABBBB, XY): 260,325 of 1,334,218 neutral (0.195); some prototypes yielded zero stable candidates (e.g., PbSe, RhO, SnS). - ML performance (20-fold CV): • Heat of formation: R² = 0.98, MAE = 0.064 eV/atom. • Energy above convex hull: R² = 0.87, MAE = 0.063 eV/atom. • Band gap (PBE): R² = 0.90, MAE = 0.135 eV. • VBM: R² = 0.82, MAE = 0.299 eV. • CBM: R² = 0.76, MAE = 0.290 eV. • Work function: R² = 0.87, MAE = 0.207 eV. • Classification F-measures: stability 0.714; direct/indirect gap 0.693; magnetic state 0.787. - Band edge correction to G0W0 established with linear relations: Egap(G0W0) = 1.63·Egap(PBE)+0.54; EVBM(G0W0) = 1.14·EVBM(PBE)+0.55; ECBM(G0W0) = 1.01·ECBM(PBE)+0.71, enabling estimated G0W0-level properties in V2DB. - Application screening: Identified thousands of candidates meeting photovoltaic and photocatalytic band gap and band edge criteria using ML-predicted properties, demonstrating effective downselection for energy conversion/storage applications. - External validation: For 103 matched materials with 2DMatPedia, predicted band gap MAE = 0.438 eV; given inter-database discrepancy (0.132 eV) and model CV error (0.135 eV), this supports practical predictivity across datasets.

Discussion

The study demonstrates that AI/ML using basic compositional and prototype features can navigate extremely large 2D materials chemical spaces and triage candidates by stability and properties relevant to energy technologies. From over 72 million generated compositions, the pipeline produced 316,505 likely stable 2D materials with predicted thermochemical, electronic, and magnetic properties, and highlighted many promising candidates for photovoltaics and photocatalysis based on band edge criteria. The approach leverages limited, yet high-quality, DFT-PBE training data (C2DB) and accounts for known PBE limitations through a regression-based correction to approximate G0W0 band gaps and edges. Generalizability depends on how well prototypes and element combinations are represented in training data; a chemical space heatmap guides users regarding reliable regions versus sparsely represented ones. External validation with 2DMatPedia supports the model’s applicability, though broader validation across more prototypes and chemistries is needed. Stability predictions facilitate identifying synthesizable candidates, but stability does not guarantee experimental synthesizability; factors such as synthesis routes, element diversity, toxicity, and abundance should inform further downselection. The methodology is tunable—users can restrict prototypes/elements to well-represented sets, limit unit-cell compositional variance, and tighten stability thresholds to adjust precision/recall trade-offs. Overall, the results underscore the feasibility of scalable AI-aided virtual screening to accelerate discovery of functional 2D materials.

Conclusion

The authors introduce a general, reproducible recipe for AI-aided virtual screening of 2D materials: generate candidates from prototypes and element sets, filter by symmetry, neutrality, and ML-predicted stability, then predict key properties with ANNs. Applied to 22 prototypes and 52 elements, the framework generated 72.5 million candidates and identified 316,505 likely stable 2D materials with predicted thermochemical, electronic, and magnetic properties (compiled as V2DB). The approach efficiently highlights candidates for energy conversion and storage applications, and incorporates a PBE-to-G0W0 correction for band edges. Future work should expand training datasets with higher-fidelity and diverse data, broaden prototype and chemistry coverage, and enhance model generalizability. Practical screening for experimental follow-up can be improved by incorporating synthesizability considerations, toxicity/abundance constraints, and by tightening stability thresholds. The recipe is adaptable and expected to benefit from anticipated growth in high-quality 2D materials data.

Limitations

- Training data limitations: Only 2,226 materials from C2DB across 22 prototypes; uneven representation across prototype–element combinations limits generalizability, as reflected in the chemical space heatmap. - Methodological approximation: Use of DFT-PBE data for training; PBE’s known inaccuracies for electronic properties necessitate linear corrections to approximate G0W0, which introduces additional uncertainty. - Feature set constraints: Models use only composition- and prototype-based features; materials with identical formula and prototype receive identical predictions, potentially missing structure-sensitive nuances (e.g., polymorphs, subtle distortions). - Stability vs synthesis: Predicted thermodynamic/dynamic stability does not ensure experimental synthesizability; synthesis complexity increases with chemical diversity of unit cells. - Validation scope: External validation covered 103 materials and only 5 of the 22 prototypes; broader validation is needed for comprehensive assessment. - Threshold choices: Stability cutoffs (ΔH, ΔH_hull < 0.2 eV/atom) favor recall and may yield false positives; users may need to tighten thresholds for higher precision.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Artificial intelligence guided screening for cardiomyopathies in an obstetric population: a pragmatic randomized clinical trial

D. A. Adedinsewo, A. C. Morales-lara, et al.

Medicine and Health

Artificial intelligence guided screening for cardiomyopathies in an obstetric population: a pragmatic randomized clinical trial

D. A. Adedinsewo, A. C. Morales-lara, et al.

Engineering and Technology

Deep-learning-based image segmentation integrated with optical microscopy for automatically searching for two-dimensional materials

S. Masubuchi, E. Watanabe, et al.

Physics

Artificial intelligence enhanced two-dimensional nanoscale nuclear magnetic resonance spectroscopy

X. Kong, L. Zhou, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny