
Chemistry
Accelerating the discovery of acceptor materials for organic solar cells by deep learning
J. Sun, D. Li, et al.
Discover how DeepAcceptor, developed by a team from Central South University led by Jinyu Sun, revolutionizes the search for high-performance organic photovoltaic materials using deep learning. This innovative approach promises to cut down the time and cost associated with identifying efficient small molecule acceptor materials, achieving impressive power conversion efficiencies.
~3 min • Beginner • English
Introduction
Organic solar cells (OSCs) are attractive for lightweight, low-cost, and flexible photovoltaics, with device architectures typically comprising donor–acceptor bulk heterojunctions. Recent A-DA'D-A non-fullerene acceptors (NFAs), such as Y6-series materials, have achieved record efficiencies of 18–19%. Despite this progress, discovering new high-performance acceptors remains slow, expensive, and inefficient. Machine learning offers a route to accelerate discovery by predicting device power conversion efficiency (PCE) directly from molecular structure, but model accuracy depends heavily on data quality and algorithmic advances. Computational datasets (e.g., DFT or Scharber-model derived) enable broad exploration but often deviate from experiment, reducing predictive accuracy. High-quality experimental datasets and advanced deep learning methods are therefore needed. This work addresses the question: can a deep learning model that fuses graph-based molecular representations with transformer architectures, pre-trained on large computational data and fine-tuned on curated experimental data, accurately predict PCE and enable efficient discovery of high-performance acceptor materials?
Literature Review
Prior work has shown machine learning can predict OSC properties and screen candidates, often using computational datasets (DFT or Scharber) to explore large chemical spaces. However, discrepancies between computational and experimental data can limit predictive accuracy. BERT-based models (e.g., Mol-BERT, ChemBERTa) employing masked language modeling on SMILES have shown strong performance for molecular property prediction, but SMILES syntax complexity and auxiliary tokens can hinder training. Graph neural networks (GNNs), leveraging message-passing on molecular graphs, have achieved excellent results by capturing structural and bonding information. Recent studies emphasize the importance of high-quality experimental data, dataset composition and splits, and modern deep learning strategies (pre-training, transfer learning) to improve generalization. These insights motivate combining transformer-based models with graph representations and self-supervised pre-training on large computational datasets followed by fine-tuning on curated experimental PCE data.
Methodology
Data: Two datasets were curated. (1) A computational dataset of 51,256 NFAs (from prior DFT work) served as unlabeled molecules for self-supervised pre-training. (2) An experimental dataset of 1,027 small-molecule NFAs with device PCEs collected from 508 articles served for supervised fine-tuning. Duplicate molecules across publications were resolved by taking the maximum reported PCE. Molecules containing Si or metals were removed. All molecules were represented as SMILES. Splits: computational 8:1:1; experimental 7:2:1 with stratified sampling across PCE ranges.
Model (abcBERT): The approach integrates GNN-style graph information into a BERT-like transformer. Molecular graphs are encoded with atom-type tokens (dictionary of common elements plus [UNK]) and a global supernode [Global] connected to all atoms to represent the whole molecule. Two structural encodings augment attention: (i) a bond length matrix encodes edge features and relative distances (supernode bond length set to 0) and (ii) an adjacency-based connection matrix encodes atomic connectivity. The transformer attention uses softmax(A + B + C) where A is the query–key similarity, B is bond-length encoding, and C is connection encoding. The final input embedding is the sum of token, bond-length, and connection-information embeddings.
Pre-training: A masked-atom task analogous to MLM was used. For each molecule, 15% of atoms were selected; of these, 80% were replaced by [MASK], 10% by a random atom, and 10% left unchanged. The model (embedding + transformer encoder + 3-layer FCN head) was trained to predict masked atom types using sparse categorical cross-entropy.
Fine-tuning: The pre-trained encoder was retained; the [Global] supernode representation was fed to a 4-layer FCN regression head to predict PCE using MSE loss.
Hyperparameters and training: Six architectures were compared; the selected Model 3 used 8 transformer layers, 8 attention heads, and embedding size 256. Output hidden layer dimensions were 256. Optimizer: Adam; learning rate 1e-4 for both pre-training and fine-tuning; dropout 0.1; early stopping applied; batch size and dropout tuned on the validation set.
Candidate generation and screening: A large Gen database was produced via BRICS fragmentation/recombination (T–C–T and T–S–C–S–T schemes) and a VAE trained on SELFIES representations derived from BRICS outputs to generate diverse valid molecules. VAE metrics achieved validity 100%, uniqueness 87.1%, novelty 100%. The initial library comprised ~4.8 million molecules, then filtered by RDKit-derived properties (molecular weight, LogP, H-bond donors/acceptors, rotatable bonds, rings, N/O counts) with thresholds guided by high-performance acceptors (PCE > 10%). Donor–acceptor energetic matching with PM6 (HOMO −5.45 eV, LUMO −3.65 eV) was performed using a molecularGNN trained on 51k NFAs with DFT HOMO/LUMO; test performance: HOMO MAE 0.057 eV (R² 0.970), LUMO MAE 0.064 eV (R² 0.967). Candidates were retained if ΔHOMO > 0, ΔLUMO > 0, and 1 < Eg < 3, reducing to 104,295 molecules. Synthetic accessibility filtering (SAscore < 8) yielded 47,653 molecules; additional polarity/charge-related descriptors (e.g., TPSA) further reduced to 23,029. The fine-tuned abcBERT predicted PCEs for these, selecting 74 candidates with predicted PCE > 14%. Manual expert curation considering synthesis difficulty, conjugation, and solubility led to three final candidates for synthesis and device testing with PM6.
Key Findings
- abcBERT outperformed baseline and SOTA models on the experimental test set for PCE prediction: MAE = 1.78, MSE = 5.53, R² = 0.67, Pearson r = 0.82.
- Ablation studies showed that pre-training reduced MAE by 0.26 and MSE by 1.14, and increased R² by 0.06; adding hydrogens, bond-length encoding, and connection information each improved performance, underscoring the value of richer chemical representations.
- The molecularGNN for frontier orbital prediction achieved high accuracy on test data: HOMO MAE = 0.057 eV (R² = 0.970) and LUMO MAE = 0.064 eV (R² = 0.967), outperforming the Tartarus GFN2-xTB baselines with substantially faster inference.
- Large-scale generation and screening: starting from ~4.8 million generated molecules (VAE validity 100%, uniqueness 87.1%, novelty 100%), multi-stage filtering based on basic properties, donor–acceptor energetic alignment with PM6, SAscore, and polarity/charge descriptors narrowed to 23,029; abcBERT identified 74 candidates with predicted PCE > 14%.
- Three selected candidates were synthesized and characterized; PM6-based devices achieved experimental PCEs of 14.61%, 12.62%, and 13.60%. The average absolute error between experimental and abcBERT-predicted PCEs for these three devices was approximately 1.96%. For candidate 1, experimental PCE reached 14.61% versus abcBERT prediction of 15.83% (Scharber model predicted 7.21%).
- The DeepAcceptor interface (Hugging Face Spaces) integrates an editable NFA database, molecular editor, and real-time PCE predictors, facilitating rapid design and screening.
Discussion
The study demonstrates that combining graph-informed transformer architectures with self-supervised pre-training on large computational datasets and fine-tuning on curated experimental data yields accurate and generalizable PCE predictions. The inclusion of bond-length and connectivity encodings allows the attention mechanism to capture local geometric and topological chemical information that is critical for electronic properties, thereby improving performance over SMILES-only language models and conventional GNNs. Pre-training provides a beneficial initialization for limited labeled data settings, improving downstream accuracy. The end-to-end discovery pipeline translated predictive gains into practical outcomes: high-throughput candidate generation, physics- and synthesis-aware screening, and successful experimental validation of three new NFAs with strong device performance (best PCE 14.61% with PM6). These results support the framework’s utility for accelerating acceptor discovery and suggest that deep learning with chemically enriched representations can reduce reliance on costly experiments and low-fidelity heuristics.
Conclusion
This work introduces DeepAcceptor, a framework that unifies curated experimental datasets, a graph-informed BERT model (abcBERT) with self-supervised pre-training, and a scalable generation/screening pipeline to discover high-performance NFAs. abcBERT achieved SOTA test accuracy (MAE 1.78, R² 0.67) and guided the selection of three experimentally validated candidates, reaching up to 14.61% PCE with PM6. The interface enables accessible, rapid design and prediction, and the approach demonstrates that richer molecular encodings (bond lengths, connectivity) and pre-training substantially enhance predictive power. Future work should focus on continuously expanding and refining experimental datasets, incorporating additional structural/3D information, improving domain adaptation across donors and processing conditions, and integrating active learning to further reduce experimental burdens.
Limitations
- Prediction–experiment gaps remain due to uncontrolled variables such as device fabrication conditions, morphology, synthesis/purification, and measurement variability.
- The experimental dataset size (1,027 NFAs) limits model coverage and generalizability; continuous expansion and curation are needed.
- The model is fine-tuned without conditioning on donor identity or processing parameters, which can influence device PCE and introduce noise.
- Computational pre-training data may not fully reflect experimental realities; domain shift persists despite fine-tuning.
- Structural representation relies on 2D graphs with bond-length encodings; richer 3D geometry and conformational effects are not explicitly modeled.
Related Publications
Explore these studies to deepen your understanding of the subject.