logo
ResearchBunny Logo
Small dataset machine-learning approach for efficient design space exploration: engineering ZnTe-based high-entropy alloys for water splitting

Engineering and Technology

Small dataset machine-learning approach for efficient design space exploration: engineering ZnTe-based high-entropy alloys for water splitting

S. V. Oh, S. Yoo, et al.

This study, conducted by Seung-Hyun Victor Oh, Su-Hyun Yoo, and Woosun Jang, introduces a novel machine learning technique that efficiently navigates the design space of ZnTe-based high-entropy alloys for water splitting, leveraging limited data to predict crucial band edge positions for enhanced photocatalytic efficiency.

00:00
00:00
~3 min • Beginner • English
Introduction
Solar energy is a promising sustainable and green source for clean fuels. Photocatalytic water splitting can generate hydrogen using sunlight, where semiconducting photocatalysts absorb photons to drive redox reactions. Suitable materials must have band edges aligned with water redox potentials (Ered = −4.44 eV, Eox = −5.67 eV vs vacuum) and an optimal visible-light bandgap (≈1.6–2.1 eV). Few known materials satisfy both criteria (e.g., GaN:ZnO, β-Ge3N4, Al-doped SrTiO3, Ta3N5). High-throughput quantum mechanical calculations (e.g., DFT) combined with ML have accelerated discovery, but most ML datasets emphasize simple stoichiometries and neglect continuous alloy compositions, limiting exploration of solid-solution spaces. Cation/anion exchange and high configurational entropy offer avenues to tune band edges and enhance thermal stability. This work addresses efficient exploration of multicomponent ZnTe-based alloy photocatalysts using a small-data ML workflow that combines SISSO with an agreement (α-) method, targeting band engineering for water splitting and extending predictions up to hexanary systems.
Literature Review
Prior studies established the band-edge alignment and visible-range bandgap as key criteria for photocatalytic water splitting, with optimal Eg ≈ 1.6–2.1 eV to exploit visible solar flux. Only a limited set of materials meet these requirements (e.g., GaN:ZnO, β‑Ge3N4, Al‑doped SrTiO3, Ta3N5). ZnTe has attracted interest as a photocathode (Eg ≈ 2.26 eV, high absorption coefficient ~3×10^5 cm−1 below 540 nm) but has a VBM too high for oxygen evolution; ZnS offers favorable alignment but an overly large gap (~3.6 eV). High-throughput DFT and ML frameworks have successfully screened materials, yet most databases focus on binaries/ternaries and omit intermediate alloy compositions, hindering discovery within continuous solid solutions. Cation/anion exchange can produce wide property tunability; large mixing entropy in high-entropy alloys can improve thermal stability. The SISSO approach (symbolic regression with compressed sensing) identifies low-dimensional, interpretable descriptors effective even with sparse data. The recently proposed agreement (α-) method can further reduce training data needs by cross-validating regression and classification models built on SISSO descriptors, improving generalizability to higher-component spaces.
Methodology
- Systems and training data: Zn-based chalcogenides were chosen as archetypes. Ternary solid solutions Zn1−xMxTe (M = Mg, Ca) and ZnTe1−yAy (A = Se, S) were modeled using SQS (ICET). 3×3×3 zinc blende supercells (27 cation and 27 anion sites) were generated for each ternary along compositions from 1/27 to 26/27 in 1/27 increments, yielding 104 alloy models (4 systems × 26 compositions). Five binaries (ZnTe, MgTe, CaTe, ZnS, ZnSe) were added, totaling 109 data points. - DFT calculations: VASP with PAW; plane-wave cutoff 500 eV. PBE-GGA used for structural relaxation; HSE06 hybrid functional on PBE structures to obtain accurate Eg and relative band-edge positions. Γ-centered k-point meshes of 8×8×8 for primitive ZB and equivalent density for supercells. Energy and force convergence thresholds of 1e−5 eV and 1e−2 eV/Å. - SISSO setup: Primary features: covalent radius (rc), ionic radius (ri), first ionization energy (IE), Pauling electronegativity (χ) for each species (values in Supplementary Table 1). Operators: +, −, *, /, exp, exp−1, sqrt, cbrt, log10, and (1−1/x). Descriptor generation up to 3rd feature space (β), dimension-grouped features; 8,063,535,501 candidate descriptors enumerated; SIS selected a 10,000-descriptor subspace for regression. L0 regularization used for sparsification. Single-dimensional regression models used for CBM/VBM; two-dimensional models for classification. - Agreement (α-) method: Builds on SISSO by cross-checking accuracy between regression and classification models using the same descriptors to select those generalizing best beyond training compositions. This improves robustness and mitigates overfitting when extrapolating to higher-component spaces. - Model construction and validation: Data subsets of size N = 13, 25, 45, 85 (from the 109 ternary/binary points) were used to train/test SISSO and SISSO+α models for EVBM and ECBM. Validation employed independent DFT results for quaternary compounds to assess extendibility. Descriptor quality was evaluated via RMSE and Pearson correlation against validation sets. - High-entropy exploration: Using the converged SISSO+α models (down to N=13), the hexanary cation/anion space (Zn–Mg–Ca | Te–Se–S with a+b+c=1 and x+y+z=1) was enumerated on a 4×4×4 ZB supercell grid (128 atoms) with 0.7% composition increments, corresponding to 3.8 million candidate configurations. Predicted Eg and band edges were evaluated; a target Eg ≈ 2.05 eV was used to identify promising candidates. Redox powers defined as Δ1 = Eox − EVBM and Δ2 = ECBM − Ered guided selection under three alignment scenarios (Δ1-dominant, Δ2-dominant, Δ1 ≈ Δ2). - First-principles verification and stability: Representative hexanary candidates were recalculated with DFT to benchmark predicted band edges (agreement within ~0.07 eV). Thermodynamic stability was assessed via inverse hull energies using Materials Project data, including configurational entropy contributions to evaluate temperature-dependent stabilization. Optical absorption coefficients were computed and compared to the ASTM G173 solar spectrum to assess solar absorption suitability.
Key Findings
- SISSO vs SISSO+α performance: Conventional SISSO showed overfitting when validated on quaternary compositions, with validation RMSE deviating from train/test RMSE and low Pearson correlations (<~0.5 for some descriptors). SISSO+α avoided overfitting, yielded strong descriptor–validation correlations, and maintained consistent RMSE across training, test, and validation. - Data-efficiency and size convergence: SISSO+α models for EVBM and ECBM converged with as few as N=13 training points, showing similar predictive performance to models trained with N=85. The approach demonstrated robust generalization from sparse ternary data to quaternary and higher-component spaces. - Descriptor interpretability: Final one-dimensional descriptors for CBM/VBM combined elemental ionization energies, ionic/covalent radii, and electronegativity of cations/anions (full equations and coefficients in Supplementary Information). Sensitivity analyses (Supplementary) clarify how anion/cation exchanges influence band edges through these features. - Hexanary design-space exploration: Enumerating ~3.8 million configurations at 0.7% compositional resolution produced predicted Eg spanning ~1.48–3.98 eV with varied band-edge positions. A target Eg ≈ 2.05 eV was adopted for optimal visible-light utilization. - Candidate alloys and band alignment: Four hexanary compositions were identified under three alignment scenarios (Δ1-dominant, Δ2-dominant, and Δ1 ≈ Δ2). Examples include: • Structure 1 (Δ2-dominant): Zn0.47Mg0.05Ca0.48 | S0.02Se0.02Te0.96. • Structure 2 (Δ1-dominant): Zn0.70Mg0.28Ca0.02 | S0.02Se0.02Te0.96. • Structures 3 and 4 (Δ1 ≈ Δ2): Zn0.85Mg0.05Ca0.10 | S0.13Se0.77Te0.10 and Zn0.40Mg0.50Ca0.10 | S0.07Se0.20Te0.73 (and a nearby S/Se-swapped variant Zn0.40Mg0.50Ca0.10 | S0.20Se0.07Te0.73). - DFT validation: SISSO+α predictions for band edges of selected hexanary structures agreed with HSE06 DFT within ~0.07 eV, confirming model reliability for multicomponent extrapolation. - Thermodynamic stability: Inverse hull analyses including configurational entropy showed rapid stabilization with temperature; structure 4 reached ΔGf = 0 at ~531.3 K. Considering a metastability tolerance of ~0.2 eV, three of the four candidates are likely stabilizable near room temperature. - Optical absorption: Predicted absorption coefficients for selected alloys peak in the visible range (~380–600 nm), aligning with high solar flux and supporting effective solar harvesting. - Benchmarking redox powers: Proposed Zn-based high-entropy chalcogenides exhibit competitive Δ1 and Δ2 relative to well-known photocatalysts (e.g., ZnTe, Cs2PdBr6, TiO2), with design flexibility to tune Δ1 and Δ2 toward balanced ≈2.0 eV.
Discussion
The study tackles the challenge of exploring vast multicomponent alloy spaces for photocatalytic water splitting using minimal high-level data. By integrating SISSO with the agreement (α-) method, the models maintain generalizability beyond the training space, mitigating overfitting and enabling accurate prediction of CBM/VBM for complex alloys with as few as 13 training points. This directly addresses the need to identify alloy compositions with suitable band alignment and visible-range bandgaps. The interpretable descriptors link elemental ionization energies, radii, and electronegativity to band-edge shifts, offering actionable design rules via cation/anion exchange. Systematic enumeration of Zn–Mg–Ca | Te–Se–S hexanaries revealed multiple candidates achieving Eg near the optimal for visible absorption and favorable Δ1/Δ2 redox powers; DFT validation confirmed predictive accuracy. Stability assessments incorporating configurational entropy indicate that several candidates can be stabilized around room temperature. Overall, the workflow efficiently maps multicomponent compositional space and provides strategies to fine-tune band edges, accelerating discovery beyond conventional ternary-focused databases.
Conclusion
This work presents a small-data ML framework that combines SISSO with an agreement method to discover and engineer ZnTe-based high-entropy chalcogenide alloys for water splitting. The approach yields interpretable, robust descriptors for CBM/VBM that generalize from sparse ternary data to quaternary and hexanary systems. Exhaustive enumeration at fine compositional resolution identified hexanary alloys with near-optimal visible-range bandgaps, balanced redox powers, and promising thermodynamic stability. DFT benchmarks validated predictions within ~0.07 eV. The methodology offers practical design routes—via targeted cation/anion exchanges—to fine-tune band edges while maintaining desired bandgaps. Future work could extend this framework to broader chemistries, incorporate additional catalytic performance metrics (surface kinetics, defect tolerance), and pursue experimental synthesis/characterization to validate stability and photocatalytic efficiency.
Limitations
- Domain specificity: Training and validation focused on ZnTe-based alloys with Mg, Ca, S, Se, Te; generalization to unrelated chemistries is untested and may require retraining. - Data sparsity: Although the α-method enables strong performance with N≈13, descriptor fidelity may degrade when extrapolating far beyond the elemental/features spanned by training data. - Model interpretability granularity: Nonlinear descriptor forms can entangle feature–property relationships, complicating one-to-one attribution of changes in band edges to individual elemental features. - Validation scope: Experimental verification is not presented; first-principles validation was limited to selected quaternary/hexanary cases. Kinetic factors and synthesis routes were not explored. - Stability assessment: Thermodynamic analyses include configurational entropy but rely on inverse hull approximations and a metastability tolerance; full finite-temperature phase equilibria and defect chemistry are not addressed.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny