logo
ResearchBunny Logo
MLMD: a programming-free AI platform to predict and design materials

Engineering and Technology

MLMD: a programming-free AI platform to predict and design materials

J. Ma, B. Cao, et al.

Discover the groundbreaking MLMD, a programming-free AI platform for materials design, developed by authors Jiaxuan Ma, Bin Cao, Shuya Dong, Yuan Tian, Menghuan Wang, Jie Xiong, and Sheng Sun. This innovative platform harnesses the power of machine learning to identify novel materials rapidly and efficiently, transforming the landscape of materials discovery.

00:00
00:00
~3 min • Beginner • English
Introduction
Novel materials underpin advances in aerospace, biomedicine, and energy, but conventional trial-and-error design is costly and slow, often taking decades to commercialize. The emergence of AI/ML has shifted materials R&D toward data-driven paradigms expected to halve cost and cycle time. Central to this paradigm is learning Composition-Process-Structure-Property (CPSP) relationships using ML trained on existing data to accelerate discovery in domains such as organic compounds, solar cells, alloys, and perovskites. Prior studies show active learning accelerating alloy exploration and leveraging both successful and failed experiments to uncover reaction factors. However, many existing platforms require programming skills and focus on forward models rather than inverse materials design. This motivates MLMD: a programming-free platform enabling end-to-end model building and inverse design that remains effective even with limited data through active learning.
Literature Review
Several AI platforms support materials discovery. Materials Cloud emphasizes ab initio computation and mechanism emulation. The Materials Project provides extensive inorganic materials data and quantum properties, leveraging ML potentials such as M3GNet and MEGNet. AFLW-ML and JARVIS-ML offer crystal property prediction (e.g., formation and exfoliation energies, bandgaps, magnetic moments) based on DFT or ML surrogates. Toolkits like Matminer and Magpie provide feature generation and ML utilities. General-purpose toolkits with materials applications include command-line and web GUIs for non-data scientists and integrative hubs like AlphaMat that streamline model construction. While these platforms automate accurate model building, they often require programming and emphasize forward prediction, leaving a gap in programming-free, end-to-end inverse design workflows that MLMD seeks to fill.
Methodology
MLMD provides a web-based, code-free workflow from data to new materials: data collection, preprocessing, feature engineering, model building, hyperparameter tuning, inverse design, and experimental validation. Users upload CSV files with features (composition/process) and targets (properties). Key modules and functions include: (1) Database: downloadable datasets (e.g., ceramics, HEAs, ferroelectric perovskites) and cloud outlier detection (DBSCAN, Isolation Forest, Local Outlier Factor, One-Class SVM). (2) Data visualization: distributions of features/targets. (3) Feature engineering: handle missing/duplicate values, assess correlations, rank feature importance, and transform descriptors. (4) Classification: multiple algorithms with supervised learning, automated hyperparameter tuning, and ensembling. (5) Surrogate optimization: integrate trained predictors with stochastic optimizers to search composition/process spaces under constraints (single-objective: GA, DE, PSO, SA; multi-objective: NSGA-II, SMS-EMOA). (6) Active learning: Bayesian sampling under data scarcity using Gaussian Process regressors and utility functions balancing exploration/exploitation (EI, EIP, AEI, REI, UCB, POI, PES; multi-objective HV/EHVI). (7) Interpretability: SHAP explanations. Workflows: (a) model inference (train model, generate virtual candidates, infer, then experimentally validate), (b) surrogate optimization (train predictors, apply constraints, run heuristic optimization, validate), and (c) active learning (build GPR on initial data, define virtual search space, select utility function, recommend next experiment, iterate). Implementation utilizes tools such as Scikit-learn, XGBoost, CatBoost, Pymoo, Scikit-Opt, Tsetlin, and bagolearn. Transfer learning and dimensionality reduction (PCA, t-SNE) are supported. Data privacy is emphasized (no storage of user-uploaded data).
Key Findings
- Classification: Across three tasks (C1: ferroelectric perovskite formability; C2: zinc alloy phase class FCC vs MSS; C3: HEA solid-solution structure), default MLMD SVC/RFC/XGBC achieved >80% 10-fold CV accuracy. Tuned models performed best: C1 tuned XGBC ≈ 86.5% CV accuracy, C2 tuned RFC ≈ 87.8%, C3 tuned RFC ≈ 92.6%, matching or exceeding baseline implementations. Confusion matrices provided for per-class insight. - Regression: For R1 (steel fracture strength), R2 (perovskite Curie temperature), R3 (FGH98 superalloy flow stress), recommended regressors achieved 10-fold CV R² of 0.9427 (XGBC, R1), 0.8480 (SVR, R2), and 0.9288 (CBR, R3), outperforming baselines. Predicted vs. experimental plots showed clustering near the diagonal, indicating strong predictive performance. - Surrogate optimization (RAFM steels): Two regression models (for UTS and total elongation) trained on R4 achieved CV performance of 0.9912 (UTS) and 0.816 (TE). Multi-objective optimization (e.g., NSGA-II) pushed the Pareto front forward at 600 °C and 300 °C relative to original data. At 600 °C, MLMD designed steels near targets (e.g., UTS ≈ 498 MPa and TE ≈ 21%), comparable to or improving over prior designs, with follow-up experimental validation referenced. - Active learning (HEAs hardness): Using 155 synthesized HEAs plus a virtual search space with composition constraints for Al–Co–Cr–Cu–Fe–Ni, MLMD’s Bayesian strategies (EI, EIP, AEI, REI, UCB, POI, PES) identified high-hardness candidates. Example recommended compositions include EI: (Al,Co,Cr,Cu,Fe,Ni) = (43,13–15,20–22,5,12,5); REI: (43,17–18,22–24,0,11,6); UCB: (43,17–21–22–26–27,0,5,5) with experimental comparators reported. Results demonstrate convergence and effectiveness across different acquisition functions.
Discussion
MLMD addresses the need for accessible, end-to-end materials design by enabling non-programmers to build predictive models and perform inverse design. The demonstrated classification and regression performance across diverse datasets establishes robust surrogate models. Integrating these models with surrogate optimization pushes property trade-offs (e.g., UTS vs TE in RAFM steels) beyond existing datasets, while active learning efficiently navigates data-scarce regimes to propose experimentally promising HEAs. Providing SHAP-based interpretability supports understanding CPSP relationships, which can guide mechanistic insights and hypothesis generation. Together, these results show that MLMD can accelerate the design cycle from data ingestion to validated materials, complementing experimental efforts and reducing trial-and-error costs.
Conclusion
MLMD is a programming-free, web-based AI platform that unifies data analysis, feature engineering, automated model construction, interpretability, surrogate optimization, and Bayesian active learning for end-to-end materials design. It achieves strong predictive performance, automates inverse design under constraints, and effectively proposes candidates in data-limited settings. Case studies across perovskites, steels, and HEAs validate its reliability and utility. Future development will continue to enhance algorithms, add frontier tools, and improve visualization interfaces. Source code and materials are available at https://github.com/aucna/MLMD, supporting community adoption and advancement of materials informatics.
Limitations
- Model inference and surrogate optimization rely on the robustness of trained predictors; performance is limited by available data. - Accurate predictive models are a prerequisite for effective surrogate optimization. - Active learning efficiency depends on the choice of utility function, which may vary case-by-case. - Reported validations focus on specific datasets and materials classes; broader generalization requires further studies and experiments. - Although privacy is emphasized (no data storage), the workflow requires structured CSV inputs and appropriate descriptor engineering for best results.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny