Chemistry

Bias-Free Multiobjective Active Learning for Materials Design and Discovery

K. M. Jablonka, G. M. Jothiappan, et al.

Explore a groundbreaking bias-free multiobjective active learning algorithm developed by Kevin Maik Jablonka, Giriprasad Melpatti Jothiappan, Shefang Wang, Berend Smit, and Brian Yoo. This innovative approach streamlines materials discovery, especially for designing polymers, by efficiently identifying Pareto-optimal materials and minimizing evaluations through advanced simulations and machine learning.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses how to efficiently and confidently identify the set of Pareto-optimal materials in multiobjective materials design, where competing objectives preclude a single best solution. Traditional approaches that combine objectives into a single ranking introduce subjective biases and obscure the true nature of multiobjective optimization, which is inherently governed by a partial order (Pareto dominance). The authors propose an active learning framework that focuses on recovering the Pareto front without imposing a total order, using the Pareto dominance relation directly. They compare this strategy to Bayesian optimization methods that typically rely on acquisition functions and implicit total orderings, potentially biasing the search. The research aims to systematically reduce uncertainty in regions near the Pareto front and confidently classify materials as Pareto-optimal or discard them, under realistic constraints on the number of evaluations. The context is de novo polymer design for dispersant applications, with a prohibitively large sequence space and computational cost, making efficient exploration essential.

Literature Review

The paper situates its contribution within growing efforts to use machine learning for materials design and discovery, including surrogate modeling and Bayesian optimization. Prior works have leveraged machine learning for targeted property prediction and multiobjective Bayesian optimization; however, most optimization techniques impose a total order for ranking candidates, which is misaligned with the partial-order nature of Pareto dominance and can bias the search. Active learning has been proposed as an efficient strategy to select informative experiments iteratively. The authors build on the ε-PAL (Pareto Active Learning) approach of Zuluaga et al., which uses uncertainty-aware dominance checks to classify points relative to the Pareto front. They note practical issues with random or diversity-based training set selection (inefficient coverage of relevant regions) and discuss challenges in handling incomplete data, which motivates their use of coregionalized Gaussian processes to exploit correlations across objectives and manage missing measurements.

Methodology

Design space and descriptors: The design space consists of linear copolymer sequences represented in a coarse-grained model with four bead (monomer) types: [W] (weakly attractive, good solvent), [R] (repulsive, bad solvent), [Ta] (theta, attractive to surface), and [Tr] (theta, repulsive to surface). Monomer counts per type range from 4 to 12. Properties are computed via coarse-grained molecular simulations: (i) adsorption free energy ΔG_ads onto a model surface (adhesion), (ii) dimer repulsion free energy barrier ΔG_rep (polymer–polymer repulsion), and (iii) radius of gyration R_g (related to viscosity). Simulations employ EMC for setup and LAMMPS for dynamics, with SSAGES for free energy calculations using steered MD and WHAM. Design of experiments (DoE): A full factorial design on bead counts for the four types (levels: 4, 6, 8, 10, 12) with five unique monomer sequences per design point yields 3125 polymers. Sequences were generated randomly subject to minimum counts (≥4 per monomer). This DoE maintains interpretability without needing training data for a generative model. Initial training points for the surrogate are selected using greedy farthest point sampling in feature space (60 samples). Surrogate modeling and features: Features include degree of polymerization, relative sequence entropy, end-group identity (one-hot), summed interaction parameters, and cluster characteristics derived from sequence. Features are z-score standardized. Multi-output Gaussian Process Regression uses Intrinsic Coregionalization Model (ICM, rank 1) with Matérn-5/2 kernel (GPy), enabling shared structure across objectives and improved predictions with missing data. Hyperparameters are optimized with random restarts and periodically updated as new data are added. ε-PAL active learning algorithm (modified): For each unsampled or sampled material, construct objective-wise uncertainty hyperrectangles around the mean predictions or measurements. Unlike the original, uncertainty is quantified by coefficient of variation (relative uncertainty) rather than absolute standard deviation; labeled points’ uncertainties reflect measurement/model uncertainties. Without requiring known objective ranges, ε serves as the tolerance for ε-Pareto dominance classification. Points are classified as ε-Pareto optimal if their pessimistic estimates exceed (by ε) the optimistic estimates of others; points whose optimistic estimates fall below the pessimistic estimates of others (by ε) are discarded. The remaining points are unclassified. The next experiment is chosen to maximally reduce uncertainty near the current front by selecting the point at/near the front with the largest hyperrectangle area (relative error). The process iterates: simulate the selected polymer, update the dataset, retrain the surrogate, and reclassify. Stopping and guarantees: The algorithm continues until all points are ε-classified (either ε-Pareto optimal or discarded). Theoretical guarantees relate ε to a bound on the maximum Pareto front error with probability δ given the GP kernel and proper scaling; larger ε accelerates classification but increases approximation error. Missing data are handled natively by the coregionalized GP, leveraging inter-objective correlations to improve predictions when some objective values are unavailable. Evaluation metrics and baselines: Performance is measured via the hypervolume indicator relative to a reference (nadir) point; hypervolume error tracks convergence to the true front (computed from full DoE simulations). Random sampling serves as a baseline (bootstrapped runs). Interpretability and inverse design: Model interpretability uses SHAP (KernelExplainer) to quantify feature importance for each objective. For inverse design, gradient-boosted decision tree surrogates (trained on GP predictions with a reduced feature set) are optimized by an elitist genetic algorithm (with novelty penalties) to propose new polymers; three sequences are generated per feature vector. GA outputs are compared to the DoE+ε-PAL results in both property and embedded spaces (UMAP).

Key Findings

- Efficiency of ε-PAL: The algorithm rapidly classifies the design space, confidently discarding many polymers and identifying ε-accurate Pareto-optimal candidates after only about 10 iterations. Using ε=0.01 per objective, ε-PAL reaches the target hypervolume error with over 89% fewer iterations than random sampling (153 iterations vs 1421 for random search). - Effect of ε: Larger ε (e.g., 0.1) accelerates classification but slightly degrades final Pareto front accuracy, reflecting the tunable accuracy–efficiency trade-off. - Missing data robustness: With one-third of ΔG_rep values intentionally withheld, the coregionalized GP within ε-PAL still converges, though requiring more iterations to classify all materials. Coregionalization improves estimates for missing objectives by exploiting inter-objective correlations. - True vs predicted front: Across iterations, the predicted ε-Pareto-optimal points approach the true Pareto front (computed from full DoE simulations), as visualized in property space and via hypervolume trends. - Structure–property insights (SHAP): For R_g, degree of polymerization is the dominant feature, consistent with Flory scaling (R_g ∝ N^ν); increasing [W] content increases R_g, while increasing [R] decreases it. For adsorption, higher [W] reduces surface interaction. For dimer repulsion ΔG_rep, higher relative sequence entropy (more disorder, fewer blocks) increases repulsion, and higher [W] content also increases repulsion. - Inverse design check: Genetic algorithm inversion of the trained models, with varying novelty penalties and elitism, did not discover polymers that Pareto-dominate those found by the DoE + ε-PAL workflow, suggesting the search effectively identified the best-performing regions within the considered design space.

Discussion

The findings demonstrate that focusing directly on the Pareto front through an ε-PAL active learning strategy yields an efficient, confidence-aware exploration of large materials design spaces without imposing a biased total order. By targeting uncertainty near the Pareto frontier, the method minimizes unnecessary evaluations in dominated regions and provides theoretical control of approximation error via ε. In the polymer dispersant case study, the approach drastically reduces the number of simulations needed to reconstruct the Pareto front with desired confidence and provides interpretable structure–property relationships. Robustness to missing data through coregionalized GPs makes the workflow practical for experimental contexts where some measurements are costly or unavailable. Overall, the strategy addresses the research question by efficiently recovering a high-quality approximation of the Pareto set, aiding decision-making in multiobjective materials discovery and enabling bias-free prioritization of informative experiments.

Conclusion

This work presents a bias-free multiobjective active learning framework based on ε-PAL that directly leverages Pareto dominance to efficiently and confidently approximate Pareto fronts in materials design. Applied to de novo polymer dispersant design, the method achieves large reductions in required evaluations compared to random exploration, maintains performance under missing data via coregionalized Gaussian processes, and yields on-the-fly structure–property insights. Inverse design checks did not reveal superior, Pareto-dominating candidates beyond those identified, supporting the effectiveness of the workflow. Given the generality of multiobjective optimization problems in materials science and beyond, the approach is anticipated to be adaptable to other domains where efficiency and confidence in identifying Pareto-optimal solutions are crucial.

Limitations

- Design space coverage: The DoE uses a factorial design over monomer counts with five sequences per design point and assumes sequence effects are secondary to monomer content; this is not fully representative of the entire sequence space. - Model assumptions: The coarse-grained polymer model (bead types, solvent/surface interactions) and ideal-chain assumptions may limit direct transferability to specific chemistries. - Hyperparameter sensitivity: Approximation guarantees depend on the GP kernel and scaling of uncertainty; choice of ε controls accuracy–efficiency trade-offs, with larger ε yielding faster but less accurate fronts. - Missing data impact: Although coregionalized GPs mitigate missing objectives, convergence requires more iterations when data are absent. - In silico validation: Results are based on simulations; experimental validation would be necessary to confirm performance in laboratory settings.

Related Publications

Explore these studies to deepen your understanding of the subject.

Engineering and Technology

A rapid and effective method for alloy materials design via sample data transfer machine learning

L. Jiang, Z. Zhang, et al.

Chemistry

An integrated high-throughput robotic platform and active learning approach for accelerated discovery of optimal electrolyte formulations

J. Noh, H. A. Doan, et al.

Medicine and Health

Design and Analysis of a Deep Learning Ensemble Framework Model for the Detection of COVID-19 and Pneumonia Using Large-Scale CT Scan and X-ray Image Datasets

X. Xue, S. Chinnaperumal, et al.

Computer Science

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

S. H. Snyder, P. A. Vignaux, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny