
Environmental Studies and Forestry
Clustering micropollutants and estimating rate constants of sorption and biodegradation using machine learning approaches
S. J. Lim, J. Seo, et al.
This study harnesses the power of machine learning to cluster micropollutants in wastewater, accurately estimating their sorption and biodegradation rate constants. Conducted by Seung Ji Lim and colleagues, this innovative approach improves monitoring of environmental contaminants, achieving significantly higher accuracy than past methods.
~3 min • Beginner • English
Introduction
Micropollutants (MPs) such as pharmaceuticals, personal care products, steroids, estrogens, pesticides, and surfactants are widely used and commonly enter wastewater treatment plants (WWTPs), where many are not fully degraded. Their discharge poses ecological risks, necessitating frequent and accurate monitoring. Monitoring individual MPs is costly and labor-intensive; grouping MPs and using markers to represent group behavior could reduce monitoring burden. Prior marker approaches (e.g., caffeine for untreated wastewater contamination) demonstrate efficiency gains. Existing clustering methods (e.g., dendrograms based on biodegradation trends and clustering by initial biotransformation rules using Eawag-PPS) either lack interpretability and dimensional richness or insufficiently capture chemical details, limiting prediction accuracy. This study proposes combining SOM-based clustering with RFC classification to: (1) develop an appropriate clustering method for MPs, (2) determine cluster marker constituents, (3) classify MPs using physicochemical properties, functional groups, and biotransformation rules, and (4) estimate ranges of sorption and biodegradation rate constants (Kd, kbio) for unlabeled MPs. The goal is to enable efficient fate prediction and reduce monitoring overhead in WWTPs.
Literature Review
The authors review marker-based and clustering-based approaches for MP monitoring and fate prediction. Marker strategies (e.g., caffeine as an anthropogenic marker) can substitute for tracking many individual MPs when removal behaviors are well understood. Clustering via dendrograms can reveal biodegradation trends along operational gradients (e.g., solids retention time) but offers limited interpretability due to its one-dimensional representation. Clustering by initial biotransformation rules (Eawag-PPS) improves mechanistic explainability but can lack predictive power because it omits important chemical features such as functional groups and physicochemical properties (e.g., log Kow, presence of aromatic rings, nitrogen/sulfur/halogen functionalities). Prior works have also shown grouping of sulfonamide-containing MPs and perfluorinated compounds due to shared structural motifs, supporting the value of structure-informed clustering. However, prediction of removal rates remains challenging and variable across plants, and previous meta-analyses explained a modest fraction of variability in removal efficiencies, underlining the need for data-driven, feature-rich models.
Methodology
Experimental evaluation: The biodegradation and sorption of 42 MPs were studied in agitated batch reactors under aerobic and anoxic conditions using synthetic wastewater (2.2 L SyWW + 0.8 L activated sludge; MLSS 3 g/L, MLVSS 1.8 g/L; pH 7; 22 °C). A cocktail of 42 MPs (0.1 mg/L each) was dosed. Time-series sampling at 0–24 h captured concentration changes. Controls without sludge (abiotic) and sterile controls with sodium azide (3 g/L) assessed non-biological removal and rapid adsorption (0–1 h) respectively. Samples were filtered (0.2 µm), spiked with internal standards, and stored at −20 °C.
Analytical methods: Nitrosamines were quantified by GC-LRMS; 35 other MPs by UHPLC-MS/MS (C18 column, 45 °C, 0.3 mL/min, water with 0.1% HF as A and methanol as B; gradient 40–70% B then to 100% B; 3 µL injection; 10-point calibration 0.1–100 ng/mL). Method optimization and validation details are in supplementary materials.
Kinetic modeling: A pseudo first-order degradation model was used, assuming rapid sorption equilibrium (instant reduction of soluble MP) and excluding volatilization and hydrolysis (confirmed negligible). Model performance was evaluated using Nash–Sutcliffe efficiency (NSE). Sorption coefficients (Kd; L gMLSS−1) and biodegradation rate constants (kbio; L g−1 h−1) were estimated for both redox conditions.
Machine learning pipeline (SOM–Ward–RFC): The dataset combined physicochemical properties, functional groups, initial biotransformation rules (Eawag-PPS), and experimentally derived Kd and kbio for 42 MPs. Data were split into training/validation (29 MPs; 5-fold cross-validation) and test (13 MPs) sets. Step 1: Unsupervised SOM projected high-dimensional features to 2D, followed by Ward’s method to draw cluster boundaries. Two clustering scenarios were tested: PF (physicochemical properties + functional groups) and BT (biotransformation rules). Optimal number of clusters was selected by minimizing the Davies–Bouldin index (DBI). Markers were defined as MPs closest (minimum Euclidean distance) to the cluster mean of rate constants. Step 2: An RFC was trained to classify MPs into SOM-derived labels using the same input features as in clustering; performance assessed by accuracy, f1-score, precision, and recall. Step 3: The trained SOM–Ward–RFC model classified unlabeled test MPs to clusters; their Kd and kbio ranges were estimated from the cluster marker using: Kd,u within (Kd,m ± N·σkd) and kbio,u within (kbio,m ± N·σkbio), with N = 1, 2, 3 and σ computed from MPs in the cluster. A preliminary Random Forest Regressor (RFR) directly predicting rate constants was evaluated but showed overfitting (train R2 0.78–0.90 vs test R2 −0.08–0.45), so the combined clustering-classification approach was adopted. Simulations were implemented in Python (MiniSOM 2.3.0; scikit-learn 1.0).
Key Findings
- Experimental removals and kinetics:
- Sorption dominated within the first hour for certain MPs; parabens, estrogens, diclofenac, and atorvastatin had 32–57% removal by sorption, while most MPs had <14% sorption.
- Under aerobic conditions, ibuprofen, naproxen, caffeine, metformin, gemfibrozil, and acetaminophen were almost completely removed primarily by biodegradation. Diclofenac total removal ~74% with sorption (47%) > biodegradation (27%). Some antibiotics, carbamazepine, atrazine, clofibric acid, and DEET were poorly removed (sorption <5%, biodegradation up to 40%). Propranolol exhibited negative apparent removal, likely due to back-transformation/deconjugation.
- Under anoxic conditions, metformin was almost completely biodegraded; ranitidine, iopromide, and acetaminophen showed 62–85% removal; parabens and estrogens were substantially removed (largely sorption). β-blockers and trimethoprim had ~45% higher biodegradation than under aerobic conditions. Several MPs (e.g., benzotriazoles, gemfibrozil, diclofenac, ibuprofen, naproxen, caffeine) showed low removal (7.9–28.6%). Perfluoropentanoic acid had negative removal, possibly from transformation of other PFAS.
- Kinetic ranges: kbio spanned 0–2.3 L g−1 h−1 (aerobic) and 0–1.8 L g−1 h−1 (anoxic). Sorption coefficients were similar across conditions: aerobic 0–0.44 L gMLSS−1; anoxic 0–0.5 L gMLSS−1. Pseudo first-order models fit most MPs (positive NSE), with poor fits for some (e.g., certain PFAS, atrazine, nitrosamines showed negative NSE values).
- Clustering and markers:
- PF scenario (physicochemical + functional groups) yielded 11 clusters with DBI 0.49; clusters were chemically interpretable (e.g., nitrosamines grouped; sulfonamides grouped; parabens and estrogens co-clustered due to high log Kow and shared functional groups; halogenated and perfluorinated compounds formed distinct clusters).
- BT scenario (biotransformation rules) yielded 15 clusters with higher DBI 0.87; grouped MPs sharing initial Eawag-PPS rules (e.g., sulfonamide bt0144; aromatic ring dihydroxylation bt0005; H-abstraction bt0002; ether dealkylation bt0023), but overall less organized than PF.
- Eleven marker MPs were identified under each redox condition (markers are MPs closest to cluster mean rate constants).
- Classification and estimation performance (train/validation on 29 MPs; test on 13 MPs):
- PF scenario classification: accuracy 0.75; f1-score 0.61 for both aerobic and anoxic.
- BT scenario classification: accuracy 0.43; f1-score 0.32.
- Rate constant range estimation (test set):
- Aerobic PF: accuracy 0.38 (N=1 SD), 0.69 (N=2), 0.77 (N=3).
- Aerobic BT: 0.10 (N=1), 0.20 (N=2), 0.40 (N=3).
- Anoxic PF: 0.46 (N=1), 0.70 (N=2), 0.77 (N=3).
- Anoxic BT: 0.30 (N=1), 0.40 (N=2), 0.40 (N=3).
- The combined SOM–RFC approach outperformed direct RFR prediction, which overfit and had low test R2.
- External dataset application (biotransformation-dominant system): BT scenario slightly outperformed PF (classification 0.72 vs 0.62; estimation 0.69 vs 0.62), indicating dataset-dependent optimal features.
- Overall, using PF features improved cluster quality, classification, and rate constant estimation accuracy (best 0.77) in the study’s mixed sorption/biodegradation dataset.
Discussion
The study demonstrates that combining SOM clustering with RFC classification enables interpretable grouping of MPs and practical estimation of sorption and biodegradation rate constants from easily accessible descriptors (physicochemical properties and functional groups). This addresses the challenge of costly, compound-by-compound monitoring by introducing cluster representatives (markers) to infer the fate of unlabeled MPs. The PF scenario’s superior DBI, classification accuracy, and estimation performance indicate that incorporating functional groups and physicochemical properties captures both biotransformation potential and sorption propensity, critical for WWTP contexts where removal mechanisms differ across compounds. The approach generalized to an external biotransformation-focused dataset where BT rules became more predictive, underscoring that the best feature set depends on the dominant removal mechanism in the system. These findings are relevant for WWTP monitoring, as markers can reduce analytical workload while providing bounds on rate constants that inform treatment performance and risk assessment. The interpretability of clusters (e.g., grouping sulfonamides, nitrosamines, parabens/estrogens, PFAS) facilitates mechanistic understanding and potential process optimization.
Conclusion
This work introduces an integrated SOM–Ward–RFC framework to cluster MPs, identify marker compounds, and estimate ranges of sorption and biodegradation rate constants for unlabeled MPs under aerobic and anoxic conditions. Using physicochemical properties and functional groups yielded well-organized clusters (DBI 0.49), strong classification (accuracy 0.75), and accurate rate constant range estimation (up to 0.77), outperforming both clustering by biotransformation rules alone and direct regression models. The method effectively captures both sorption and biotransformation behaviors and can streamline monitoring by focusing on marker MPs. Future research should expand MP datasets to balance clusters, incorporate additional features (e.g., operational parameters, microbial community metrics), improve coverage for compounds lacking Eawag-PPS rules, and validate across diverse WWTPs to enhance generalizability and robustness.
Limitations
- Data sparsity and uneven cluster sizes may bias clustering and classification; authors note uneven distribution due to limited MP data.
- The pseudo first-order model assumes rapid sorption equilibrium and excludes volatilization and hydrolysis; some compounds showed poor kinetic fits (negative NSE), indicating model mismatch.
- Eawag-PPS did not provide initial biotransformation rules for some compounds (e.g., certain PFAS, some nitrosamines), limiting BT-based clustering.
- Direct regression (RFR) suffered from overfitting; the final approach estimates ranges rather than precise values and often required broader (±3 SD) intervals to achieve highest accuracy.
- Model performance is dataset-dependent; BT rules were more effective in an external, biotransformation-dominant dataset, suggesting sensitivity to system-specific mechanisms.
- Code is proprietary, which may limit reproducibility; substantial training data are required for successful application.
Related Publications
Explore these studies to deepen your understanding of the subject.