logo
ResearchBunny Logo
Data-driven approaches linking wastewater and source estimation hazardous waste for environmental management

Environmental Studies and Forestry

Data-driven approaches linking wastewater and source estimation hazardous waste for environmental management

W. Xie, Q. Yu, et al.

This research, conducted by Wenjun Xie and colleagues, unveils a data-driven methodology for predicting hazardous waste generation using massive wastewater datasets. With impressive accuracy, the findings promise to enhance efficiency across various sectors, ultimately contributing to better environmental management.

00:00
00:00
~3 min • Beginner • English
Introduction
Waste management is increasingly central to environmental management and linked to many SDGs. Hazardous waste (HW) is prioritized due to its harmful properties and recycling potential. Industrial acceleration is driving HW growth, stressing management systems. Source management—improving collection and reducing generation intensity—faces challenges where regulations are lax and accurate firm-level estimates are lacking. Although some regions mandate reporting (e.g., TRI in the US, E-PRTR in the EU), ensuring regular participation by all generators is difficult, especially with short reporting cycles. Traditional estimation methods rely on industry-level intensity factors (e.g., employees, economic output) that overlook firm heterogeneity and often lack timely, accessible, high-resolution data. Advances in IoT have enabled continuous wastewater monitoring, and because HW and wastewater arise from the same manufacturing processes—partitioning contaminants between solid and liquid phases—there is a logical correlation to exploit. This study proposes a data-driven framework to estimate monthly firm-level HW generation by linking wastewater data and firm characteristics, addressing heterogeneity with feature engineering, data balancing, and causal discovery, and demonstrating feasibility using Jiangsu, China data across 10 sectors.
Literature Review
Prior HW generation studies commonly use economic and demographic indicators (e.g., employee numbers, product output) and industry-average waste intensity factors, which mask firm-level heterogeneity and lack temporal granularity. Regulatory inventories like the US TRI and the EU PRTR mandate reporting but have participation and frequency limitations. Emerging IoT sensor networks provide high-frequency wastewater monitoring, offering a proxy for production activities and contaminant partitioning. The paper also situates model choice within broader ML literature: on tabular data with medium samples, tree-based models (e.g., Random Forests) often outperform deep learning due to non-smooth target functions and uninformative features. The study integrates causal discovery (DAG/Markov Blanket) and SHAP-based feature importance to select informative predictors, building on literature in causal inference and explainable AI.
Methodology
- Study design and data: A generic machine learning framework was built to predict monthly firm-level HW generation using 43 variables: 19 static characteristics (sector, firm scale via employee-based categories, and 17 binary manufacturing process indicators) and 24 wastewater/emissions indicators (e.g., discharge volume, COD, pH, NH3-N, N, P, metals Fe/Cr/CrVI/Cu/Zn/Ni, etc.). For the Jiangsu case, 1024 enterprises from 10 sectors (covering 49.7% of regional HW in 2020) were analyzed using monthly HW declarations (2020–2022) and 4,260,593 daily wastewater sensor records aggregated monthly, yielding 16,477 observations. A subset of 25 variables was used based on sector characteristics and data availability. - Data handling: The dataset was randomly split into training and test sets (80/20) before any preprocessing to avoid leakage. Missing values were imputed and outliers (5%) removed using unsupervised ML on the training set; test set imputation used training-derived parameters. - Feature engineering: Causal discovery via DAG learning (continuous optimization approach) identified Markov Blanket features for the target. Features were then adjusted using SHAP importance and correlation analysis: high-importance features not in MB were considered for addition if weakly correlated (Spearman < 0.6) with MB features, or used to replace correlated MB features (Spearman > 0.6) when more important. Iterative, one-at-a-time adjustments continued until performance stabilized. - Addressing long-tailed targets: To improve learning in rare high-value regions, the SMOGN algorithm combined under-sampling of common regions with SMOTER interpolation and Gaussian noise generation for rare cases (threshold at 75th percentile of target). This was applied to regression training data. - Model construction and selection: Eight algorithms were considered for regression (GBDT, SVM, XGBoost, kNN, RF, MLP, MLP ensemble, Tabular Neural Network). Ten-fold cross-validation on the training set tuned hyperparameters, with performance measured by R², RMSE, MAE, MAPE, MAD, MSE, SSE. The best model on the test set was selected. - Total HW prediction: A combined regression model predicted total HW per firm-month. RF achieved the best trade-off of accuracy and speed and was adopted. - Category-specific prediction (MHW/HW17): An ensemble model coupled a binary classifier (to predict zero vs >0) with a regression model (to predict positive quantities). The classifier trained on the full pre-balance training set; the regression model trained only on positive cases and used SMOGN balancing. Test-time logic: if classifier predicts >0, apply regression; else output 0. Classification metrics were accuracy, recall, precision, F1. - Sector-independent models: Given inter-sector heterogeneity, separate RF models were developed for nine sectors for total HW (excluding steelmaking due to insufficient samples) and for four sectors for MHW (MST, SRP, ECM, MWR), using the same pipeline. Sector feature was removed when training sector-specific models. - Model interpretation: SHAP was used to quantify feature effects and group-wise average importance (sector, firm scale, processes, routine wastewater indicators, metal emissions) across combined and sector-independent models. - Generalizability studies: Three additional applications (annual data, 2015) assessed transferability: MST sector in Shandong (lower wastewater-to-HW ratio), MST in Zhejiang (higher ratio), and lead/zinc metallurgy in Hunan (different sector). Also compared feature importance for MST models across six provinces to assess regional heterogeneity. - Sensitivity analyses: Compared feature selection strategies (MB-only vs MB + importance), assessed temporal extrapolation by predicting 2022 months using models trained on 2020–2021, and evaluated uncertainty using Quantile Regression Forests to produce 95% prediction intervals and coverage. - Cost-effectiveness: Reported compute time on a standard laptop for data integration, preprocessing, feature engineering, training, and inference; contrasted with time-consuming field surveys for intensity factors.
Key Findings
- Model performance and balancing: RF outperformed linear regression (R² 0.22, RMSE 592.42) and deep learning on tabular data. On the combined model for total HW, SMOGN balancing improved R² from 0.80 to 0.87 (test RMSE 247.40; N=3294). Training time RF ~12 s vs TNN ~5767 s with comparable accuracy (TNN R² 0.81, RMSE 307.85). - Category-specific ensemble (MHW/HW17): The RF-based ensemble achieved R² 0.85 and RMSE 47.50 (N=3294). The classifier achieved Acc 0.96, Recall 0.94, Precision 0.97, F1 0.95. The ensemble outperformed direct regression on balanced data (R² 0.69, RMSE 65.80) and was superior on both zero and positive subsets of the test data. - Sector-independent vs combined models: Sector-specific models generally outperformed the combined model. Examples: total HW MST R² 0.86 vs 0.74 (combined); OCM R² 0.77 vs 0.71. For MHW, MST R² 0.72 vs 0.66. - Feature importance (SHAP): For total HW (combined model), average relative importance by group: firm scale 16.8% > routine wastewater indicators 9.5% > metal emissions 8.8% > sector 8.7% > manufacturing processes 3.6%. For MHW classification, sector dominated (58.1%), followed by metal emissions (8.2%), firm scale (3.2%), routine indicators (2.6%), processes (2.2%). For MHW regression, firm scale (14.6%) > sector (10.1%) > metal emissions (9.0%) > processes (6.2%) > routine indicators (3.8%). Metals aligned with sectoral processes (e.g., Zn/Fe/Ni in MST; Cu in ECM). A specific process (wastewater treatment, process_10) was highly important for MHW due to sludge generation from metal wastewater treatment. - Generalizability: MST models using annual data performed well in Shandong (R² 0.69, RMSE 29.38) and Zhejiang (R² 0.72, RMSE 721.15). Lead/zinc metallurgy in Hunan achieved R² 0.82, RMSE 2366.63. Applying Zhejiang-trained model to Shanghai underperformed (R² 0.32), indicating need for localized training. - Temporal extrapolation: Combined models maintained R² ~0.7 for about 3 months beyond training data, then declined to ~0.4–0.5, supporting retraining every three months. - Uncertainty: For combined models, 95.60% of test samples fell within 95% prediction intervals, indicating good calibration; some sector-independent and MHW models showed overconfidence due to small/imbalanced data. - Cost-effectiveness: End-to-end model development took ~20 minutes on a laptop (data integration ~13m 23s; preprocessing ~35.1s; feature engineering ~171.2s; model construction ~125.7s). Field surveys for intensity factors can exceed one day per enterprise.
Discussion
Linking wastewater monitoring data with firm characteristics enables accurate, scalable estimation of firm-level hazardous waste generation, addressing the lack of granular, timely data for source management and regulatory oversight. The strong performance of RF-based models, strengthened by data balancing and informed feature selection via causal discovery and SHAP, reflects the complex, non-linear relations between HW generation and wastewater/firm attributes. Sector-independent models capture distinct sectoral patterns and outperform a single combined model when sufficient data are available, highlighting industrial heterogeneity. Feature importance analyses show consistent roles for firm scale and routine wastewater indicators, with sector- and region-dependent influences of metal emissions and specific processes. The framework generalizes to other regions and sectors but requires localized training due to differing contaminant partitioning and industrial configurations. Operationally, predictions remain reliable for approximately three months, suggesting a practical retraining cadence for regulatory applications. Calibrated uncertainty estimates further support decision-making by quantifying confidence in predictions.
Conclusion
The study introduces a generic, data-driven framework that integrates wastewater monitoring and firm characteristics to predict monthly firm-level hazardous waste generation with high fidelity. Through causal-informed feature selection, data balancing, and efficient tree-based models, the approach achieves strong performance for total HW and category-specific HW (MHW), and sector-specific models further enhance accuracy. The framework demonstrates adaptability across regions and sectors when models are trained on localized data and relevant variables are screened per context. It offers a cost-effective, rapidly deployable tool for regulators to optimize collection, enforcement, and planning. Future work should expand applications to more regions and sectors, incorporate additional high-resolution variables (e.g., raw materials, product outputs, energy consumption) as data availability improves, and explore continual learning strategies to sustain accuracy amid evolving industrial processes.
Limitations
- The primary case study covers a single region (Jiangsu) and 10 sectors; broader validation is needed. - Key operational variables (raw material inputs, product outputs, electricity consumption) were excluded due to data availability, potentially limiting accuracy. - Some sectors had limited sample sizes (e.g., steelmaking), constraining sector-specific modeling. - Certain models exhibited overconfident uncertainty estimates, likely due to small and long-tailed datasets; larger, higher-quality datasets would improve calibration. - Temporal drift affects extrapolation; performance declines beyond ~3 months without retraining.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny