Environmental Studies and Forestry

Data-driven approaches linking wastewater and source estimation hazardous waste for environmental management

W. Xie, Q. Yu, et al.

This research, conducted by Wenjun Xie and colleagues, unveils a data-driven methodology for predicting hazardous waste generation using massive wastewater datasets. With impressive accuracy, the findings promise to enhance efficiency across various sectors, ultimately contributing to better environmental management.

00:00

Playback language: English

Index

Introduction

Effective hazardous waste (HW) management is crucial for sustainable development, directly impacting multiple Sustainable Development Goals (SDGs). The increasing generation of HW, driven by industrial expansion, necessitates efficient source management strategies. Current methods for estimating firm-level HW generation often rely on aggregated industry-level intensity factors based on employee numbers or economic output. These approaches lack the granularity and accuracy required for effective regulation and enforcement. The heterogeneity of HW generation patterns across different enterprises further complicates accurate estimation. Traditional methods suffer from low accuracy and high costs, hindering effective source management. This paper addresses this gap by proposing a data-driven approach leveraging readily available wastewater data collected via the Internet of Things (IoT) sensors. The underlying assumption is a strong correlation between wastewater discharge and HW generation, reflecting the partitioning of contaminants during manufacturing processes. Wastewater data offers advantages over traditional economic indicators due to higher temporal resolution and accessibility for environmental stakeholders. This study develops a machine learning-based framework to predict monthly firm-level HW generation using wastewater data combined with enterprise-specific characteristics, demonstrating its feasibility through a large-scale application in Jiangsu Province, China.

Literature Review

Existing literature highlights the challenges in accurately estimating hazardous waste generation at the firm level. Traditional methods often use aggregated industry-level data based on factors like employee numbers or economic output, failing to account for enterprise heterogeneity. The lack of granular, timely data further limits the effectiveness of these methods. Studies have explored using various indicators like economic activity and demographic factors, but these lack the detail and temporal resolution needed for precise and up-to-date assessments. While programs like the Toxic Release Inventory (TRI) in the US and the European Pollutant Release and Transfer Register (E-PRTR) exist, ensuring comprehensive and timely participation from all generators remains a significant hurdle. The use of IoT sensors for continuous wastewater monitoring has emerged as a promising approach, but its integration into predictive modeling for HW generation is still underdeveloped. This study builds upon this nascent area by utilizing a data-driven approach to link wastewater data with HW generation and providing a comprehensive evaluation of its accuracy and applicability.

Methodology

The study employed a three-stage methodology: data collection and preprocessing, feature engineering, and model construction (Fig. 1). Data was collected from 1024 enterprises across 10 sectors in Jiangsu, China, encompassing 43 variables: 19 related to static enterprise characteristics (sector, firm scale, manufacturing processes) and 24 related to real-time manufacturing activities (wastewater discharge indicators, including metals). Data preprocessing involved handling missing values using imputation techniques and outlier rejection. Feature engineering utilized causal discovery (DAG learning and Markov Blanket search) to select relevant features, further refined based on feature importance (SHAP analysis) and correlation analysis. This optimized feature selection aimed to improve model performance and computational efficiency. The training dataset was balanced using SMOGN (Synthetic Minority Over-sampling Technique for Regression with Gaussian Noise) to address the long-tailed distribution of the HW generation data. Various machine learning algorithms (RF, GBDT, SVM, XGBoost, kNN, MLP, MLP ensemble, TNN) were employed for model development. The model for predicting total HW generation used a single Random Forest (RF) model. For predicting the quantity of a specific HW category (metal surface treatment hazardous waste, MHW), an ensemble model coupling binary classification (to determine if generation was zero) and regression (to predict the quantity if non-zero) was used, again with an RF model for both components. Sector-independent models were also developed for sectors with sufficient data to compare with the combined models. Model performance was evaluated using R², RMSE, MAE, MAPE, MAD, MSE, and SSE (regression) and accuracy, recall, precision, and F1-score (classification). SHAP analysis was employed for feature importance interpretation. Sensitivity analysis examined the impact of feature selection approaches and temporal extrapolation, while uncertainty analysis (using Quantile Regression Forest) quantified prediction uncertainty. A cost-effectiveness analysis compared the data-driven approach to traditional field surveys.

Key Findings

The Random Forest (RF) model achieved the best performance in predicting total HW generation (R² = 0.87, RMSE = 247.40), significantly outperforming multiple linear regression (R² = 0.22). Data balancing using SMOGN substantially improved model accuracy, increasing R² from 0.80 to 0.87. For predicting MHW generation, an ensemble model combining classification and regression achieved an R² of 0.85 and an RMSE of 47.50, outperforming a direct regression model. Sector-independent models generally outperformed the combined model, especially for sectors where the combined model had lower accuracy, emphasizing the heterogeneity of HW generation patterns across sectors. SHAP analysis revealed that firm scale and wastewater indicators were consistently important predictors across models. The importance of manufacturing processes and metal emissions in wastewater varied significantly across sectors. Three case studies in different provinces (Shandong, Zhejiang) and sectors (lead and zinc metallurgy) demonstrated the framework's adaptability, although performance was affected by data size and locality of data. Sensitivity analysis showed that feature selection based on a combination of Markov Blanket and feature importance enhanced both model performance and efficiency. Temporal extrapolation demonstrated good predictive performance within the first three months after model training but decreased over longer timescales, indicating a need for periodic model updates. Uncertainty analysis showed mostly low uncertainty with most testing samples falling into the 95% confidence intervals. A cost-effectiveness analysis indicated that this data-driven approach is significantly more efficient than traditional field surveys in both time and cost.

Discussion

The findings demonstrate the effectiveness of a data-driven approach for predicting firm-level hazardous waste generation using readily available wastewater data. The high accuracy achieved (R² = 0.87 for total HW prediction) suggests that this approach can significantly improve the efficiency and accuracy of HW source management. The superior performance of sector-specific models highlights the importance of considering the heterogeneity across industrial sectors when developing predictive models. The incorporation of causal discovery and data balancing techniques enhanced model performance and robustness. The insights from SHAP analysis regarding feature importance can inform targeted interventions and regulations. The findings of this study have implications for policy making and environmental monitoring, potentially leading to more effective regulation and reduction of hazardous waste generation. The cost-effectiveness analysis underscores the practical benefits of this approach, offering a sustainable and scalable solution for hazardous waste management.

Conclusion

This study presents a novel data-driven framework for predicting firm-level hazardous waste generation using readily available wastewater data. The approach demonstrates high accuracy and efficiency, outperforming traditional methods. The findings emphasize the importance of sector-specific models and highlight key predictive variables. While limitations exist regarding data availability and regional scope, the framework offers a valuable tool for improving hazardous waste management. Future research should focus on expanding the model's application to a wider range of sectors and regions, incorporating additional variables such as raw material inputs and product outputs, and refining uncertainty quantification techniques.

Limitations

The study's scope was limited to 1024 enterprises across 10 sectors in Jiangsu Province, China. While representative, the findings might not be fully generalizable to all regions and industrial sectors. Data availability limitations prevented the inclusion of certain potentially relevant variables (e.g., raw material inputs, product outputs, electricity consumption), potentially impacting model accuracy. The model's performance might be sensitive to changes in manufacturing processes and technologies over time, requiring periodic model retraining.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Open source and reproducible and inexpensive infrastructure for data challenges and education

P. E. Dewitt, M. A. Rebull, et al.

Business

From insights to impact: leveraging data analytics for data-driven decision-making and productivity in banking sector

R. Gul and M. A. S. Al-faryan

Aquaculture

Recycling biofloc waste as novel protein source for crayfish with special reference to crayfish nutritional standards and growth trajectory

R. Lunda, K. Roy, et al.

Medicine and Health

Linking the Urban Environment and Health: An Innovative Methodology for Measuring Individual-Level Environmental Exposures

K. Krenz, A. Dhanani, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny