Introduction
Access to safely managed drinking water is a critical global health concern, particularly in low- and middle-income countries (LMICs) where a significant portion of the population lacks access to safe and readily available water. Fecal contamination, often indicated by the presence of E. coli, is a major cause of waterborne diseases. While integrating water quality testing into nationally representative household surveys, like the Ethiopia Socioeconomic Survey (ESS), provides valuable data, it's not always feasible due to resource constraints. This study aims to leverage machine learning to overcome this limitation. Previous research has used machine learning to predict various aspects of household surveys, including poverty, food security, and crop yields. However, its application to predicting drinking water quality from nationally representative household surveys, particularly in predicting for surveys without direct quality measures, remains limited. This research focuses on Ethiopia, using data from the 2016 ESS (which includes water quality testing) to build predictive models and apply them to earlier (2013-2014) and later (2018-2019) ESS waves that lacked such data. The study aims to assess the performance of various machine learning algorithms, analyze the importance of different variable types (household characteristics, water service characteristics, geospatial variables), and apply the best-performing model to predict water quality in the data-deficient ESS waves.
Literature Review
A substantial body of literature highlights the global burden of disease caused by unsafe drinking water, emphasizing the need for effective monitoring and intervention strategies. Studies have increasingly integrated water quality testing into household surveys in LMICs to generate representative data on water quality and establish baselines for safely managed drinking water services. However, the resource intensity of such testing limits its inclusion in all surveys. Machine learning has shown promise in predicting various socioeconomic factors from household survey data. While some studies have used machine learning to predict water quality at smaller scales or for specific water sources, this study's focus on using a nationally representative household survey dataset to predict water quality in surveys without water testing data is novel. This research builds on existing studies by applying machine learning to a national dataset, bridging data gaps and improving understanding of drinking water safety.
Methodology
The study utilized data from three waves of the Ethiopia Socioeconomic Survey (ESS): 2013-2014 (ESS2), 2015-2016 (ESS3), and 2018-2019 (ESS4). ESS3 included a water quality module with E. coli testing, serving as the basis for model development. The presence or absence of E. coli contamination served as the response variable. The study employed several common machine learning classification algorithms: Extreme Gradient Boosting (XGBoost), Generalized Linear Model (GLM), Generalized Linear Model with elastic net regularization (GLMNET), K-Nearest Neighbors (KNN), Random Forest (RF), and Support Vector Machines (SVM). The data preprocessing involved constructing variables (e.g., wealth index), transforming variables (e.g., combining similar water source types), and using the Boruta algorithm for feature selection, focusing on variables frequently available in household surveys. The dataset was split into training (80%) and testing (20%) sets, stratified by contamination status. Model performance was evaluated using metrics such as accuracy, F1-score, sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC). The best-performing model from ESS3 was then applied to ESS2 and ESS4 to predict water quality in those datasets. Analyses were conducted using the R programming language.
Key Findings
The Random Forest (RF) algorithm consistently outperformed other algorithms, achieving the highest accuracy (88.5%, 95% CI 86.3%, 90.6%) and AUC (0.91, 95% CI 0.89, 0.94) in predicting E. coli contamination in ESS3. Analysis of variable importance revealed that distance to the nearest market, annual average rainfall, annual average temperature, water source type, and the prevalence of open defecation were among the most significant predictors. Models using only geospatial data or only household-level socioeconomic variables showed surprisingly good predictive performance, comparable to the full model. Applying the RF model to ESS2 and ESS4 yielded similar results, indicating its generalizability across different survey waves. However, there were some variations in performance across different water source types, with the lowest accuracy observed for tanker/vendor supplies. Rural areas consistently showed higher prediction accuracy than urban areas. The water source-only model showed poor accuracy, indicating that considering only the source type is insufficient for accurate prediction.
Discussion
The study successfully demonstrates the feasibility of using machine learning to predict drinking water quality from readily available household survey data, effectively filling data gaps in monitoring efforts. The high accuracy and discrimination of the RF model suggest that this approach can be a valuable tool for assessing drinking water safety, particularly in settings with limited resources for direct water quality testing. The comparable performance of models using only geospatial or household-level variables suggests that either data type alone can be sufficient for generating reasonable predictions, offering flexibility in model application depending on data availability. The findings underscore the importance of considering various factors beyond water source type when assessing water quality and highlight the potential of using machine learning for creating risk maps.
Conclusion
This study showcases a robust and adaptable method for predicting drinking water contamination using readily available data from household surveys. The high accuracy achieved by the Random Forest model, coupled with its successful application across different survey waves, demonstrates its potential to significantly enhance water quality monitoring efforts in resource-constrained settings. Future research could explore incorporating additional variables (e.g., temporal factors, specific water treatment methods), refining feature selection techniques, and validating the model in other LMICs with similar data availability. This approach offers a cost-effective and scalable solution for improving access to safe drinking water.
Limitations
The E. coli measurements represent a snapshot in time and might not fully capture the dynamic nature of water contamination. Some discrepancies exist between water source reporting in the main survey and the water quality module. Not all ESS variables were included in the analysis, and the selection of geospatial datasets introduces some subjectivity. The study focused on contamination at the point of collection and may not accurately reflect contamination at the point of use. The study's findings are specific to Ethiopia's context and further validation is needed across diverse settings.
Related Publications
Explore these studies to deepen your understanding of the subject.