Introduction
The Environment Agency (EA) in England and Ofwat are responsible for regulating wastewater pollution. Despite this, in 2018, over 400 sewage pollution incidents were publicly reported, highlighting potential underreporting by operators. This study addresses this issue by using machine learning to detect untreated sewage spills. The research leverages readily available data streams, including daily effluent flow patterns, rainfall data, river flow data, and WWTP alarm data. Environmental Information Regulation (EIR) requests were used to obtain daily treated effluent flow patterns and event duration monitoring (EDM) data from two WWTPs operated by the same water company, chosen for their contrasting population sizes and data availability. Storm tanks, used to temporarily hold excess sewage during rainfall, are a key focus; their overflows can lead to permitted or illegal spills of untreated sewage. Since 2016, operators have reported EDM-detected spills annually, but reporting is not always complete, especially for spills occurring during sub-exceptional rainfall or due to non-compliance with minimum treatment flow rates. The study's objective was to develop methods that analyze daily flow patterns and EDM data to detect these untreated sewage spills, benefiting water companies, regulators, and citizen scientists. The researchers used a machine learning approach, training pattern recognition algorithms on flow patterns during known spill events to identify similar patterns in unreported events. This builds on existing applications of AI in regulatory compliance, where symbolic representations of regulations are used, and expands into using quantitative AI methods for regulatory compliance checks. The study involves shape analysis of daily flow patterns to create a compact flow representation, followed by supervised learning with various ML algorithms to build classifiers that distinguish between spill and non-spill events. The classifiers were verified using data not involved in training and then applied retrospectively to a broader dataset. Publicly accessible rainfall, river flow, and alarm data were also incorporated to contextualize and corroborate findings.
Literature Review
The introduction cites various sources indicating significant financial penalties and criminal prosecutions for major sewage pollution incidents in England. It references reports from the Environment Agency detailing the number of pollution incidents and the proportion reported by operators versus the public. The literature review highlights the lack of complete data on the frequency and impact of wastewater pollution incidents and the need for better methods to detect and report these events. It also mentions previous work employing AI and machine learning techniques for regulatory compliance checking, particularly in the context of predicting compliance with US environmental regulations using publicly available data. The paper discusses existing regulations regarding storm tank overflows and minimum treatment flow rates, highlighting the gaps in data collection and reporting around these aspects.
Methodology
The study employed a multi-step methodology. First, shape analysis was performed on daily effluent flow patterns from 2016-2020 for both WWTPs (WWTP1 and WWTP2) to create a compact representation of the flow data. This involved converting daily flow curves into 3D surfaces and employing Principal Component Analysis (PCA) to identify principal components of shape variation. This process identified shape variations related to the magnitude and timing of flow peaks, as well as seasonal changes. Supervised learning was then used to build classifiers to distinguish between 'spill' and 'normal' flow patterns. This involved training 20 variations of Support Vector Machine (SVM) algorithms on data from 2018-2020 where EDM data confirmed spill events. The best-performing algorithms were selected based on cross-validation accuracy. The optimal classifiers were then used to classify flow patterns from 2016-2018 (semi-blinded) and 2009-2015 (fully blinded). The analysis incorporated additional data sources, such as rainfall, river flow and level data, and telemetry alarm data from storm tank overflow and consented overflow level (COL) alarms, which were used to corroborate the ML classifications. To study contiguous 24-hour spills, the daily flow patterns were ranked by the degree of flattening, measured by the standard deviation of the 15-minute interval flow rates. The methodology included shape analysis using dense surface models and principal component analysis (PCA), with a focus on the first two principal components (PCA1 and PCA2). The supervised learning utilized Support Vector Machines (SVM) with various kernel functions and parameter variations, employing 20-fold cross-validation for robust accuracy estimation. The key parameters and best-performing models are detailed in the supplementary materials. The study also detailed the process of obtaining data using EIRs and explained how to corroborate the findings with telemetry alarm data.
Key Findings
The shape analysis revealed that the second principal component (PCA2) was strongly correlated with the shape difference between 'normal' and 'spill' affected flow patterns. The PCA analysis alone achieved an area under the ROC curve (AUC) of 0.88-0.91 for distinguishing between 'normal' and 'spill' days. Supervised learning with SVM algorithms achieved an average AUC of 0.97 for WWTP1 and 0.96 for WWTP2 in cross-validation. The optimal ML classifiers showed very high agreement with EDM data for the training period (2018-2020). Retrospective analysis of data from 2009-2018 (7160 days) revealed 926 days classified as potential spills by the ML model, 926 'spill' days were identified. Corroboration with telemetry alarm data (STO and COL) showed good agreement at WWTP1 (Cohen's kappa: 0.81-1.00) but less so at WWTP2 (where STO data was unreliable), highlighting the importance of data reliability and potentially alternative corroborative data sources. The analysis also identified numerous instances of isolated and contiguous series of 24-h spills, some occurring on days with minimal rainfall, suggesting non-compliance with permits and the possibility of groundwater ingress influencing spills, particularly at WWTP2. Examination of the twenty most flattened daily effluent flow patterns for each WWTP showed contrasting characteristics; WWTP1 exhibited consistently low flows during 24-h spills, well below the permitted storm overflow rate, while WWTP2's spills were generally above this rate, although certain spills occurred without rainfall. A significant near-continuous spill at WWTP2 (60 days in 2014) was linked to reported sewage fungus, highlighting the environmental consequences of prolonged spills. Further analysis of data from 2018-2020 revealed similar patterns of prolonged spills, often coinciding with low rainfall.
Discussion
The findings demonstrate the effectiveness of the ML approach in retrospectively detecting untreated sewage spills. This approach can aid water companies in identifying poorly managed assets and help regulatory bodies improve compliance checking. The ML approach also empowers public and professional scrutiny. The study highlights the potential inadequacy of relying solely on operator-reported incidents, emphasizing the significant role of public reporting and the need for more comprehensive monitoring. The lack of standardized protocols for data acquisition through EIRs is also identified as a limitation. While the ML model performed exceptionally well, the study acknowledges the dependence on the availability of accurate and complete data. The inconsistencies observed in the corroboration of ML results with different alarm data also suggest the need to tailor the approach to specific WWTP characteristics or consider improved alarm systems. The study's findings have implications for catchment managers, conservation groups, recreational users, and academics involved in modeling and measuring water quality. The identified non-compliance events could be significant in understanding why many surface water bodies in England have poor water quality status. This improved understanding of sewage spill frequency and duration can directly influence regulatory actions, investment decisions, and potential prosecution of water companies.
Conclusion
This study demonstrates a novel application of machine learning to detect unreported sewage spills from WWTPs. The high accuracy of the ML model and the retrospective identification of numerous potential spills highlight the value of this approach for improving WWTP management and regulatory oversight. The findings underscore the importance of data availability and suggest the need for standardized data access protocols. Future work could focus on refining the ML model, expanding data sources, and incorporating predictive capabilities to enable real-time spill detection and prevention. Further research should investigate the incorporation of groundwater level data and exploration of different ML techniques for improved accuracy and applicability.
Limitations
The study relies on data availability, which might be inconsistent due to recent changes in data collection methods and equipment. The accuracy of the ML model is dependent on the quality of the training data. There is a need for more robust data acquisition protocols and potential improvement of existing alarm systems. This research focuses on two WWTPs and might not be generalizable to all WWTPs without further investigation. The retrospective nature of the study, though valuable, could be enhanced with real-time monitoring and predictive capabilities.
Related Publications
Explore these studies to deepen your understanding of the subject.