Environmental Studies and Forestry

A data-driven approach to identifying PFAS water sampling priorities in Colorado, United States

K. E. Barton, P. J. Anthamatten, et al.

This innovative research conducted by Kelsey E. Barton and colleagues employs a data-driven random forest classification to predict groundwater contamination risks from PFOS and PFOA in Colorado. It identifies key sampling locations, focusing on vulnerable populations and highlighting critical data gaps.

00:00

~3 min • Beginner • English

Index

Introduction

PFAS are ubiquitous and persistent in the environment and human serum due to extensive use and long environmental and biological half-lives, with some associated with adverse health outcomes. Regulatory and monitoring efforts such as UCMR3 detected PFAS in public water systems, including notable exceedances in Colorado, prompting expanded state testing. However, many small and rural systems and most private wells remain untested. The study’s purpose was to develop a statewide PFOS/PFOA groundwater contamination risk prediction map for Colorado by integrating known and potential sources, environmental vulnerability factors, and existing sampling data, and to use it to prioritize future sampling—especially for disproportionately impacted communities—and identify data gaps.

Literature Review

Background literature notes PFAS persistence, bioaccumulation, and health effects, as well as nationwide monitoring under UCMR3 that found widespread detections, including in Colorado. Prior modeling and spatial prediction studies have applied machine learning to prioritize PFAS testing and predict contamination risk in private wells and groundwater. This study builds on such approaches by incorporating diverse source and environmental predictors at a statewide scale and focusing on vulnerable populations and policy-relevant prioritization.

Methodology

Data sources included 1,232 groundwater sampling results for PFOS and PFOA (from public water systems, private wells, and monitoring wells) collected between January 2016 and April 2022 across 10 efforts. Samples from obvious source locations (e.g., within military sites or corrective action sites) were excluded to reduce bias; for wells with multiple measurements, the maximum concentration during the study period was used. Non-detects were assigned 1.5×LoD; varying LoDs and analyte lists were present across datasets, but PFOS and PFOA were measured in all. Training labels categorized summed PFOS+PFOA as low (<5 ng/L), moderate (5–35 ng/L), or high (≥35 ng/L). Known and potential PFAS point sources were compiled and, where possible, sub-categorized (e.g., fire stations by AFFF possession/use; airports by Part 139 AFFF certification; multiple NAICS-based industrial categories; landfills; wastewater treatment plants; ski resorts; dry cleaners; oil and gas sites; injection wells; AFFF spills; major roadways). For each sampling location, Euclidean distance to the nearest source by category was calculated as explanatory distance features. Kernel density rasters were computed for features with diffuse spatial influence (e.g., oil and gas wells; aggregated industrial sites). Additional non-source environmental predictors included major quaternary alluvial aquifers, annual precipitation, aspect, elevation, general geology, soil hydraulic conductivity, slope, soil permeability class, and water flow direction; some variables with low importance (depth to water table, land use, irrigated lands, merged alluvial aquifers) were excluded in final models. A supervised machine learning approach using forest-based classification in ArcGIS Pro 2.6.3 was implemented. Models were trained on 75% of labeled samples with 25% held out for validation, using class balancing via “compensation for sparse categories.” Hyperparameters (tree depth, number of trees) and variable sets were tuned. Models were run in 20-iteration batches to assess stability via variable importance boxplots. Final modeling used the full dataset to train and then predict class at unsampled locations across a one-mile fishnet grid; the categorical outputs (0=low, 1=moderate, 2=high) were interpolated to a continuous risk surface using inverse distance weighting to yield values from 0 to 2. Variable importance was evaluated using Gini coefficients. Prioritization integrated the predicted risk surface with locations of unsampled public water systems (including very small community systems, TNCs, NTNCs), private well densities from the state water resources database, schools with independent systems, mobile home parks, and census block groups designated as disproportionately impacted (DI) via Colorado EnviroScreen. Continuous predicted risk values (0–2) were assigned to these features to create a sampling priority list emphasizing vulnerable populations and systems outside UCMR5 coverage.

Key Findings

Model performance (validation): low-risk category sensitivity 90% with precision 90%; high-risk category sensitivity 80% with precision 71%; moderate-risk category sensitivity 58% with precision 55%. The final model trained on all data correctly classified 96.5% of points. Population density was the top predictor by Gini importance; other influential variables included ski resorts, soil permeability class, elevation, airports (with and without Part 139 certification), annual precipitation, AFFF spills, fire stations reporting possession and use of AFFF, and water flow direction. There was no consistent pattern across most source types beyond population density, indicating data gaps in source characterization. Prioritization identified 15 schools and 19 mobile home parks with unsampled systems at elevated risk (3 schools and 12 parks in DI communities), over 300 public water systems at potentially elevated risk (including very small n=70, TNCs n=152, NTNCs n=32), and 20 priority DI-community census blocks each with at least 5,000 household or domestic-use private wells for targeted outreach. The approach also highlighted geographic areas with sparse sampling and source categories needing further investigation.

Discussion

The predictive map directly supports the research objective by identifying where PFOS/PFOA groundwater contamination is more likely and who could be most affected, enabling targeted sampling and resource allocation. It emphasizes systems and communities not covered by UCMR5—especially TNCs, NTNCs, very small systems, and private wells—thus addressing gaps that leave vulnerable populations at risk. Variable importance results suggest broad urbanization/population density signals and potential contributions from specific sources (e.g., airports, AFFF use, ski resorts), but also underscore limited knowledge about many source types. Recommendations prioritize sampling at schools and mobile home parks (often in DI communities), very small systems, and high-density private well areas in DI communities. The study outlines regulatory and programmatic avenues for improved source investigation (e.g., RCRA corrective action authority; evolving TRI reporting; CERCLA hazardous substance designation for PFOS/PFOA) and calls for better historical records of AFFF use and targeted source-proximate sampling. Overall, the approach provides a practical, data-driven framework for protecting susceptible populations while guiding future data collection to refine predictions.

Conclusion

This work delivers a first statewide random forest-based PFOS/PFOA groundwater risk map for Colorado that integrates diverse source and environmental data with monitoring results to inform sampling priorities. It identifies specific unsampled systems and communities for near-term action and highlights critical data gaps in both geographic coverage and source characterization. CDPHE intends to rerun and refine the model annually as more comprehensive datasets with lower and more consistent detection limits become available, and as additional information on PFAS releases emerges. Future work includes improved source-specific data, separate modeling for different PFAS, evaluation of spatial autocorrelation as source importance rankings stabilize, localized modeling incorporating well depth or uppermost aquifer depth where feasible, and potential inclusion of predictors such as groundwater age.

Limitations

Key limitations arise from data constraints and heterogeneity: variable and often high limits of detection across datasets (with non-detect substitution at 1.5×LoD), preferential sampling near known or suspected sources (notably along the Front Range), inconsistent PFAS analyte suites, and limited information on the occurrence, magnitude, and timing of releases for many source categories. Some non-detects with high LoDs were classified as moderate risk, potentially biasing predictions near certain sources (e.g., areas near ski resorts). Aggregating PFOS and PFOA limits insight into broader PFAS mixtures and source signatures, though these two are frequently co-detected. Several environmental predictors (e.g., depth to water table, land use, irrigated lands, merged alluvial aquifers) were excluded due to low importance or dataset limitations. Statewide modeling is challenged by complex hydrogeology and lack of reliable statewide well depth data; some potentially informative predictors (e.g., groundwater age) were unavailable. These factors may affect generalizability and introduce uncertainty in specific locales and source attributions.

Related Publications

Explore these studies to deepen your understanding of the subject.

Earth Sciences

The conterminous United States are projected to become more prone to flash floods in a high-end emissions scenario

Z. Li, S. Gao, et al.

Medicine and Health

Combining Clinical and Genetic Data to Predict Response to Fingolimod Treatment in Relapsing Remitting Multiple Sclerosis Patients: A Precision Medicine Approach

F. L, C. F, et al.

Earth Sciences

A data-driven approach to rapidly estimate recovery potential to go beyond building damage after disasters

S. Loos, D. Lallemant, et al.

Interdisciplinary Studies

A value-driven approach to addressing misinformation in social media

N. Komendantova, L. Ekenberg, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny