
Environmental Studies and Forestry
Regional and global hotspots of arsenic contamination of topsoil identified by deep learning
M. Wu, C. Qi, et al.
This groundbreaking research by Mengting Wu, Chongchong Qi, Sybil Derrible, Yosoon Choi, Andy Fourie, and Yong Sik Ok unveils a cost-effective technique for assessing topsoil arsenic contamination worldwide. Utilizing VNIR spectroscopy and an optimized deep learning model, the study reveals hotspots like China, Brazil, and California, while calling attention to lesser-known regions like Gabon. Dive into this innovative approach tackling a serious ecological and human health threat!
~3 min • Beginner • English
Introduction
Soil is a finite resource essential for food security and ecosystem services, yet is widely contaminated by heavy metals due to natural processes and human activities (e.g., mining, smelting, wastewater discharge, and agrochemical misuse). Arsenic (As) is a highly toxic element that accumulates in topsoil (0–20 cm), impairs plant growth, threatens food and water quality, and exposes humans through dietary and non-dietary routes. Detecting and mapping topsoil As contamination, especially in the critical topsoil layer where plant uptake occurs, is vital for environmental and public health protection. Conventional detection via field sampling and laboratory analysis is accurate but localized, time-consuming, and costly. Hyperspectral methods, particularly laboratory VNIR spectroscopy, enable large-scale, rapid, non-destructive assessment, but prior work seldom generalized predictions to large regions and lacked continental-scale, diverse spectral datasets. The study aims to develop a robust deep learning model using a large laboratory topsoil spectral dataset to accurately and efficiently estimate As content and identify regional and global contamination hotspots and potentially affected populations.
Literature Review
Previous studies have documented widespread soil contamination and the particular risks of As to agriculture, water quality, and human health. Hyperspectral remote sensing, including VNIR spectroscopy, has been used to estimate soil properties and heavy metals, but efforts to predict As content at large scales from laboratory spectra have been limited. Prior models often suffered from small datasets and limited diversity, impairing generalization. Some studies reported spectral features linked to soil components (e.g., water, hydroxyl-bearing minerals, organics, clays, carbonates) that influence As behavior. However, comprehensive, continental-scale datasets for As modeling and demonstrations of regional/global application had not been reported. This work addresses these gaps using the extensive LUCAS 2009 topsoil spectral dataset and deep learning to enable broader-scale prediction and hotspot identification.
Methodology
Datasets: The primary dataset was the 2009 LUCAS topsoil dataset (~20,000 samples across 25 EU Member States). VNIR spectra were measured with a FOSS XDS Rapid Content Analyzer (400–2500 nm; used 400–2499.5 nm). As content per site was extracted from ESDAC heavy metal maps. After excluding 1,325 samples without matching As data, 18,675 samples remained. Each sample provided 4,200 spectral wavelengths as inputs and As content as the response. Additional application datasets included: (1) US Rapid Carbon Assessment (RaCA) topsoil spectra (18,200 samples; LabSpec 2500, 350–2500 nm, 1-nm resolution) and (2) ISRIC global VNIR soil spectral library (4,239 spectra; 350–2500 nm, 10-nm resolution), from which 675 topsoil spectra with spatial information were selected for global assessment.
Preprocessing: LUCAS adsorption spectra were converted to reflectance. Standard normal variate (SNV) transformation and scaling were applied to reduce particle-size effects and harmonize feature scales. As content was transformed to its reciprocal prior to modeling.
Data splitting and evaluation protocol: LUCAS data were split into training/validation/testing sets at 80%/10%/10%. To ensure robustness and mitigate randomness, the model was trained and evaluated over 12 random splits (from an initial 20 repetitions showing stability by 12). Evaluation metrics included R², MAE, and MSE; Wilcoxon Signed Rank tests and Pearson correlations compared predicted vs. measured values.
Model development: A fully connected neural network (FCNN) was implemented in Keras. Starting from a default two-hidden-layer architecture, the number of hidden layers was increased to five to improve performance. Neuron counts followed powers of two; hidden layers used dropout during architecture exploration. Hyperparameters tuned via grid search included batch size (64, 256, 512), epochs (100, 250, 700), and dropout (0.25, 0.5, 0.75), while other settings were fixed (input size 4200, output 1, ReLU activation, Adam optimizer, learning rate 0.001). The best configuration used five hidden layers with dropout 0.5, batch size 512, and 700 epochs.
Interpretability: Permutation importance (eli5 in Python) identified influential wavelengths and bands. Sensitive regions were linked to soil components and mechanisms (e.g., Fe oxides, molecular water, hydroxyl-bearing minerals, organic matter, carbonates, clays).
Application and harmonization: To apply the LUCAS-trained model to RaCA and ISRIC data, spectral ranges/resolutions were harmonized by interpolating to 0.5 nm and trimming to 400–2499.5 nm to match LUCAS. Several interpolation methods (linear, nearest, zero, quadratic, cubic) were tested on downsampled LUCAS spectra to select an optimal method (based on similarity to original spectra). Predicted As contents were normalized to relative As content for mapping and comparison. US relative As predictions were spatially interpolated via kriging to create continuous maps. Population-at-risk assessment combined predicted high relative As with population density distributions (assuming uniform exposure across land-cover types) to estimate potentially affected populations in the US and globally near sampling locations.
Key Findings
- Model performance: The optimum FCNN with five hidden layers and tuned hyperparameters achieved R² of 0.784 (training), 0.688–0.690 (validation), and 0.692 (testing), with MAE ≈ 0.123 and MSE ≈ 0.027 on testing, indicating strong accuracy and generalization. Loss decreased substantially from epochs 1 to 200, with further gains up to 700 epochs (training time ~616.6 s for 700 epochs vs. ~315.1 s for 200 epochs).
- Prediction consistency: No significant difference between measured and predicted As content (Wilcoxon Signed Rank: validation P≈5.60×10⁻²; testing P≈1.16×10⁻¹). Strong correlations (validation R≈0.81; testing R≈0.82). The model outperformed several standard ML models and a 1D-CNN baseline. A critical threshold around 2.5 mg/kg (As-reciprocal ≈ 0.4) was observed: predictions tended to be lower than true values below this threshold and higher above it, suggesting potential benefits of separate models for low vs. high As ranges.
- Important spectral features: Most influential wavelengths included 675 nm; 1408–1409 nm; 1424–1437 nm (top contributors 1434.5, 1431.5, 1428 nm); 1517, 1570.5–1571, 1589.5, 1595 nm; 2188.5–2201 nm; 2230–2244 nm; and 2430–2470 nm (including 2442.5, 2453, 2465, 2470). These align with features related to Fe oxides (As adsorption), molecular water and hydroxyl-bearing secondary minerals, soil organic matter (C–H, C–O, C–N), carbonates, and clays.
- US hotspots and risk: High relative As content identified in Wyoming, California (e.g., Sierra Nevada downstream/San Francisco Bay), South Carolina, North Carolina, Utah, and Montana. Estimated potentially affected US population ≈ 81.92 million. Top states by share of affected population: California 32.3%, Texas 10.8%, North Carolina 8.6%, Georgia 8.4%, Florida 5.6%. Wyoming shows high As but low population, hence only ~0.49% of potentially affected people.
- Global hotspots: High relative As content at sampling points across continents, notably southern Thailand, southeast China (e.g., Guangxi, Hunan, coastal provinces), western Cameroon, Italy, Brazil, and Cuba. Major hotspots combining high As and population density: China and Brazil. Potentially underreported hotspots with high relative As and notable nearby population: Girona region (Spain) and Gabon.
- Practicality: The FCNN approach enables rapid, low-cost, large-scale assessment compared to conventional sampling/analysis, with predictions achievable in minutes over large areas using spectral inputs.
Discussion
By leveraging the large, diverse LUCAS topsoil spectral dataset and a robust FCNN, the study demonstrates accurate estimation of As content from VNIR spectra with strong generalization. The approach addresses the limitations of traditional site-by-site chemical analyses by enabling rapid, scalable assessments that can guide monitoring and remediation. Identifying both known and potential hotspots in the US and worldwide, and overlaying these with population data, provides a framework for prioritizing areas where As contamination likely poses the greatest ecological and human health risks. The spectral importance analysis aligns with environmental knowledge that As behavior is governed by soil constituents (e.g., SOM, Fe/Al oxides, clays, carbonates, and water-bearing minerals), reinforcing confidence in the model’s mechanism-consistent predictions. The findings directly address the research goal of mapping and risk-screening topsoil As contamination at regional and global scales using spectroscopy and deep learning, offering actionable insights for soil management and public health planning.
Conclusion
This work develops and validates a fully connected neural network trained on the extensive LUCAS VNIR topsoil spectral dataset to accurately estimate As content and identify contamination hotspots. The model exhibits strong robustness and generalization, enables rapid and cost-effective large-scale assessment, and highlights US and global hotspots (e.g., California, China, Brazil) as well as underreported regions (e.g., Girona, Gabon) for targeted investigation and remediation. The approach provides benchmarks for soil regulation and supports sustainable environmental and economic decisions. Future directions include: expanding and unifying global soil spectral libraries, developing calibration transfer and preprocessing protocols to harmonize different instruments, refining exposure and risk models (e.g., accounting for land-cover-specific exposure rates), and exploring specialized models for low vs. high As ranges to improve accuracy.
Limitations
- Global spectral coverage: The available global spectral library for topsoil included only 675 spectra with spatial information, limiting representativeness of global diversity and restricting analyses to specific sites rather than comprehensive global mapping across environments and land types.
- Database heterogeneity: Absence of a unified, standardized soil spectral database; existing datasets were collected with different sampling designs, laboratory analyses, and purposes, complicating integration.
- Instrument variability: Differences in spectral instruments (wavelength ranges, resolutions) necessitate interpolation and calibration transfer; without robust preprocessing and transfer methods, cross-library application can introduce uncertainty.
- Risk assessment assumptions: Population-at-risk estimates assume uniform exposure across land-cover types and rely on spatial proximity/density rather than detailed exposure pathways, potentially over- or under-estimating risk in certain contexts.
Related Publications
Explore these studies to deepen your understanding of the subject.