logo
ResearchBunny Logo
Introduction
Soil contamination, largely due to human activities like mining, smelting, and improper waste disposal, poses a substantial threat to global food security and environmental health. Arsenic (As), a ubiquitous and highly toxic heavy metal, is of particular concern due to its ease of accumulation in soil. High As levels impair plant growth, reduce crop yields, and contaminate water sources, posing a significant risk to human health through both dietary and non-dietary pathways. The topsoil layer (0-20 cm) is especially vulnerable to As contamination and is the primary source of As uptake by plants, intensifying the risk. Conventional As detection methods, relying on on-site sampling and laboratory analysis, are time-consuming, expensive, and limited in spatial coverage. Hyperspectral remote sensing offers a potential solution, providing large-scale, non-destructive, and cost-effective detection. While laboratory spectral analysis is more accurate, previous studies have rarely predicted large-scale As content. This research addresses these limitations by developing a deep learning model that uses a large, comprehensive spectral dataset to predict topsoil As levels, allowing for regional and global assessments of contamination.
Literature Review
Existing literature highlights the significant problem of arsenic contamination in soils worldwide. Studies have shown the detrimental effects of arsenic on plant growth and human health. Previous research has explored the use of hyperspectral remote sensing and spectral analysis to estimate topsoil arsenic content. However, these studies often lack the scale and diversity of soil types necessary for robust, global-scale predictions. The novelty of this research lies in its utilization of the LUCAS topsoil spectral dataset, the largest and most comprehensive dataset available, to train a deep learning model capable of large-scale predictions.
Methodology
This study utilized a large, integrated dataset of topsoil visible near-infrared (VNIR) spectra and corresponding arsenic (As) content. The primary dataset was the 2009 LUCAS topsoil dataset, containing approximately 20,000 topsoil samples from 25 European Union member states. After removing samples lacking As information, the final dataset comprised 18,675 samples. The VNIR spectra were obtained using a FOSS XDS Rapid Content Analyzer (λ = 400-2500 nm). As content was determined using inductively coupled plasma-optical emission spectrometry. The study also included datasets from the US Rapid Carbon Assessment (RaCA) project (18,200 samples) and the ISRIC Soil VNIR Spectral Library (675 topsoil spectra with spatial information). A fully connected neural network (FCNN) model was developed using Keras, a Python-based deep learning library. The model's architecture and hyperparameters were optimized through a process of architecture adjustment (increasing the number of hidden layers from two to five) and hyperparameter tuning (using grid search to find optimal values for dropout rate, batch size, and epochs). The model's performance was evaluated using R-squared, mean absolute error (MAE), and mean squared error (MSE). Following model optimization, the model was applied to the US and global datasets to predict topsoil As content and identify potential contamination hotspots. Kriging spatial interpolation was used to create prediction maps for the US. The potential population at risk was assessed by considering population density near sampling points with high predicted As content.
Key Findings
The optimized FCNN model demonstrated high predictive accuracy and generalization capability. The model achieved R-squared values of 0.784 and 0.688 on the training and validation sets, respectively, and 0.692 on the testing set. The analysis identified specific wavelengths and spectral bands most strongly associated with As content, providing insights into the underlying mechanisms of As detection. Application of the model to the US dataset revealed California, Wyoming, South Carolina, North Carolina, Utah, and Montana as areas with high relative As content. An estimated 81.92 million people in the US were identified as potentially at risk, with California having the largest at-risk population. Globally, China and Brazil were identified as major hotspots, while areas such as Gabon, exhibiting high As content but lacking documentation, were identified as potential hotspots. The model also outperformed other machine learning models and a 1D convolutional neural network.
Discussion
The results demonstrate the effectiveness of the developed FCNN model for predicting topsoil As content and identifying contamination hotspots on both regional and global scales. This approach offers a significant advantage over traditional methods due to its cost-effectiveness, speed, and scalability. The identification of specific wavelengths sensitive to As content provides valuable insights into the underlying biogeochemical processes related to As accumulation in soil. The study highlights the importance of integrating information on As levels and population density to accurately assess the risks associated with As contamination. While the study identifies several key hotspots, further research is needed in understudied regions like Gabon to fully assess the global extent of the problem. The findings are particularly relevant to informing soil remediation strategies and developing sustainable land management practices.
Conclusion
This study successfully developed and applied a fully connected neural network model to predict topsoil arsenic contamination using visible near-infrared spectroscopy. The model demonstrated high accuracy and generalizability, identifying known and potential hotspots in the US and globally. The research highlights the potential of deep learning for large-scale environmental monitoring and risk assessment. Future research should focus on expanding the global spectral library to improve representation of diverse regions and soil types and on integrating additional environmental factors to refine risk assessments.
Limitations
The global-scale assessment was limited by the size of the available global spectral library, which may not fully represent the diversity of various regions worldwide. The risk assessment considered only spatial distribution and assumed uniform As exposure across land cover types, which may not reflect reality. Further, the accuracy of the model is dependent on the quality of the spectral data used for training.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny