Environmental Studies and Forestry

Deep learning for detecting and characterizing oil and gas well pads in satellite imagery

N. Ramachandran, J. Irvin, et al.

This research, conducted by Neel Ramachandran, Jeremy Irvin, Mark Omara, Ritesh Gautam, Kelsey Meisenhelder, Erfan Rostami, Hao Sheng, Andrew Y. Ng, and Robert B. Jackson, presents a groundbreaking deep learning approach to mapping oil and gas infrastructure using high-resolution satellite imagery, revealing previously unmapped well pads and storage tanks in key basins.

00:00

~3 min • Beginner • English

Index

Introduction

Methane is a potent greenhouse gas with a short atmospheric lifetime, making rapid mitigation impactful for near-term climate forcing. Fossil fuel activities account for a large share of anthropogenic methane emissions, with a significant portion originating from oil and gas production infrastructure such as well pads and, frequently, storage tanks. Accurate bottom-up inventories are hindered by incomplete or outdated facility data, while top-down satellite and airborne measurements require granular infrastructure maps for source attribution. Existing U.S. datasets (e.g., HIFLD, Enverus) have inconsistencies, outdated records, and lack sub-facility information (e.g., tanks), and global databases have major gaps in well pad and storage tank coverage. Advances in high-resolution satellite imagery and deep learning present an opportunity to build comprehensive, updatable infrastructure maps. This study develops and validates deep learning models to detect well pads and storage tanks from freely available high-resolution imagery, evaluates performance across diverse U.S. basins, deploys models at basin scale in the Permian and Denver basins, and compares detections to existing datasets to quantify coverage gaps and new detections. The goal is to enable improved methane emissions estimation and attribution by enhancing infrastructure completeness and accuracy.

Literature Review

The paper situates the work within the context of discrepancies between bottom-up (inventory-based) and top-down (atmospheric) methane estimates, noting success in limited regions where detailed facility-level datasets exist. It reviews satellite capabilities: global sensors (SCIAMACHY, GOSAT, TROPOMI) identify coarse hotspots, while targeted instruments (GHGSat, AVIRIS-NG) detect high-emission point sources with limited coverage; upcoming missions (MethaneSAT, CarbonMapper) promise improved resolution and coverage. U.S. datasets (HIFLD, Enverus) differ in scope, recency, and accuracy; HIFLD aggregates state data with variable coverage and may include decommissioned sites, and sub-facility (e.g., tanks) inventories are scarce. At the global level, recent oil and gas infrastructure databases using reported data still lack well pad and storage tank detail. Prior deep learning work has mapped energy infrastructure (e.g., buildings, solar PV, wind turbines, oil refineries) and demonstrated feasibility for O&G well pads and tanks in the Denver basin, but previous models were not assessed basin-wide against existing repositories. This study extends prior efforts with larger-scale, cross-basin training, rigorous evaluation, and basin-scale deployments.

Methodology

Well pad detection pipeline: Two-stage approach. Stage 1 is object detection using RetinaNet with a ResNet-50 backbone trained to produce axis-aligned bounding boxes for well pads. The detector is thresholded to maximize recall. Stage 2 is a verification model using EfficientNet-B3 for binary classification of centered crops around candidate detections to eliminate false positives; thresholded for 99% precision to maximize final precision. Training data for well pads: 88,044 Google Earth satellite basemap images (30–70 cm in the US), including 10,432 positives with 12,490 well pad boxes and 77,612 negatives. Positives came from (a) expert annotations via manual basin panning in QGIS and (b) crowdsourced labeling (Scale AI) of randomly sampled active wells (post-2005) from HIFLD and Enverus; validation/test sets were manually reviewed. Negatives included random basin/city samples and hard negatives resembling pads (roads, wind turbines, exposed soil, agricultural fields, etc.) sourced from OpenStreetMap and a geo-visual similarity search. Images were 512–640 px tiles projected to Web Mercator (zoom level 16; ~197–223 m tiles). Data split: 75/15/10 for train/val/test with overlap control. Well pad detection training: RetinaNet initialized with ImageNet weights, trained with Adam (lr=1e-6), batch size 8, stochastic augmentations (random crops, flips, scaling, color jitter). Class imbalance addressed by inverse-frequency sampling. During evaluation, random cropping/scaling simulate deployment distribution where pads are off-center. IoU=0.3 for matching due to ambiguous pad boundaries; AP computed over thresholds. Also measured precision/recall at thresholds set to high recall on validation. Predictions near image borders (within 50 px) excluded in eval. Verification model: EfficientNet-B3 trained on centered crops around detection candidates and the same negatives. Adam (lr=1e-4), oversampling positives, similar augmentations in training; deterministic evaluation. Threshold chosen for 99% precision on validation, optimizing F1 for checkpoint selection. The combined pipeline feeds detector candidates to the verifier; non-verified candidates are removed. Generalization tests: Additional labeled datasets for Appalachia, TX–LA–MS Salt, Anadarko, and Uinta–Piceance collected similarly (crowdsourced positives with manual review) to evaluate cross-basin performance of models trained only on Permian+Denver. Storage tank detection: Object detection with Faster R-CNN and a Res2Net backbone trained on 10,470 labeled tanks across 1,833 well pad images; negatives are well pads without tanks. Tuned learning rate (5e-5) and anchor scales [4,6,8] for small objects. Evaluated with AP, precision, recall, and mean absolute error (MAE) for per-pad tank count, stratified by pads with/without tanks. No two-stage verification for tanks given high standalone accuracy and clustered/adjacent appearances. Deployment at basin scale: The Permian and Denver basins were tiled into >13.9 million 512×512 images (overlapping by 100 px) covering ~313,340 km². Tiles were processed by the detector; non-maximum suppression with IoU=0.2; post-processing converted pixel to lat/lon, merged overlaps, and removed low-confidence detections per validation thresholds. Candidate pad crops were fetched at appropriate zoom and verified either by the verification model (above 99% precision threshold) or by spatially matching to reported pad data (Enverus, HIFLD) using well clustering, a 50 m buffer, and spatial join; only candidates verified by either path were retained. Verified pads were then passed to the tank detector; tank predictions were post-processed similarly. Inference used 4× NVIDIA RTX A4000 GPUs, total batch size 96 (~14 hours for detection).

Key Findings

- Test-set performance (well pads): Permian AP 0.959, precision 0.975, recall 0.906; Denver AP 0.928, precision 0.935, recall 0.901; overall AP 0.944, precision 0.955, recall 0.904 (means over 10 runs with stochastic eval). - Verification stage impact: Standalone verification model achieved precision 1.0 with recall >0.97 in both basins. Adding verification improved precision of the pipeline by 0.013 (Permian) and 0.009 (Denver) at matched recall versus using a higher detection threshold alone. - Sensitivity to pad size: AP for small pads (<41 m²) was 0.853 (Permian) and 0.700 (Denver); medium (41–164 m²) 0.962 (Permian), 0.976 (Denver); large (>164 m²) 0.944 (Permian), 0.953 (Denver). Denver contains more small pads (18.3% vs 5.2% in Permian), contributing to lower Denver performance. - Joint vs basin-specific training: Joint Permian+Denver model outperformed basin-specific models in both basins by ~0.004–0.006 AP and generalized better across distribution shifts. - Cross-basin generalization (well pads): Uinta–Piceance achieved high precision/recall (0.948/0.943); TX–LA–MS Salt >0.80; Anadarko >0.85; Appalachia lower (precision 0.647, recall 0.552) due to small/obscured pads and environmental differences. - Basin-scale deployment counts: 194,973 verified well pad detections in the Permian; 36,591 in Denver. - Recall against reported datasets (active pads): Permian 80.5% vs Enverus; 73.3% vs HIFLD. Denver 68.1% vs Enverus; 46.1% vs HIFLD. Across both basins, 79.5% of active Enverus pads were captured. - Time and imagery effects: Recall tracks pad size trends; sharp recall drop for pads completed in late 2010s due to outdated imagery in Google Earth basemap (Permian imagery predominantly 2014–2019). In Permian, for pads completed in 2012 recall 0.97; for 2019, 0.67. Recently constructed, high-producing pads were disproportionately missed due to imagery staleness. - Production-weighted recall and imagery filtering (Table 3): Permian 2010–2021 recall: pads 0.889 (0.966 excluding outdated imagery), production 0.541 (0.992 excluding). All years: pads 0.805 (0.833 excluding), production 0.555 (0.981 excluding). Denver 2010–2021: pads 0.701 (0.780 excluding), production 0.681 (0.706 excluding). All years: pads 0.788 (0.986 excluding), production 0.786 (0.979 excluding). - New well pads: 67,201 (Permian) and 24,525 (Denver) detections not matching Enverus or HIFLD. Sample validation (n=2,500 per basin) indicates 83.04% (Permian) and 57.9% (Denver) are true pads, implying >55,800 and ~14,200 new pads respectively (~33% increase over existing repositories). Estimated overall deployment precision 0.909; basin-level estimated precision 0.944 (Permian) and 0.720 (Denver). - Storage tank detection (test): Permian AP 0.989, precision 0.965, recall 0.972; MAE: overall 0.072, pads with no tanks 0.009, with tanks 0.220. Denver AP 0.981, precision 0.957, recall 0.963; MAE: overall 0.101, no tanks 0.011, with tanks 0.360. Overall AP 0.986, precision 0.962, recall 0.968, MAE overall 0.082. - Storage tank deployment: 175,996 tank detections across both basins (83.6% in Permian). Sample precision: 96.9% (Permian), 95.9% (Denver). Estimated totals: >142,000 tanks in Permian, ~27,000 in Denver. Estimated fraction of pads with tanks: 18.0% (Permian), 23.2% (Denver). Mean tanks per pad: 4.194 (Permian), 3.397 (Denver). - Production correlations: At the pad level, correlation between tank count and production (kBOE/d) was low (r=0.20 Permian; 0.22 Denver). Aggregated to 5 km² cells, correlations were moderate (r=0.53 Permian; 0.68 Denver). Correlation with gas production exceeded oil: Permian r=0.55 (gas) vs 0.50 (oil); Denver r=0.72 (gas) vs 0.58 (oil).

Discussion

The study demonstrates that deep learning applied to high-resolution satellite imagery can effectively map oil and gas infrastructure at scale, addressing major data gaps that impede methane emissions quantification and source attribution. The two-stage well pad pipeline achieves high precision and recall in controlled test sets and maintains strong performance at basin scale, capturing most active pads in a frequently updated dataset (Enverus). The verification stage specifically reduces false positives beyond simple thresholding, improving overall precision, which is crucial when scaling to tens of millions of tiles. Performance stratification reveals that small well pads and complex urban/suburban contexts (prevalent in the Denver basin) remain challenging, suggesting targeted data augmentation and label strategies are needed to improve recall on smaller features. Cross-basin evaluations show strong generalization to visually similar regions (e.g., Uinta–Piceance) and reduced accuracy in regions with substantial distribution shift (e.g., Appalachia), indicating that region-specific fine-tuning or diversified training data is necessary for robust national/global deployments. Storage tank detection achieves very high accuracy and adds valuable sub-facility information typically absent in public/private datasets, enabling refined emissions source characterization. Deployment analyses highlight the material impact of outdated imagery on detecting recently constructed, high-producing pads, thereby depressing both pad and production recall; filtering out outdated imagery nearly closes this gap for recent years, underscoring the importance of imagery recency for operational monitoring. Overall, the resulting datasets provide a more complete, spatially explicit basis for reconciling bottom-up and top-down methane estimates and for targeting mitigation at high-activity hotspots.

Conclusion

This work introduces and validates a scalable deep learning framework to detect and verify oil and gas well pads and to detect storage tanks from freely available high-resolution satellite imagery. Trained on curated labels from experts and crowdsourcing, the models achieve high accuracy across two major U.S. basins and identify substantial numbers of previously unmapped well pads and storage tanks, increasing infrastructure coverage by roughly one-third over existing repositories. The approach enhances the granularity and completeness of oil and gas infrastructure databases, supporting improved methane emissions estimation and source attribution. Future directions include: expanding training data to better represent small/older pads and diverse geographies; iterative active-learning cycles incorporating deployment errors; adopting higher-cadence, high-resolution commercial imagery (e.g., Airbus SPOT, PlanetScope) to mitigate staleness; scaling globally, especially in high-producing regions with limited transparency; and extending detection to other emitting assets (e.g., pump jacks, flares, compressor stations, terminals) to build comprehensive, public infrastructure inventories.

Limitations

- Imagery recency: Outdated basemap imagery led to missed detections of recently constructed, high-producing pads, disproportionately reducing production recall; access to more recent, high-resolution imagery would alleviate this. - Distribution shift: Performance drops in regions with different visual characteristics (e.g., Appalachia with small, forested pads) indicate limited generalization without region-specific data. - Label bias: Training labels were partly curated by experts and partly sampled post-2005, over-representing larger, more prototypical pads and under-representing older/smaller pads, contributing to lower recall on small pads. - Small object challenges: Single-wellhead/small pads are harder to detect, especially in visually complex urban/suburban environments, increasing false negatives and some false positives. - Noisy baselines for deployment evaluation: Reported datasets (HIFLD, Enverus) contain inaccuracies (coordinates, outdated statuses), likely underestimating true deployment recall. - Negative diversity limits: Despite hard negative mining, the diversity at basin scale may exceed that of the training negatives, leaving residual false positives. - Storage tank verification: No external ground-truth datasets at scale prevented basin-level recall estimation for tanks; verification relied on model confidence and sampling-based precision estimates.

Related Publications

Explore these studies to deepen your understanding of the subject.

Economics

Using publicly available satellite imagery and deep learning to understand economic well-being in Africa

C. Yeh, A. Perez, et al.

Medicine and Health

A multimodal deep learning approach for the prediction of cognitive decline and its effectiveness in clinical trials for Alzheimer’s disease

C. Wang, H. Tachimori, et al.

Physics

An unsupervised deep learning algorithm for single-site reconstruction in quantum gas microscopes

A. Impertro, J. F. Wienand, et al.

Medicine and Health

A clinically applicable deep-learning model for detecting intracranial aneurysm in computed tomography angiography images

Z. Shi, C. Miao, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny