Physics

Deep learning at the edge enables real-time streaming ptychographic imaging

A. V. Babu, T. Zhou, et al.

Discover how a team of researchers, including Anakha V. Babu and Tao Zhou, have leveraged artificial intelligence combined with high-performance computing to perform real-time inversion on X-ray ptychography data at unprecedented speeds. This groundbreaking approach allows for low-dose imaging with significantly less data, transforming high-resolution imaging techniques into real-time capabilities.

00:00

~3 min • Beginner • English

Index

Introduction

The paper addresses the challenge of achieving real-time ptychographic imaging given rapidly increasing data rates and computational demands at modern light sources. Ptychography recovers sample structure by inverting far-field diffraction patterns obtained by scanning a coherent beam with spatial overlap, but conventional iterative phase retrieval methods are too slow to provide live feedback and require significant overlap, which limits throughput and increases dose. With faster detectors and brighter sources, a single 1 mm × 1 mm raster scan at 100 nm step can yield ~200 TB of 16-bit data (with ~1 Mpixel detector) in under 24 hours, and requires massive compute for phase retrieval. While deep learning has shown promise as a faster surrogate to iterative methods—especially under low-light conditions—there had been no demonstration of real-time coherent imaging via deep learning, either at the edge or via on-demand HPC. The study’s purpose is to create and demonstrate an AI-enabled, edge-HPC workflow that performs real-time inversion of streamed X-ray ptychography data at up to 2 kHz, reduces dose by removing oversampling constraints, and enables real-time experimental steering.

Literature Review

Prior works established ptychography’s wide applicability and high resolution across X-ray, optical, and electron modalities, including non-destructive nanoscale imaging of large objects and record-breaking sub-angstrom resolution in electron ptychography. Iterative solvers (e.g., PIE and DM families) have seen optimization and HPC scaling, but their throughput lags modern detector rates. Deep convolutional networks have been explored for coherent imaging and can outperform iterative algorithms in speed and, increasingly, quality under sparse/low-light conditions. However, before this work, real-time coherent imaging via deep learning had not been demonstrated at the edge or through HPC on-demand resources. The paper situates itself within advances in scanning strategies, detector technologies with rapidly increasing data rates, and emerging AI surrogates for scientific imaging, referencing toolkits (e.g., Tike), edge acceleration (e.g., TensorRT), and strategies like continual learning for adapting models during experiments.

Methodology

Workflow: Three concurrent components enable real-time streaming ptychographic imaging: (1) measurement, (2) online training, and (3) live inference at the edge. Diffraction patterns are captured while scanning a focused X-ray beam (spiral scan). After each scan, data are sent to HPC resources where iterative phase retrieval (e.g., PIE/DM via Tike) produces labels. Cropped phase outputs (128×128 around beam center) are paired with corresponding diffraction patterns to expand the training corpus. A neural network is trained online and periodically pushed to an edge device for low-latency inference. Diffraction patterns are streamed to the edge via a feedback structured data protocol; the edge infers phase per pattern and returns stitched images to the beamline in real time. Continual learning updates the model as new data arrive. Hardware and computation: Phase retrieval and model training run at Argonne Leadership Computing Facility (ALCF) on ThetaGPU (NVIDIA DGX A100 nodes; 8×A100 per node) with APS–ALCF connected by a 200 Gbps link, enabling near real-time periodic retrieval. During experiments, 7 DGX nodes (52 GPUs) were used for concurrent phase retrieval and rendering; 8 GPUs were used for PtychoNN model training. Workflow orchestration used Globus Automate/Flows, KMS for remote function calls/resource management, and Globus for data management. Neural network: A modified PtychoNN++ (PtychoNN 2.0) is used for live inference. It predicts only phase (no amplitude) and reduces filters per convolutional layer, yielding ~0.7M trainable parameters versus ~4.7M for original PtychoNN. Smaller size reduces training and inference latency with comparable accuracy. Edge device (Jetson AGX Xavier) supports native PyTorch and TensorRT; TensorRT provides significant speedups. Training protocol: Training occurs online during the experiment. Each time a new batch is available, iterative phase retrieval runs on HPC; new pairs are appended to the dataset; the model is trained for 50 epochs with a cyclical learning rate using ADAM to minimize MAE. Validation is performed to avoid overfitting; the updated model is deployed to the edge for inference. Continual learning adapts to new features; a retraining service is invoked if AI–iterative mismatch exceeds a tolerance (e.g., >10% SSIM difference). Strategies (e.g., fairMDS-like reuse) are noted for faster startup in future work. Data acquisition and preprocessing: X-ray energy 10 keV; scans at APS hard X-ray nanoprobe. Two detector configurations: (i) ASI Medipix3 (516×516 px, 55 µm), 15.5 cm from sample, max 100 Hz (limited by chip network bandwidth); (ii) Dectris Eiger X 500K (75 µm pixels) 0.9 m downstream; a 128×128 crop was used to reach 2 kHz. FZP (160 µm diameter, 30 nm outermost zone width) was intentionally defocused to ~800 nm spot. Scanning used spiral trajectories; example step size 500 nm (overlap ratio ~0.9). For training pairs, reconstructed pixel size 6.2 nm; a 128×128 phase patch centered at beam FWHM (~50 nm) was interpolated onto a regular grid due to irregular spiral positions. In total, 113 scans × 963 patterns each yielded ~103,819 diffraction–phase pairs. Iterative reconstruction details: PIE and DM implementations in Tike were used (e.g., 3 probe modes, 2000 iterations on RTX 2080 Ti for comparisons). PIE results were used for labels due to slightly better uniformity for this dataset. Tike provides GPU-accelerated, scalable multi-node reconstructions with optimized communication for halo updates. Stitching: Edge inference yields 128×128 phase tiles centered at scan positions; tiles are stitched to form cumulative phase images for entire scans (see Methods for specifics). Performance measurements: Edge inference benchmarked on Jetson AGX Xavier using PyTorch/TensorRT; additional timing on RTX A6000 and other GPUs. Networking capped throughput at 1 Gbps on the detector control computer; rate-dependent image sizes (512×512 at 100 Hz; 128×128 at 2 kHz) were used to sustain live streaming. Low-dose strategies: Two approaches were tested—(1) sparse sampling (reduced spatial overlap up to none), enabled by per-pattern inference, eliminating oversampling constraints; (2) low-count mitigation by scaling intensities of experimental data before inference (no retraining) or scaling down training data and retraining (supports larger factors, with or without added Poisson noise).

Key Findings

- Real-time edge inference: Demonstrated live AI phase retrieval from streamed diffraction patterns at detector frame rates up to 2 kHz (128×128 crops) with end-to-end automated workflow and real-time stitching/feedback. - Accuracy vs. iterative methods: Under high overlap (spiral step 500 nm; overlap ratio ~0.9), AI-inferred stitched phases closely match iterative PIE/DM reconstructions; line profiles are nearly identical. - Sparse sampling/low dose: AI stitching retains >90% structural similarity even with zero overlap (≥460 nm step), whereas iterative methods’ accuracy falls below 80% at overlap ratio 0.6. For 80% acceptable accuracy, the AI-enabled workflow allows step sizes 2.5× larger, enabling 6.25× larger area coverage and equivalent dose reduction compared to conventional ptychography. - Low-count robustness: At 0.5 ms exposure (2 kHz), upscaling experimental intensities by 10× yielded 86% accuracy versus ground-truth iterative reconstruction; estimated dose ~81 ph/m². With retraining on scaled-down training data, >80% accuracy achieved for scaling factors up to 10,000, with negligible difference from adding Poisson noise post-scaling. - Continual learning: Accuracy on unseen features improves progressively with continued training; a marked improvement occurs after ~80,000 training pairs when edge-of-patterned regions are included, underscoring the need for continual learning to adapt to new structures. - Throughput and latency: Live inference at 100 Hz was achieved for 512×512 images limited by a 1 Gbps network (0.5 Gbps incoming). Reducing to 128×128 enabled 2 kHz streaming at detector maximum. On a powerful GPU, per-image inference time was ~70 µs (≈14 kHz capability). On Jetson AGX Xavier, TensorRT inference times (batch size 1): PtychoNN 1.0 ≈ 10±1 ms vs. 15±1 ms (PyTorch); PtychoNN 2.0 ≈ 2.3±0.4 ms vs. 8±1 ms. - Model efficiency: PtychoNN 2.0 reduces parameters from ~4.7M to ~0.7M while maintaining reconstruction quality comparable to original PtychoNN, enabling lower latency and faster training. - HPC integration: Periodic iterative reconstructions over 200 Gbps APS–ALCF link support online labeling; training scales near-linearly with dataset size; ~10 minutes for ~100k pairs using 8×A100 GPUs.

Discussion

The workflow directly addresses the bottleneck of conventional iterative ptychography, whose throughput cannot keep pace with modern detector data rates that are doubling annually. By deploying a compact, accurate surrogate model at the edge, the system provides low-latency inference and real-time feedback to enable experimental steering. Continual learning ensures that performance improves throughout the experiment and adapts to previously unseen features. The approach significantly reduces dose and acquisition time by removing oversampling constraints and allowing sparse sampling, which is particularly beneficial for dose-sensitive materials. A resource-aware strategy minimizes HPC usage by retraining only when AI–iterative discrepancies exceed defined thresholds, while retaining the ability to detect distribution shifts via periodic validation. The demonstrated throughput indicates feasibility for next-generation beamlines and advanced electron microscopes, with performance ultimately bounded by network and edge hardware limits; however, scaling to more capable GPUs suggests ample headroom. The method’s accuracy is linked to the fidelity of iterative labels and the similarity between experimental conditions and the training distribution (e.g., refractive index range and probe), guiding practical deployment and retraining policies.

Conclusion

This work demonstrates, for the first time, real-time coherent imaging via deep learning at the edge for X-ray ptychography, achieving live phase retrieval up to 2 kHz with high fidelity to iterative methods. The AI-enabled workflow eliminates oversampling constraints, enabling sparse sampling for substantial dose reduction and increased area coverage, while continual learning adapts the model during acquisition. Integration with HPC for periodic labeling and training, plus edge acceleration (TensorRT), yields a scalable, automated solution for next-generation light sources. Future directions include improving robustness across broader sample and probe variations, reusing models across experiments (e.g., fairMDS-like strategies), enhancing edge/network infrastructure for higher data rates, extending to other coherent imaging modalities (including electrons), and deeper integration into autonomous experimental frameworks.

Limitations

- Dependence on iterative labels: Final accuracy is bounded by the quality of iterative phase retrieval used for training; biases or errors propagate to the AI inference. - Domain specificity: The network is trained for samples within a given refractive index range and a fixed illumination probe; significant deviations require retraining to maintain accuracy. - Count-rate sensitivity: The model is resilient to count-rate variations up to ~16×; at ~40× variation the stitched inference becomes unreliable. - Need for initial high-quality data: Early in an experiment, high-overlap, higher-count scans and iterative retrieval are required to bootstrap training; removing overlap later limits the ability to continue iterative labeling and thus continual learning. - System bottlenecks: Inference throughput can be constrained by detector control computer networking (e.g., 1 Gbps), and pre-/post-processing overheads not included in reported inference times. - Generalizability: While robust within the trained domain, generalization to drastically different samples, probes, or detector configurations may be limited without adaptation/retraining.