Biology
Neural space-time model for dynamic multi-shot imaging
R. Cao, N. S. Divekar, et al.
Discover how Ruiming Cao and colleagues have developed a groundbreaking neural space-time model (NSTM) that enhances computational imaging by jointly estimating scene dynamics and motion without prior data. This innovative approach minimizes motion artifacts and enables precise motion dynamics recovery, particularly in advanced microscopy techniques.
~3 min • Beginner • English
Introduction
Multi-shot computational imaging acquires multiple sequential measurements under varying encodings (for example, illumination or pupil coding) and combines them to enhance capabilities such as super-resolution, phase retrieval, and hyperspectral imaging. When the scene is dynamic, sequential capture causes motion blur and artifacts because the reconstruction typically assumes a static object. Prior approaches mitigate this by hardware multiplexing to accelerate capture, more data-efficient algorithms, or deep learning with data priors, but these can be system-specific, impractical, or fail on out-of-distribution data. The authors instead model scene dynamics directly during reconstruction. Modeling is challenging because raw measurements are differently encoded (precluding simple registration) and motion can be complex and deformable. The paper proposes a neural space-time model (NSTM) that leverages temporal redundancy (smooth evolution across adjacent time points) using two coordinate-based neural networks: a motion network that outputs per-pixel displacement (motion kernel) for any time, and a scene network that represents the underlying scene and is queried at motion-warped coordinates. Joint optimization against a system’s differentiable forward model enables recovery of both the dynamic scene and its motion without pretraining or external priors. A coarse-to-fine training process stabilizes convergence by synchronizing motion and scene learning, first recovering low-frequency content and global motion, then refining to high-frequency detail and local deformations. The model is demonstrated on differential phase-contrast microscopy (DPC), 3D structured illumination microscopy (3D SIM), and rolling-shutter DiffuserCam.
Literature Review
The paper situates NSTM among several strands of prior work: (1) hardware strategies to reduce acquisition time in multi-shot systems and enable live imaging (for example, faster SIM hardware and analog processing) that still assume static scenes within short windows; (2) algorithmic advances that improve data efficiency (for example, SRRF variants and related super-resolution methods) but generally do not explicitly model complex deformations during acquisition; (3) deep-learning approaches that rely on data priors or supervised training (for example, SOFI, Deep-STORM, CARE, virtual refocusing, and temporal super-resolution) that can be hard to train due to limited ground truth and may fail on out-of-distribution samples; and (4) recent uses of neural implicit representations for view synthesis (NeRF), periodic activation networks (SIREN), and deformable registration, as well as dynamic microscopy modeling via deep learning in single-molecule localization microscopy. NSTM differs by jointly estimating scene and motion on a per-dataset basis without pretraining or external priors, leveraging implicit neural representations to flexibly model deformable motion and spatiotemporal signals across arbitrary coordinates.
Methodology
NSTM comprises two coordinate-based neural networks (multi-layer perceptrons with multiresolution hash embeddings): a motion network and a scene network. Inputs and forward process: For a spatial coordinate x and time t, hash-embedded features of (x, t) are fed to the motion network f_motion to predict a per-coordinate displacement δx. The motion-adjusted coordinate (x + δx) is then hashed and passed to the scene network f_scene to output the signal o(x, t). The scene network does not take time directly; dynamics arise via motion-warped sampling of a time-independent implicit scene. Outputs are modality-specific: single channel fluorescence density (3D SIM), two channels amplitude/phase (DPC), or RGB intensities (DiffuserCam). Coarse-to-fine training: Both networks use multiresolution hash embeddings whose features are reweighted during training by a granularity parameter α (ratio of current epoch to 80% of total epochs), progressively enabling finer features (coarse-to-fine). This mitigates poor convergence where the scene might overfit before motion recovers, especially with complex deformation. Optimization: The reconstructed dynamic scene is passed through the system’s differentiable forward model to render measurements at the correct time(s). Mean squared error between rendered and acquired measurements is minimized via gradient descent (Adam optimizer). Learning rates: motion network 1e-5 (5e-5 for DiffuserCam), scene network 1e-3, each decayed exponentially to 0.1× by end of reconstruction. Architectures: motion network: 2 hidden layers, width 32; scene network: 2 hidden layers, width 128. Implemented in Python and JAX. Forward models per modality: - DPC: Raw images normalized by background; linear transfer-function forward model using absorption and phase transfer functions from prior work. Conventional reconstruction via Tikhonov regularization (weight 1e-4) for amplitude and phase; same regularization added to NSTM loss for fair comparison. - 3D SIM: To avoid static-scene assumption of band separation, a real-space forward model renders each measurement independently at its acquisition time: o is modulated by the illumination pattern for each band, filtered by corresponding OTFs, and bands are summed. Efficiency improvement: group measurements with identical orientation and phase across z to render them jointly, requiring only 15 time points regardless of number of z planes. Illumination parameters are shared between conventional and NSTM methods. Conventional reconstructions for time-varying scenes used a moving window across acquisitions to approximate time points. - Rolling-shutter DiffuserCam: Each raw image row integrates the dynamic scene convolved with the caustic PSF over the shutter interval. The forward model sums over shutter-on times; to fit GPU memory, subsets of rows (20 consecutive rows per pass) are rendered per step using only the contributing time points. Data acquisition highlights: DPC data from a Nikon TE300 with LED-array illumination (4 half-circular patterns, 25 ms exposure). 3D SIM data acquired on Zeiss Elyra PS.1 (×100/1.46 NA, ×1.6 TL), 15 images per z-plane (3 orientations × 5 phases); imaging of microbeads and live RPE-1 cells (mitochondria, ER, F-actin). DiffuserCam data from prior work: color sCMOS in slow-scan rolling-shutter mode (27.52 µs/row, 1,320 µs exposure, total 31.0 ms).
Key Findings
- General: NSTM jointly recovers motion dynamics and artifact-free reconstructions from the same multi-shot raw measurements, without pretraining or external priors, across three modalities (DPC, 3D SIM, DiffuserCam). - DPC (live C. elegans): Conventional reconstruction assuming static scene exhibits motion artifacts. NSTM with coarse-to-fine recovers per-time-point motion kernels and produces clean amplitude/phase without motion blur. - 3D SIM microbeads with induced vibration: Diffraction-limited widefield cannot resolve 190 nm beads; conventional fairSIM reconstruction is motion-blurred and fails to separate beads. NSTM resolves individual subresolution beads with reconstruction quality comparable to a ground-truth dataset acquired without motion; NSTM also recovers motion maps. - 3D SIM live-cell mitochondria (RPE-1, StayGold mitochondrial matrix): Conventional reconstruction appears to show a branching tubule; NSTM reveals the true dynamics—a single moving tubule—disambiguating motion-induced artifacts. NSTM removes motion blur, recovering high-frequency features lost in conventional reconstructions and preserving more super-resolved content (supported by frequency spectra). - 3D SIM live-cell ER (RPE-1, StayGold ER): Conventional reconstructions struggle to present clear, temporally consistent ER networks; NSTM yields consistent structures across time and visualizes motion kernels that describe inter-time-point displacements. - 3D SIM live-cell F-actin (LifeAct Halo-JF585): NSTM handles dense subcellular structures, providing temporally consistent, sharper reconstructions relative to conventional methods. - Rolling-shutter DiffuserCam: Compared to basic deconvolution and FISTA with anisotropic 3D TV (original method), NSTM produces cleaner videos without over-smoothing and enables direct querying of motion trajectories for points in the scene. - Simulations: NSTM performs well across rigid (translation, rotation) and non-rigid (shear, swirl) motions; the coarse-to-fine schedule is critical—without it, optimization often fails for complex motions. Performance degrades with large-magnitude local deformations, with high-frequency periodic deformations (reduced temporal redundancy), and with increasing measurement noise. - Computational performance: Microbead NSTM reconstruction ~3 min on NVIDIA RTX 3090 GPU versus conventional fairSIM <10 s on CPU; live-cell 3D reconstructions (20×512×512 volume, 15 time points) ~40.5 min on NVIDIA A6000 GPU.
Discussion
The findings demonstrate that explicitly modeling motion within a self-supervised, coordinate-based framework effectively addresses motion-induced artifacts in multi-shot systems. By aligning the scene representation through learned per-pixel motion kernels and optimizing against the modality’s differentiable forward model, NSTM reconstructs sharper, temporally consistent images and clarifies dynamic biological processes that would otherwise be misinterpreted. The method’s generality is underscored by its application to DPC, 3D SIM, and rolling-shutter DiffuserCam, including cases where conventional reconstructions assume static scenes or require regularization that can over-smooth detail. The motion-aware approach preserves high-frequency content and enables motion trajectory analysis. However, NSTM relies on temporal redundancy (smooth evolution between adjacent time points); when dynamics are less smooth (e.g., high-frequency local deformations) or when noise is high, performance degrades, as shown in simulations. Coordinate-based implicit networks also allow flexible querying at arbitrary spatiotemporal coordinates, facilitating temporal interpolation without grid resampling, provided the learned dynamics are accurate.
Conclusion
NSTM delivers a unified, self-supervised approach to jointly recover scene and motion from standard multi-shot acquisitions, eliminating motion artifacts and revealing dynamics across diverse computational imaging modalities. This reduces the risk of misinterpretation in live biological imaging and effectively enhances temporal resolution by enabling reconstruction at intermediate time points. Future work could extend the architecture to handle non-smooth dynamics and events like appearance/disappearance (by introducing time dependence in the scene network), and improve computational efficiency via better initializations, hyperparameter optimization, lower-precision arithmetic, and possibly hybrid data-driven components. Broader adoption is anticipated across multi-shot systems with differentiable, deterministic forward models.
Limitations
- Reliance on temporal redundancy: performance degrades when motion is not smooth between adjacent frames (e.g., high-frequency vibrations). - Architectural constraint: time-independent scene network requires deformation of a static template; thus cannot capture appearance/disappearance or on/off events (e.g., neuron firing, photoactivation). - Sensitivity to complex, large-magnitude local deformations and increased measurement noise (demonstrated in simulations). - Higher computational cost than conventional reconstructions (minutes to tens of minutes on GPUs versus seconds on CPU for some cases). - Joint optimization can fail without the coarse-to-fine strategy due to scene overfitting before motion convergence.
Related Publications
Explore these studies to deepen your understanding of the subject.

