logo
ResearchBunny Logo
Dynamic machine vision with retinomorphic photomemristor-reservoir computing

Engineering and Technology

Dynamic machine vision with retinomorphic photomemristor-reservoir computing

H. Tan and S. V. Dijken

This groundbreaking research by Hongwei Tan and Sebastiaan van Dijken introduces an innovative dynamic machine vision system that revolutionizes real-time motion recognition and prediction through advanced in-sensor processing, making strides towards enhanced applications in robotics and autonomous driving.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses a core challenge in dynamic machine vision: using the present visual frame to both infer past motion and predict future trajectories without relying on long frame sequences or heavy off-sensor computation. Motivated by biological vision, where short-term visual memory aids motion perception and prediction, the authors propose a compact, in-sensor approach that embeds the temporal history of motion into the current frame via photomemristive memory. The research question is whether a retinomorphic photomemristor array operating as a dynamic reservoir, combined with lightweight readout networks, can achieve accurate motion recognition and prediction from a single informative frame, thereby reducing redundant data flow and energy consumption inherent in conventional multi-module imaging pipelines.
Literature Review
Prior work in retinomorphic sensing has demonstrated devices with memory-enabled, adaptive, and all-in-one sensing capabilities, including switchable photovoltaic sensors, non-volatile phototransistors, and memristors, enabling in-sensor computing, visual adaptation, and motion detection. In-sensor reservoir computing systems with spatiotemporal processing have been reported for tasks such as language learning and image classification. Separately, neuromorphic hardware using ionic memristors and phase-change memtransistors has shown temporal data classification and forecasting by emulating synaptic dynamics. Traditional dynamic vision relies on multi-frame analysis with frequent data transfer among sensing, memory, and processing modules, incurring large overheads. Despite this progress, compact systems that perform motion recognition and prediction in-sensor using a single compressed frame had not been realized.
Methodology
Device fabrication: A 5×5 photomemristor array (PMA) based on ITO/ZnO/Nb-doped SrTiO3 (NSTO) Schottky junctions was fabricated using ALD, photolithography, etching, and magnetron sputtering. ZnO (60 nm) was sputtered (5.8×10^−3 mbar, Ar 16 sccm, O2 4 sccm, 60 W) onto conductive NSTO serving as the bottom electrode; ITO top electrodes were sputtered (3.4×10^−3 mbar, Ar 10 sccm, 50 W). Working areas were 100 μm×100 μm. An ALD Al2O3 insulating layer isolated ZnO from ITO wiring outside opened working areas. Characterization: Electrical/optical response was measured with Keithley 4200, Keithley 2400, Agilent B1500A, Tektronix AFG1062, Keysight DSO1024A, and a blue LED with shadow masks. Blue pulse intensity was 0.65 ± 0.06 mW mm^−2, calibrated by a Thorlabs FD11A photodetector and Ocean Optics USB2000+ spectrometer. Programmed light pulses simulated image/motion inputs. PMA current maps were recorded pixel-by-pixel under one-by-one optical input; bias was applied to ITO. RP-RC operation and datasets: The PMA acts as a dynamic reservoir; hidden memristive states between pulses embed prior frames into the current frame. For word-video recognition (5 words ending with E), only currents after the last frame were used. For motion recognition (slow/medium/fast), three frames were played (50 ms illumination per frame) and only h3 features used. Noise-augmented datasets were generated from measured currents. Readout ANN for word classification: 25 inputs (PMA pixels), 5 outputs (classes). 1200 datasets (900 train, 300 test) generated by adding Gaussian noise (σ=0.15 and 0.30). Trained with batch size 25 for 200 epochs. Readout ANN for motion speed: 25 inputs, 3 outputs (slow/medium/fast). 1200 datasets (900 train, 300 test) with noise σ=0.15 and 0.30. Batch size 25, 100 epochs. Autoencoder for motion prediction: Fully connected autoencoder with 25-10-25 neurons (encoder softmax, decoder sigmoid), MSE loss. 96,000 datasets (72,000 train, 24,000 test) with 15% noise; batch 100, 100 epochs. Trained on transitions (h1→h2, h2→h3). First-frame input drives recursive prediction (h1→h2→h3→...). A shifting operation simulated gaze/field-of-view extension for continuous prediction. CNN for speed recognition in traffic simulation: 48×48 present frames (h3) used with a CNN (4 Conv2D + 4 MaxPooling2D, followed by FC). 16,800 datasets (12,000 train, 2400 val, 2400 test) with 10% noise; batch 100, 100 epochs. Convolutional Autoencoder (CAE) for trajectory prediction: Input/output 48×48. Encoder: 4 Conv2D + 4 MaxPooling2D; Decoder: 4 Conv2DTranspose layers. 64,000 datasets (32,000 train, 32,000 test) from simulated motions with 10% noise; batch 160, 400 epochs. Deep Neural Network (DNN) for crossmodal learning: 52 Mel-feature inputs, hidden layers with 25 and 15 neurons (ReLU), output layer with 2304 neurons (48×48 first-frame pixels, sigmoid). MSE loss. 97,920 MFCC datasets (81,600 train, 16,320 test) with 10% noise; batch 200, 150 epochs. TensorFlow was used for all algorithms.
Key Findings
Photomemristor dynamics: Illumination increases output current by 2–3 orders of magnitude with gradual decay after light-off, yielding a wide, continuous range of analog hidden states. High on/off ratio (~10^2) to 100 ms pulses and adequate dynamic memory were observed up to 60 Hz input frequencies. Device responses were highly uniform. Word-video recognition from a single final frame: Using the last frame (E) that embeds prior letters via hidden states, a 25→5 ANN achieved test accuracies of 97.3% (σ=0.15) and 91.3% (σ=0.30). Conventional sensing using only peak photoresponses (no hidden states) achieved 36.2% at σ=0.30, underscoring the importance of inherent memory. Increasing bias (Vbias 0.8→1.2 V) enhanced hidden-state memory and improved accuracy from 78% to 100%. Motion speed recognition and imprint: For three-frame motions at slow (3 s), medium (1.5 s), and fast (0.6 s) speeds, the last-frame features (h3) differed due to accumulated memory imprints of earlier positions; higher speeds produced stronger imprints. A 25→3 ANN achieved 100% training accuracy (σ=0.15) and 97% (σ=0.30), with 100% test accuracy. Motion prediction with autoencoder: A 25-10-25 autoencoder trained on sequential pairs (h1→h2, h2→h3) predicted future frames from the first frame alone, with recursive prediction and field-of-view extension via shifting. By combining recognized speed with predicted trajectories, precise future positions were computed (e.g., at t=9 s: 9, 18, and 45 steps for slow, medium, fast). Intelligent traffic simulation: With a 48×48 PMA framework, a CNN recognized object speeds with ~90% average test accuracy, and a CAE predicted robot/car trajectories over many steps. Decision maps based on predicted positions enabled dynamic slow-down/keep-speed decisions for car and robot at a crosswalk. Crossmodal audio-to-visual prediction: A DNN (using MFCC features) crossmodally recognized the first frame of motion with ~90% test accuracy for phrases such as 'A person/car is moving left/right'. Feeding the recognized first frame into the PMA-trained CAE produced continuous motion predictions; three of four motions were predicted successfully for 25 frames, while one ('person moving right') required re-stimulation after 9 frames to maintain accuracy.
Discussion
Embedding temporal history as hidden photomemristive states within a single present frame enables compact in-sensor processing that circumvents the multi-module data movement of traditional dynamic vision systems. The RP-RC system effectively compresses spatiotemporal information at the sensor level, allowing lightweight readout networks to perform high-accuracy recognition and prediction tasks, including inferring past content and forecasting future trajectories from a single informative frame. Tunability of the photomemristor memory via bias voltage provides a hardware-level mechanism to trade off speed, sensitivity, and accuracy, akin to adjustable attention in biological systems. The approach scales to richer tasks and modalities: it extends from small-array demonstrations to 48×48 simulated scenarios for intelligent traffic and integrates audio cues through crossmodal learning to initiate visual prediction from spoken descriptions. Compared with conventional frame-by-frame pipelines, the RP-RC paradigm reduces redundant data flows and enables real-time perception and decision making, aligning with the needs of autonomous systems, robotics, and intelligent transport.
Conclusion
The work demonstrates a retinomorphic photomemristor-reservoir computing platform in which a photomemristor array encodes spatiotemporal sequences as hidden states in the present frame, enabling in-sensor motion recognition and trajectory prediction. Key contributions include: (1) hardware evidence of robust short-term photomemristive memory suitable for spatiotemporal compression; (2) successful recognition of dynamic content and motion speeds using a single final frame with high accuracy; (3) autoencoder-based recurrent prediction of future motion from minimal input; (4) application to intelligent traffic decision-making; and (5) crossmodal audio-to-visual motion prediction. Future directions include expanding spectral sensitivity to the visible range via materials engineering (e.g., Co-doped ZnO or narrower bandgap materials), scaling array size and integration level, improving robustness in complex real-world scenes, and broadening crossmodal capabilities for multimodal perception in autonomous systems.
Limitations
The ZnO-based photomemristor has a 3.1 eV bandgap, limiting photoresponse to the 320–400 nm range (UV–blue); broader visible sensitivity will require material modifications (e.g., doping or alternative semiconductors). Several application-level demonstrations (48×48 PMA in traffic scenarios and crossmodal audio-to-visual prediction) rely on simulated datasets and noise-augmented measurements rather than fully deployed hardware at scale. Reported recognition and prediction accuracies are evaluated under controlled conditions and specified noise factors, which may differ from performance in complex, variable real-world lighting and motion. While device uniformity is reported as high, large-scale manufacturing variability and long-term stability were not exhaustively assessed.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny