Computer Science
Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks
T. Tadros, G. P. Krishnan, et al.
Discover how Timothy Tadros, Giri P. Krishnan, Ramyaa Ramyaa, and Maxim Bazhenov explore a groundbreaking approach to combat catastrophic forgetting in artificial neural networks. Their innovative 'Sleep Replay Consolidation' algorithm demonstrates a remarkable ability to recover lost knowledge through sleep-like dynamics, redefining our understanding of memory retention in machines.
~3 min • Beginner • English
Introduction
Humans and animals learn continuously, integrating new information without erasing older memories, whereas artificial neural networks (ANNs) typically suffer from catastrophic forgetting when trained sequentially on multiple tasks. This reflects the stability–plasticity dilemma: networks must be plastic to learn new tasks yet stable to preserve old ones. Neuroscience indicates that sleep supports memory consolidation and knowledge generalization via spontaneous replay and local unsupervised synaptic plasticity, enabling formation of more orthogonal, sparse representations that reduce interference between memories. Motivated by hippocampus-independent consolidation processes thought to occur particularly in REM sleep, the authors hypothesize that introducing a sleep-like, offline replay phase into ANN training can protect old memories during new task learning and reduce catastrophic forgetting. They propose and evaluate a Sleep Replay Consolidation (SRC) algorithm interleaving standard supervised training with sleep-like unsupervised replay phases.
Literature Review
The paper situates SRC within two main continual learning approaches: rehearsal (explicit replay of stored or generated past examples) and regularization (constraining updates to preserve important weights). Prior work shows rehearsal methods (including generative replay) can be effective but require storing data or training auxiliary generators and may scale poorly. Regularization methods such as Elastic Weight Consolidation (EWC) and Synaptic Intelligence (SI) penalize changes to important weights but can struggle in class-incremental settings with high representational overlap; Orthogonal Weight Modification (OWM) improves performance by projecting updates orthogonally to subspaces representing old tasks. Neuroscience literature supports sleep-dependent memory consolidation involving spontaneous replay and local plasticity, with evidence for both NREM and REM contributions, increased sparsity, and orthogonalization of representations. Prior wake–sleep algorithms and spiking network studies suggest sleep-like phases can improve learning and reduce required training examples. The authors build on this by implementing a biologically inspired offline replay without explicit task data to mitigate catastrophic forgetting.
Methodology
SRC is integrated into standard ANN training by interleaving supervised "awake" training (backpropagation with ReLU activations) with an offline sleep phase featuring spontaneous activity and local Hebbian-type plasticity. Key components:
- Sleep conversion: During sleep, ReLU activations are replaced with Heaviside (spike-like) activations. Layer weights are scaled by a factor derived from the maximum activation observed during the preceding training to maintain activity levels. Layer-specific thresholds and scaling factors are selected via hyperparameter search/genetic algorithm based on maintaining reasonable firing activity.
- Stimulation during sleep: The input layer is driven by noisy binary inputs. For each SRC forward pass, each input pixel is set to 1 with probability drawn from a Poisson process whose mean equals the pixel’s mean intensity across all previously seen training data. Thus, SRC only stores per-pixel average intensities (fixed-size statistics independent of the number of tasks). No task-specific exemplars are presented during sleep.
- Unsupervised Hebbian rule: After a forward pass that propagates spikes, a backward pass updates weights locally: increase a synapse if both pre- and post-synaptic units are active in sequence; decrease if the post-synaptic unit is active but the pre-synaptic unit is inactive. Spiking neurons reset voltages after threshold crossing. After multiple sleep steps, weights are rescaled back and activations return to ReLU for subsequent testing or training.
- Tasks and datasets: Incremental class learning on MNIST, Fashion MNIST, CIFAR-10, and CUB-200. For MNIST/Fashion/CIFAR-10, five sequential tasks each comprising two classes; for CUB-200, two tasks each with 100 classes. A cross-modal task trains MNIST first, then Fashion MNIST (or vice versa) in sequence.
- Architectures and training: Fully connected networks for MNIST/Fashion/Multi-modal MNIST (two hidden layers of 1200 units; ReLU; no biases; SGD with momentum; 10 epochs/task, minibatch 100; dropout 0.2). CIFAR-10 uses features extracted from a VGG-like CNN backbone (trained on Tiny ImageNet); classifier has two hidden layers (1028 and 256 units; dropout 0.2). CUB-200 uses two hidden layers (350 and 300 units) feeding 200-way classifier (50 epochs/task). Loss is multiclass cross-entropy evaluated on the current task during incrementally supervised training.
- Baselines and comparisons: Sequential training (lower bound), parallel training on all data (upper bound), regularization methods (EWC, SI, OWM), rehearsal with a small fraction of stored old data, and combinations (Rehearsal+SRC). Additionally, iCaRL is combined with SRC across memory capacities K (50–2000) to assess complementarity. Analyses include toy binary patch tasks controlling overlap (interference), correlation and sparseness analyses of hidden-layer representations, and task-specific neuron activity/firing during sleep.
- Pseudocode: Algorithm 1 details the SRC procedure including ANN-to-sleep conversion, forward spike propagation with thresholds and scaling, Hebbian updates, and reversion to ANN for further training/testing.
- Analysis metrics: Classification accuracy on all tasks after each phase; representational correlations across classes; firing rate comparisons for task-specific vs random neurons; input drive to hidden units before/after SRC; statistical tests with Bonferroni correction where applicable.
Key Findings
- Toy binary patches: When training two sequential tasks comprising overlapping 10×10 binary images, catastrophic forgetting occurs as overlap increases. SRC applied after training the second task recovered performance on the first task for overlap ranges (e.g., 12–16 pixels overlap) that otherwise caused forgetting. Weight histograms showed SRC downscaled task-irrelevant connections, rendering T1-unique pixels inhibitory to T2 outputs, thereby correcting misclassification.
- Incremental datasets (Table 1):
- Incremental MNIST: Sequential 19.49% vs SRC 48.47±5.03%; OWM 77.04±2.91%; Parallel upper bound 98.02%.
- Incremental Fashion MNIST: Sequential 19.67% vs SRC 41.68±5.04%; OWM 58.35±2.05%; Parallel 87.86%.
- Cross-modal MNIST (MNIST then Fashion, or vice versa): Sequential 47.18% vs SRC 61.33±0.015; EWC/SI ≈74–75%; OWM 91.29±1.05%; Parallel ≈90.05%.
- Incremental CUB-200 (two tasks reported separately): Sequential (Task1, Task2) = (5.32%, 95.41%) indicating severe forgetting of Task1; SRC = (63.2%, 45.4%), balancing performance across tasks; OWM = (71.4%, 21.5%); Parallel = (85.49%, 79.15%).
- Incremental CIFAR-10: Sequential 19.01% vs SRC 44.55±1.45%; OWM 34.23±1.87%; Parallel 72.43%.
- Complementarity with rehearsal (Table 1, Fig. 3): Adding small fractions of old data during new-task training improves accuracy; adding SRC further boosts it. Examples:
- MNIST: Rehearsal (0.75%) 79.91±5.34% → Rehearsal+SRC 86.47±1.06%.
- Fashion MNIST: 55.19±7.74% → 67.82±3.64%.
- Cross-modal MNIST: 83.13±0.89% → 83.18±1.91% (slight gain).
- CIFAR-10: 39.39±0.64% → 58.24±0.56%.
- CUB-200 (two tasks): 42.32%, 51.49% → 56.55%, 38.05% (Task1 improved substantially; Task2 decreased).
- iCaRL + SRC (Table 2): Across memory capacities K, SRC consistently improved iCaRL performance on MNIST, Fashion MNIST, and CIFAR-10. Examples: MNIST K=100: iCaRL 65.50±4.66% → +SRC 78.09±3.16%; CIFAR-10 K=500: 54.90±1.49% → 57.50±0.61%. SRC also reduced required training epochs per task to reach given accuracies (e.g., training savings of ~3.7 epochs/task on MNIST, ~3.7 on Fashion, ~2.8 on CIFAR-10).
- Representation changes (Figs. 4–6): SRC decreased correlations between different classes while maintaining/increasing within-class correlations, increased representational sparseness, and differentially allocated neurons to distinct classes. Sleep increased activation/firing of task-specific neurons relative to random subsets in both hidden layers (significant p-values with Bonferroni correction). SRC adjusted inputs to hidden neurons to favor old tasks, increasing input to old-task units and reducing input to recently learned task units, contributing to recovery of older tasks.
- Single-task effect: SRC can improve undertrained memories (low initial performance) without new data, but may not improve and can slightly reduce performance for well-trained tasks, consistent with inverse sleep benefit observed in human procedural learning.
Discussion
The findings support the hypothesis that a biologically inspired, offline sleep-like replay phase can mitigate catastrophic forgetting by leveraging spontaneous reactivation and local Hebbian plasticity. SRC recovers ostensibly lost performance by re-expressing latent information preserved in synaptic weights after sequential training, reorganizing connectivity to reduce cross-talk and orthogonalize representations across tasks. This leads to increased sparsity, decorrelation between classes, and downscaling of task-irrelevant connections, effectively reallocating resources to create distinct population codes.
Compared to regularization methods, SRC operates without storing task-specific exemplars and exceeds EWC and SI in class-incremental settings, though OWM can outperform SRC on some tasks. Unlike rehearsal methods that require explicit data storage/generation, SRC uses only simple input statistics (mean pixel intensities) and spontaneous replay. Crucially, SRC is an offline procedure and can be combined with rehearsal-based methods, improving accuracy, reducing required memory capacity (e.g., iCaRL + SRC at lower K achieves performance similar to higher K), and shortening training time. These properties align with neuroscience evidence that sleep consolidates memories by spontaneous replay and local synaptic modifications, potentially integrating both REM-like and NREM-like dynamics.
Mechanistically, analyses indicate SRC decreases inter-class correlations, enhances sparsity, increases activity of task-specific neurons during sleep, and rebalances synaptic inputs to favor older tasks while pruning recent interfering pathways, thereby addressing the stability–plasticity dilemma.
Conclusion
This work introduces the Sleep Replay Consolidation (SRC) algorithm, an offline, sleep-inspired replay and local Hebbian plasticity phase interleaved with standard supervised training. Across toy and real datasets (MNIST, Fashion MNIST, CIFAR-10, CUB-200, and cross-modal MNIST), SRC reduces catastrophic forgetting, recovers performance on older tasks, and reshapes internal representations toward sparser, more orthogonal codes. SRC complements rehearsal methods, improving accuracy while lowering memory and training-time requirements, and provides a biological mechanism-based alternative to regularization approaches.
Future directions include incorporating richer sleep dynamics (combining REM- and NREM-like processes, e.g., oscillatory waves), extending SRC to convolutional layers, exploring end-to-end spiking implementations and neuromorphic hardware, optimizing sleep parameters based on biologically plausible firing targets, and scaling to larger, more complex continual learning benchmarks.
Limitations
- Performance gap to state-of-the-art rehearsal/generative replay persists in some settings; OWM outperforms SRC on several class-incremental tasks.
- Multi-modal task performance with SRC is lower than some regularization approaches (EWC/SI) despite gains over sequential training.
- Convolutional feature extractors were kept frozen; the impact of SRC on convolutional layers remains untested.
- Sleep implementation is simplified (Heaviside activation, noise-driven input based on mean pixel intensities) and depends on hyperparameters selected via genetic search.
- Some evaluations rely on pre-extracted features or pre-trained backbones, limiting assessment of fully end-to-end continual learning.
- Computational cost of sleep is comparable to an additional training pass per task, though input usage is lower; efficiency optimizations (e.g., mini-batching during sleep) are needed.
- Generalization to larger-scale datasets and diverse task domains requires further validation.
Related Publications
Explore these studies to deepen your understanding of the subject.

