logo
ResearchBunny Logo
Dopamine transients follow a striatal gradient of reward time horizons

Biology

Dopamine transients follow a striatal gradient of reward time horizons

A. Mohebi, W. Wei, et al.

Dopamine transients in distinct rat striatal subregions convey prediction errors across different time horizons, systematically accelerating from ventral to dorsomedial to dorsolateral regions—spanning behaviors from seconds to hours and enabling efficient learning and adaptive motivation. This research was conducted by Authors present in <Authors> tag.... show more
Introduction

Animal behavior is driven by expectations of future rewards, and a central question in behavioral neuroscience is how these expectations are formed and updated. Reinforcement learning (RL) theory proposes that agents compute state values and update them via reward prediction errors (RPEs), signals thought to be encoded by brief fluctuations in midbrain dopamine neuron activity and striatal dopamine release. A critical aspect of prediction is timescale: future rewards are discounted in time, and the rate of discounting shapes learning and decision-making. Excessive discounting is implicated in psychiatric disorders such as addiction. While DA RPEs are often considered a uniform scalar broadcast, animals must evaluate outcomes across multiple timescales to support both rapid motor sequences and prolonged goal pursuit. Anatomical and functional evidence suggests heterogeneity across striatal subregions (dorsolateral, dorsomedial, ventral), potentially reflecting a hierarchy of timescales for evaluative computations. The present study tests the hypothesis that striatal subregions implement distinct time horizons for reward prediction and learning, manifest in DA transient dynamics.

Literature Review

Prior work established DA RPE signaling and its role in updating value (Schultz 1998; Hart et al. 2014), with broad projections to striatum implicated in value-based decision-making. Temporal discounting provides a normative rationale for devaluing distant rewards, with abnormal discounting linked to psychiatric conditions (Amlung et al. 2019; Bickel & Marsch 2001). Empirical studies reveal heterogeneity in DA neuron firing and striatal DA release across subregions (Parker et al. 2016; Tsutsui-Kimura et al. 2020; Brown et al. 2011) consistent with distinct cortico-basal ganglia loops and hierarchical RL (Frank & Badre 2012; Ito & Doya 2011). Theoretical accounts predict gradients of discounting across striatum, but evidence has been mixed (Tanaka et al. 2004; Kobayashi & Schultz 2008; Enomoto et al. 2020). Multiple-timescale learning benefits both animal behavior and artificial agents (Kurth-Nelson & Redish 2009; Iigaya et al. 2019; Fedus et al. 2019).

Methodology

Subjects: Twenty adult Long-Evans rats (15 males) on reverse light cycle; procedures approved by UCSF IACUC. Viral expression and photometry: AAV-DJ-CAG-dLight1.3b injected into dorsolateral striatum (DLS), dorsomedial striatum (DMS), and ventral striatum (VS; NAc core). Optical fibers implanted above injection sites. Fiber photometry measured DA-dependent fluorescence (470 nm) with interleaved isosbestic control (405/415 nm). Signals were normalized within subject/region to the mean peak response to unexpected reward delivery in the same session. Histology verified placements and expression (anti-GFP immunofluorescence). Behavioral tasks: (1) Spontaneous DA dynamics recorded in unrestrained rats; unexpected sugar pellet ‘click’ used as probe. (2) Instrumental task: rats made nose pokes; reward probabilities (left/right) shifted between 10%, 50%, 90% in blocks of 35–45 trials; used to relate DA RPEs at reward click to reward rate estimated from recent history. (3) Pavlovian task: auditory cues (2, 5, 9 kHz pip trains, 2.6 s) predicted reward delivery click after delay (0.5 s trace) with probabilities 75%, 25%, or 0%; trials included uncued rewards; 60 trials/type per day for 15 days. (4) Multiple delay task: three cues indicated delays to potential reward (0.6, 3, 12 s; 75% reward probability for all), plus uncued rewards; 60 trials/type; intertrial interval 15–30 s. Computational analyses: Reward history was modeled with (i) a leaky integrator of rewards with time constant τ (1–2,500 s) fit per site to maximize negative correlation between estimated reward rate and DA response to reward click; and (ii) a delta-rule learner with learning rate α (0–1) tuned to maximize DA–RPE correlation. Within-trial value dynamics were modeled with (i) a complete serial compound (CSC) temporal-difference model with discount factors mapped to exponential time constants (τ=2, 10, 1,000 s), including overlapping cue features for generalization; and (ii) a composite recurrent neural network (RNN) actor-critic with three LSTM subnetworks (‘DLS’, ‘DMS’, ‘VS’) differing only in discount time constants (τ=2 s, 10 s, 1,000 s), all receiving identical inputs and projecting to a shared actor trained with PPO; TD RPEs from each subnetwork were compared to DA transients. Statistics: ANOVAs, repeated measures ANOVA, and regression analyses as specified; sample sizes per figure provided; exclusion criteria applied for weak signals; normalization procedures standardized across sessions.

Key Findings
  • DA dynamics vary systematically across striatal subregions. Spontaneous DA fluctuation tempo and evoked transient duration followed DLS fastest, DMS intermediate, VS slowest. Autocorrelogram decay constants differed by subregion (one-way ANOVA: F(2,23)=22.9, P=3.4×10⁻⁵). Reward click-evoked transient half-widths differed (F(2,23)=24.2, P=2.2×10⁻⁵), shortest in DLS, longest in VS.
  • Distinct reward history timescales shape DA RPEs. In the instrumental task, DA responses to reward delivery decreased with higher recent reward rate consistent with RPE in all regions, but the best-fit history timescale depended on subregion. The τ maximizing DA–RPE correlation was shortest in DLS, intermediate in DMS, and longest in VS (repeated measures ANOVA: F(2,39)=23.6, P=2.0×10⁻⁸). A delta-rule learner showed analogous gradients in learning rate α (highest DLS, lowest VS; one-way ANOVA: F(2,39)=23.2, p=2.2×10⁻⁵).
  • Pavlovian cue responses were region specific. After training, DA responses scaled with cue reward probability, but patterns differed by subregion with a significant CUE × AREA interaction (two-way ANOVA: F(4,66)=6.4, P=0.0002). DMS showed strong discrimination with negative responses to the 0% cue; VS showed slower learning and poorer cue discrimination, remaining positive to all cues early and mid training; DLS cue responses were small but reward-evoked responses were reduced when preceded by predictive cues, indicating learned prediction.
  • Modeling links subregional patterns to discounting. CSC models showed that faster discounting yields smaller cue-evoked RPEs despite preserved reward responses, consistent with DLS. A composite RNN with only differing discount time constants (τ: DLS 2 s; DMS 10 s; VS 1,000 s) reproduced key features: diminutive DLS cue responses, negative DMS response to 0% cue, and slow/poor VS cue discrimination early in training.
  • Multiple delay task confirmed region-specific discounting. DA responses decreased with increasing cue-indicated delay in all regions, but discount rates differed by subregion. Exponential fits: decay rate r differed by region (one-way ANOVA: F(2,32)=13.6, P=5.2×10⁻⁶). Hyperbolic fits: r also differed (F(2,32)=12.8, P=7.9×10⁻⁶). Ordering indicated fastest discounting in DLS and slowest in VS. Behavior showed cue-dependent anticipatory head entries confirming learning (two-way ANOVA CUE × AREA for DA peak: F(4,96)=29.3, P=1.3×10⁻²²).
  • Behavior acquisition showed increasing cue discrimination over 15 days with a significant CUE × DAY interaction (F(28,336)=12.3, P=0.0001).
Discussion

The findings demonstrate a consistent gradient of evaluative timescales across striatum—DLS fastest, DMS intermediate, VS slowest—spanning spontaneous DA tempo, temporal integration of prior rewards, and discounting of future rewards. This suggests that subregions implement parallel valuation systems over distinct horizons, rather than a single global DA RPE signal. The gradient aligns with hierarchical cortico-basal ganglia organization: DLS supports rapid, fine-grained motor control with brief state representations and fast DA reuptake, naturally favoring short time horizons and rapid discounting; VS supports prolonged, abstract representations needed for sustained motivation, favoring long horizons, slower discounting, and extended integration of past rewards. The RNN model shows that long horizons impair rapid cue discrimination because values average over outcomes across multiple trials, explaining slower VS cue discrimination and persistent positive responses to nonrewarded cues early in training. The linkage between past and future horizons may reflect data sampling requirements: estimating further into the future benefits from integrating more extensive past histories. The work refines understanding of DA transients, highlighting that subregional differences in time horizons shape RPE expression and learning dynamics, with implications for impulsivity control and hierarchical RL in the brain.

Conclusion

This study establishes that dopamine transients encode prediction errors over distinct time horizons along a striatal gradient (DLS fast, DMS intermediate, VS slow). Differences in spontaneous DA tempo, integration of past rewards, and discounting of future rewards converge on a unified framework of parallel valuation systems with region-specific timescales. Computational models show that varying only discount timescales reproduces key subregional DA patterns and learning dynamics. These insights help explain how animals balance rapid motor evaluation with sustained motivation for delayed outcomes. Future work should investigate mechanisms linking past and future horizons, the regulation and arbitration among multiple concurrent value estimates, and how motivation-related DA signals (e.g., ramps and tonic changes) interact with RPEs across regions to influence behavior and disorders of impulsivity.

Limitations
  • The CSC model could not capture all observed features (e.g., negative responses to a 0% cue in DMS), motivating use of RNNs; nevertheless, models focused on RPE-related transients and did not incorporate all DA signal components such as session-level tonic shifts or ramping associated with motivation.
  • The study does not address how the brain arbitrates among multiple concurrent value estimates across regions during decision-making, nor how these interact to produce time-inconsistent choices.
  • Photometry measures bulk DA-related fluorescence and cannot resolve cell-type-specific or axon-specific contributions; normalization precludes direct comparison of absolute DA levels across subjects/regions.
  • Data collection and analysis were not blinded; sample sizes were not predetermined; some sessions/fibers with weak signals were excluded.
  • Generalization to other species, tasks with different structures, or pathological states is not directly tested.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny