Health and Fitness
Using Adaptive Bandit Experiments to Increase and Investigate Engagement in Mental Health
H. Kumar, T. Li, et al.
The paper addresses how to improve engagement and effectiveness of digital mental health (DMH) text-messaging interventions by adaptively personalizing content. The research question is whether contextual multi-armed bandit (MAB) algorithms (e.g., Thompson Sampling) can simultaneously optimize user rewards (e.g., perceived helpfulness, engagement) and collect data suitable for rigorous statistical analysis by behavioral scientists. The context is the increased need for scalable mental health support, especially post-COVID-19, and the limitations of uniform one-size-fits-all messaging. The study’s purpose is to build and evaluate a system that instruments modular message components to run adaptive experiments in the wild, balancing exploration–exploitation with the need for unbiased, sufficiently powered inference. The importance lies in enabling continuous improvement and personalization of DMH interventions while ensuring scientific validity and generalizability of results.
The paper reviews evidence that DMH interventions, including smartphone and text-message modalities, can be effective but face low engagement in real-world use. Automated messaging has shown effects on behavior change and mental health outcomes. The authors discuss adaptive experiments via contextual MAB algorithms that allocate more users to effective arms over time. Prior work has identified statistical complications of bandit-based data collection, including biased arm mean estimates, elevated false positive rates, and reduced power, which may hinder accurate identification of effective conditions. This motivates systems that can compare algorithms and quantify trade-offs between reward optimization and inferential rigor.
Intervention and system: The authors developed a software system (open-source testbed) to deliver an 8-week text-message DMH intervention with modular, self-contained dialogues. Messages have three adaptive decision points in a factorial design: Rationale (present vs absent), Link (present vs absent), and Interaction type (four options). The system embeds adaptive experiments at these points. Contexts and rewards: It captures multiple user contexts (12 total; examples include Mood & Energy binary high vs low/medium; K10 distress scale 1–4; Recent Activity in past 48h binary). Rewards include message helpfulness ratings (1–5, scaled) and link clicks; the system supports rapid switching/combination of rewards. Bandit formulation: Each message component is modeled as a separate contextual bandit problem. The objective over horizon T is to choose actions to maximize expected cumulative reward. The implementation focuses on contextual linear Thompson Sampling (TS): rewards modeled as linear-Gaussian in context; at each step, sample parameters from the posterior and choose the arm maximizing predicted reward. Priors are set to be relatively uninformative to encourage exploration. Pilot and simulations: After an initial pilot with 50 participants to validate data pipelines and algorithm behavior, the team designed simulation scenarios to evaluate reward, false positive rate (FPR), and statistical power under different conditions. Simulation setup: contexts sampled uniformly; raw rewards drawn from Normal then discretized to {0, 0.25, 0.5, 0.75, 1}. Three scenarios:
- Scenario 1: No arm difference: r_i(t) ~ N(0.5, (1/6)^2) for both arms.
- Scenario 2: Substantial arm difference: arm 1 better with effect depending on context x_i: r_i(t) ~ N(0.5 + 1/8 * x_i, (1/6)^2).
- Scenario 3: Context-dependent optimal arm with interaction between Rationale and Mood: r_i(t) ~ N(0.5 + x_i − x_ix_i + x_ime, (1/6)^2), where me=1 for high mood, 0 otherwise; Rationale better for high mood (expected 0.875 vs 0.5), No Rationale better for low mood (0.25 vs 0). They compared Contextual TS to Uniform Random (UR) at N=100 and N=1000, estimating FPR and power via posterior sampling and 95% confidence intervals. Real-world deployment: The system was deployed to 1,100 users (recruited via MHA website) over multiple batches; users received 2–3 modular message sequences per week for 8 weeks. Analysis focused on Rationale (present/absent) and Link (present/absent). Engagement and reward outcomes were compared between Contextual TS and UR arms, and contextual heterogeneity was explored for Mood and Recent Activity. Temporal arm allocation dynamics were visualized over monthly periods.
Simulations:
- Reward: Contextual TS achieved higher average rewards than UR when arm differences existed and adapted assignments by context (e.g., assigning high-mood participants to Rationale and low-mood to No Rationale).
- FPR (Scenario 1, no arm difference): At N=100, TS=0.03 vs UR=0.04; at N=1000, TS=0.07 vs UR=0.04. Using an uninformative prior led to more exploration and lower FPR at small N; at larger N, TS’s exploitative allocations increased imbalance, slightly raising FPR.
- Power: Scenario 2 (substantial arm difference): TS power for Rationale=0.72 (N=100), 0.97 (N=1000); UR=0.87, 1.00. Scenario 3 (context-dependent): TS power—Rationale=0.92, 1.00; Mood=0.41, 0.47; Rationale×Mood=1.00, 1.00. UR power—Rationale=1.00, 1.00; Mood=0.95, 1.00; Interaction=1.00, 1.00. TS showed reduced power for main effects like Mood due to adaptive allocation by context, but high power for detecting arm and interaction effects. Real-world deployment:
- Engagement: 8,521 total arm assignments yielded 813 ratings (9.54% rating engagement) from 230 unique users (20.9% of users). Response rates did not differ substantially between Link vs No Link: Link present—171/781 (22%) responded; Link absent—160/788 (20%) responded; overall 331/1,569 (21%) response in this subset analysis.
- Efficiency (reward): For Link, TS increased mean rewards relative to UR—Link present TS mean=0.790 (SEM 0.018, N=232) vs UR=0.719 (0.021, N=202); Link absent TS=0.716 (0.024, N=175) vs UR=0.640 (0.024, N=204). For Rationale, TS present=0.736 (0.018, N=282) vs UR=0.710 (0.022, N=192); TS absent=0.728 (0.025, N=167) vs UR=0.703 (0.024, N=171).
- Contextual effects: With Mood as a context, TS achieved slightly higher average rewards than UR, consistent with simulations. Effects by Recent Activity were more pronounced for recently active participants; the 48h cutoff may warrant revision.
- Adaptation over time: TS adjusted arm allocations across months—after a marginal early advantage for No Link, TS shifted to allocate more Link as more favorable responses accumulated.
The findings demonstrate that contextual TS can improve user-experienced rewards in DMH messaging by personalizing content, while still enabling meaningful statistical analyses. However, adaptive allocation introduces trade-offs: slight inflation of FPR at large sample sizes and reduced power for certain main effects (e.g., Mood) due to imbalanced assignment across contexts. The system’s flexibility—rapidly changing algorithms, rewards, and contexts; adding/removing arms; and redeploying—supports iterative product improvement and rigorous behavioral science. Real-world deployment corroborated simulation insights: TS improved average rewards without large differences in engagement rates by arm, and it adapted allocations over time as evidence accumulated. The platform serves both to optimize participant experience and as a testbed to study methodological properties (reward vs inference trade-offs), aiding domain scientists in drawing generalizable conclusions about heterogeneous treatment effects. Incorporating fairness-aware objectives and better handling of non-stationarity, missingness, and repeated measures are important next steps.
The paper introduces an adaptive experimentation system for DMH text messaging that operationalizes contextual MAB algorithms (notably contextual Thompson Sampling) at multiple decision points within modular dialogues. Through simulations and a large-scale deployment (1,100 users), the authors show that adaptive methods can increase rewards and personalize experiences while maintaining acceptable inferential properties, though with trade-offs in FPR and power for certain effects. Contributions include: an open-source, production-ready testbed for rapid algorithm integration; a design demonstrating how to balance reward optimization with statistical rigor; and empirical evidence from both simulations and field data. Future work will scale deployments (e.g., to 5,000 users), explore alternative and composite reward definitions, refine context usage and cutoffs, integrate fairness metrics, and extend to scenarios with complex interactions, non-stationarity, missing data mechanisms, and longitudinal dependencies.
- Adaptive allocation can increase false positive rates at larger sample sizes due to arm imbalance (TS FPR ~0.07 vs UR ~0.04 at N=1000).
- Reduced power for detecting certain main effects (e.g., Mood) under contextual TS because of preferential assignment aligned with context, limiting across-arm comparisons.
- Low engagement typical of DMH interventions (9.54% ratings across assignments; 20.9% of users provided ratings) constrains statistical power, especially for interaction tests in early stages.
- Early-stage contextual effect checks have higher power requirements; detecting interactions reliably may require larger samples or alternative designs.
- Potential real-world complexities (non-stationarity, missing data patterns, repeated measures, unforeseen subgroups) may violate model assumptions and affect inference; these are acknowledged but not fully addressed in current analyses.
Related Publications
Explore these studies to deepen your understanding of the subject.

