Psychology

A critical evaluation of QIDS-SR-16 using data from a trial of psilocybin therapy versus escitalopram treatment for depression

B. Weiss, D. Erritzoe, et al.

This research conducted by Brandon Weiss, David Erritzoe, Bruna Giribaldi, David J Nutt, and Robin L Carhart-Harris delves into the limitations of the Quick Inventory of Depressive Symptomatology while revealing that psilocybin therapy may outperform traditional escitalopram treatment in alleviating major depressive disorder symptoms.... show more

Introduction

The study investigates why the QIDS-SR-16 failed to show a treatment difference in a randomized trial of psilocybin therapy (PT) versus escitalopram treatment (ET) for major depressive disorder, where 14 of 16 efficacy outcomes favored PT. The authors question whether QIDS-SR-16’s null result reflects measurement insensitivity rather than true equivalence between treatments. They contextualize concerns about relying on unidimensional sum-scores for depression, given heterogeneity of symptoms and evidence for a core depression factor more closely tied to causal centrality and psychosocial impairment. The purpose is to psychometrically evaluate QIDS-SR-16 relative to other scales and re-examine treatment differences at item, facet, and factor levels, potentially clarifying the original trial’s findings and informing broader measurement practice in depression research.

Literature Review

The paper reviews the origins of QIDS-SR-16 as a concise instrument aligned with DSM criteria and derived from the IDS. Literature indicates depression scales often use sum-scores assuming unidimensionality, despite evidence for multidimensionality and heterogeneous symptom networks. QIDS-SR-16 includes a high proportion of compound items (about 90%), compared with lower proportions in HAMD/HRS, MADRS, and BDI, potentially inflating variance and reducing test–retest reliability. Prior work shows modest test–retest reliability for QIDS-SR-16 (ICC ~0.49–0.77) and concerns around its compound criteria for sleep, weight/appetite, and psychomotor domains, where raters select a single highest item among oppositely valenced items (e.g., insomnia vs hypersomnia). Large datasets (e.g., STAR*D) suggest weak to moderate intercorrelations among these items, supporting concerns about measurement error. Critiques of DSM-based symptom coverage argue DSM criteria may not capture core, causally central aspects of depression linked to impairment; non-DSM symptoms (e.g., guilt, anhedonia) may be more central. The authors cite network and process-based models advocating symptom- and facet-level measurement and note previous factor-analytic work (e.g., Ballard et al., 2018) identifying distinct depression facets.

Methodology

Data source: Carhart-Harris et al. (2021) randomized clinical trial (NCT03429075) with 59 adults with MDD randomized to PT (n=30) or ET (n=29). PT arm received 25 mg psilocybin (COMP360) at weeks 0 and 3 with daily placebo capsules; ET arm received 1 mg psilocybin at dosing visits plus daily escitalopram (10 mg then 20 mg from week 3). Blinding of investigators/medication staff was maintained. Ethics approvals were obtained from relevant UK bodies. Measures: Primary measure QIDS-SR-16 administered weekly; analyses use baseline, week 5, and week 6. Internal consistency α ranged from 0.75 (baseline) to 0.89 (weeks 5/6). Two alternative QIDS composites were created to remove compound-criterion scoring: (1) all items except Sleeping too much, Increased appetite, Increased weight; (2) all items except Sleeping too much, Decreased appetite, Decreased weight. Secondary measures: MADRS (10-item clinician-rated), HRS/HAMD-17 (clinician-rated), BDI-IA (21-item self-report), and SHAPS (anhedonia). Creation of narrow facets: Items (78 total) from QIDS, BDI-IA, MADRS, HRS, and SHAPS were scaled to 0–1 and mapped to Ballard et al.’s (2018) factor structure via item correspondence/rational allocation. Low-endorsement and low item–total items were excluded; Tension factor was dropped. Resulting facets: Amotivation, Reduced Appetite, Impaired Sleep, Suicidal Thoughts, Negative Cognition, Depressed Mood, and Anhedonia (with reported α at baseline and 6 weeks). Single depression factor: EFA across items from QIDS, BDI-IA, HRS, MADRS (scaled to 0–1) with item composites to manage dimensionality; two HRS items excluded for low variance. One factor extracted (OLS factoring) accounting for 15% of variance; loadings >0.40 defined the factor score (0–1 scale; α=0.84 baseline, 0.95 at 6 weeks). Expectancy: Pre-dosing patient ratings of expected improvement after PT and ET were collected on 0–100 scales; a relative expectancy score (PT minus ET) was computed for N=55. Analytic plan: Linear mixed-effects models (R lme4) with outcomes regressed on Time × Condition (random intercept) to estimate between-condition differences in change. First, identify most differentially responsive symptoms across scales; assess coverage by QIDS-SR-16. Second, compare QIDS items to analogous items on other scales (all scores scaled by points-possible). Third, assess QIDS compound criteria for inconsistency (highest item changes from baseline to week 6) and inter-item correlations at baseline and for change. Fourth, compare standard errors and variance components across scale mean-scores (scaled). Second analysis set: LME models for Ballard-based facets and the EFA-derived depression factor; for significant interactions, models including Relative Expectancy interactions were also tested. Significance at p<0.05, reporting standardized (b) and unstandardized (B) coefficients.

Key Findings

Item-level differentially responsive symptoms favoring PT included: depressed mood (MADRS Reported Sadness B≈−0.20), lassitude (MADRS Lassitude B≈−0.18), somatic energy (HRS Somatic energy B≈−0.21), work/interests (HRS Work and interests B≈−0.18), agitation (HRS Agitation B≈−0.18), libido/sexual interest (HRS Libido B≈−0.38; BDI Reduced sexual interest B≈−0.19), guilt (BDI Guilt B≈−0.23), dissatisfaction with life (BDI Dissatisfaction with life B≈−0.19), and worthlessness (BDI Worthlessness B≈−0.16).
QIDS-SR-16 coverage gaps and insensitivity: Several highly responsive domains (guilt, anhedonia, libido, perceived attractiveness) were absent or underrepresented in QIDS-SR-16. QIDS items often compound (e.g., View of myself combines worthlessness/guilt/self-criticism) and may have poorly ordinal response options; energy level item combined heterogeneous content (fatigue and work behaviors), potentially masking change. Suicidality wording included “thoughts of death,” possibly capturing non-dysphoric mortality salience after psychedelics. Sleep domain showed mixed item directions; combining into a single criterion likely masked effects (e.g., Falling asleep B≈−0.15, Sleeping too much B≈−0.11 vs Sleep during the night B≈+0.05).
Compound-criteria instability: Inconsistency in highest-scored item from baseline to week 6 occurred in Sleep (22%), Weight/Appetite (19%), and Psychomotor (7%). Inter-item correlations within criteria were weak or inconsistent, suggesting they may not index a single construct.
Variance and precision: QIDS-SR-16 had higher baseline SD (47% vs BDI-IA; 74% vs MADRS; 135% vs HRS) and higher change-score SD (11% vs BDI-IA; 14% vs MADRS; 58% vs HRS). The SE of the Time×Condition interaction for QIDS was larger than for other scales (e.g., +76% vs HRS), indicating reduced precision and higher measurement noise.
Facet-level outcomes (Ballard et al. mapping): Significant Time×Condition interactions favored PT for Depressed mood (B_int=−0.11, b_int=−0.68, p=0.013) and Anhedonia (B_int=−0.12, b_int=−0.79, p=0.001), reflecting greater reductions of 0.68 and 0.79 SDs, respectively, in PT vs ET from baseline to 6 weeks. No significant differences in Amotivation, Negative Cognition, Reduced Appetite, Impaired Sleep, or Suicidal Thoughts.
Single depression factor (EFA across four scales): Significant moderation by Condition (B_int=−0.09, b_int=−0.55, p=0.035), indicating a 0.55 SD greater reduction in PT vs ET on a core depression factor capturing depressed mood, negative self-appraisal, and amotivation.
Overall, multiple convergent analyses suggest PT is superior to ET in reducing core depressive symptoms, notably depressed mood and anhedonia, and improving sexual function, with QIDS-SR-16 likely under-detecting these differences due to psychometric limitations.

Discussion

Findings indicate that the QIDS-SR-16 likely underestimates true treatment differences between psilocybin therapy and escitalopram due to higher variance, compound item structures, ambiguous or non-ordinal response options, and limited coverage of core depression symptoms (e.g., anhedonia, guilt, libido). Analyses at more granular levels (item, facet, factor) consistently revealed stronger improvements with PT in domains central to depression and functioning—particularly depressed mood and anhedonia—and clinically meaningful areas such as sexual functioning. The results underscore the limitations of unidimensional sum-scores in heterogeneous constructs like depression, as such scores can mask differential symptom responses and core-factor changes. The superiority of PT on core emotional facets persisted even when accounting for expectancy, suggesting genuine treatment effects rather than expectancy bias. The work supports network/process-based perspectives and advocates for measurement strategies that combine clinician and self-reports and target symptom- and facet-level outcomes to detect mechanistically relevant treatment differences.

Conclusion

The study identifies multiple psychometric issues with QIDS-SR-16—elevated variance, compound items, vague/poorly ordinal response options, unidimensional sum-scoring, and limited focus on core depressive symptoms—that plausibly explain its discordant null findings relative to other scales in the psilocybin vs escitalopram trial. Re-analysis at item, facet, and factor levels revealed domains where psilocybin therapy showed superior efficacy, notably depressed mood, anhedonia, and libido/sexual functioning. The authors recommend more granular, multidimensional assessment approaches that emphasize core and facet-level outcomes, potentially integrating items across scales, to improve sensitivity and clinical relevance in depression trials. Future work should replicate these findings, refine depression measurement to better capture core symptomatology, and explore mechanistic biomarkers aligned with core depression factors.

Limitations

Clinician expectancy and rater biases were not measured or controlled.
Facet-level analyses relied on Ballard et al.’s EFA structure (N=119) without confirmatory replication; results are tentative.
Post hoc analyses with a small sample (N≈59) increase risk of type I error; findings are exploratory and need replication.
Psychometric critiques of QIDS-SR-16 are based on a single, specific dataset; generalizability is uncertain.
Variance and standard error comparisons cannot definitively establish measurement error; observed differences could reflect true variance capture rather than imprecision.

Related Publications

Explore these studies to deepen your understanding of the subject.

Psychology

Psilocybin for treatment-resistant depression without psychedelic effects: study protocol for a 4-week, double-blind, proof-of-concept randomised controlled trial

M. I. Husain, D. M. Blumberger, et al.

Medicine and Health

DAILY – A personalized circadian Zeitgeber therapy as an adjunctive treatment for alcohol use disorder patients: results of a pilot trial

N. Springer, L. Echtler, et al.

Psychology

An internet-delivered acceptance and commitment therapy program for anxious affect, depression, and wellbeing: A randomized, parallel, two-group, waitlist-controlled trial in a Middle Eastern sample of college students

Z. Vally, H. Shah, et al.

Medicine and Health

Effectiveness of app-based cognitive behavioral therapy for insomnia on preventing major depressive disorder in youth with insomnia and subclinical depression: A randomized clinical trial

S. Chen, J. Que, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny