logo
ResearchBunny Logo
A comprehensive assessment of current methods for measuring metacognition

Psychology

A comprehensive assessment of current methods for measuring metacognition

D. Rahnev

A comprehensive empirical assessment of 17 metacognition measures shows all are valid, most similarly precise, with weak links to response and metacognitive bias but strong dependence on task performance; split-half reliability is high while test-retest reliability is generally poor, suggesting no perfect metric and that different measures suit different contexts. Research conducted by Dobromir Rahnev.... show more
Introduction

The paper addresses the central question of how best to measure metacognitive ability—defined as the capacity to evaluate one’s decisions by distinguishing between correct and incorrect answers—given its importance for learning, decision-making, social interaction, and self-knowledge. Metacognitive ability is often treated as a stable trait with meaningful variability, potentially domain-general, and ideally measured independently of task skill or bias. Yet, the empirical properties of commonly used measures have been insufficiently established. The study aims to comprehensively evaluate 17 measures of metacognition across key psychometric criteria: validity, precision, dependence on nuisance variables (task performance, response bias, metacognitive bias), and reliability (split-half and test-retest), using multiple large datasets from the Confidence Database.

Literature Review

Traditional metacognition measures include the Type 2 ROC area (AUC2), Goodman-Kruskall Gamma (rank correlation of confidence with accuracy), Phi (Pearson correlation of confidence with accuracy), and ΔConf (difference between average confidence on correct vs. error trials). These are intuitively appealing but thought to be influenced by task performance. Maniscalco and Lau introduced meta-d' (confidence sensitivity in d' units) and its normalizations M-Ratio (meta-d'/d') and M-Diff (meta-d' − d') to mitigate performance dependence. The current work extends normalization to eight new measures: ratio and difference versions of AUC2, Gamma, Phi, and ΔConf by comparing observed values to SDT-expected values given d' and criterion. Model-based approaches include meta-noise (lognormal meta noise model; confidence criteria sampled from a lognormal distribution affecting confidence but not the initial decision) and meta-uncertainty (CASANDRE model; second-order uncertainty about sensory uncertainty). Prior empirical and simulation studies are limited: some show dependence on response bias for certain legacy measures; others show meta-d', AUC2, and Phi depend strongly on task performance, with M-Ratio and meta-noise less so; reliability studies report low test-retest for M-Ratio and modest split-half reliability in mixed-condition datasets; meta-uncertainty has been reported to have high test-retest reliability with weak dependence on task performance and metacognitive bias. Overall, there has been little empirical work spanning all major measures and properties.

Methodology

Data came from six large datasets in the Confidence Database: Haddara (n=70 analyzed, 3000 trials/participant, 4-point confidence, single condition, 6 days), Locke (n=10, 4900 trials, 2-point confidence, seven conditions manipulating priors and rewards), Maniscalco (n=22 analyzed, 1000 trials, 4-point confidence, single condition), Rouault1 (n=466, 70 difficulty levels, 210 trials/participant, 11-point confidence recoded to 6-point), Rouault2 (n=484, staircase difficulty, 210 trials/participant, 6-point confidence), Shekhar (n=20, 2800 trials/participant, 3 difficulty levels, continuous confidence binned to 6 levels). Participants with extreme accuracy or response patterns were excluded per pre-registered criteria. Seventeen measures were computed, including nine established (AUC2, Gamma, Phi, ΔConf, meta-d', M-Ratio, M-Diff, meta-noise, meta-uncertainty) and eight new normalized measures (AUC2-Ratio/Diff, Gamma-Ratio/Diff, Phi-Ratio/Diff, ΔConf-Ratio/Diff). SDT expectations were derived from observed d' and confidence/decision criteria to compute expected values for normalization. Validity and precision: Developed an artificial-confidence corruption method that decreases confidence by 1 on correct trials and increases by 1 on incorrect trials for 2%, 4%, or 6% of trials within bins of 50, 100, 200, 400 trials. Precision is the average normalized drop (in SD units of baseline fluctuations) across bins. Dependence on task performance: For datasets with multiple difficulty levels (Shekhar; Rouault1/2 median split into easy vs. difficult), computed each measure per difficulty and compared extremes via t-tests; reported effect sizes (Cohen’s d). Also plotted vs d'. Dependence on metacognitive bias: Applied Xue et al. recoding to induce lower vs higher average confidence on Haddara, Maniscalco, and Shekhar (analyzing Shekhar difficulty levels separately and averaging), then paired t-tests between recodes and Cohen’s d. Dependence on response bias: In Locke (priors/payoffs manipulated), repeated measures ANOVAs across seven conditions for each measure; computed correlations of each measure with absolute criterion |c|. Reliability: Split-half correlations (odd vs even trials) computed for bins of 50, 100, 200, 400 trials on Haddara, Maniscalco, Shekhar (per difficulty, then averaged). Test-retest reliability in Haddara (6 days) using ICC (two-way absolute consistency) and Pearson correlations across all day pairs for bins of 50, 100, 200, 400 trials. Inter-measure across-subject correlations were calculated for the three large datasets.

Key Findings

Validity and precision: All 17 measures decreased with artificial corruption, confirming validity. Precision increased with bin size; corrupting 2% of trials yielded average decreases of 0.35, 0.50, 0.70, and 1.04 SD units for bins of 50, 100, 200, and 400 trials. Most measures had comparable precision. Meta-uncertainty was notably less precise (Haddara: 0.37 vs 0.67 average of others; Maniscalco: 0.30 vs 0.53), with pairwise differences significant uncorrected across both datasets. Dependence on task performance: Non-normalized measures strongly increased with easier tasks; average Cohen’s d across datasets: meta-d' 2.47, AUC2 2.29, Gamma 2.95, Phi 1.34, ΔConf 1.81. Ratio measures reduced dependence but slightly overcorrected (negative d): M-Ratio −0.18, AUC2-Ratio −0.39, Gamma-Ratio −0.11, Phi-Ratio −0.17, ΔConf-Ratio −0.23. Difference measures overcorrected more strongly: M-Diff −0.58, AUC2-Diff −0.49, Gamma-Diff −0.39, Phi-Diff −0.30, ΔConf-Diff −0.55. Model-based measures showed small effects: meta-noise −0.29; meta-uncertainty 0.06. Dependence on metacognitive bias: Non-normalized measures depended on average confidence (Cohen’s d across datasets): meta-d' 0.44, AUC2 0.51, Gamma −0.61, Phi 0.81, ΔConf 0.54. Several normalized measures showed smaller dependence: AUC2-Ratio 0.09, Gamma-Ratio 0.001, AUC2-Diff 0.10, Phi-Diff 0.11, with meta-noise −0.21 and meta-uncertainty 0.27. Measures with absolute effect sizes under 0.15 included AUC2-Ratio, Gamma-Ratio, AUC2-Diff, and Phi-Diff. Dependence on response bias: Despite strong manipulation of criterion (F(6,54)=12.18, p<0.001, η²p=0.58), no significant condition effects emerged for any metacognition measure; correlations with |c| were small (r between −0.04 and 0.21). Split-half reliability: High for bins ≥100 trials (all r>0.837; average r=0.861 at 100, 0.946 at 200, 0.965 at 400). Poor at 50 trials (average r=0.424; none >0.6). d' split-half at 50 trials was r=0.685. Test-retest reliability: Generally low. Example ICCs: AConf 0.39 (50), 0.53 (100), 0.65 (200), 0.75 (400); M-Ratio 0.16, 0.23, 0.29, 0.42. Non-normalized measures were low-to-moderate; confidence itself >0.86 regardless of bin size; d' ≈0.84 at 400. Inter-measure correlations: Medium average across-subject correlations (r≈0.49–0.56). Three clusters: (1) non-normalized measures correlated strongly within-group (avg r≈0.60), (2) ratio/difference measures (avg r≈0.63), and (3) model-based measures weakly correlated with others (meta-noise avg r≈0.35; meta-uncertainty avg r≈0.44) and weakly with each other (r≈0.03–0.15).

Discussion

The study establishes that all evaluated metacognition measures are valid, but none are universally superior across psychometric criteria. Traditional non-normalized measures are strongly confounded by task performance and metacognitive bias, which can inflate or distort estimates. Ratio normalizations effectively reduce performance dependence, albeit with slight overcorrection, while difference normalizations generally fail, introducing medium negative dependencies. Model-based measures (meta-noise, meta-uncertainty) perform comparably to ratio normalizations regarding nuisance controls and offer transparent assumptions and flexibility, though meta-uncertainty has lower precision. Reliability analyses highlight that within-subject precision can be high with adequate trials (≥100), but between-day stability is weak for most measures, complicating individual differences research and necessitating large samples or multiple measures. Findings inform measure selection tailored to experimental goals: use ratio normalizations or model-based measures to mitigate performance and bias confounds; avoid difference measures for performance correction; and employ sufficient trial counts. The observed opposite-direction dependencies for certain measures (e.g., meta-uncertainty vs. others) suggest complementary use to guard against confounds. Overall, the work guides the field toward principled measurement and cautions against overinterpretation of single metrics.

Conclusion

This paper introduces an empirical framework to assess validity and precision via controlled corruption of confidence and systematically evaluates 17 metacognition measures, including eight new normalized variants, across dependence on task performance, metacognitive bias, response bias, and reliability. Key contributions include demonstrating universal validity, quantifying precision (with meta-uncertainty notably less precise), showing strong performance dependence of traditional measures, the relative success of ratio normalizations and model-based measures, the failure of difference normalizations, and documenting high split-half but poor test-retest reliability. Recommendations emphasize using multiple measures (especially ratio and model-based), designing relatively easy single-condition tasks, and sufficient trial numbers. Future research should expand to additional datasets and domains beyond perception, consider unequal-variance SDT, refine process models, and develop practical implementations of model-based measures to eventually supplant legacy metrics like M-Ratio.

Limitations

Several limitations are noted: (1) Not all measures of metacognition were included (e.g., some recent model-based measures, variants of M-Ratio, legacy measures like Kunimoto’s a''). (2) Two analyses relied on a single dataset (response bias dependence; test-retest reliability), warranting cautious interpretation. (3) SDT computations assumed equal variance; unequal-variance assumptions may change results. (4) All analyses were in perceptual tasks; generalization to memory, learning, problem solving, and other domains remains to be empirically verified. (5) Most measures apply only to 2-choice tasks, limiting generality to n-choice or estimation paradigms. (6) Values summarised (e.g., effect sizes) are dataset-dependent and should be treated as rough estimates pending replication across more datasets.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny