logo
ResearchBunny Logo
A comprehensive assessment of current methods for measuring metacognition

Psychology

A comprehensive assessment of current methods for measuring metacognition

D. Rahnev

This study presents a comprehensive empirical assessment of 17 measures of metacognition, introducing a method to evaluate validity and precision and revealing that while all measures are valid and have similar precision, they differ in dependence on task performance and in test-retest reliability. Research conducted by Dobromir Rahnev.... show more
Introduction

Metacognition is broadly defined as knowing about knowing, with metacognitive ability more narrowly referring to the capacity to distinguish correct from incorrect decisions via confidence. High metacognitive ability yields high confidence when correct and low confidence when wrong, supporting learning, decision making, social interaction, and self-knowledge. The field often treats metacognitive ability as a stable trait with individual differences, potentially domain-general, and ideally separable from task skill and various biases. Despite widespread use, the desirable psychometric properties of metacognition measures have not been comprehensively established. This paper articulates key desired properties—validity (measuring what it purports to measure), precision (quantitative accuracy and sensitivity to true changes rather than noise), independence from nuisance variables (task performance, response bias, metacognitive bias), and reliability (split-half and test-retest)—and provides an empirical assessment across 17 measures to fill this gap.

Literature Review

Traditional measures include AUC2 (area under Type 2 ROC), Goodman–Kruskall Gamma (rank correlation of confidence with accuracy), Phi (Pearson correlation of confidence with accuracy), and ΔConf (difference in mean confidence between correct and error trials). These are intuitively appealing but considered dependent on primary task performance. Maniscalco and Lau introduced meta-d' (Type 2 sensitivity estimated within SDT) and normalization-based metrics M-Ratio (meta-d'/d') and M-Diff (meta-d' − d'), assumed to lessen performance dependence. Extending that logic, this paper develops normalized ratio and difference variants for AUC2, Gamma, Phi, and ΔConf using SDT-expected values (AUC2-Ratio/Diff, Gamma-Ratio/Diff, Phi-Ratio/Diff, ΔConf-Ratio/Diff). Process-model measures have emerged, including meta-noise (lognormal meta noise model; confidence criteria sampled from a lognormal distribution; metacognitive noise parameter σmeta) and meta-uncertainty (CASANDRE model; second-order uncertainty about sensory noise). Prior investigations have been limited, often simulation-based, focusing on few measures and properties. Empirical work has shown some measures depend on task performance or metacognitive bias and that reliability can be modest, motivating comprehensive empirical testing across datasets.

Methodology

The study evaluates 17 measures: meta-d', AUC2, Gamma, Phi, ΔConf; ratio variants (M-Ratio; AUC2-Ratio, Gamma-Ratio, Phi-Ratio, ΔConf-Ratio); difference variants (M-Diff; AUC2-Diff, Gamma-Diff, Phi-Diff, ΔConf-Diff); and model-based meta-noise and meta-uncertainty. Data come from six large 2-choice confidence datasets in the Confidence Database: Haddara (n=70; 6 days; 3000 trials/participant; single condition), Locke (n=10; 7 conditions; criterion manipulations; 2-point confidence), Maniscalco (n=22; 1000 trials/participant; single condition), Rouault1 (n=466; 70 difficulty levels; 210 trials/participant), Rouault2 (n=484; multiple difficulty levels; 210 trials/participant), Shekhar (n=20; 2800 trials/participant; 3 difficulty levels; continuous confidence binned to six). Validity and precision: A new corruption method systematically reduces informativeness by decreasing confidence for correct trials and increasing confidence for error trials by one scale point on a small percentage of trials (2%, 4%, 6%). Measures are computed on bins of 50, 100, 200, 400 trials; precision is quantified as the normalized drop (in SD units) relative to baseline variability without corruption. Dependence on task performance: For datasets with multiple stimulus difficulty levels (Shekhar, Rouault1 median-split, Rouault2 median-split), each metric is computed per level and compared between easiest vs. hardest conditions via t-tests; effect size is Cohen’s d. Dependence on metacognitive bias: Apply recoding method (Xue et al.) to induce lower vs. higher average confidence and compare measures via paired t-tests in Haddara, Maniscalco, and per-level Shekhar; report Cohen’s d. Dependence on response bias: Use Locke dataset with priors and rewards manipulating criterion c; compute each measure across seven conditions; test condition effects with repeated measures ANOVAs; compute correlation between measure and |c|. Reliability: Split-half reliability via odd-even trial correlations for bins of 50, 100, 200, 400 trials (Haddara, Maniscalco, Shekhar per level); report Fisher-z averaged r. Test-retest reliability via intraclass correlation (ICC, two-way absolute, A-1) and Pearson correlation across day pairs (Haddara; bins of 50–400 trials). SDT expectations for normalized measures use observed d' and confidence criteria to compute predicted HRi and FARi under equal-variance SDT.

Key Findings

Validity and precision: All 17 measures showed decreased values under artificial confidence corruption, validating them. Decrease scales approximately linearly with percent of trials corrupted. Precision improves with bin size; corrupting 2% of trials yielded average drops of 0.35, 0.50, 0.70, and 1.04 SD for bins of 50, 100, 200, and 400 trials. Most measures have comparable precision; meta-uncertainty is notably lower (Haddara: 0.37 vs. others’ avg 0.67; Maniscalco: 0.30 vs. 0.53). Task performance dependence: Non-normalized traditional measures are strongly positively related to task ease (average Cohen’s d across datasets: meta-d' 2.47; AUC2 2.29; Gamma 2.95; Phi 1.34; ΔConf 1.81). Ratio measures largely reduce dependence but slightly overcorrect (average Cohen’s d: M-Ratio −0.18; AUC2-Ratio −0.39; Gamma-Ratio −0.11; Phi-Ratio −0.17; ΔConf-Ratio −0.23). Difference measures fail to correct, showing medium negative effect sizes (M-Diff −0.58; AUC2-Diff −0.49; Gamma-Diff −0.39; Phi-Diff −0.30; ΔConf-Diff −0.55). Model-based measures show small effects: meta-noise decreases with easier conditions (d ≈ −0.29), meta-uncertainty slightly increases (d ≈ 0.06). Metacognitive bias dependence: Non-normalized measures depend on average confidence (meta-d' d=0.44; AUC2 d=0.51; Gamma d=−0.61; Phi d=0.81; ΔConf d=0.54). Several normalized measures have small effects (AUC2-Ratio d=0.09; Gamma-Ratio d≈0.001; AUC2-Diff d=0.10; Phi-Diff d=0.11), with meta-noise showing a small negative effect (d=−0.21) and meta-uncertainty a small positive (d=0.27). Response bias: Across seven conditions manipulating priors/rewards (Locke), criterion c changed strongly, but no metacognition measure varied significantly by condition; correlations with |c| were small (r between −0.04 and 0.21). Reliability: Split-half reliability is high with ≥100 trials (all measures r>0.837; average r≈0.861), increasing further at 200 (avg r≈0.946) and 400 trials (avg r≈0.965); at 50 trials, reliability falls substantially (avg r≈0.424). Test-retest reliability (Haddara) is low-to-moderate: most normalized and model-based measures have ICC<0.5 even at 400 trials; ΔConf was highest (ICC≈0.75 at 400 trials), while M-Ratio ICC≈0.42 at 400 trials; d' and c had only modestly higher ICC than ΔConf and confidence itself was very high (>0.86). Inter-measure correlations across participants are moderate (avg r≈0.49–0.56), with non-normalized measures correlating more strongly within-group, normalized measures forming a second correlated group, and model-based measures correlating weakly with others (meta-noise avg r≈0.35; meta-uncertainty avg r≈0.44; mutual r≈0.03–0.15). Overall: No single measure is perfect; ratio measures tend to handle task performance better than difference measures; meta-noise and meta-uncertainty are promising model-based approaches; reliability constraints imply large samples and adequate trials are needed.

Discussion

The study addresses the central question of which metacognition measures possess desirable psychometric properties. By empirically testing validity, precision, nuisance-variable dependence, and reliability across 17 measures and multiple large datasets, it shows that while all measures are valid, they differ meaningfully. Traditional non-normalized measures are highly sensitive to task performance and metacognitive bias, limiting their interpretability in varied experimental contexts. Ratio normalizations substantially reduce task performance dependence, whereas difference normalizations generally overcorrect and perform poorly. Model-based measures capture distinct aspects of metacognitive inefficiency, offering explicit assumptions and flexibility (e.g., modeling multiple conditions), but can be noisier or exhibit opposite-direction dependencies that warrant careful interpretation. Reliability analyses reveal strong split-half reliability with adequate trial numbers but generally low test-retest reliability, cautioning against simple cross-day individual differences correlations and underscoring the need for sufficient trials and sample sizes. Practically, M-Ratio remains a defensible standard given its overall balance and established status, but combining it with model-based measures like meta-noise (and possibly meta-uncertainty) can help disentangle effects of metacognitive bias and task performance due to their differing dependency patterns. The results guide measure selection based on study goals, data characteristics, and potential confounds.

Conclusion

This work provides a comprehensive empirical evaluation of 17 metacognition measures, introduces a practical method to jointly assess validity and precision, systematically quantifies dependence on task performance, response bias, and metacognitive bias, and evaluates reliability (split-half and test-retest). It demonstrates that all measures are valid but vary in precision, nuisance-variable dependencies, and reliability; ratio normalizations outperform difference methods for controlling task performance; model-based measures capture unique variance and offer principled flexibility but may be noisier. Recommendations include using sufficient trials (≥100 for within-subject precision; ≥400 for individual differences), preferring ratio measures (e.g., M-Ratio) when performance is adequate, confirming results with model-based metrics to probe biases, and being cautious interpreting cross-day correlations. Future research should extend evaluations to additional datasets and domains (beyond perception), incorporate unequal-variance SDT assumptions, refine model-based estimation for improved precision (especially meta-uncertainty), and develop measures applicable beyond 2-choice tasks.

Limitations

Several measures were omitted, including newer model-based approaches, M-Ratio variants, and legacy metrics (e.g., Kunimoto’s α). Some analyses relied on single datasets (response bias; test-retest reliability), limiting generalizability. SDT expectations assumed equal-variance; unequal-variance models may yield different results. All analyses were in perceptual tasks; generalization to learning, memory, problem solving, and other domains requires testing. Most measures apply only to 2-choice tasks, constraining applicability to n-choice or estimation paradigms. Normalized measures can be unstable at low d' (near-zero denominators), and continuous confidence must be binned for SDT-based metrics, potentially losing information.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny