Psychology

Challenging the Bayesian confidence hypothesis in perceptual decision-making

K. Xue, M. Shekhar, et al.

Abstract not provided. To hear the full findings and insights from research conducted by Kai Xue, Medha Shekhar, and Dobromir Rahnev, listen to the audio for the complete summary and conclusions.... show more

Introduction

The study investigates how humans compute confidence in simple two-choice perceptual decision-making tasks, contrasting the Bayesian Confidence Hypothesis (BCH) with the Confidence in Raw Evidence Space (CRES) hypothesis. BCH posits that confidence reflects the posterior probability that a decision is correct, requiring accurate knowledge of task structure and correct probabilistic computation. CRES proposes that confidence is based directly on the raw sensory evidence by placing criteria on an internal evidence axis, without explicit probability computations. The authors identify a decisive qualitative signature for two-choice tasks: when task difficulty varies, BCH predicts confidence criteria should remain constant in log-likelihood ratio (log odds) space but vary in evidence space, whereas CRES predicts the opposite (criteria stable in evidence space and varying in log odds). The research aims to test these competing models in three experiments using tasks with Gaussian evidence distributions, including conditions where difficulty is uncued, trial-cued, and block-fixed, to determine whether humans use BCH or CRES computations.

Literature Review

Empirical support for BCH has been limited and mixed. Prior work identified signatures of statistical confidence (equivalent to BCH), but none are sufficient and some fail empirically. Some studies found BCH-consistent patterns in subsets of subjects or conditions, and one paper supported BCH in a simple two-choice task via model fits. However, several studies in more complex tasks with additional noise, multiple alternatives, or varying information quality have rejected BCH in favor of alternatives. The paper distinguishes BCH from the broader Bayesian decision-making framework: BCH is a specific, falsifiable model with strong assumptions (knowledge of task structure, estimating probability of being correct, and correct computation), whereas the Bayesian framework is a general, non-falsifiable conceptual approach. Alternatives to BCH include models where confidence reflects raw evidence strength (CRES) and other proposals such as entropy-based confidence or visibility-based mechanisms. Dynamic models that jointly account for choice, confidence, and response time exist but make direct BCH vs. CRES comparisons challenging. The literature also discusses metacognitive noise, lapses, asymmetric criteria, and the effects of confidence scale granularity on calibration, with mixed findings.

Methodology

Three experiments compared BCH and CRES using simple two-choice orientation discrimination tasks with Gaussian sensory evidence distributions and three difficulty levels manipulated via stimulus contrast (easy, medium, hard). Stimuli were noisy Gabor patches tilted 45° left or right. Subjects reported choice and confidence on a continuous scale later rescaled to six points, yielding five confidence criteria per choice. Experiment 1 reanalyzed published data (20 subjects, 2,800 trials each; interleaved contrasts; difficulty not explicitly cued). Experiment 2 was preregistered (20 subjects; 750 trials each) and explicitly cued trial difficulty, with practice and feedback to build performance awareness. Experiment 3 was preregistered (20 subjects; 720 trials each); difficulty was fixed within runs of 120 trials, with trial-by-trial feedback and extended training, designed to facilitate BCH computations. Confidence criterion locations were quantified in internal evidence space (signal detection measure c) and in log-likelihood ratio space (log odds, LO). Change scores were computed between hard and easy conditions for each criterion: c_change = |c_hard - c_easy| / (c_hard + c_easy) and LO_change = |LO_hard - LO_easy| / (LO_hard + LO_easy). Statistical comparisons used two-tailed t-tests for each of the 10 criteria. Model comparison included 16 models per experiment: eight CRES and eight BCH variants, systematically varying three auxiliary assumptions: (1) symmetry vs. asymmetry of confidence criteria around the decision criterion, (2) presence vs. absence of lapse rate, and (3) presence vs. absence of Gaussian metacognitive noise. Corresponding CRES and BCH models had the same number of parameters. Maximum likelihood estimation was used with Bayesian Adaptive Direct Search (BADS), and models were compared using Akaike Information Criterion (AIC); lower AIC indicates better fit. Additional absolute fit assessments used G^2 statistics; bootstrapped 95% confidence intervals on summed AIC differences assessed significance. Model and parameter recovery analyses confirmed high recoverability in all experiments. Methods details: Sensory evidence modeled as Gaussian with means ±μ and fixed σ=1, defining d' via μ. CRES generates decision and confidence by placing criteria directly on the evidence axis. BCH defines thresholds in posterior probability/log odds space; due to the one-to-one mapping between log odds and posterior, thresholds in log odds correspond to specific evidence criteria depending on d'. The transformation implies that for fixed evidence-space criteria across difficulty, log odds vary linearly with d'; conversely, BCH with fixed log odds implies evidence criteria shift with d'. Reaction time (RT) control analysis derived RT-based confidence by binning RT quartiles into four levels and repeating the signature analysis.

Key Findings

Across all three experiments, empirical signatures and model comparisons favored CRES over BCH. Experiment 1 (uncued, interleaved difficulty; 20 subjects, 2,800 trials): average change score across 10 criteria was c_change = 0.09 and LO_change = 0.44; LO_change was significantly larger than c_change for all criteria (all P < 7.9e-5). All eight CRES models outperformed corresponding BCH models (average summed AIC advantage ≈ 2,122; range 1,747 to 2,524). Even the worst CRES model outperformed the best BCH model by 1,071 summed AIC. The best CRES model reproduced empirical signatures (c_change ≈ 0.06, LO_change ≈ 0.45 vs empirical 0.09 and 0.44); the best BCH model deviated (c_change ≈ 0.17, LO_change ≈ 0.03). Experiment 2 (trial-cued difficulty; 20 subjects, 750 trials): empirical c_change = 0.17 and LO_change = 0.46; LO_change > c_change for all criteria (all P < 0.04). All eight CRES models had lower summed AIC than BCH models (average advantage ≈ 591.18; range 244.95 to 906.46). Best CRES vs BCH model: c_change ≈ 0.06 vs 0.14; LO_change ≈ 0.43 vs 0.12. Experiment 3 (block-fixed difficulty; 20 subjects, 720 trials): empirical c_change = 0.19 and LO_change = 0.51; LO_change > c_change for all criteria (all P < 0.02). All eight CRES models outperformed BCH (average summed AIC advantage ≈ 361.32; range 68.43 to 651.02). Best CRES vs BCH model: c_change ≈ 0.10 vs 0.18; LO_change ≈ 0.60 vs 0.06. Auxiliary assumptions: Models with asymmetric confidence criteria fit better than symmetric ones (average summed AIC differences: Expt 1 ≈ 792; Expt 2 ≈ 165.16; Expt 3 ≈ 229.56). Inclusion of metacognitive noise and lapse rate improved fits (Expt 1 average AIC advantages: meta noise ≈ 165; lapses ≈ 198; Expt 2: meta noise ≈ 30.95; lapses ≈ 143.59; Expt 3: meta noise ≈ 10.65; lapses ≈ 40.19). Training effects: In Expt 3, CRES advantage decreased from first to second half of runs (average summed AIC differences ≈ 178.97 to 90.46), suggesting a modest shift toward BCH-like computation over prolonged exposure. Qualitative change scores across experiment halves: first half c_change ≈ 0.24, LO_change ≈ 0.77; second half c_change ≈ 0.47, LO_change ≈ 0.52, still more consistent with CRES. RT-based confidence replicated CRES signatures across experiments: c_change ≈ 0.10, 0.13, 0.23 for Experiments 1–3; LO_change ≈ 0.55, 0.51, 0.63, respectively, with LO_change > c_change for all criteria (all P < 0.03).

Discussion

The findings challenge the Bayesian Confidence Hypothesis by demonstrating that, even in simple two-choice tasks with Gaussian evidence distributions, confidence criteria are more stable in internal evidence space than in log-likelihood ratio space, contrary to BCH predictions. Extensive model comparisons uniformly favored CRES models that place confidence criteria directly on raw sensory evidence. The authors discuss that falsifying BCH does not falsify the broader Bayesian framework; BCH is a specific instantiation with strong assumptions, whereas the Bayesian framework is a flexible, conceptual approach. BCH-adjacent variations relaxing assumptions about knowledge of task structure, computing probability of being correct, or correctness of computations could, in principle, account for data but risk unfalsifiability or require implausible mechanisms (e.g., ignoring difficulty cues). The results are most parsimoniously explained by dropping the assumption that people estimate posterior correctness and instead using raw evidence thresholds. The work also elaborates implications for metacognitive modeling: confidence criteria tend to be asymmetric; metacognitive noise and lapses improve fits; and RT-derived confidence aligns with CRES, suggesting timing is more consistent with raw-evidence criteria than posterior computations. The discussion contemplates the definition of “raw evidence” in neural terms, emphasizes that CRES can be interpreted within a Bayesian resource-rational perspective, and notes that while subjects can slightly adjust criteria with training or fixed difficulty, the magnitude of shifts is far smaller than BCH predicts.

Conclusion

Confidence computations in simple two-choice perceptual tasks deviate from BCH predictions and are better described by placing criteria directly in sensory evidence space (CRES). A qualitative signature based on varying task difficulty, corroborated by comprehensive model comparisons across auxiliary assumptions, decisively supports CRES over BCH in three experiments. Future research should: more precisely define raw evidence in neural terms; extend models beyond current CRES assumptions (e.g., non-Gaussian noise, visibility-based mechanisms, dynamic decision processes); clarify the role and modeling of lapses and metacognitive noise; examine effects of confidence scale design; and investigate training-induced shifts toward BCH-like computations.

Limitations

The study lacks a formal abstract in the provided text. The experiments focus on simple two-choice tasks with Gaussian evidence; results may not generalize to more complex or non-Gaussian tasks. The modeling assumes equal sensory noise across conditions and does not implement dynamic RT-choice-confidence models, limiting exploration of temporal strategies. Although CRES models fit better, fits are not perfect, indicating missing mechanisms. The analysis cannot identify which specific BCH assumption is violated. Lapse modeling assumes random confidence during lapses, which may be unrealistic and conflates effects with metacognitive noise. Confidence ratings were converted from a continuous to discrete scale, potentially contributing to asymmetric criteria. Training in Experiment 3 suggests modest shifts toward BCH-like behavior, indicating computations may be context- or experience-dependent.

Related Publications

Explore these studies to deepen your understanding of the subject.

Psychology

Confidence reports in decision-making with multiple alternatives violate the Bayesian confidence hypothesis

H. Li and W. J. Ma

Interdisciplinary Studies

Trapped in the prison of the mind: Notions of climate-induced (im)mobility decision-making and wellbeing from an urban informal settlement in Bangladesh

S. Ayeb-karlsson, D. Kniveton, et al.

Psychology

Disentangling the roles of dopamine and noradrenaline in the exploration-exploitation tradeoff during human decision-making

A. Cremer, F. Kalbe, et al.

Medicine and Health

The experiences of adult patients, families, and healthcare professionals of CPR decision-making conversations in the United Kingdom: A qualitative systematic review

M. Hartanto, G. Moore, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny