Psychology
Confidence reports in decision-making with multiple alternatives violate the Bayesian confidence hypothesis
H. Li and W. J. Ma
The study examines how humans compute confidence in decisions when faced with more than two alternatives. The leading view—the Bayesian confidence hypothesis—posits that confidence equals the posterior probability that the chosen option is correct. Prior work has largely used two-alternative tasks, where alternative formulations collapse to the same prediction. The authors propose that in multi-alternative settings, confidence may instead depend on the relative strength between the top two options (difference in posterior probabilities), or possibly on overall uncertainty (entropy). To dissociate these accounts, they design a three-alternative visual categorization task where participants choose a category for a target dot and rate confidence. By manipulating category configurations and target positions, they test whether confidence is better explained by (i) the maximum posterior of the chosen option (Max/Bayesian hypothesis), (ii) the difference between the highest and second-highest posteriors (Difference), or (iii) the negative entropy of the posterior (Entropy).
Confidence influences downstream behavior, learning, and group decisions, and impairments are linked to psychiatric conditions. The Bayesian confidence hypothesis (confidence equals probability correct) has guided both behavioral and neurophysiological research, with neural correlates reported in parietal, orbitofrontal cortex, and pulvinar. Behavioral evidence is mixed: some studies report alignment with Bayesian predictions, while others show deviations. Factors such as reaction time, post-decisional processing, and emphasis on positive evidence can modulate confidence. Prior model comparison work in two-alternative tasks found departures from Bayesian accounts but often relied on heuristic alternatives without broader theoretical grounding, and two-alternative designs could not test dependence on unchosen options because posterior differences are redundant with maximum posterior. Multi-alternative decision studies and related heuristics (e.g., balance-of-evidence, tallest-minus-next thresholds) suggest that relative comparisons among top options might drive behavior, motivating the Difference model for confidence.
Participants: N=13 (Exp. 1), N=11 (Exp. 2), N=11 (Exp. 3), normal or corrected vision; NYU IRB approved; one 1-hour session per experiment (84 trials/config in Exp.1; 120 trials/config in Exp.2-3). Setup: Dim room; 3840×2160 monitor at 30 Hz, 45 cm viewing distance; stimuli implemented in JavaScript; calibrated display. Stimuli and task: On each trial, three categories were displayed via 375 colored exemplar dots per category (uncorrelated isotropic Gaussian, SD=2°) plus a black target dot. Colors were randomly chosen, equally spaced in CIE Lab* (L*=70, equidistant from gray). Participants first chose the target’s category, then reported confidence on a four-point Likert scale; unlimited viewing and response time; in Exp.3, trial-wise feedback revealed the true category by leaving its exemplars visible for 500 ms after confidence. Spatial configurations and sampling:
- Exp.1 and Exp.3 (1D horizontal manipulation): category centers vertically aligned; four horizontal configurations: (-3°,0°,3°), (-4°,0°,4°), (-3°,-2°,3°), (-3°,2°,3°). Target horizontal positions sampled uniformly between outer centers ±0.2°; vertical positions from N(0, 2°).
- Exp.2 (2D manipulation): four configurations with centers varying in both axes; horizontal centers: (-2°,0°,2°), (-1.59°,0°,1.59°), (-2°,-2°,2°), (-2°,2°,2°); vertical centers: (1.16°,-2.31°,1.16°), (0.94°,-1.84°,0.94°), (1.16°,0°,1.16°), (1.16°,0°,1.16°). Targets sampled uniformly within a 2.6°-radius circle centered on screen. Modeling framework: Bayesian observer computes posteriors over three categories given a noisy measurement x of target position s. Generative assumptions: category distributions N(m_C, σ_s^2 I), sensory noise N(s, σ_x^2 I), yielding p(x|C)=N(m_C, (σ_s^2+σ_x^2)I). Decision noise: instead of true posterior p(C|x), a noisy posterior q is drawn from Dirichlet(α p), with concentration α controlling variability (higher α, lower noise). Choice: argmax_C q(C|x). Confidence models (two-stage): map posterior q to an internal confidence variable, then apply three criteria (b1,b2,b3) to produce four discrete ratings. Three main mappings:
- Max model (Bayesian confidence hypothesis): c* = max_C q(C|x) = q(chosen|x).
- Difference model: c* = q(best|x) − q(second-best|x).
- Entropy model: c* = −Σ_C q(C|x) log q(C|x). Common parameters across models: sensory noise magnitude σ_x, Dirichlet concentration α, criteria b1–b3, lapse rate λ (random decision and confidence on lapse trials). Variants removed sensory or decision noise to form reduced models; additional Bayesian and heuristic alternatives were tested (e.g., sampling from true posterior, uncertain category means, heuristic rules not computing posteriors). Fitting and comparison: Jointly fit category choice and confidence using maximum likelihood with Monte Carlo simulations (10,000 samples per target position) to estimate p(C,c|s; θ); optimization via Bayesian Adaptive Direct Search. Model comparison via AIC summed across participants; bootstrapped 95% CIs; BIC reported in Supplementary for confirmation. Model recovery analyses synthesized datasets using best-fit parameters to assess identifiability. Visualization: sliding windows for 1D psychometric confidence curves (Exp.1,3) and hexagonal binning heat maps for 2D confidence (Exp.2).
- Across three experiments, the Difference model best explained human confidence reports in a three-alternative categorization task, outperforming the Max (Bayesian confidence hypothesis) and Entropy models.
- Experiment 1 (1D configurations, no feedback): The Difference model outperformed the Max model by group-summed AIC Δ=391 (95% CI [222, 569]) and the Entropy model by Δ=1937 (95% CI [1363, 2562]). Only the Difference model captured lowest confidence near locations where two categories were closest. Entropy showed erroneous central dips in some conditions.
- Experiment 2 (2D configurations, no feedback): The Difference model outperformed Max by Δ=541 (95% CI [371, 735]) and Entropy by Δ=1631 (95% CI [1179, 2159]). It reproduced extended low-confidence regions where two categories were close; Entropy yielded overly convex confidence contours relative to data.
- Experiment 3 (1D configurations, with trial-wise feedback and true-category sampling): Difference outperformed Max by Δ=100 (95% CI [46, 156]) and Entropy by Δ=1113 (95% CI [817, 1447]). Smaller AIC gaps were attributable to stimulus selection concentrating more trials at extreme positions where all models predict high confidence; model recovery confirmed this.
- Source of variability: Decision/inference noise (Dirichlet) was more influential than sensory noise. Removing sensory noise from the Difference model increased AIC by 121 (95% CI [48, 199]), while removing decision noise increased AIC by 737 (95% CI [590, 914]). Results held under BIC.
- Robustness: Findings persisted when fitting confidence alone, choices alone, and under alternative noise structures (e.g., sampling true posterior, noisy knowledge of category means) and against heuristic decision rules; no heuristic consistently outperformed Difference.
- Behavioral choices were broadly consistent with selecting the category with highest posterior probability; confidence patterns specifically required the Difference mapping to be captured.
- Interpretation: Confidence reflects relative evidence between the top two options (subjective probability of having made the best possible choice), not solely the absolute posterior of the chosen option nor overall posterior entropy.
The results challenge the Bayesian confidence hypothesis by demonstrating that in multi-alternative decisions, human confidence depends on the relative posterior strength of the top two options. This supports an account in which observers evaluate whether their chosen option is sufficiently better than the alternative, aligning with the idea that confidence reflects the subjective probability of having made the best possible decision. The Difference model provides a compact computational description, is robust across stimulus configurations and feedback, and generalizes conceptually to more than three alternatives (predicting dependence on the top-two posterior gap when the sum of others is held constant). The dominance of decision/inference noise suggests late-stage computational imprecision is a key driver of variability under high-signal stimuli. The findings connect to broader decision-making literature where relative comparisons (e.g., tallest-minus-next thresholds, balance-of-evidence concepts) guide choices and confidence, and they motivate extensions to value-based multi-alternative decisions where confidence may likewise depend on the difference between the top two options’ probabilities of being most valuable. They also suggest new neural predictions: distinct correlates for probability-correct (Max) vs. difference-based confidence could be dissociated in multi-alternative paradigms, refining interpretations from prior two-alternative neurophysiology. While the present framework is static, it invites integration with temporal accumulation models to determine whether multi-accumulator dynamics can yield a top-two difference confidence signal consistent with observed data.
This work introduces and validates a difference-based computational account of confidence in multi-alternative decisions, showing that confidence tracks the posterior gap between the best and second-best options rather than the posterior of the chosen option alone or posterior entropy. Through three experiments and extensive model comparisons, the Difference model robustly explains both choices and confidence, highlighting the role of decision/inference noise. The study reframes theoretical and neural interpretations of confidence and suggests practical generalization to tasks with more alternatives and to value-based choice. Future research should: (1) test variable numbers of categories to distinguish Difference from extreme “editing” (e.g., Ratio) models; (2) integrate dynamic evidence accumulation to derive real-time confidence consistent with a top-two difference; (3) design neural studies in multi-alternative settings to dissociate representations of probability-correct vs. difference-based confidence; and (4) examine generalization to non-Gaussian, multimodal posteriors and to value-based multi-alternative decisions.
- Temporal dynamics not modeled: single-stimulus design precluded separating inference vs. selection noise within decision processes; confidence modeled statically without reaction time.
- Stimulus regime: high contrast, unlimited duration likely minimized sensory noise, possibly limiting generalizability to noisy sensory conditions.
- Decision noise modeling: Dirichlet perturbation is a specific choice; while alternatives (e.g., uncertain category means) were tested, the exact source of decision noise remains underspecified.
- Model indistinguishability: Difference vs. Ratio (discarding least-likely option from normalization) were largely indistinguishable in Experiments 1–2 and only slightly favored Difference in Experiment 3; designs varying number of categories are needed to conclusively separate them.
- Reduced diagnosticity in Experiment 3 due to sampling more extreme targets led to smaller AIC separations across models, emphasizing sensitivity to stimulus selection.
- Generalization to other domains and neural mechanisms is hypothesized but untested within this study.
Related Publications
Explore these studies to deepen your understanding of the subject.

