Psychology
How do online users respond to crowdsourced fact-checking?
F. Panizza, P. Ronzani, et al.
The study investigates how online users respond to crowdsourced fact-checking and the extent to which peers’ prior ratings influence judgments of scientific content on social media. Motivated by the spread of misinformation and limitations of professional fact-checking in private or rapid online contexts, the authors examine whether showing distributions of prior participants’ validity ratings affects accuracy, response tendencies, decision time, and search behavior. They ask whether users blindly follow majorities, integrate peer ratings with personal evaluation, or ignore them, and which factors (distribution properties and personal relevance) predict reliance on peer information. Understanding these dynamics informs platform design and policy regarding crowdsourced fact-checking interventions.
Prior work shows social influence shapes online decisions (e.g., recommendations, financial decisions, Wikipedia discussions) and that comments or peer refutations can alter content perceptions. The MAIN model suggests interface cues (like peer ratings) can trigger heuristics affecting credibility judgments. Crowdsourced approaches (e.g., Twitter’s Birdwatch/Community Notes) show promise in reducing spread and agreement with misleading content, and congruence with AI assessments can increase trust. However, some studies find that receiving others’ search resources does not outperform self-collected evidence. Users often believe others are more susceptible to misinformation (third-person effect), complicating reliance on others’ judgments. The present work addresses gaps by directly testing behavioral responses to crowdsourced rating distributions, considering distribution properties (majority accuracy, peak deviation, opinion contrast) and personal relevance, and examining effects on response times and search strategies.
Design and participants: A preregistered online experiment (osf.io/egkxy) recruited 1001 UK residents on Prolific (July 22–23, 2022); 1 exclusion per preregistered criteria yielded N=1000 (mean age 36, SD=13; 50% female; 58.5% with Bachelor’s or higher). The sample skewed younger and more educated than UK averages. Participants were evenly randomized across 10 stimuli.
Stimuli and task: Each participant viewed one of ten Facebook posts on science topics (6 climate, 4 health/nutrition), balanced on validity (5 valid, 5 invalid) and other attributes. Sources were largely unfamiliar to minimize source effects. Participants rated scientific validity on a forced-choice 6-point Likert scale (1=definitely invalid to 6=definitely valid). Below each post, participants saw a histogram showing the distribution of validity ratings from an earlier experiment (Ronzani et al., 2023; “inform others” condition). Distributions were real; for two posts (both about cannabis and heart attacks), the most-selected rating misleadingly pointed to the incorrect answer. Evaluation was self-paced, and participants could leave to search for information; then completed questionnaires. Median completion time was 5 minutes; compensation £0.70.
Distribution properties: For each post, authors coded: (a) majority opinion (most-selected rating) and whether it was accurate or misleading; (b) peak deviation (#most-selected − #second-most-selected), indicating strength of consensus; (c) opinion contrast (whether the second-most peak indicated the opposite validity side), indicating clarity of consensus.
Measures and preregistered hypotheses:
- Accuracy outcomes: (1) Accuracy score (1–6) recoded from the 6-point validity response to reflect distance from truth direction (6=most accurate), and (2) Correct guessing (valid vs invalid correctness). H1 predicted that relative to original participants, observers would be more accurate when the majority was accurate and less accurate when it was misleading (condition × informativeness interaction), affecting both accuracy score (H1a) and correct guessing (H1b).
- Opinion-following: Distance between participant rating and the distribution’s majority opinion (0=match). Predictors: peak deviation (H2a), opinion contrast (H2b; closer when no contrast), and personal relevance (H2c; closer when lower relevance).
- Response times: Rank-transformed response times. H3a: observers faster than original; H3b: higher personal relevance slows responses.
- Search behavior: Self-reported lateral reading (use of search engines) and click restraint (checking beyond top results). H4a: observers report less of both than original; H4b: higher personal relevance increases both.
Controls: Measures included confidence, sharing intention, plausibility, subjective knowledge, personal relevance (0–100 plus yes/no), familiarity and trust in the source, trust in scientists, conspiratorial beliefs, altruism, social comparison, and self-reported evaluation strategy (including whether they followed previous answers and why). Platform metadata (education, SES, social media use, belief in climate change) were collected. Statistical analyses used R, α=0.05, two-tailed tests, Benjamini-Hochberg corrections, and robustness checks; given N=10 stimuli, primary regressions did not cluster by post (mixed-effects in supplements yielded similar results). Some preregistered tests were adapted (e.g., linear regression with rank-transformed times) and some exploratory tests were limited by data constraints.
Comparators: Observers were compared primarily to the original participants whose ratings formed the displayed distributions; additional comparisons to a control group are in supplements.
-
Randomization and baseline performance: Participants were evenly distributed across posts (χ²(9)=0.740, p≈1). Median evaluation time: 36 s (original: 34 s). Mean accuracy score: 4.07 (SD=1.45), with 64.4% correct guessing.
-
Effect of distributions on accuracy (H1): Significant interaction between condition (observer vs original) and informativeness (accurate vs misleading majority) for both outcomes: • Accuracy score (ordered logistic): β=0.695 [0.314, 1.076], z=2.405, p<0.001. • Correct guessing (logistic): β=0.581 [0.109, 1.056], z=2.405, p=0.016. Post-hoc contrasts showed effects in the predicted directions for accuracy scores: • Misleading majority: β=0.413 [0.798, −0.028], z=2.398, p=0.017 (accuracy reduced vs original). • Accurate majority: β=0.282 [0.081, 0.483], z=3.136, p=0.003 (accuracy increased vs original). For correct guessing, post-hoc estimates were not statistically significant: • Misleading majority: β=0.426 [0.905, 0.053], z=1.988, p=0.094. • Accurate majority: β=0.155 [−0.095, 0.404], z=1.388, p=0.165. Robustness: Results held with random effects for post, excluding participants who reported using distributions only due to demand, and using an alternate control (supplementary). Notably, 91.5% reported not following previous answers; restricting to this subgroup, the interaction remained significant for accuracy score (β=0.732 [0.341, 1.123], z=3.670, p<0.001) and was marginal for correct guessing (β=0.471 [−0.010, 0.955], z=1.914, p=0.056).
-
Opinion-following predictors (H2): Despite visible overlap with majority tendencies, distances to the majority opinion were not predicted by distribution properties nor personal relevance: • Peak deviation (H2a): β≈0.000 [−0.031, 0.031], z=0.001, p=0.999. • Opinion contrast (H2b): β=−0.248 [−0.612, 0.115], z=1.633, p=0.307. • Personal relevance (H2c): β=0.002 [−0.005, 0.008], z=0.589, p=0.834. Variance analyses indicated greater dispersion and extremity among observers: • Fligner-Killeen test: SDobserver=1.45 vs SDoriginal=1.38; χ²(1)=4.662, p=0.031. • More extreme ratings: ordered logistic β=0.232 [0.068, 0.395], z=2.780, p=0.006.
-
Response times (H3): No significant difference between groups and no effect of personal relevance: • Group difference: linear regression on rank-transformed times: β≈39 [90, 11], z=1.525, p=0.128; medians 36.3 s (observer) vs 34.4 s (original). • Personal relevance: β≈1 [1,2], z=1.106, p=0.269.
-
Search behavior (H4): No reduction in reported lateral reading or click restraint vs original; personal relevance did not predict search behaviors: • Lateral reading: β=0.009 [−0.304, 0.285], z=0.060, p=0.952. • Click restraint: β=0.047 [−0.402, 0.308], z=0.259, p=0.795. • Personal relevance: lateral reading β=0.006 [−0.002, 0.015], z=1.403, p=0.161; click restraint β=0.009 [−0.002, 0.020], z=1.607, p=0.108.
Findings demonstrate that showing crowdsourced distributions subtly but reliably influences users’ validity judgments, improving accuracy when the majority is accurate and harming it when the majority is misleading. However, users did not simply copy the most selected option; rather, they appeared to use the distribution as a directional cue and integrated it with their own evaluation and searches. Self-reports downplayed reliance on others’ opinions, consistent with third-person and self-presentation effects, yet behavioral data showed influence even among those denying it. The lack of changes in response times and search style suggests that the presence of peer ratings did not replace individual effort; users continued to spend similar time and reported similar search strategies, potentially seeking confirmation or counterevidence relative to the distribution trend. These results support the potential of transparent, non-coercive, crowd-based informative nudges on platforms, while underscoring the continued role of individual reasoning and context-dependent social learning in online information evaluation.
Crowdsourced fact-checking signals, presented as distributions of prior ratings, do influence online users’ evaluations but do not lead to blind conformity. Users appear to use peer information as a cue and combine it with personal reasoning and research. This approach can bolster resilience to misinformation without sacrificing user autonomy and may be a viable, transparent nudge for platforms. Future work should test varied domains (political, economic, historical), more ecologically valid and recent content, known sources, different rating formats (e.g., binary), and settings where communities accrue trust and reputation.
- Ecological validity of stimuli: Many posts were older and experimenter-selected, potentially unrepresentative of typical, current newsfeed content.
- Source familiarity: Focus on unfamiliar sources limits generalizability; crowdsourced advice might have lower impact when users have priors about known sources.
- Response format: The 6-point scale may affect how users align with majority ratings and base-rate expectations; effects may differ with binary formats.
- Sample characteristics: Participants were younger and more educated than average UK social media users, limiting generalizability across demographics.
- Domain scope and platform format: Study focused on scientific topics and distribution-style summaries; results may differ for other domains and for formats like Community Notes that include contextual summaries and references.
- Statistical and design constraints: Limited number of stimuli (N=10) and correlations between predictors and random effects constrained some preregistered/exploratory analyses.
Related Publications
Explore these studies to deepen your understanding of the subject.

