Psychology
Low replicability can support robust and efficient science
S. Lewandowsky and K. Oberauer
The paper addresses how best to resolve the replication crisis in psychology by comparing alternative replication regimes. The authors note low replicability rates across subfields and contrast views that attribute failures to unknown moderators versus those that emphasize suboptimal research practices (low power, p-hacking, optional stopping, HARKing, publication bias). They propose to model an idealized, transparent scientific community to evaluate when, how, and by whom replications should be conducted. The key research question is whether requiring pre-publication replication (private regime) or allowing publication followed by selective, interest-driven replications (public regime) yields more efficient and successful knowledge accumulation.
The article reviews evidence of low replicability in psychology and high-profile journals, and summarizes known contributors: low statistical power, p-hacking (including optional stopping), HARKing, and publication bias. It acknowledges reforms such as preregistration, stricter significance thresholds, stronger theories, and publishing null results. It distinguishes discovery-oriented research (broad exploratory searches with low prior odds and higher risk of non-replicable findings) from theory-testing research (constrained predictions with higher prior odds and greater informativeness of successes and falsifications). Citations are discussed as a proxy for community interest, with distributions highly skewed and not predictive of replicability.
Design: A computational simulation of a scientific community under two orthogonal dimensions: (a) knowledge acquisition strategy—discovery-oriented vs theory-testing; and (b) replication regime—private (pre-publication replication of significant findings; only successful replications are published) vs public (publish significant findings immediately; the community selectively chooses which to replicate based on interest). Outcomes include success (number of true, interesting phenomena identified) and efficiency (total number of experiments required). Ground truth: A 10×10 grid represents all possible experiments. For discovery-oriented research, each cell is randomly set to H1 with probability P(H1)=0.09 and H0 otherwise (reinitialized each of 1000 simulation replications). For theory-testing, true effects are clustered within a 4×4 region around a random centroid. Experimental process: In each simulation replication, 100 first-round experiments are run by sampling data from normal distributions with mean equal to the cell’s ground truth (0 or 1) and specified σ. Significance criteria: frequentist two-tailed single-sample t-test at α=0.05 with power set to 0.5 (n=18) or 0.8 (n=34); Bayesian single-sample t-test (JZS prior; Cauchy on effect size) with BF10>3 (n=34, σ=1.5) achieving ~0.8 "power". For symmetrical Bayesian analyses in theory-testing, evidence for absence is BF01>3 (i.e., BF10<1/3). Significant first-round findings enter replication per regime. Replication regimes: Private—every significant first-round result is replicated; both studies are published only if replication succeeds. Public—every significant first-round result is published; the community then decides whether to replicate based on scientific interest. Scientific interest modeling: Interest is proxied by citations. Empirical citation counts for 1665 psychology articles (2014) are fit with a generalized Pareto distribution (k=0.115, σ=8.71, θ=0). For each finding, a citation count is sampled and mapped via a logistic function centered at the 90th percentile threshold q≈22.98 citations: P(interest|x)=1/(1+exp(-(x−q)/t)), with temperature t∈{1,5,10} controlling decision sharpness. Higher t yields a more graded threshold (more articles considered interesting). Theory-testing configuration: The theory predicts a 4×4 region whose centroid’s proximity to the true-effects centroid is governed by parameter p∈{0.1,0.5,1.0}. p=1 aligns theory with reality; lower p shifts the predicted cluster away by (1−p)×9 rows/columns within bounds. Perturbations/boundary conditions: The model tests robustness under questionable research practices: increased type I error (α=0.2; p-hacking), optional stopping (adding batches of Noth∈{1,5,10} participants up to five times until significance). Two extreme cases are examined in supplement: (a) fraud (all first-round results declared significant; only public replications honest), and (b) absence of true effects P(H1)=0. Analyses are run for both frequentist and Bayesian approaches; discovery-oriented analyses focus on non-null effects, whereas theory-testing includes analyses where both presence and absence (null effects) are notable and subject to replication decisions.
- Efficiency advantage of public regime: Publishing single, potentially non-replicable studies and allowing selective post-publication replication consistently minimized total experimental effort while achieving comparable discovery of true, interesting effects to the private regime.
- Discovery-oriented (frequentist): Private replication incurred about 10 additional experiments per 100-study round (~10% extra cost) compared to the public regime, largely independent of power (0.5 vs 0.8). Nearly all successful replications were true effects due to low probability of two successive type I errors (α^2=0.0025).
- Discovery-oriented (Bayesian): The opportunity cost of the private regime persisted when using BF10>3 instead of p-values.
- Theory-testing (frequentist and Bayesian tests for presence): As theory quality improved (p→1), more true effects were discovered under both regimes, but the private regime required substantially more experiments. With a perfect theory (p=1) and a diffuse interest threshold (high temperature), private replication demanded nearly 40% additional experimental effort relative to public replication.
- Theory-testing (symmetrical Bayesian, presence and absence): When both evidence for effects (BF10>3) and for null effects (BF01>3) were considered notable, the private regime’s cost penalty increased to 50% or more, because null effects also required replication regardless of community interest.
- Success parity across regimes: The number of true, interesting phenomena discovered was similar across regimes; differences lay primarily in cost/efficiency.
- Robustness: The cost advantage of the public regime persisted under p-hacking (α inflated to 0.2) and under optional stopping (adding batches of 1, 5, or 10 participants up to five times). These practices increased the fraction of replicated but false effects but did not alter the relative efficiency pattern.
- Boundary/extreme cases: In a fraud scenario (all first-round results declared significant), public replications led to higher overall experimental cost and more falsely identified interesting effects, underscoring that the model’s public regime is not equivalent to promoting fabricated results. When P(H1)=0 (e.g., parapsychology-like landscape), overall costs decreased, but the public regime’s efficiency advantage remained.
- Editorial workload consideration: Under the public regime for discovery-oriented research with P(H1)=0.09, α=0.05, power=0.8, the expected number of initially published first-round significant findings is ~11.75 (0.8×9 + 0.05×91). Estimated added editorial/review burden (~188 h) is argued to be modest compared to time required for ~10 extra replication studies under the private regime.
The simulations directly address how replication should be organized to optimize scientific progress. Across knowledge-acquisition strategies (discovery-oriented and theory-testing) and statistical frameworks (frequentist and Bayesian), publishing initial significant results without requiring pre-publication replication and then allocating replication effort based on community interest yields similar scientific gains at markedly lower cost. This efficiency stems from avoiding replication of findings that the community deems uninteresting. The approach aligns with a cost–benefit view of research, freeing resources for theory development and for powerful, targeted replications, often by independent labs—a practice experts consider more informative. The authors acknowledge concerns: increased editorial workload and the risk that preliminary, nonreplicated findings may be overinterpreted by media or researchers. They propose publication reforms to mitigate these risks, including provisional publications with embargoes, explicit replication invitations, coauthorship for successful replicators in the archival record, public withdrawal upon failed replication, and archival status if no replication occurs after the embargo, thereby reducing publication bias and clarifying evidential status. The model remains compatible with preregistration and strong-theory approaches and highlights the need for broader community-level cost–benefit analyses that account for varying payoffs of type I/II errors and equitable workload distribution.
The paper contributes a formal, simulation-based argument that, if managed appropriately, low replicability at the point of initial publication can support a more robust and efficient scientific process. A public, interest-driven replication regime minimizes wasted effort compared to mandatory pre-publication replications without compromising the ultimate identification of true effects (and, under symmetrical Bayesian analysis, true nulls). The authors recommend cultural and procedural changes to publication and replication practices to realize these gains. Future research should extend cost–benefit modeling to include differential payoffs for errors, editorial/reviewer workload, field-specific constraints (e.g., clinical trials), and more realistic models of interest formation and incentive structures, as well as investigate equity implications across researcher demographics and institutions.
- The model uses citations as a proxy for community interest; while interest is central to the argument, citations may be noisy measures of quality. The results rely on the presence of any differentiation in interest, not on citations reflecting merit per se.
- The simulated community is idealized: transparent, adequately powered, and free of questionable practices or fraud; although robustness checks were performed (inflated α, optional stopping, fraud), real-world complexities (e.g., nuanced incentives, heterogeneity of study costs) are not fully captured.
- Preliminary results under a public regime risk misinterpretation by media and researchers; proposed mitigations (embargo, provisional status) require cultural and infrastructural change.
- Editorial and reviewer workload was considered at an aggregate, approximate level; the model does not capture disparities in burden across career stages, genders, or institutions.
- The simulations assume specific statistical thresholds (α=0.05; BF10>3/BF01>3), grid sizes, and base rates (P(H1)=0.09); while sensitivity analyses were reported (e.g., other thresholds, temperatures), other parameterizations might affect magnitudes.
- The theory–reality overlap parameterization and clustering are simplified; real theories and phenomena may involve more complex structures and dependencies.
Related Publications
Explore these studies to deepen your understanding of the subject.

