logo
ResearchBunny Logo
When combinations of humans and AI are useful: A systematic review and meta-analysis

Computer Science

When combinations of humans and AI are useful: A systematic review and meta-analysis

M. Vaccaro, A. Almaatouq, et al.

Conducted by Michelle Vaccaro, Abdullah Almaatouq and Thomas Malone, this preregistered meta-analysis of 106 experimental studies reveals that human–AI combinations on average perform worse than the best of humans or AI alone, with losses in decision tasks but gains in content-creation—offering clear directions to improve human–AI systems.... show more
Introduction

People increasingly work with AI tools across domains such as medicine, finance, law and everyday tasks. Although humans and AI have complementary strengths, recent studies show mixed results for human–AI collaboration due to issues like communication barriers, trust, ethics and coordination. This raises key questions: when do humans and AI complement each other, and by how much? The study focuses on two outcomes: (1) human–AI synergy (the human–AI group performs better than both human alone and AI alone) and (2) human augmentation (the human–AI group performs better than the human alone). To address these questions, the authors conducted a preregistered systematic review and meta-analysis of 106 experiments (370 effect sizes) published between January 2020 and June 2023 that report performance for human-only, AI-only, and human–AI conditions. They aimed to quantify overall synergy and identify moderators that explain when synergy or augmentation occurs.

Literature Review

Eligibility criteria required original human-participant experiments evaluating a human–AI collaborative task and reporting quantitative performance for human alone, AI alone and human–AI together. Pure meta-analyses, reviews, theory, qualitative analyses, commentaries, opinions, simulations and papers lacking necessary statistics were excluded; studies had to include design, sample sizes and outcome variances (or sufficient information to compute them) and be in English. The search targeted ACM Digital Library, Association for Information Systems eLibrary and Web of Science Core Collection, limited to studies published from 1 January 2020 to 30 June 2023. A comprehensive search string combined terms for human participants, AI, collaboration and experiments in abstracts, followed by forward and backward citation searches. The initial search yielded 5,126 papers; 74 met inclusion criteria, reporting 106 unique experiments and 370 effect sizes. Data and preregistration are available on OSF.

Methodology

Data collection extracted means and standard deviations for performance in human-only, AI-only and human–AI conditions, along with sample sizes. When only CIs or SEs were reported, SDs were computed using standard formulas; if necessary, raw data were downloaded to compute statistics in Python or R, or numerical values were digitized from plots using WebPlotDigitizer. Authors were contacted for missing data; studies without enough information were excluded. Potential moderators coded included publication date, preregistration status, experimental design, data type, task type, task output, AI type, AI explanation, AI confidence, participant type and performance metric. Effect sizes measured standardized mean differences using Hedges’ g: for synergy, human–AI vs the better of human or AI alone; for augmentation, human–AI vs human alone. A random-effects three-level meta-analytic model accounted for variance at the effect-size, within-experiment and between-experiment levels; robust variance estimation was used alongside the Knapp–Hartung adjustment with t-distribution degrees of freedom based on experiment clusters. Separate meta-regressions were run for each moderator. Heterogeneity was quantified using I² and multilevel I². Publication bias was assessed with funnel plots, Egger’s regression and rank correlation tests. Sensitivity analyses included clustering at the paper level, outlier and influence diagnostics (residuals and Cook’s distance), re-analysis excluding outliers, leave-one-out analyses at effect size, experiment and publication levels, and re-analysis excluding effect sizes estimated from plots or indirect reports. Analyses were conducted in R (metafor and related packages).

Key Findings

Sample and overall effects: 74 papers, 106 experiments, 370 effect sizes. Overall human–AI systems performed worse than the better of human or AI alone (synergy: g = −0.23; t102 = −2.89; P = 0.005; 95% CI, −0.39 to −0.07), indicating no average synergy. However, human–AI systems outperformed humans alone (augmentation: g = 0.64; t105 = 11.87; P = 0.000; 95% CI, 0.53 to 0.74), a medium to large effect. Heterogeneity was substantial: I² = 97.7% (synergy) and 93.8% (augmentation). Moderators: Task type significantly moderated synergy (F1,104 = 7.84; P = 0.006). Decision tasks showed losses (g = −0.27; t104 = −3.20; P = 0.002; 95% CI, −0.44 to −0.10), while creation tasks showed gains (g = 0.19; t104 = 1.35; P = 0.180; 95% CI, −0.09 to 0.48), with the difference between task types statistically significant despite non-significant average gains for creation tasks (n = 34). Data type significantly moderated synergy (F4,101 = 15.24; P = 0.000) and augmentation (F4,101 = 6.52; P = 0.000). Relative human vs AI performance strongly moderated both outcomes (synergy: F1,104 = 81.79; P = 0.000; augmentation: F1,104 = 24.35; P = 0.000). When humans outperformed AI alone, the combined system showed synergy (g = 0.46; t104 = 5.06; P = 0.000; 95% CI, 0.28 to 0.66). When AI outperformed humans alone, the combined system showed losses relative to AI alone (g = −0.54; t104 = −6.20; P = 0.000; 95% CI, −0.71 to −0.37). Augmentation was larger when AI outperformed humans (g = 0.74; t104 = 13.50; P = 0.000; 95% CI, 0.63 to 0.85). AI type (F2,103 = 3.77; P = 0.026) and publication year (F3,102 = 3.65; P = 0.015) moderated synergy; experimental design moderated augmentation (F1,104 = 4.90; P = 0.029). Explanations, confidence displays, participant type and division of labour (as implemented in most studies) were not significant moderators. Publication bias diagnostics suggested no bias for synergy (Egger β = −0.67; t104 = −0.78; P = 0.438; rank r = 0.05; P = 0.121) but potential bias favoring augmentation (Egger β = 1.96; t104 = 3.24; P = 0.002; rank r = 0.19; P = 0.000). Sensitivity analyses yielded similar results (e.g., synergy excluding outliers: g = −0.25; t104 = −3.45; P = 0.001; augmentation excluding outliers: g = 0.60; t104 = 12.60; P = 0.000).

Discussion

The analysis shows that, on average, human–AI collaboration did not yield synergy, with combined systems performing worse than the better solo performer. Nonetheless, human augmentation was present: adding AI improved performance over humans alone. These findings suggest that synergy requires specific interaction designs beyond simple advice-taking, and that many recent experiments may not be configured to elicit synergy. Task type matters: decision tasks tend to incur losses, whereas creation tasks show gains, likely because creative tasks blend human insight with routine generation where AI excels. The relative strengths of humans and AI are pivotal: when humans are better, they can more effectively decide when to rely on themselves versus the AI, leading to gains; when AI is better, human intervention can introduce errors or miscalibrated reliance, reducing performance relative to AI alone. Surprisingly, common design features like explanations or confidence indicators did not systematically improve outcomes, implying research should pivot toward understanding baseline capabilities, task types and effective division of labour.

Conclusion

Across 106 experiments from 2020–2023, human–AI systems often underperform the best solo agent, though they typically augment humans. The study contributes a quantitative synthesis identifying when collaboration helps or hurts and highlights moderators that matter. It outlines a roadmap to find human–AI synergy: focus on creation tasks and generative AI; design innovative processes that allocate subtasks to the more capable partner (including predetermined delegation); develop robust, multi-criteria evaluation metrics (accuracy, cost, time, error costs) and appropriate augmentation measures for high-stakes contexts; and establish commensurability criteria (benchmark tasks, quality thresholds, incentives, process taxonomies, evaluation metrics) plus open repositories with standardized reporting to enable synthesis and predictive design of optimized human–AI systems.

Limitations

The meta-analysis is limited to studies reporting performance for human-only, AI-only and human–AI conditions, excluding tasks solvable only via collaboration. Effect sizes (Hedges’ g) unify diverse metrics (accuracy, error, quality), but differences in samples (e.g., doctors, crowdworkers, students) and measurement error reduce comparability. Publication bias may exist (especially favoring augmentation), though tests did not show bias for synergy. Results reflect tasks, processes and participant pools chosen by researchers and may not generalize to real-world configurations. Study rigor varies (attention checks, incentives), adding noise. Substantial heterogeneity remains, with moderators explaining only part of it; potential interaction effects likely exist, but the current literature lacks enough studies to detect them.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny