logo
ResearchBunny Logo
When combinations of humans and AI are useful: A systematic review and meta-analysis

Computer Science

When combinations of humans and AI are useful: A systematic review and meta-analysis

M. Vaccaro, A. Almaatouq, et al.

In this preregistered systematic review and meta-analysis, researchers Michelle Vaccaro, Abdullah Almaatouq, and Thomas Malone explored when human-AI combinations excel beyond humans or AI on their own. Surprisingly, they found that, on average, these combinations performed worse, especially in decision-making tasks, but hinted at potential advantages in content creation. Their insights unveil the complex dynamics of collaboration between humans and AI.

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates when and how combinations of humans and AI outperform either component alone. Motivated by widespread adoption of AI in domains such as medicine, finance and law, the authors note mixed empirical findings on human-AI collaboration due to factors like communication barriers, trust calibration, ethics and coordination. They define two outcomes: human-AI synergy (combined performance exceeds both human-alone and AI-alone) and human augmentation (combined performance exceeds human-alone). The research aims to quantify overall synergy and augmentation and to identify moderators—such as task type and relative human/AI strength—that explain when collaboration yields gains or losses.
Literature Review
The authors conducted a preregistered (OSF: https://osf.io/wrq7c/?view_only=b9e1e86079c048b4bfb03bee6966e560) systematic literature review across the ACM Digital Library, Web of Science Core Collection and AIS eLibrary, restricting to studies published between 1 January 2020 and 30 June 2023. The search string combined terms for humans, AI, collaboration and experimentation in abstracts. Inclusion criteria required original human-participant experiments reporting quantitative performance for (1) human-alone, (2) AI-alone and (3) human-AI conditions, along with sufficient statistics (means, SDs, Ns or convertible equivalents) to compute effect sizes, and written in English. Exclusions included meta-analyses, purely theoretical or qualitative work, commentaries, simulations, and studies missing required conditions. Forward and backward citation searches supplemented database results. The initial search yielded 5,126 papers; screening identified 74 papers meeting criteria, with 106 unique experiments and 370 effect sizes.
Methodology
- Design: Meta-analysis of experimental studies comparing human-alone, AI-alone and human-AI conditions. - Outcomes: (1) Human-AI synergy: standardized mean difference (Hedges’ g) of human-AI vs max(human-alone, AI-alone); (2) Human augmentation: Hedges’ g of human-AI vs human-alone. - Data extraction: Collected means, SDs and Ns per condition; when unavailable, derived SDs from CIs/SEs; used publicly available datasets to compute statistics; digitized plotted values with WebPlotDigitizer when necessary; contacted authors for missing data. Multiple experiments, treatments and measures per paper were uniquely indexed. - Moderators coded: publication date, preregistration status, AI type (deep, shallow, Wizard-of-Oz), experimental design (dependent/independent samples), data type (image, text, numeric, video, multiple), task type (decision vs creation), task output (binary, categorical, numeric, open response), AI explanation, AI confidence, participant type (experts, crowdworkers, students), performance metric. - Effect size computation: Hedges’ g used as unitless standardized mean difference with small-sample bias correction. For synergy, baseline = max(human-alone, AI-alone). For augmentation, baseline = human-alone. - Meta-analytic model: Three-level random-effects model with effect sizes nested within experiments to account for dependence; robust variance estimation; Knapp–Hartung adjustments with t-distribution using k − p degrees of freedom. Separate meta-regressions for each moderator. Heterogeneity quantified with I^2, including multilevel variants. - Bias assessment: Funnel plots (contour-enhanced), Egger’s regression and rank correlation tests evaluated publication bias. - Sensitivity analyses: Alternative clustering at paper level; outlier and influence diagnostics via residuals and Cook’s distance with re-analyses excluding outliers; leave-one-out analyses at effect size, experiment and publication levels; re-analysis excluding effect sizes estimated via digitization or author-provided plot info. Analyses conducted in R (metafor package).
Key Findings
- Overall synergy: Human-AI systems performed significantly worse than the better of human-alone or AI-alone (Hedges’ g = −0.23; t102 = 2.89; two-tailed P = 0.005; 95% CI, −0.39 to −0.07). Heterogeneity substantial (I^2 = 97.7%). - Human augmentation: Human-AI systems outperformed humans alone (g = 0.64; t105 = 11.87; P = 0.000; 95% CI, 0.53 to 0.74). Heterogeneity substantial (I^2 = 93.8%). - Task type moderation (synergy): Significant (F1,104 = 7.84; P = 0.006). Decision tasks showed losses (g = −0.27; t104 = 3.20; P = 0.002; 95% CI, −0.44 to −0.10). Creation tasks showed gains (g = 0.19; t104 = 1.35; P = 0.180; 95% CI, −0.09 to 0.48); difference between task types significant. - Data type moderation: Significant for synergy (F4,101 = 15.24; P = 0.000) and augmentation (F4,101 = 6.52; P = 0.000). - Relative human vs AI performance: Strong moderator for synergy (F1,104 = 81.79; P = 0.000) and augmentation (F1,104 = 24.35; P = 0.000). • When human-alone > AI-alone: synergy gains (g = 0.46; t104 = 5.06; P = 0.000; 95% CI, 0.28 to 0.66). • When AI-alone > human-alone: synergy losses (g = −0.54; t104 = −6.20; P = 0.000; 95% CI, −0.71 to −0.37). • Augmentation larger when AI-alone > human-alone (g = 0.74; t104 = 13.50; P = 0.000; 95% CI, 0.63 to 0.85). - Other moderators: • AI type and year moderated synergy (AI type: F2,103 = 3.77; P = 0.026; year: F3,102 = 3.65; P = 0.015). • Experimental design moderated augmentation (F1,104 = 4.90; P = 0.029). • Not significant: presence of explanations, confidence displays, participant type, division of labour (few studies had explicit pre-delegation; n=4 effect sizes from 3 experiments had g = 0.22; I^2 = 0.69; P = 0.494). - Publication bias: • Synergy: No evidence of bias (Egger β = −0.67; t104 = −0.78; P = 0.438; rank correlation r = 0.05; P = 0.121). • Augmentation: Evidence of bias favoring positive results (Egger β = 1.96; t104 = 3.24; P = 0.002; rank correlation r = 0.19; P = 0.000). - Robustness: Results consistent across alternative clustering, exclusion of outliers, leave-one-out analyses, and excluding digitized estimates (e.g., synergy g ≈ −0.21 to −0.25; augmentation g ≈ 0.60 to 0.65 across checks).
Discussion
The analysis shows that, on average, human-AI combinations do not achieve synergy relative to the better of human-alone or AI-alone, addressing the question of general effectiveness with evidence of small but significant performance losses. Simultaneously, human-AI systems reliably augment humans compared to human-alone baselines, indicating that collaboration can improve human performance even if not surpassing the best single agent. The pattern depends strongly on context: decision tasks tend to suffer from human-AI integration, while creation tasks tend to benefit. A critical determinant is who performs better alone: when humans are stronger, they appear able to calibrate reliance on AI and realize gains; when AI is stronger, human involvement often reduces performance, potentially due to over- or underreliance and miscalibrated trust. Surprisingly, features such as explanations or confidence displays did not systematically improve outcomes across studies, suggesting that commonly studied interface aids may be insufficient to create synergy. These findings imply that achieving synergy likely requires rethinking process design and division of labour so each agent handles subtasks for which it is demonstrably superior, as well as improved evaluation metrics aligned with operational costs and constraints.
Conclusion
This preregistered systematic review and three-level meta-analysis synthesizes 370 effect sizes from 106 experiments, showing that human-AI systems, on average, underperform the better of human or AI alone (no synergy) but do augment humans relative to human-alone baselines. Synergy is more likely in creation tasks and when humans outperform AI, whereas decision tasks and settings where AI outperforms humans tend to produce losses when combined. The work highlights key moderators (task type, data type, relative human/AI strength, AI type, publication year) and identifies widely studied but non-influential factors (explanations, confidence displays). Future research directions include: (1) prioritizing creation tasks and generative AI scenarios with rigorous, quantitative performance comparisons including AI-alone baselines; (2) designing innovative collaboration processes with principled delegation of subtasks to the best-suited agent; (3) adopting robust, multi-criteria evaluation metrics that account for costs, rare high-impact errors, time and expense; and (4) establishing commensurability criteria and an open repository for standardized reporting of human-AI collaboration experiments to enable cumulative progress and predictive design of synergistic systems.
Limitations
- Scope constraints: Only studies reporting human-alone, AI-alone and human-AI performance were included; tasks solvable only via collaboration were excluded. - Metric harmonization: Effect sizes standardized across diverse outcome measures and participant samples (experts, crowdworkers, students), which may limit strict comparability; measurement error varied across studies. - Publication bias: No evidence of bias for synergy, but evidence of bias favoring positive augmentation results may influence pooled augmentation estimates. - External validity: Configurations studied in labs may not reflect real-world deployments; topic selection bias may exist in what researchers choose to study. - Study quality variability: Differences in attention checks, incentives and methodological rigor introduce noise. - High heterogeneity: Substantial unexplained variance remains; likely interaction effects between moderators and unmeasured study-specific factors could not be assessed due to limited data.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny