Computer Science

Scaffolding cooperation in human groups with deep reinforcement learning

K. R. Mckee, A. Tacchetti, et al.

This groundbreaking research conducted by Kevin R. McKee, Andrea Tacchetti, Michiel A. Bakker, Jan Balaguer, Lucy Campbell-Gillingham, Richard Everett, and Matthew Botvinick uses deep reinforcement learning to boost cooperation in human groups, achieving a whopping 77.7% cooperation rate. Discover how a 'social planner' AI can transform cooperation dynamics in network games!

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses how to structure social networks to support cooperation while mitigating the spread of defection. Cooperation and antisocial behaviors are both socially contagious, creating a challenge for social planners in both offline and online networks. Prior work commonly relies on assortative mixing—clustering cooperators with cooperators and defectors with defectors—often framed as decentralized ostracism to protect cooperators and punish defectors. Given that algorithms already shape online social graphs, the authors propose using deep reinforcement learning to learn effective network rewiring policies. The research question is whether a learned, recommendation-based social planner can enhance and sustain cooperation among humans playing a networked cooperation game, and what structural strategies it discovers (for example, whether it reproduces assortative clustering or identifies novel patterns).

Literature Review

Past studies show that dynamic partner updating and homophilic clustering can stabilize or increase cooperation by separating cooperators from defectors (e.g., Rand et al., 2011; Shirado & Christakis, 2020). Assortative mixing is seen as simultaneously protecting cooperators from antisocial contagion and punishing defectors via exclusion, with roots in human evolutionary dynamics (e.g., hunter-gatherer assortment). Algorithms routinely mediate link formation in social media, making machine learning a natural tool for network design. Prior network-engineering approaches often rely on ostracism-like mechanisms to change incentives, aiming to eliminate defectors’ payoff advantages. This paper situates itself against that background, exploring whether a learned planner discovers similar assortative policies or different, potentially more effective strategies.

Methodology

Experimental game: 16-person groups played a 15-round network cooperation game for real monetary stakes. Players occupy graph vertices; edges indicate active links. Each round, players choose cooperate or defect. Cooperation costs c = 0.05 to the cooperator per linked neighbor and provides b = 0.1 to each neighbor. A social planner observes the prior-round choices and current graph, then recommends adding or removing specific edges; players can accept or reject recommendations. Graphs can range from isolates to fully connected, with no self-loops. Planner design and training: The social planner is a reinforcement learning agent with a graph neural network (GraphNet) that computes policy logits (rewiring recommendations) and a value estimate. The agent is trained with advantage actor-critic in simulation, optimizing a combination of group capital and recommendation quality. The GraphNet processes global, node, and edge features with message passing and multilayer perceptrons. Thirty independently initialized agents were trained for up to 5 × 10^7 simulated rounds; a high-performing policy was selected for human evaluation. Simulated humans for training: Bots sampled an individual cooperative disposition and made cooperation decisions via logistic functions based on round number, neighborhood size, number and rate of cooperating neighbors, and their disposition. Acceptance of planner recommendations depended on recommendation valence (add vs delete) and the referent neighbor’s previous cooperation. Bot parameters were fit to behavioral data from baseline human conditions. Human experiments and conditions: Between-participants design with participants recruited from Prolific; comprehension checks enforced. Main evaluation compared four conditions in 16-person groups over 15 rounds: (1) GraphNet social planner (N = 208, 13 groups), (2) static network (no rewiring; N = 176, 11 groups), (3) random recommendations (randomly sample 30% of possible edges each round for add/remove; N = 208, 13 groups), and (4) cooperative clustering (rule-based: disconnect defectors from cooperators, then connect cooperators; also randomly change 5% of edges; N = 176, 11 groups). Across these, N = 768 participants in 48 groups. Follow-up studies: An interpretable ‘encouragement’ planner was derived from the GraphNet’s empirical policy, making recommendations as a simple function of player choices and round number (N = 224, 14 groups). Two additional planners tested density as a driver: a ‘neutral’ planner that reproduced GraphNet-like connectivity dynamics but ignored player choices (N = 192, 12 groups) and a ‘max connectivity’ planner that pushed density as high as possible (N = 208, 13 groups). Analysis: Individual cooperation was analyzed via generalized linear mixed models with random effects for participants nested in groups. Group-level outcomes used linear models. Diagnostics suggested normality and equal variances were acceptable. Assortativity and network structure metrics (including core–periphery structure) were computed and compared across conditions.

Key Findings

- Initial cooperation averaged 69.5% across conditions. In static networks, cooperation declined markedly (GLMM coefficient = -0.24, 95% CI -0.27 to -0.20, P < 0.001), ending at 42.8% in round 15. - Random recommendations and cooperative clustering mitigated decline but still trended downward: random recommendations ended at 57.0% (coefficient = -0.13, 95% CI -0.16 to -0.10, P < 0.001) and cooperative clustering at 61.2% (coefficient = -0.07, 95% CI -0.10 to -0.04, P < 0.001). - Under the GraphNet planner, cooperation increased over time (coefficient = 0.04, 95% CI 0.01 to 0.07, P = 0.007), reaching 77.7% by round 15. Direct comparisons showed significantly higher cooperation than static (z = 13.0, P < 0.001), random (z = 8.3, P < 0.001), and cooperative clustering (z = 5.4, P < 0.001). - Inequality was minimal under GraphNet (Lorenz/Gini analyses) with higher group capital relative to baselines. - Policy characterization: The GraphNet planner conditioned recommendations on player choices (likelihood ratio test χ2(2) = 3451.8, P < 0.001). It almost always recommended adding cooperator–cooperator links (P = 0.99, 95% credible interval 0.99–1.00) and rarely deleting them (P = 0.03). It avoided adding defector–defector links (P = 0.00) and almost always recommended deleting existing defector–defector links (P = 1.00). For cooperator–defector pairs it recommended a mix of adds (P ≈ 0.58) and deletes (P ≈ 0.50), initially more adds early in the game, shifting to more deletes later (average defector received 1.4 connect-with-cooperator recommendations per round in early rounds vs 0.9 in later rounds). - Network structure: GraphNet induced near-zero assortativity (β = -0.06, 95% CI -0.14 to 0.02, P = 0.142), unlike cooperative clustering which produced positive assortativity (β = 0.10, 95% CI 0.01 to 0.19, P = 0.029). GraphNet substantially increased relative connectivity of cooperators (mean degree bias B = 6.2, 95% CI 5.3 to 7.2), exceeding static, random, and clustering conditions. Networks exhibited significant core–periphery structure under GraphNet (β = 0.46, 95% CI 0.35 to 0.58, P < 0.001), with cooperators comprising on average 96.7% of the core and defectors 61.2% of the periphery. - Despite defectors earning higher average payoffs than cooperators under GraphNet, cooperation was maintained, suggesting non-pecuniary drivers (e.g., norms, fairness, conformity) in behavior. - Follow-up: The simple ‘encouragement’ planner reproduced GraphNet’s performance (GLMM coefficient 0.04, 95% CI 0.00 to 0.06, P = 0.005) and did not differ significantly from GraphNet in supporting cooperation (z = -0.3, P = 1.000), while outperforming static, random, and cooperative clustering (z = 13.4, 8.4, 5.4; all P < 0.001). It also reproduced low assortativity and core–periphery structures. - Density controls: A ‘neutral’ planner matching GraphNet-like density but ignoring player choices degraded cooperation over time (coefficient -0.17, 95% CI -0.19 to -0.14, P < 0.001). A ‘max connectivity’ planner further increased density yet caused a steep cooperation decline (coefficient -0.51, 95% CI -0.55 to -0.46, P < 0.001). High density alone does not explain GraphNet’s success.

Discussion

The learned social planner effectively sustains and enhances human cooperation by engineering network structures that differ from traditional assortative or ostracism-based approaches. Instead of segregating defectors, the planner employs a conciliatory strategy that places defectors in small, cooperative neighborhoods within a core–periphery topology. This supports pro-social contagion while limiting defection’s spread. The results indicate that short-term payoff incentives alone cannot account for behavior, as defectors often received higher payoffs without undermining cooperation, suggesting the influence of norms and conformity. The findings demonstrate that deep reinforcement learning, combined with graph neural networks, can discover interpretable, effective social interventions. Ethical and practical considerations include preserving human autonomy via recommendation-based systems and exploring participatory governance for deployment. The approach may generalize to other graph-structured social dilemmas.

Conclusion

The paper introduces a deep reinforcement learning-based social planner that significantly improves cooperation in human network games. It learns a non-assortative, encouragement-oriented policy that forms a cooperative core–periphery structure, outperforming static, random, and cooperative-clustering baselines and matching a simplified, interpretable encouragement policy. The work challenges the assumption that assortative ostracism is necessary, revealing an alternative route that fosters cooperation even when defectors earn higher short-term payoffs. Future work should explore scalability to large, real-world networks, enhance interpretability (potentially via language-model explanations), extend to other social dilemmas (e.g., resource sharing, innovation diffusion), and investigate participatory, democratic governance frameworks for such AI systems.

Limitations

- Interpretability and mechanism: Although an interpretable encouragement planner replicated performance, the original GraphNet remains a complex model; residual black-box aspects may obscure full causal mechanisms. - Simulation-to-human transfer: Training relied heavily on bot simulations calibrated to baseline human data. While effective here, generalization to other populations, contexts, and longer time horizons remains uncertain. - Scope and scale: Experiments involved 16-person groups over 15 rounds. Real-world social networks are larger, multi-layered, and persistent; the approach’s effectiveness and robustness at scale are untested. - Network density confound: While follow-ups suggest density alone does not drive success, other unmeasured structural factors correlated with the learned policy could contribute. - Behavioral drivers: The finding that defectors received higher payoffs yet cooperation persisted implies psychological mechanisms (norms, fairness, conformity) not directly modeled; these require further investigation. - Ethical and governance concerns: Deployment in real communities raises autonomy, consent, and oversight questions; the study uses recommendation-only interactions, but broader governance frameworks are needed.

Related Publications

Explore these studies to deepen your understanding of the subject.

Psychology

Dopamine regulates decision thresholds in human reinforcement learning in males

K. Chakroun, A. Wiehler, et al.

Psychology

Dopamine regulates decision thresholds in human reinforcement learning in males

K. Chakroun, A. Wiehler, et al.

Engineering and Technology

In-sensor human gait analysis with machine learning in a wearable microfabricated accelerometer

G. Dion, A. Tessier-poirier, et al.

Medicine and Health

Prediction of tumor origin in cancers of unknown primary origin with cytology-based deep learning

F. Tian, D. Liu, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny