logo
ResearchBunny Logo
Emergent social conventions and collective bias in LLM populations

Computer Science

Emergent social conventions and collective bias in LLM populations

A. F. Ashery, L. M. Aiello, et al.

Experimental results show that decentralized populations of large language model agents can spontaneously develop universally adopted social conventions, produce strong collective biases even from unbiased individuals, and enable committed minority adversarial agents to impose alternative norms—research conducted by Ariel Flint Ashery, Luca Maria Aiello, and Andrea Baronchelli.

00:00
00:00
~3 min • Beginner • English
Introduction
Social conventions shape social and economic life, determining how individuals behave and their expectations (1–4). They can be defined as unwritten, arbitrary patterns of behavior that are collectively shared by a group. Examples range from conventional greetings like handshakes or bows to language and moral judgments (5, 6). Recent numerical (7, 8) and experimental (9) results have confirmed the hypothesis that conventions can arise spontaneously, without the intervention of any centralized institution (3, 5, 10, 11). Individual efforts to coordinate locally with one another can generate universally accepted conventions. Do universal conventions also spontaneously emerge in populations of large language models (LLMs), i.e., in groups of N simulated agents instantiated from an LLM? This question is critical for predicting and managing artificial intelligence (AI) behavior in real-world applications, given the proliferation of LLMs using natural language to interact with one another and with humans (12–14). Answering it is also a prerequisite to ensure that AI systems behave in ways aligned with human values and societal goals (15). A second key question concerns how the biases of individual LLMs influence the emergence of universal conventions, where “bias” refers to an initial statistical preference for one option over an equivalent alternative in norm formation (e.g., individuals systematically preferring one name over another in a process leading to the population settling on a single name). Because collective processes can, in general, both suppress and amplify individual traits (16, 17), answering this question is also relevant for practical applications. While most research has focused on investigating and addressing bias in one-to-one interactions between humans and LLMs (18–20), less attention has been given to how these biases evolve through repeated communications in populations of LLM agents and, ultimately, in mixed human-LLM ecosystems (15), even though the safety of a single LLM does not necessarily imply the safety of a multi-agent system (21). Last, a third question concerns the robustness of social conventions. Recent theoretical (22) and empirical (23) results have shown how a minority of adversarial agents can exert an outsized influence on the group, provided that they reach a threshold or “critical mass” (24–26). Investigating how conventions change through critical mass dynamics in a population of LLMs will help anticipate and potentially steer the development of beneficial norms in AI systems, while mitigating risks of harmful norms (27). It will also provide valuable models for how AI systems might play a role in shaping new societal norms to address global challenges such as antibiotic resistance (28) and the post-carbon transition (29). Here, we address these three key questions—on the spontaneous emergence of conventions, the role of individual biases, and critical mass dynamics—in populations of LLM agents. Drawing from recent laboratory experiments with human subjects (9, 23, 30), we follow the well-established practice of using coordination on a naming convention as a general model for conventional behavior (5, 7, 30–33). In this setting, agents are endowed with purely local incentives and conventions may (or may not) emerge as an unintended consequence of individuals attempting to coordinate locally with one another. This sets our paper apart from the growing body of literature on LLM multi-agent systems, which has made considerable progress in complex problem-solving and world simulation but has primarily focused on goal-oriented simulations where LLMs either accomplish predefined group-level tasks or approximate human behavior in structured settings (15, 34–36). Unlike studies that use LLMs to predict human responses in social science experiments (37) or to simulate human societies (38–40), our work does not treat LLMs as proxies for human participants but rather investigates how conventions emerge organically within a population of communicating AI agents as a result of their interactions (6). The emergence of conventions is a foundational element to any type of LLM multi-agent system (14, 41), including but not limited to “in silico” experiments to emulate human social networks (42). Here, we adopt a complex systems perspective (43), rather than high-fidelity simulations of human interactions (44), thereby minimizing the complexity of the experimental design to enhance the transparency of the result interpretation. Overall, our approach addresses recent calls for AI researchers to investigate how LLM agents may develop shared solutions to poorly defined social problems such as creating language, norms, and institutions—to gain insights into the formation and stability of genuine cooperative AI systems (15).
Literature Review
Methodology
Experimental setting and framework: The study builds on Wittgenstein’s and naming-game models of linguistic convention formation, where purely local, pairwise interactions can lead to population-level consensus (1,2,5–9,45). Predictions are grounded in the naming game: agents try to coordinate in pairwise interactions, accumulate a bounded memory of past plays, and use it to choose a convention in future interactions. Population and interaction protocol: A simulation trial comprises a population of N LLM-instantiated agents. Unless specified otherwise, default parameters are N=24 agents, name pool size W=10, and memory length H=5. At each time step, two agents are randomly selected to interact. Each agent outputs a convention (“name”) chosen from a finite pool of W unique letters sampled from the English alphabet; the list is randomized for each player at every interaction to remove ordering bias. If both agents match, each agent receives a reward; if they mismatch, each receives a penalty. No global incentive to coordinate is given. Memory for each agent contains up to H past interactions (co-player’s choice, own choice, success/failure, cumulative score). Memory is initially empty, so the first output is random from the available names. Committed minorities for norm change: For critical mass experiments, a committed minority of adversarial agents is introduced. These agents deterministically output an alternative convention at every interaction regardless of history, modeling norm-change interventions (22,23). Populations are initialized in full consensus on a convention (each agent’s H-memory contains only that convention and successes), and the adversarial convention is then introduced. Prompting and decision extraction: Each interaction presents the LLM agent with a system prompt (game rules, payoff structure, objective), a dynamic memory prompt (context from the last H interactions), and an instruction on output format to reliably parse the decision. Agents are positioned as external observers predicting the next move, but their recommendation dictates play. The objective given is to maximize the agent’s accumulated points, conditional on co-player behavior. Payoffs are fixed at +100 for success and −50 for failure. The prompt encourages step-by-step reasoning and explicit consideration of memory. To ensure comprehension and reduce hallucinated behavior, a meta-prompting strategy checks instruction understanding via text-comprehension queries (good performance reported; fig. S1). Models and stochastic generation: Four homogeneous-agent populations were tested: Llama-2-70b-Chat (4-bit quantized), Llama-3-70B-Instruct, Llama-3.1-70B-Instruct, and Claude-3.5-Sonnet. Llama 3-family models were accessed via Hugging Face Inference API; Llama-2-70b-Chat was run locally on a single A100 GPU. All agents generated responses non-deterministically using a fixed nonzero temperature. K-sampling constrained next-token choices to top-K words to increase probability mass on valid names (parameters in table S4). The agent’s output decision was parsed from a standardized format enforced by the system prompt. Measuring individual bias: Individual bias was quantified by the frequency of each convention chosen in agents’ first interaction (empty memory) across T trials. For W=2, a two-tailed exact binomial test with null p=0.5 assessed bias; for W=10, a chi-square test against uniform expected counts (0.1 T per name) assessed neutrality. Significance threshold was P<0.05. Determining critical mass: A consensus flip is recorded when, after introducing a committed minority, 95% of the past 3N interactions succeed (indicating convergence to the new convention). For Llama-3-70B-Instruct, the minimal minority needed to flip a weak-majority convention was first found, then repeated for a strong-majority convention. For other models, the critical mass is the minimum committed proportion that flips consensus within 30 population rounds. Default settings were N=24 and H=5 (except Llama-3-70B-Instruct with N=48 and H=3 in Fig. 3). Robustness checks and parameter sweeps: Results are reported as robust across variations in N (up to N=200), W (up to W=26), and prompt/label choices, and hold for non–fine-tuned LLMs (figs. S2–S7, Supplementary Text).
Key Findings
- Spontaneous emergence of global conventions: Across all tested LLM populations, decentralized local coordination led to rapid population-level consensus on a single convention (symmetry breaking). With default parameters (N=24, W=10, H=5), a shared convention typically emerged by population round ~15 in all models except Llama-2-70b-Chat. Larger populations (up to N=200) and larger name pools (W up to 26) still reached consensus (fig. S2). - Collective bias in convention selection: Although all names are a priori equivalent, the distribution of final consensus names was non-uniform and model-dependent (Fig. 2A). Order effects were ruled out by randomizing name list order every interaction. Even when first-round individual choices appeared unbiased (e.g., W=2 conditions; exact binomial test P-values for models: 0.068, 0.116, 0.757, 0.849), collective dynamics produced a consistent bias toward a specific “strong” convention (Fig. 2B), with the alternative being “weak.” In W=10 without the letter A, some models (Llama-2-70b-Chat, Claude-3.5-Sonnet) showed no significant first-round bias (chi-square P=0.100 and 0.410), whereas Llama-3/3.1-70B-Instruct exhibited skewed initial selections (fig. S5). When the full alphabet was allowed, populations systematically converged to “A” due to strong first-round individual preference (fig. S4). - Microscopic origin of collective bias: In W=2 experiments (e.g., name pool {Q, M}), agents’ strategies were not symmetric under relabeling of conventions once memory accrued. Example (Table 1, Llama-3.1-70B-Instruct): First interaction aggregated P(M)=0.508 (unbiased; P=0.116). Second interaction aggregated P(M)=0.487 (unbiased; P=0.110). By the third interaction, aggregated P(M)=0.563, indicating a bias toward the strong convention M (P<2.2×10^-10). Behaviorally, agents almost always repeated a name after success (~99.4%) and switched after failure (~97.3%), and the probability mappings differed for mirrored memory states (e.g., P(M | {1: M, Q; 2: Q, M})=0.848 vs P(Q | {1: Q, M; 2: M, Q})=0.451). These asymmetries, amplified through diverse memory trajectories, drove convergence toward the strong convention. - Tipping points and critical mass: Established conventions were steady states, but could be overturned by a committed minority producing an alternative convention. The critical minority size depended on the model and whether the majority held the strong or weak convention (Fig. 3). Strong-convention majorities required larger committed minorities to flip than weak-convention majorities. Observed critical masses ranged widely: as small as ~2% (Llama-3-70B-Instruct) and as large as ~67% (Llama-2-70b-Chat); in some Llama-3.1-70B-Instruct settings the population spontaneously abandoned a weak convention without any committed minority. Success criterion: 95% of past 3N interactions successful after introduction of the minority; for most models flips were assessed within 30 population rounds. - Robustness: Consensus emergence and collective bias effects persisted across parameter changes (N up to 200, W up to 26) and across different prompts/labels and non–fine-tuned LLMs (fig. S2, Supplementary Text).
Discussion
The study demonstrates that populations of LLM agents can autonomously generate universal social conventions through decentralized, local interactions without central coordination. Importantly, the collective coordination process itself can induce strong population-level biases in which specific conventions are favored, even when individual agents initially appear unbiased. These biases are highly model-dependent and not inferable from isolated single-agent tests, underscoring the need to evaluate group-level behavior in multi-agent LLM systems. The work also establishes the existence of tipping points: sufficiently large (model- and convention-dependent) committed minorities can overturn an entrenched convention. Stronger conventions (those favored by the collective dynamics absent prior memory) exhibit deeper and larger basins of attraction and thus require larger minorities to flip, whereas weak conventions are more easily overturned. These findings extend the multi-agent LLM literature by providing a minimal, transparent benchmark for detecting higher-order biases that arise only through interaction, and by characterizing norm-change dynamics. They suggest avenues for integrating cultural evolution and game-theoretic analyses (e.g., asymmetric payoffs, explicit collective goals), designing mechanisms to promote desired conventions and higher-order norms, and studying heterogeneous populations comprising different LLMs. The results also have implications for using LLM agents as proxies for human societies: while some qualitative similarities (emergence of norms, critical mass dynamics) mirror human behavior, LLM-specific collective biases highlight important differences that must be identified and potentially corrected when using synthetic social systems or deploying agents in social environments. The work motivates techniques to detect and mitigate discrepancies between LLM collective behavior and expected human norms, and to incorporate human judgment where needed.
Conclusion
This paper provides experimental evidence that decentralized populations of LLM agents (i) spontaneously converge to shared social conventions, (ii) can exhibit emergent collective biases in convention selection even without initial individual bias, and (iii) display tipping-point dynamics whereby committed minorities can overturn established conventions. Together, these results highlight the importance of evaluating and aligning group-level behaviors of LLM agents, not just single-agent performance. Future directions include: extending experiments to larger populations and richer semantic spaces; embedding interactions in realistic social networks and multi-party settings; exploring asymmetric payoff structures and explicit collective objectives; studying heterogeneous and mixed human–LLM populations; and moving from abstract letters to sensitive, real-world norms (e.g., gender and race), with careful alignment and safety considerations. Methodologically, combining multi-agent reinforcement learning and external strategic reasoning modules may help steer convention emergence toward desirable outcomes while measuring and mitigating collective biases.
Limitations
- Experimental scope and parameter dependence: Results depend on selected LLM models, prompts, and convention labels. Although robustness checks were performed (alternative prompts/labels, non–fine-tuned models), broader generalization to other controlled settings remains to be established. - Simplified environment: Interactions are pairwise with random matching (unstructured populations). Realistic social networks and group interactions (more than two agents) may substantially alter dynamics. - Limited semantic space: Primary experiments use abstract name pools (letters) and often W=2 for bias/critical-mass analysis; extending to richer, real-world convention spaces is needed. - Non-deterministic generation and finite horizons: Stochastic decoding (temperature, top-K) and finite observation windows (e.g., 30 population rounds for flips) may influence measured thresholds; while design choices mimic real deployments, thresholds could vary with decoding parameters/time windows. - External validity to human societies: Although some qualitative parallels exist, LLM-specific collective biases caution against directly extrapolating to human norm dynamics without calibration and validation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny