Transportation

Comparing AI and human decision-making mechanisms in daily collaborative experiments

L. Wang, Z. Jiang, et al.

Artificial intelligence’s potential to rival human decision-making is evaluated by comparing humans, large language models (LLMs), and reinforcement learning (RL) in a multi-day commute decision-making game with interdependent individual and collective outcomes. The study finds LLMs can learn from historical experience and reach convergence like humans, yet they struggle in multi-person collaboration due to weak perception of others’ choices, poor group decision mechanisms, and limited physical knowledge. This research was conducted by Linghao Wang, Zheyuan Jiang, Chenke Hu, Jun Zhao, Zheng Zhu, Xiqun Chen, Ziyi Wang, Tianming Liu, Guibing He, Yafeng Yin, and Der-Horng Lee.

00:00

~3 min • Beginner • English

Index

Introduction

In contemporary society, individuals frequently make decisions within a framework of interdependence with their groups. The complexity of these decisions is primarily evident in how individual choices can influence group outcomes, which in turn affect future individual decisions. This analytical framework is extensively applied in areas such as public goods dilemmas, stock trading, and transportation choice. Artificial intelligence (AI) is advancing swiftly toward human-level capabilities, especially in complex decision-making and inference tasks. With the advent of large language models (LLMs), the realization of artificial general intelligence (AGI) has become increasingly feasible, leading to potential implications for fields requiring interactive decision-making. The prospect of employing LLMs to model human social interactions and decision processes is promising, potentially enabling simulations of cooperative and competitive behaviors across diverse scenarios. However, existing research often presumes that LLMs mirror human cognitive mechanisms, but there is insufficient evidence that their decision-making processes align with those of humans, particularly in cases with collaborative decisions among multiple subjects. Significant limitations remain in LLM-based simulations of human decision-making. This prompts the core question: Can LLMs effectively emulate or even surpass human decision-making capability in various social contexts, particularly in daily-life scenarios involving repeat, dynamic and collaborative decisions among people, such as commuting route choice? This gap is evident in collaborative and repetitive environments, which differs from the controlled environments of classical game theory studies. Prior work has highlighted challenges faced by LLMs in classical game theory (e.g., 2×2 matrix games), where LLMs struggle to consistently select optimal strategies, and broader benchmarks are proposed to evaluate LLM performance across diverse game scenarios. Other studies extend evaluations to communication-centric strategy games (e.g., deception and negotiation in Werewolf/Avalon), where LLMs can show human-like strategic behaviors, yet these scenarios still lack grounding in real-world contexts. To systematically explore LLM-based decision-making capabilities and boundaries, this study applies experimental economic tools to evaluate and compare decisions of LLMs (GPT-3.5 and GPT-4), human participants, and reinforcement learning (RL) in an everyday multi-day, dynamic, and repetitive commute route choice game. The paradigm, validated in transportation engineering, simulates a 40-day commuting experiment on a one-way network with two origin-destination (OD) pairs and 15 users (9/6 for OD1/OD2), each able to start on a route and switch up to three times per day. Expressways offer shorter baseline travel times but are more sensitive to demand fluctuations (high-risk, high-reward), whereas local roads are more stable with longer baseline travel times (low-risk, low-reward). Decisions alter the network state, and theory suggests evolution toward user equilibrium (UE), where no individual can reduce travel time by switching. The status of UE and system optimum (SO) is defined for the network and used as benchmarks. By comparing LLM participants, human participants, and RL participants across independent trials, we assess system results, convergence, individual route-switch dynamics, and decision-making mechanisms. The study finds that LLMs can learn from historical experiences and achieve convergence in daily commuting decisions but still have room for improvement in multi-agent collaboration: perceiving others’ choices, optimizing group decision-making mechanisms, and improving physical knowledge. These insights have practical value in urban traffic and establish a general AI decision-assistance framework for dynamic and collaborative systems, with potential extensions to personal assistance and multi-agent social simulations.

Literature Review

The paper synthesizes prior research on LLM behavior in game-theoretic contexts and transportation decision-making. In classical game theory (e.g., 2×2 matrix games), LLMs have difficulty consistently selecting optimal strategies, motivating larger-scale benchmarks to evaluate strategic reasoning across diverse games. In more complex, text-heavy strategy games involving communication, deception, and negotiation (e.g., Werewolf, Avalon), LLMs demonstrate human-like strategic tendencies such as trust-building and leadership. However, these experiments often lack real-world grounding and operate under static, one-shot, or highly controlled conditions. Transportation engineering offers validated experimental paradigms for repeated, day-to-day route choice behavior, showing how information and experience shape decisions, and emphasizing the concepts of user equilibrium (UE) and system optimum (SO). Prior studies demonstrate sequence dependence, risk aversion, and contrarian behaviors in route choice, and the benefits of balanced decision-maker populations for system stability. The current work leverages these insights to evaluate LLMs in a realistic, multi-agent, repeated decision environment, addressing gaps in collaborative dynamics, fairness, and physical knowledge that are not captured in simplified game-theoretic settings.

Methodology

Experimental design: A 40-day commuting route choice experiment simulates urban expressway dynamics on a one-way road network with two OD pairs. There are 15 users (OD1: 9 users choosing between Local1 and Expressway; OD2: 6 users choosing between Local2 and Expressway). Each day is divided into four decision intervals, and participants can switch routes up to three times via access ramps without incurring ramp travel time. Expressways have lower free-flow travel time but higher sensitivity to congestion (high-risk, high-reward), while local roads have higher free-flow times and greater stability (low-risk, low-reward). Travel cost model (BPR function): For each route j, travel cost c_j(n_j) = a_j(1 + α (n_j/s_j)^β), where n_j is the number of users, a_j is free-flow time, s_j is capacity, and α, β are parameters. Parameters: Local1/Local2: a=15, α=0.15, s=5, β=4; Expressway: a=5, α=0.075, s=3, β=4. These parameters make the expressway more congestion-sensitive despite lower free-flow time. The theoretical UE and SO states of the network are defined and reported. Participants and trial setup: The study involves 75 human participants (five trials, 15 participants each), an RL algorithm (multi-agent), and two LLMs (GPT-3.5-turbo and GPT-4o), with five trials per participant type (total N=300 agents across conditions). Human participants were recruited from Zhejiang University, consented, received a 15-minute tutorial, and then made anonymous, collaborative decisions via an online platform for approximately 45 minutes. After each interval, choices were submitted, processed server-side, with immediate feedback for the next interval. LLM condition: Each LLM-controlled user is a distinct conversational instance with session memory only (no access to other sessions or long-term memory). A neutral, unbiased prompt describes the task context (network design and parameters) and provides real-time feedback after all LLMs submit their choices each interval. Fifteen new dialogues are initiated for each trial (no prior memory). The environment aggregates choices, computes network outcomes, and returns them to each LLM, which then updates its route choice, iterating across 40 days. RL condition: The RL environment hosts 15 independent agents with identical state and reward structures to humans. States include lt (number choosing local), et (number choosing expressway), lct (local travel time), ect (expressway travel time), nt (route chosen in previous interval), and t (current time). Reward rt equals the travel time of the chosen route (lct if local, ect if expressway). The cumulative reward is the sum of per-interval travel costs. Actions are discrete choices between local and expressway. Agents follow self-maximizing policies without explicit collaboration. The IA2C algorithm is used. Agents are pretrained to establish stable strategies; during the experimental trials, network parameters are not updated online. Decision-maker type classification: To analyze multi-day decision mechanisms, the study models route-switching behaviors using conditional probabilities under different network states, grounded in win-stay/lose-shift and cognitive hierarchy ideas. Four user types are defined: Naive (always move toward shorter-time routes), Strategic (anticipate naive behavior and avoid expected congestion), Exploratory (frequent switching regardless of prior conditions), and Status quo (stick to chosen route regardless of conditions). Probabilities are computed based on whether the previously chosen route had shorter or longer travel time than the unchosen route, and classification uses two-dimensional vectors to encode tendencies (e.g., (1,1) Naive; (0,0) Strategic; (1,0) Exploratory; (0,1) Status quo). Evaluation metrics and statistical analysis: System overall results include total and individual travel costs, convergence trends, and fairness. Converging progress is assessed against UE and SO using Sørensen-Dice coefficients (UESDC and SOSDC), where values closer to 1 indicate greater similarity to theoretical benchmarks. Individual route-switch dynamics are measured via average daily switching frequency and correlations between switching and travel cost. Risk preferences are inferred from deviations in road volumes from UE on expressway versus local roads. Statistical tests include ANOVA (overall travel cost, SDC indices, switching frequency, user counts on roads), Kruskal-Wallis (individual travel costs), and Mann-Whitney (OD pair travel costs). Significance is set at p<0.05.

Key Findings

System overall results: ANOVA showed a significant main effect of condition (F(3,3196)=38.58, p<0.0001). Total travel time: RL (M=245.75, SD=49.94) vs Human (M=270.43, SD=107.34) had no significant difference (p=0.2410). GPT-3.5 (M=369.85, SD=412.14) and GPT-4 (M=339.15, SD=307.59) incurred significantly higher travel times than Humans (p<0.0001), indicating inferior traffic efficiency for LLMs compared to Humans and RL. Individual travel costs: Kruskal-Wallis results show Humans (M=72.11, SD=5.01) had higher travel costs than RL (M=65.53, SD=4.89; p<0.0001), while GPT-3.5 (M=98.63, SD=23.48) and GPT-4 (M=93.58, SD=18.82) were significantly higher than Humans (p<0.001). RL trials exhibited the most concentrated (lowest-variance) cost distributions, then Humans; GPT-3.5/4 showed greater variability, implying reduced fairness. Convergence trends: Across all trials, total travel costs decreased over time, indicating learning from historical experience for Humans, RL, and LLMs. Humans started with lower initial costs and required a smaller learning space, attributed to stronger prior knowledge. OD pair differences: For Humans, GPT-3.5, and GPT-4, OD1 costs were significantly higher than OD2 (Table S1; Figure S1). RL showed no significant OD1–OD2 difference, suggesting better flow allocation and fairness than LLMs. System converging progress: UESDC (last 40 intervals): Humans (M=0.973, SD=0.034) vs RL (M=0.971, SD=0.035) showed no significant difference (p=0.996). Humans significantly exceeded GPT-3.5 (M=0.900, SD=0.065; p<0.0001) and GPT-4 (M=0.925, SD=0.090; p<0.0001). SOSDC: Humans (M=0.961, SD=0.037) and RL (M=0.962, SD=0.036) did not differ (p=0.997). Humans were significantly better than GPT-3.5 (M=0.945, SD=0.037; p<0.0001) and GPT-4 (M=0.931, SD=0.070; p<0.0001). UE was not fully attained by any group, consistent with prior route-choice experiments. Route-switch dynamics: Average daily switching frequency: RL (M=0.71, SD=0.27) vs Human (M=0.78, SD=0.29) showed no significant difference (p=0.156). GPT-4 switched more frequently than Humans (M=1.24, SD=0.52; p<0.0001). GPT-3.5 switched less (M=0.17, SD=0.15). Correlations between switching and travel cost: Human R^2=0.32 (p<0.0001), RL R^2=0.15 (p=0.0014), GPT-4 R^2=0.57 (p<0.0001); GPT-3.5 showed no significant relationship (R^2=0.01, p=0.38). Slopes indicate RL highest stability, Humans moderate, GPT-4 unstable; GPT-3.5 exhibits a distinct decision framework divergent from Humans. Risk preferences and allocation: Expressway volumes were consistently below UE, while local routes were above UE (except GPT-3.5 on Local2), indicating risk aversion. LLMs were more risk-averse than Humans, with higher variance in route choice and preference for reliability under uncertainty. Decision-making types and transitions: Initial proportions — Humans: Status quo 40.0%, Naive 22.7%, Strategic 17.3%, Exploratory 20.0%; RL: higher Status quo (50.7%); GPT-3.5: Status quo 89.3%, Naive 4.0%, Strategic 5.3%, Exploratory 1.4% (overly static, suboptimal); GPT-4: Status quo 22.6%, Naive 38.7%, Exploratory 14.7% (excess volatility). Over time — Humans shifted toward Status quo (40.0%→64.0%), with decreases in Naive (22.7%→12.0%) and Exploratory (20.0%→5.3%), reflecting experiential learning of equilibrium concepts. GPT-3.5 changed little (Status quo 89.3%→93.3%; Naive 4.0%→1.3%). GPT-4 showed increases in Status quo (≈22.6%→32.0%) and Exploratory (14.7%→26.7%), indicating failure to learn satisfying strategies. Overall, RL aligned closely with human benchmarks across efficiency, convergence, stability, and population composition; LLMs learned but underperformed and exhibited fairness and collaboration deficits.

Discussion

The findings address whether LLMs can emulate or surpass human decision-making in repeated, collaborative daily scenarios. While LLMs demonstrate human-like learning from historical experience and system-level convergence, they underperform Humans and RL across efficiency, fairness, and stability. Three core factors explain these gaps: (1) Weak collaborative perception and adaptation — LLMs insufficiently predict or react to others’ choices, leading to early congestion and persistent suboptimal strategies in multi-agent settings. (2) Imbalanced decision-type structures — GPT-3.5 is overly dominated by Status quo types (stable but suboptimal), whereas GPT-4 over-relies on Naive decisions (volatile and unstable), reducing the beneficial diversity that stabilizes systems. (3) Limited physical world knowledge — Unlike humans with innate and experiential priors, LLMs rely solely on textual inputs and session memory, lacking multimodal grounding; this slows early-stage convergence and increases risk aversion. These results imply that for dynamic, collaborative systems like transportation networks, LLMs need enriched mechanisms for modeling others’ behaviors, balancing decision-type diversity, and integrating physical knowledge to approach human-level collaboration. RL, with states and rewards engineered from domain understanding, achieves human-like outcomes, highlighting the importance of structured perception and tailored objectives. The broader relevance spans daily decision supports (e.g., commute planning) and multi-agent social simulations, where LLMs could assist but require enhancements to handle group dynamics, fairness, and real-world uncertainty.

Conclusion

This study introduces an everyday, multi-day collaborative route-choice experiment to compare Humans, LLMs (GPT-3.5, GPT-4), and RL. LLMs can learn from experience and achieve convergence but lag behind Humans and RL in system efficiency, fairness, and stability, particularly in collaborative multi-agent contexts. RL closely mirrors Human benchmarks across convergence and decision dynamics. The work contributes a general evaluation framework for AI decision-making in dynamic, collaborative environments and highlights actionable directions to strengthen LLMs: improving perception of others’ choices and group dynamics, fostering balanced decision-type diversity, and integrating physical/world knowledge. Practical applications include personal decision assistance (e.g., restaurant reservations, shopping optimization) and multi-agent simulations (e.g., social network information dissemination, financial market dynamics). Future research should explore prompt engineering and conversational memory management to help LLMs model group influences, predict individual behaviors, and handle long-term information, enabling more interactive, stable, and equitable human-AI decision support.

Limitations

The experiment abstracts real-world commuting and omits dynamic factors such as weather, infrastructure constraints, and real-time disruptions, limiting generalizability. The LLM scope is narrow (GPT-3.5 and GPT-4o) relative to rapidly evolving, potentially multimodal or hybrid models that may employ different decision strategies. Human decision-making is shaped by prior knowledge, habits, social norms, incentives, and competitive pressures; these external influences were not explicitly modeled or controlled. Gender effects were not analyzed or controlled, which may affect applicability across gender groups. Overall, these constraints may limit the extent to which the observed LLM behaviors generalize to real-world, heterogeneous populations and conditions.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Human-AI collaboration is not very collaborative yet: a taxonomy of interaction patterns in AI-assisted decision making from a systematic review

C. Gomez, S. M. Cho, et al.

Education

Impact of artificial intelligence on human loss in decision making, laziness and safety in education

S. F. Ahmad, H. Han, et al.

Education

Impact of artificial intelligence on human loss in decision making, laziness and safety in education

S. F. Ahmad, H. Han, et al.

Psychology

Reward expectation yields distinct effects on sensory processing and decision making in the human brain

A. Sengupta and D. Sridharan

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny