Political Science
Performance and biases of Large Language Models in public opinion simulation
Y. Qu and J. Wang
Explore how Large Language Models like ChatGPT can reshape public opinion simulation! This exciting research by Yao Qu and Jue Wang reveals critical insights into LLM performance disparities across demographics and countries, urging for improved representativeness and bias mitigation in public policy development.
~3 min • Beginner • English
Introduction
The study investigates whether and how Large Language Models, specifically ChatGPT, can simulate public opinion accurately and equitably across global contexts. It addresses three challenges: C1) the global applicability and reliability of LLMs beyond U.S.-centric and English-language contexts; C2) demographic biases related to gender, race/ethnicity, education, age, and social class that may be embedded in models trained on internet data; and C3) the effects of issue domain and choice complexity on simulation accuracy, particularly comparing environmental versus political topics. The purpose is to evaluate algorithmic fidelity—the alignment of model-simulated distributions with human survey responses—across countries, demographic subgroups, and themes. This work is important for informing the responsible use of LLMs in public opinion research and policy, where representativeness and bias mitigation are crucial.
Literature Review
Prior work highlights growing applications of LLMs in social science, including editing, literature reviews, persona simulation, and modeling economic or political behaviors. Argyle et al. (2023) found strong correlations between LLM-simulated responses and human samples in U.S. political contexts. Lee et al. (2023) showed LLMs can predict opinions on global warming but require richer covariates (including psychological factors). Aher et al. (2023) and Horton (2023) demonstrated persona emulation and replication of human subject experiments. Studies also document biases in LLMs stemming from training data: political lean (e.g., liberal tilt), demographic skew (overrepresenting higher-income, educated groups), and challenges in bias measurement and mitigation (Caliskan et al., 2017; Liu et al., 2022; Delobelle et al., 2021; Dillon et al., 2023; Martin, 2023; Motoki et al., 2024). This literature motivates evaluating cross-national, demographic, and thematic fidelity, as well as the impact of response-option complexity.
Methodology
Tool and data: The study uses ChatGPT (GPT-3.5 Turbo) to generate synthetic “silicon sample” responses, following Argyle et al. (2023). Human ground truth comes from World Values Survey (WVS) Wave 6 (2010–2014), with harmonized questionnaires across nearly 100 countries. Six countries were analyzed for cross-national comparisons: USA, Japan, Singapore, South Africa, Brazil, and Sweden. Timing of fieldwork varied by country (e.g., USA 2011, Japan 2010, Sweden 2011, Singapore 2012, South Africa 2013, Brazil 2014).
Target variables: V81 (environment vs economy prioritization): 1) Protect environment even at economic cost; 2) Emphasize economic growth even at environmental cost; 3) Other/no answer. V228 (political voting intent): “If a national election were tomorrow, for which party would you vote?” Country-specific party options plus uncertainty/non-vote; in U.S. analyses, responses were reduced to a Democrat vs Republican binary for comparability.
Demographics and covariates: Demographics included ethnicity (V254, country-specific), sex (V240), age (V242), education (V248), and self-identified social class (V238). Environmental covariates included: membership in environmental orgs (V30; active/inactive/none), environmental consciousness (V78; 1–6), donations to ecological orgs in past two years (V82; yes/no), participation in environmental demonstrations (V83; yes/no), and confidence in environmental orgs (V122; 1–4). Political covariate: ideology (V95; 1–10 left–right scale).
Model settings and prompting: GPT-3.5 Turbo was used for efficiency and response consistency; API temperature set to 0.2 to reduce variance in discrete-option tasks. Prompts were interview-style, constructing a descriptive profile from WVS-coded attributes (e.g., “You are male,” “You are 47 years old,” etc.), then asking the target question with enumerated options. For non-English countries (Sweden, Brazil, Japan), prompts used native languages sourced from WVS questionnaires; responses were requested in the same language. For each respondent sample, 100 simulations were run to account for model variability. Prompts requested a reasoning chain before selecting a numeric option; the chosen option number was recorded.
Comparative design and bias framing: Algorithmic fidelity was defined as the degree to which simulated response distributions match WVS distributions. Bias was operationalized as systematic deviations in agreement across countries, demographics, and topics. Cross-national codes categorized culture (Western vs non-Western), economy (Developed vs Developing), and language (English vs non-English). An additional experiment tested country-context sensitivity by altering the country in prompts and compared U.S. Wave 6 vs Wave 7 to examine temporal stability (results in supplementary materials).
Analysis: Primary agreement metric was Cohen’s Kappa (chance-corrected); supporting metrics were Cramer’s V and Proportion Agreement. Agreement was computed across 100 simulations per respondent and averaged to form overall agreement for each prompt/group. Pearson correlations linked agreement metrics to country-level binary indicators (culture, economy, language). Political analyses sometimes reduced categories (e.g., two-party choice) to isolate effects of choice complexity.
Key Findings
- Cross-country accuracy: Mean Cohen’s Kappa from 100 simulations indicates substantial variation: USA 0.239 (highest), Sweden 0.185, Singapore 0.053, Brazil 0.036, Japan 0.024, South Africa 0.006 (lowest). Performance is strongest in Western, English-speaking, developed contexts, especially the U.S.
- Socio-structural correlates: Pearson correlations between country attributes and simulation accuracy (Cohen’s Kappa) show culture is the strongest correlate (0.971), economy moderate (0.557), and language weaker (0.101). Similar patterns occur for Cramer’s V (culture 0.942; economy 0.411; language 0.068) and Proportion Agreement (culture 0.789; economy 0.214; language 0.465).
- Demographic representation in the U.S. (environmental question): Higher alignment for males vs females; stronger correspondence for White and “Other” categories relative to Black and multiracial groups; alignment increases with age (older groups higher); upper and middle classes align more than working/lower classes; respondents with university education show higher fidelity. These patterns indicate skew toward perspectives of higher SES, older, male, and majority-ethnicity groups.
- Topic comparisons (U.S.): Political simulations outperform environmental ones. With covariates: political Kappa 0.306 (Cramer’s V 0.324; 65.92% agreement) vs environmental Kappa 0.270 (Cramer’s V 0.274; 66.83% agreement). Without covariates: political Kappa 0.145 (Cramer’s V 0.263; 54.71%) vs environmental Kappa 0.000 (38.07%). This suggests environmental decision-making is harder to simulate from demographics and basic covariates than political choices.
- Ideological tendencies: Relative to survey data, environmental simulations show −6.10% in liberal choices (conservative tilt), while political simulations show +16.33% in liberal choices (liberal tilt), indicating topic-dependent ideological biases.
- Choice complexity: For political simulations, two-option setups yield much higher fidelity than four options: Kappa 0.306 vs 0.109; Cramer’s V 0.324 vs 0.157; Proportion Agreement 65.92% vs 29.21%. Increasing response options reduces alignment with target distributions.
- Additional checks: Altering country in prompts affected outputs (low agreement when using U.S. data but indicating Japan), and U.S. Wave 6 vs Wave 7 produced similar accuracy, suggesting some temporal stability (details in supplements).
Discussion
The findings address the research questions by demonstrating that ChatGPT’s public opinion simulations are uneven across geographies, demographics, and topics. Higher accuracy in Western, developed, English-speaking contexts—especially the U.S.—reflects likely training data imbalances, with cultural factors most predictive of accuracy. Within the U.S., simulations better reflect perspectives of males, Whites/Other, older individuals, higher social classes, and university-educated respondents, indicating demographic skew in representativeness. Topic analyses show political behaviors are more predictable from demographics and limited covariates than environmental decisions, and that the model’s ideological lean depends on the domain (liberal in politics, relatively conservative in environmental framing). Choice complexity negatively impacts fidelity, underscoring design sensitivities in discrete-option simulations. Together, these results suggest LLM-based opinion simulations can complement traditional methods but require careful bias assessment, expanded and diversified training data, and thoughtful task design to avoid reinforcing inequities and misrepresentations in policy-relevant applications.
Conclusion
The study demonstrates that LLMs like ChatGPT can simulate aspects of public opinion but exhibit significant disparities in global applicability and demographic representativeness. Accuracy is highest in Western, developed, English-speaking contexts, with clear demographic skews within the U.S. Political simulations are more accurate than environmental ones, ideological bias varies by topic, and greater choice complexity reduces fidelity. To use LLM simulations responsibly in public policy and management, models require more diverse cultural, linguistic, and socio-economic training data, richer covariates (including psychological and social factors), and methods to detect and mitigate bias. Future research should expand country coverage, compare multiple LLMs, incorporate broader covariate sets, explore thematic and question-type effects, and develop methodologies for bias reduction and validation over time.
Limitations
- Temporal scope: No assessment of performance beyond the model’s training cutoff due to WVS’s 5-year intervals and lack of post-2021 data aligned to GPT-3.5 Turbo’s cutoff, limiting evaluation of adaptability to recent shifts.
- Single-model focus: Analyses center on GPT-3.5 Turbo; results may not generalize across models with different architectures or training corpora.
- Limited covariates: Especially for political simulations (only ideology included), constraining fidelity assessments; covariate selection was restricted to items consistently available across all six countries to maintain comparability.
Related Publications
Explore these studies to deepen your understanding of the subject.

