Introduction
Public opinion significantly influences policy decisions, particularly in democracies. Traditional methods of gathering public opinion, such as surveys and interviews, face challenges like low response rates and potential biases. The emergence of LLMs like ChatGPT offers a potential solution, enabling rapid responses to numerous questions and analysis of extensive text data. While LLMs show promise in various social science applications, including simulating expert responses and replicating human subject studies, their use in public opinion analysis requires careful consideration. Three key challenges were identified in this research: 1) Global applicability and reliability of LLMs, particularly given the predominance of US-centric data in existing studies; 2) Demographic biases within LLMs stemming from their training data; and 3) The complexity and choice variability in LLM simulations, focusing on the ability to replicate complex decision-making across diverse topics. This study aims to address these challenges by investigating the impact of cultural, linguistic, and economic differences on LLM simulation accuracy, analyzing the implications of demographic biases, and assessing LLM performance across diverse topics (environmental versus political) with varied choice complexity.
Literature Review
Existing research highlights the potential of LLMs in public opinion analysis. Argyle et al. (2023) demonstrated ChatGPT's ability to reflect responses across human subgroups in presidential election contexts, showing a correlation between human and LLM responses. Lee et al. (2023) found that LLMs can predict public opinions on global warming but emphasized the need for incorporating a broader range of variables. Other studies explored LLMs' capacity to emulate human subjects and simulate consumer behavior. However, these studies primarily focused on US data and English-language models, raising concerns about global applicability and the presence of demographic biases in LLM outputs, particularly the potential for favoring liberal and privileged viewpoints. This study builds on this existing literature by explicitly investigating the global applicability, demographic biases, and thematic biases of LLMs in simulating public opinion.
Methodology
This study used ChatGPT (GPT-3.5 Turbo model) and socio-demographic data from the World Values Survey (WVS) Wave Six (2010-2014). The WVS data provided a large, globally representative sample covering nearly 100 countries and 400,000 respondents. The study focused on two target variables: prioritization between the economy and the environment (V81), and political election voting behavior (V228). Key demographic variables included ethnicity, sex, age, education level, and social class. Relevant covariates were also incorporated, such as environmental organization membership and political ideology. The simulation process involved converting survey data into prompts for ChatGPT, mimicking an interview-style format. The prompts included demographic profiles and target questions, prompting ChatGPT to respond as a person with specific characteristics. The OpenAI API's temperature was set to 0.2 for consistent outputs. For non-English-speaking countries, prompts were translated into the native language. Each prompt was simulated 100 times to account for variability in the model's responses. Data analysis used Cohen's Kappa, Cramer's V, and Proportion Agreement to assess the correspondence between simulated responses and actual survey results. The study employed a comparative design, examining variations in agreement scores across countries, demographic groups, and thematic areas to identify biases. Additional experiments were done comparing political simulations across time and altering the country specified in the prompt to evaluate the model's robustness.
Key Findings
The study revealed significant variations in ChatGPT's performance across countries. The United States showed a moderate Cohen's Kappa score (0.239), indicating reasonably good simulation accuracy, while Japan and South Africa showed very low scores (0.024 and 0.006, respectively). Analysis revealed a strong positive correlation between ChatGPT's simulation accuracy and cultural background (Cohen's Kappa = 0.971), with economic status and language playing lesser, but still significant, roles. Within the United States, demographic biases were evident, with higher agreement scores for males, white individuals, older age groups, and those from upper social classes and with university education. This mirrors findings from previous research on biases in LLMs. Comparing environmental and political issue simulations in the United States showed higher accuracy for political behavior prediction, even without covariates, than for environmental decisions, suggesting higher complexity in the latter. ChatGPT showed a conservative bias in environmental simulations and a liberal bias in political simulations. Increasing the number of response options decreased simulation accuracy, highlighting the impact of choice complexity. Further experiments using prompts for US respondents but specifying Japan as the location produced low agreement scores, while similar analysis comparing simulations using US data from Wave 6 and Wave 7 showed consistent accuracy.
Discussion
The findings underscore the potential of LLMs in public opinion analysis, particularly in specific contexts (e.g., the United States), but also highlight significant limitations regarding global applicability, demographic representation, and thematic biases. The strong correlation between simulation accuracy and cultural background emphasizes the limitations of LLMs trained predominantly on data from Western, developed, English-speaking countries. Demographic biases mirror those prevalent in human societies, raising concerns about equitable AI development and the risk of amplifying existing inequalities. The differences in accuracy between political and environmental simulations, along with the identified ideological biases, highlight the need for cautious interpretation of LLM-generated outputs. The impact of choice complexity emphasizes the need for careful consideration in LLM-based simulation design.
Conclusion
This study demonstrates the potential but also the limitations of LLMs like ChatGPT in simulating public opinion. While showing promise in specific contexts, the model exhibits significant biases related to geography, demographics, and thematic areas. To improve the equitable and reliable use of LLMs in public policy research, future work must focus on diversifying training datasets, addressing biases, and conducting thorough analyses across various LLMs and a broader range of countries and questions. Ethical considerations regarding data privacy and the responsible use of LLM outputs in public discourse are paramount.
Limitations
The study acknowledges several limitations: 1) The analysis focused on a single LLM (ChatGPT's Turbo-3.5) and a limited temporal scope, preventing comprehensive evaluation across different models and time periods; 2) The covariate analysis was limited by data availability; 3) The analysis only included six countries, limiting the generalizability of the findings. Future research should address these limitations by incorporating more LLMs, broader temporal analysis, richer covariate sets, and a more extensive global representation.
Related Publications
Explore these studies to deepen your understanding of the subject.