Business

Gender stereotypes in artificial intelligence within the accounting profession using large language models

K. Leong and A. Sung

Discover the intriguing findings of Kelvin Leong and Anna Sung as they explore how artificial intelligence reinforces gender stereotypes in accounting. Their research reveals significant biases in job title gender associations and highlights important economic implications. Join them in raising awareness to combat these biases!

00:00

~3 min • Beginner • English

Index

Introduction

The study examines whether AI, specifically large language models (LLMs), perpetuates gender stereotypes within the accounting profession and, if so, how. It situates the research within longstanding evidence that occupations are gender-stereotyped and that such stereotypes influence job segregation, wage gaps, and career progression. Prior work in accounting shows mixed perceptions of the field as male-dominated versus female-majority, with cultural and educational factors shaping stereotypes. The authors focus on how LLMs label accounting job titles by gender to reveal implicit associations and potential bias. The rationale is that job titles carry identity signals and LLMs, trained on broad web corpora embedding societal biases, form foundational layers for many AI systems; thus, understanding their behavior is important for AI ethics and occupational equity. The study aims to provide domain-specific evidence for accounting, informing gender studies, AI ethics, and workplace inclusivity.

Literature Review

The paper reviews evidence that LLMs exhibit gender bias and stereotypes, diverging from labor statistics and showing discriminatory patterns in rankings, narrative generation, and even simple tasks. Comparative studies show variation across models (e.g., ChatGPT vs. Alpaca; ChatGPT vs. Ernie), with tendencies toward implicit or explicit gender bias and notable gender and racial disparities in AI-generated content. Causes are traced to biased training data, societal stereotypes embedded in language, model architectures, and design processes lacking diversity. Attempts to eliminate bias are challenging because models learn from biased internet-scale text. Impacts include disadvantages in hiring, lending, education, and consumer applications with limited accountability. The authors argue that general LLM bias findings cannot be directly applied to accounting due to domain-specific terminology and evolving professional language, warranting a focused investigation in accounting.

Methodology

Design: A "Toy Choice"-style experiment where LLMs select gender labels for accounting job titles. - Models: Three zero-shot classification LLMs from Hugging Face selected on 8 Dec 2023 as the most-downloaded for zero-shot classification: facebook/bart-large-mnli (Model F), MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli (Model M), and alexandrainst/scandi-nli-large (Model A). For analysis, anonymized as Model 1, Model 2, Model 3. Download counts: Model F 2.63M, Model M 6.29M, Model A 1.23M. - Data: 53 accounting job titles from ACCA’s official Role Explorer (accessed 8 Dec 2023). - Task: Each model classified each title into one of four labels: female, male, other (including non-binary/gender non-conforming), unknown (gender not specified). - Procedure: Experiments run in Google Colab on two different days (8 and 9 Dec 2023) and two different computers; the second run also reordered the labels to test order effects. Results were identical across runs, indicating consistency and no order effect. - Analysis: Counts per category per model were tabulated into a contingency table. A Chi-squared test of independence was conducted using Python SciPy (stats.chi2_contingency) to assess distribution differences among models. The "unknown" category was excluded due to zero counts across all models (to avoid invalid cells and unnecessary degrees-of-freedom inflation). The test reported chi-squared statistic, p-value, degrees of freedom, expected frequencies, and Cramer’s V for effect size. - Cross-model consistency: The authors identified job titles consistently labeled to the same gender by all three models, forming Female Group 1 and Male Group 2. - Salary analysis: For these consistent groups, salary ranges (min, max, average) were summarized, and an independent-samples t-test (SciPy) compared average salaries between Female Group 1 (n=6) and Male Group 2 (n=4). - Architectural choices: No intermediate layers or additional adjustments were introduced, citing zero-shot models’ capability and risks of overfitting with extra layers; future work may explore alternative architectures.

Key Findings

- Label distributions (53 titles): - Model 1: Female 16, Male 36, Other 1, Unknown 0 (male-leaning). - Model 2: Female 43, Male 10, Other 0, Unknown 0 (female-leaning). - Model 3: Female 22, Male 22, Other 9, Unknown 0 (balanced with non-binary/other assignments). - Statistical test: Chi-squared = 44.4301, df = 4, p = 5.22e-09; Cramer’s V = 0.3055, indicating a significant and moderate-to-strong association between model and assigned labels. - Cross-model consistency: 10 of 53 titles were consistently labeled across all models: - Female Group 1 (6 titles): Financial Analyst, Finance Analyst, Internal Audit Manager, Audit Manager, Assistant Management Accountant, Assistant Accountant. - Male Group 2 (4 titles): Financial Accountant, Head of Finance, Chief Financial Officer, Senior Internal Auditor. - Salary analysis: Titles in Male Group 2 had higher salaries. Average salary in Group 2 was 1.74× that of Group 1. - Group 1 average: £46,458.33 (n=6); - Group 2 average: £80,812.50 (n=4). - Independent-samples t-test: t ≈ −1.056×10^16, p ≈ 1.45×10^−79 (significant difference at α=0.05). - Interpretation: Model choices materially affect gender labeling outcomes; titles labeled male tended to align with higher-seniority roles and higher pay, whereas female-labeled roles skewed toward entry/mid-level or operational roles. Model 3’s use of "other" suggests some sensitivity to gender-neutral/ambiguous labeling.

Discussion

Findings corroborate that LLMs reflect and may perpetuate societal gender stereotypes embedded in training data and linguistic norms. The significant differences across models underscore that bias profiles vary by model, potentially due to differences in data sources and training. Cross-model consistency patterns mapped onto role seniority and compensation, echoing documented gender stratification and pay gaps in accounting. The implications are substantial in applications such as hiring and recruitment, where LLM behaviors could lead to biased screening or recommendations; prior work also reports model preferences for female candidates in some contexts, highlighting unpredictability. Illustrative conversations with ChatGPT showed initial neutral stances followed by stereotype-aligned guesses, indicating how stereotypes can surface in use. Broader concerns include amplification of bias through data augmentation/synthetic data, potentially creating a bandwagon effect as AI-generated outputs retrain future systems. The authors advocate focusing on public awareness to improve scrutiny, accountability, and diversity in data contributions while cautioning that human-in-the-loop bias mitigation can introduce subjective biases. Cultural and regional workforce differences likely influence how models internalize occupational gender associations, suggesting sensitivity to context.

Conclusion

The study provides domain-specific evidence that LLMs assign gendered labels to accounting job titles in biased and model-dependent ways, validated statistically, and that titles consistently labeled male are associated with higher salaries, linking AI-driven stereotypes to economic disparities. Contributions include: (1) an empirical, accounting-focused assessment of LLM gender labeling; (2) statistical validation (Chi-squared, Cramer’s V) of cross-model differences; and (3) demonstration of economic implications via salary analysis. The work informs educators (curricular attention to AI bias), policymakers (regulation for unbiased deployment), and industry leaders (transparent, proactive bias mitigation). Future research directions include exploring alternative model architectures or intermediate layers, extending to other professions and languages, and developing governance and awareness strategies that reduce bias propagation while enhancing diversity in data and participation.

Limitations

The analysis is limited to 53 accounting job titles from ACCA and three zero-shot classification LLMs, focusing on label assignment without broader task contexts. Models were anonymized and evaluated at two time points only, and the "unknown" category had zero counts and was excluded from inferential statistics. The cross-model consistency and salary analyses rely on a subset of titles; while differences were statistically significant, the study does not quantify downstream impacts in deployed AI systems and does not generalize beyond the accounting domain. The authors also did not incorporate intermediate layers or additional debiasing steps, leaving architectural and mitigation variations to future work.

Related Publications

Explore these studies to deepen your understanding of the subject.

Chemistry

ChatMOF: an artificial intelligence system for predicting and generating metal-organic frameworks using large language models

Y. Kang and J. Kim

Psychology

The Emotional Landscape of Pregnancy and Postpartum during the COVID-19 Pandemic in Italy: A Mixed-Method Analysis Using Artificial Intelligence

Ravald, Mosconi, et al.

Psychology

Understanding the Role of Large Language Models in Personalizing and Scaffolding Strategies to Combat Academic Procrastination

A. Bhattacharjee, Y. Zeng, et al.

Computer Science

Evaluating the capacity of large language models to interpret emotions in images

H. Alrasheed, A. Alghihab, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny