Economics

Modelling dataset bias in machine-learned theories of economic decision-making

T. Thomas, D. Straub, et al.

This exciting research by Tobias Thomas, Dominik Straub, Fabian Tatai, Megan Shene, Tümer Tosik, Kristian Kersting, and Constantin A. Rothkopf delves into dataset bias in economic decision-making theories. They reveal intriguing findings about how online data may introduce greater decision noise than laboratory studies, leading to enhanced predictions through a new probabilistic generative model.

00:00

~3 min • Beginner • English

Index

Introduction

The paper revisits the problem of predicting human choices under risk, where human behavior often departs from expected value maximization. Prior work contrasts normative models (prescribing how people should decide) with descriptive models (capturing how people actually decide). Recent advances have applied neural networks (NNs) to large datasets of human risky choices (notably choices13k) with claims of discovering superior theories. The authors argue that theory, models, and data are intricately linked and that dataset representativeness, model complexity, and model–data interactions can confound such claims. They pose the central question: do NN models trained on large online datasets generalize across datasets collected under different conditions, and what accounts for any transfer failures? They hypothesize that differences between online (AMT) and laboratory datasets may reflect systematic decision noise, especially evident in choices involving stochastic dominance.

Literature Review

The study situates itself within decades of research on risky decision-making, including normative (e.g., expected utility) and descriptive theories (e.g., prospect theory) documenting systematic violations of rational axioms. It reviews recent machine-learning efforts that use NNs to model human decisions and aspirations to automate theory discovery (Bourgin et al., Peterson et al.). It highlights known ML phenomena—overfitting, double descent, adversarial/idiosyncratic generalization, and spurious correlations—and emphasizes pervasive dataset bias. It contrasts large-scale online datasets (choices13k via AMT) with laboratory datasets (CPC15, CPC18), and prior use of cognitive model priors (BEAST-generated synth15) to mitigate overfitting on small lab datasets. It also draws on behavioral economics constructs such as stochastic dominance (first, second, third order) and psychological features shown to predict human risky choices (Plonsky et al.).

Methodology

Datasets: The authors analyze three decision datasets with binary choices between gambles: CPC15 (lab; Technion/HUJI; 30 gambles × repeated trials), CPC18 (lab; expanded set including CPC15; reduced here to CPC15-compatible format), and choices13k (online; AMT; ≈13,000 gambles; large participant pool). A synthetic pretraining set (synth15) was generated by sampling gamble problems in the CPC15 problem space and labeling them with predictions from the BEAST model to mitigate overfitting. Models: Five models were evaluated: (1) BEAST (psychological model with agent sampling), (2) Random Forest (500 trees; includes naive and psychological features), (3) SVM (RBF kernel; C=1; standardized inputs; includes naive and psychological features), (4) NN_Bourgin (multilayer perceptron with SReLU, dropout via sparse evolutionary training; pretraining on synth15 and fine-tuning on CPC15 or choices13k as applicable), and (5) NN_Peterson (most unconstrained, context-dependent model using outcome–probability pairs; with/without synth15 pretraining). NNs were trained under several regimens: pretrain on synth15 then fine-tune on CPC15; pretrain on synth15 then fine-tune on choices13k; or train directly on choices13k. Due to overfitting, some configurations (e.g., NN_Peterson pretrain synth15 + fine-tune CPC15) were excluded. Transfer testing: For each trained model, performance was evaluated via mean squared error (MSE × 100) on train/test splits of CPC15, choices13k, and compatible CPC18 subsets to assess generalization and detect dataset bias. Feature-based analysis: To explain divergences between NNs trained on CPC15 vs choices13k, the authors computed NN_difference = predictions(NN_CPC15) − predictions(NN_choices13k) on choices13k gambles and regressed this difference on feature sets: (a) basic gamble descriptors (e.g., Ha, La, Hb, Lb, pHa, pHb, ambiguity, correlation, feedback), (b) naive features (e.g., diffEV, diffSDs, diffMins, diffMaxs, diffBEVO, diffBEVfb), and (c) psychological features (e.g., pBbet_Unbiased1, pBbet_UnbiasedFB, diffUV, pBbet_Uniform, diffSignEV, pBbet_Sign1, pBbet_SignFB, SignMax, RatioMin, Dom). They also computed higher-order stochastic dominance features (SOSD, TOSD). Single-feature correlations and multicollinearity were assessed. XAI analysis: SHAP (additive feature attribution) values were computed for NN_difference using basic input features to test whether naive/psychological features could be linearly recovered from SHAP explanations. Theory-driven hybrid noise model: The authors hypothesized increased decision noise in the online dataset. They built a generative mixture model where a proportion p_guess of participants guess randomly (probability 0.5), and the remaining 1−p_guess follow NN_CPC15 predictions corrupted by multiplicative scaling of log-odds by factor f (0 < f < 1), which contracts probabilities toward 0.5. The resulting predicted choice proportion is a weighted combination of the noisy NN prediction and random guessing. Using probabilistic programming (Turing.jl) and NUTS sampling (10,000 samples), the posterior over p_guess and f was inferred on choices13k training data. The model’s posterior means were then used to evaluate transfer performance.

Key Findings

- Transfer testing reveals dataset bias: Models perform best on their own dataset and transfer poorly across datasets, despite choices13k being much larger. For example, NN_Bourgin trained on CPC15 achieves MSE×100 of 0.53 (test) on CPC15 but 2.77 (test) on choices13k; conversely, NNs trained on choices13k have low error on choices13k (e.g., 1.00 test) but higher error on CPC15 (e.g., NN_Bourgin^Prior: 1.38 test). - Pretraining on synth15 slightly improves transfer from choices13k to CPC15/CPC18 but does not eliminate dataset bias. - Feature regressions show that basic features alone explain little of NN_difference (MSE=0.0220, R^2=0.1186), whereas adding naive and psychological features halves error (Base+Naive+Psych.: MSE=0.0106, R^2=0.5760). Including SOSD/TOSD adds small additional improvement (MSE=0.0103, R^2=0.5876). - Single-feature analyses identify strongest predictors of NN_difference among features related to how much one gamble is better than the other: diffEV; pBbet_Unbiased (with/without feedback); stochastic dominance (Dom); diffBEVO/diffBEVfb; SOSD (R^2≈0.34) and TOSD (R^2≈0.33). Basic features are largely uninformative individually (R^2≤~0.025). - Behavior differences in dominated gambles: In CPC15, 95% of human responses to dominated gambles fall in choosing the dominating option between 0.84 and 1.00; in choices13k, this interval is 0.62–0.97, indicating more violations/greater variability online. NNs trained on each dataset mirror these patterns, with NN_choices13k avoiding extreme probabilities. - SHAP analysis: SHAP values over basic inputs do not straightforwardly recover naive/psychological features (R^2 for regressions from SHAP to these features are moderate at best), suggesting limits of automatic explanation extraction for the most predictive theoretical features. - Hybrid noise model inference: Posterior means p_guess = 0.2757 (s.d. 0.0015) and f = 0.6236 (s.d. 0.0038). Applying this to NN_CPC15 improves transfer to choices13k substantially: MSE×100 decreases from 2.69 to 1.49 on choices13k train, and to 1.53 on test, closing more than half the gap to models trained on choices13k. Similarity to NN_choices13k increases (MSE×100=0.87; R^2=0.77). Performance on CPC15 deteriorates as expected (~+1 MSE×100), consistent with the noise being dataset-specific. - Overall, the best transferring CPC15-trained model to choices13k is the hybrid NN_CPC15 + decision noise, supporting the decision-noise hypothesis as a key contributor to dataset differences.

Discussion

The findings address the central question of whether large data and flexible NNs yield generalizable theories of human risky choice. Transfer testing shows that models trained on choices13k do not generalize well to lab datasets and vice versa, indicating dataset bias rather than universal theory discovery. The strongest predictors of inter-dataset prediction differences relate to dominance, expected value differences, and likelihood of one gamble outperforming the other, consistent with theoretical constructs from behavioral economics. The choices13k dataset exhibits systematically less extreme choice proportions—especially in dominated cases—suggesting increased decision noise in online settings. A theory-driven hybrid noise model that mixes random guessing with precision-limited log-odds scaling of NN_CPC15 predictions explains much of the gap and transfers best, highlighting the importance of modeling context-dependent noise structures. These results temper claims that unconstrained NNs trained on large datasets alone can discover general theories, and they underscore the need to integrate theoretical insights, careful data characterization, and cross-dataset validation.

Conclusion

The study demonstrates that dataset context crucially shapes machine-learned models of human risky choice. There is clear evidence of dataset bias between online (choices13k) and laboratory (CPC15/CPC18) datasets. Differences concentrate in gambles where one option is superior (dominance, EV differences), with choices13k showing structured decision noise (less extreme choice proportions). A simple generative hybrid model—combining a guessing fraction and a log-odds precision factor—substantially improves transfer from lab-trained models to online data. The work argues that combining theory, data analysis, and model comparisons is necessary for progress and that automated theory discovery via deep NNs remains limited. Future directions include collecting richer individual-level and sequential decision data, systematically varying and measuring experimental contexts, extending beyond binary small-stakes gambles to more naturalistic and higher-stakes decisions, and developing theoretically constrained models with better generalization.

Limitations

- The datasets contain only aggregate choice proportions per gamble (average over participants and trials), limiting analysis of individual differences, learning, and sequential effects. - The domain is restricted to binary risky choices with small monetary outcomes; generalization to multi-option, multi-outcome, or higher-stakes settings is uncertain. - Differences in experimental protocols (e.g., feedback structure, online vs lab environment, attention/understanding) may introduce unobserved contextual confounds. - Some NN configurations overfit small lab datasets (e.g., CPC15) and were excluded, constraining architectural comparisons. - XAI (SHAP) explanations over basic features did not recover the most predictive naive/psychological constructs in a simple way, reflecting limitations of post hoc interpretability. - The hybrid noise model captures a major component of dataset differences with only two parameters but may still be a simplification of heterogeneous participant behaviors.

Related Publications

Explore these studies to deepen your understanding of the subject.

Business

Unveiling the landscape of Fintech in ASEAN: assessing development, regulations, and economic implications by decision-making approach

C. Wang, N. Nhieu, et al.

Engineering and Technology

Performance of two complementary machine-learned potentials in modelling chemically complex systems

K. Gubaev, V. Zaverkin, et al.

Interdisciplinary Studies

Trapped in the prison of the mind: Notions of climate-induced (im)mobility decision-making and wellbeing from an urban informal settlement in Bangladesh

S. Ayeb-karlsson, D. Kniveton, et al.

Education

Impact of artificial intelligence on human loss in decision making, laziness and safety in education

S. F. Ahmad, H. Han, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny