logo
ResearchBunny Logo
Artificial intelligence exceeds humans in epidemiological job coding

Medicine and Health

Artificial intelligence exceeds humans in epidemiological job coding

M. A. Langezaal, E. L. V. D. Broek, et al.

Introducing OPERAS, a groundbreaking decision support system for epidemiological job coding that enhances the accuracy and efficiency of occupational exposure assessments! This innovative tool outshines expert coders and traditional coding methods, demonstrating remarkable improvements in exposure assessment accuracy and significant workload reductions. The research was conducted by Mathijs A Langezaal, Egon L van den Broek, Susan Peters, Marcel Goldberg, Grégoire Rey, Melissa C Friesen, Sarah J Locke, Nathaniel Rothman, Qing Lan, and Roel C H Vermeulen.... show more
Introduction

Occupation is a major determinant of adult health, with exposures such as stress, chemicals (e.g., diesel exhaust, asbestos), and physical agents contributing substantially to morbidity and mortality. Epidemiological studies depend on accurate exposure assessment, which in turn relies on translating free-text job descriptions into standardized occupational and industry codes to enable application of job-exposure matrices (JEMs). Manual coding by expert coders is time-consuming, prone to inconsistency, and exhibits variable inter-coder reliability (often 42–71%). Existing automatic and semi-automatic tools show limited accuracy and often require human intervention for reliable exposure assessment. The study aims to develop and validate a machine learning-based decision support system, OPERAS, to improve accuracy and efficiency of occupational coding and downstream exposure assessment, and to test whether it can match or exceed expert coders’ reliability while reducing workload.

Literature Review

Prior work has produced several coding tools (e.g., ACA-NOC, CASCOT, NIOCCS, Procode, SOCcer, SOCEye) using methods ranging from string similarity to ensemble classifiers. Reported accuracies span roughly 15–85% depending on coding level and taxonomy, with out-of-distribution performance often substantially lower (17–26%). Inter-coder reliability for automated tools and manual coders typically ranges between ~0.40–0.87 for exposure assessment and 48–89% agreement for titles/levels. Studies indicate that automatic coding without human oversight can yield unreliable exposure assessments, and many tools lack modern machine learning techniques or broad taxonomy support, limiting generalizability across international classification systems. This motivates the development of an ML-driven, multi-taxonomy system supporting decision assistance and automatic coding with confidence estimates.

Methodology

Overview: OPERAS is a decision support system built and evaluated in six phases: data acquisition and barrier identification; data preparation; model training; evaluation of classification performance; comparison with expert inter-coder reliability; and exposure assessment evaluation via multiple JEMs. Datasets: Three main datasets were used: (1) Constances (France), with free-text occupation and sector manually coded to NAF2008 and PCS2003; (2) AsiaLymph (hospital-based case-control study in Eastern Asia), with free-text items translated to English and coded to ISCO-88; and (3) Lifework (Netherlands) with job names/descriptions/company types coded to ISCO-68. Ethical approvals were obtained in the original studies; data reuse was permitted where stated. Data preparation: Entries with incomplete codes were removed to enable complete code suggestions. Post-cleaning, available entries for model training were approximately 281,418 (NAF), 483,090 (PCS; note textual inconsistencies also mention 36,007 for ISCO-88 and 12,007 for ISCO-68), with originally much larger raw datasets prior to cleaning. Text normalization included tokenization, punctuation removal, lowercasing, and stemming. Descriptions were embedded into sentence embeddings to create fixed-length numerical vectors capturing semantic similarity; when multiple input fields existed (e.g., job description and sector), vectors were combined (summed). Data were split into train/validation/test (60/10/30), and training data were supplemented with descriptions from coding indexes to improve generalizability. Class imbalance was recognized as severe; standard resampling (e.g., SMOTE) was deemed inadequate due to extremely small minority classes. Model training: XGBoost was selected based on prior comparative studies showing strong performance for occupation coding with index augmentation. Models were trained for four taxonomies: NAF2008, PCS2003, ISCO-88, ISCO-68. Training used early stopping on validation data and default-like hyperparameters informed by prior work (e.g., learning rate η around 0.6, max_depth ~5, regularization terms, subsampling). ISCO-88 models were trained on English translations; others in original languages. Evaluation metrics: Accuracy (proportion correct) and Cohen’s Kappa (κ) were computed on test sets per code level and major group, treating expert-coded labels as gold standard. OPERAS outputs a confidence score per prediction; performance was analyzed across 5% confidence bins, and workload reduction was estimated by the proportion auto-accepted above a threshold multiplied by their accuracy. Human-model cross-validation: OPERAS’ κ per coding level was compared to published human inter-coder reliability studies (e.g., Maaz et al.; ALBUS/PIAAC studies), using independent sample t-tests (two-sided, p<0.05), accounting for unequal variances where needed, and reporting Hedges’ g. Exposure assessment evaluation: Four JEMs were used: Formaldehyde-JEM (NAF/PCS), Silica-JEM (NAF/PCS), ALOHA-JEM (ISCO-88), and DOM-JEM (ISCO-68). Exposure was evaluated for (1) all individuals and (2) exposed individuals (per gold standard), computing accuracy and κ for dichotomous exposure (exposed vs not exposed) and, where applicable, comparing exposure levels. For continuous exposure scores (e.g., Formaldehyde-JEM), Kendall’s τ rank correlation between gold standard and predicted coding-derived exposures was calculated. Confidence-threshold analyses analogous to coding performance were applied to exposure assessments to estimate potential automation gains.

Key Findings
  • Classification performance vs. experts: Using expert-coded descriptions as gold standard, OPERAS achieved inter-coder reliability (κ) ranges by coding level: Level 1: 0.66–0.84; Level 2: 0.62–0.81; Level 3: 0.60–0.79; Level 4: 0.57–0.78. These exceed expert coder ranges reported in comparison studies: Level 1: 0.59–0.76; Level 2: 0.56–0.71; Level 3: 0.46–0.63; Level 4: 0.40–0.56. T-tests showed significantly higher κ for OPERAS at each level (e.g., Hedges’ g ~0.36–0.64 as reported). - Accuracy by taxonomy/level (examples from reported tables): NAF accuracies/kappa were high across levels (e.g., Level 1 Acc ~88.3%, κ ~0.85; Level 5 Acc ~78.9%, κ ~0.78). PCS showed Level 1 Acc ~84.4%, κ ~0.79 and decreasing with granularity; ISCO-88 and ISCO-68 showed lower but competitive performance at finer levels, consistent with fewer training entries. - Confidence-threshold utility and workload reduction: High-confidence predictions (95–100%) were very accurate. With a 95% threshold, estimated minimum workload reductions were NAF: 55.5% (57.4% of predictions above threshold with 97.0% accuracy), PCS: 40.7%, ISCO-88: 20.4%, ISCO-68: 19.7%. - Exposure assessment accuracy: Across JEMs and taxonomies, exposure assessment accuracy ranged between 75.0% and 98.4% overall, with very high correctness among predictions in the 95–100% confidence bin (e.g., ~99%+). For exposed-only subsets, accuracies were lower (e.g., illustrative κ ~0.46 reported), reflecting reduced chance agreement and the rarity/specificity of certain exposures. - Overall, OPERAS outperformed existing automatic coding tools (per literature benchmarks) and exceeded expert inter-coder reliability while offering substantial potential workload reduction.
Discussion

The study demonstrates that an ML-based decision support system for occupational coding can surpass expert inter-coder reliability and provide accurate exposure assessments when applied to free-text job descriptions. Confidence scores enable selective automation, balancing efficiency gains with quality control. These results support the central goal of improving both the accuracy and scalability of occupational exposure assessment. Differences across taxonomies are attributable to training set sizes, language specifics, and input description lengths; longer, noisier descriptions can reduce reliability if irrelevant details conflict with code definitions. Exposure assessment performance is influenced by the prevalence and specificity of exposures in JEMs; rarer exposures increase the likelihood of discordant codes leading to a non-exposed classification. Despite these challenges, strong rank correlations and high-accuracy high-confidence strata indicate practical applicability for epidemiological studies. Incorporating OPERAS as decision support may also standardize coding suggestions and improve downstream human inter-coder reliability.

Conclusion

OPERAS, a customizable ML-based decision support system, achieves human-level or better performance in occupational coding across multiple classification systems (NAF2008, PCS2003, ISCO-88, ISCO-68), provides accurate exposure assessments via established JEMs, and can substantially reduce manual workload through confidence-based automation. These advances enable large-scale, efficient occupational exposure assessment, potentially improving the validity and throughput of epidemiological studies. Future work should expand training datasets across languages and regions, address class imbalance and minority class performance, refine exposure mapping across diverse JEMs, and explore more advanced text encoders or multimodal inputs to further enhance generalizability and robustness.

Limitations
  • Class imbalance: Severe skew toward majority classes limited the effectiveness of standard balancing techniques (e.g., SMOTE), potentially reducing performance on rare occupations that may be epidemiologically important. - Dataset heterogeneity and size: Performance varied with taxonomy-specific training sizes and languages; ISCO-88/68 models had fewer entries, affecting fine-level accuracy. - Description length/noise: Longer or overly detailed free-text entries may include irrelevant information conflicting with code definitions, decreasing reliability. - Removal of incomplete codes: Excluding entries with incomplete codes improves precision of suggestions but may limit coverage and introduce selection bias relative to real-world data. - Exposure prevalence effects: JEMs with rare, specific exposures can yield lower agreement for exposed-only subsets; non-exposure dominance inflates chance agreement in all-individual analyses. - Generalizability: Models were trained on specific cohorts and languages; external validity to other populations/taxonomies or to out-of-distribution phrasing requires further testing. - Implementation choices: Default-like hyperparameters and reliance on sentence embeddings without more recent large language models may cap performance; however, this was intentional for robustness and generalization.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny