Computer Science
Larger and more instructable language models become less reliable
L. Zhou, W. Schellaert, et al.
General-purpose AI systems based on LLMs are widely used in education, medicine, science and administration, yet they frequently make mistakes requiring user supervision. As models have been scaled (more parameters, data, compute) and shaped (instruction tuning, RLHF, moderation), it is often assumed their reliability for human users also improves. The paper questions this assumption by defining user-driven reliability as predictable error patterns aligned with human difficulty expectations and manageable via supervision. It motivates the study with examples (e.g., arithmetic) where modern models perform well overall yet fail unpredictably on seemingly simple prompts. The research asks how reliability has evolved with scaling and shaping by examining three intertwined aspects: (1) difficulty concordance (whether errors concentrate on instances humans find hard), (2) task avoidance versus incorrect answering, and (3) prompting stability under natural phrasing variations.
The study builds on scaling laws showing performance gains with more compute, data, and parameters; on instruction tuning and RLHF improving instructability and alignment; and on extensive work analyzing prompt sensitivity and robustness. Prior literature has explored model avoidance/evasiveness, hedging, and refusals, as well as human oversight and calibration issues in LLMs. Despite these advances, it remains unclear how scaling and shaping affect the alignment between model errors and human-perceived difficulty, the balance between avoidance and incorrectness, and the stability to natural prompt variations, especially from the perspective of end-user reliability.
The authors analyze three LLM families (GPT, LLaMA, BLOOM), including raw/base models and shaped-up instruct/chat variants, using paired versions where possible to disentangle scaling from shaping. Responses are categorized into three outcomes: correct (c), incorrect (i), and avoidant (a). Five benchmarks span a spectrum of domains and openness: Addition (arithmetic, 1–100 digits; difficulty proxy: number of carry operations), Anagram (unscrambling 3–20 letters; proxy: anagram length), Locality (geographical proximity/population; proxy: inverse city popularity), Science (basic to graduate-level multiple-choice; proxy: human-rated difficulty), and Transforms (information-centric transformations; proxy: combined input/output length and Levenshtein distance). Human study S1 calibrates intrinsic difficulty proxies to human difficulty expectations (0–100 scale) and validates proxy quality; S2 measures human supervision ability by asking participants to judge model outputs as correct, incorrect, avoidant, or unsure. For each model–benchmark pair, 15 natural prompt templates are used to assess prompt sensitivity. Performance is summarized with indicators: correctness proportion (c/(c+i+a)), prudence (c+a)/(c+i+a), difficulty concordance (alignment between correctness and human difficulty), and prompting stability (sensitivity to prompt variation). Difficulty trends are analyzed by binning calibrated difficulty into equal-sized bins and examining c/a/i distributions. A scaling analysis uses training FLOPs to model trends for raw versus shaped-up models, focusing on avoidance (a), incorrectness (i), and ultracrepidarianism i/(a+i). The study reports family-wide trends (GPT, LLaMA, BLOOM) and detailed plots in main, extended data, and supplementary materials.
- Human difficulty proxies predict model correctness (negative correlation between difficulty and correctness across benchmarks; high correlations except BLOOM-addition), confirming that instances perceived as harder by humans are generally harder for models.
- Despite this, easy operating areas without errors are not secured. Errors persist even at very low difficulty levels, especially in Anagram (GPT), Science (LLaMA), and Locality/Transforms (GPT and LLaMA). GPT-4 improves mainly on medium/hard instances versus GPT-3.5-turbo, with limited gains on easy items. LLaMA models fail to achieve 60% correctness at simplest difficulty levels (except a low-difficulty region in Science for GPT-4).
- Shaped-up models show higher correctness and greater prompting stability than raw models but exhibit lower difficulty concordance and markedly reduced avoidance, often replacing avoidance with incorrect answers. GPT-4 shows near-vanishing avoidance.
- Avoidance does not reliably increase with difficulty; correlations are weak, with only pockets of positive trends for some models and tasks. Incorrectness rises with difficulty more than avoidance, eliminating a safe buffer for users.
- Prompt sensitivity decreases with shaping and scaling, yet substantial variability remains across difficulty levels and models. Raw GPT and non-chat LLaMA are highly prompt-sensitive even for unambiguous tasks like addition; shaped-up models are more stable but still show pockets of instability.
- Scaling analysis using FLOPs reveals: shaped-up models have much lower avoidance and higher incorrectness compared to raw models; correctness increases with scale, but incorrectness does not decrease and even increases for raw models. The proportion of errors among non-correct responses i/(a+i) increases with scale, indicating rising ultracrepidarianism.
- Human supervision (S2) fails to fully compensate: dangerous supervision errors (incorrect outputs judged correct) persist across difficulties, with only limited safe regions (e.g., easy additions and broader range for anagrams) where verification is straightforward.
The findings indicate that while scaling and shaping improve headline accuracy and prompt stability, they degrade user-driven reliability by eroding predictable error distributions aligned with human difficulty, suppressing avoidance in favor of plausible but wrong answers, and leaving pockets of prompt sensitivity. This undermines the existence of safe operating areas where users can trust easy instances or rely on avoidance for hard ones. The results suggest that current optimization objectives prioritize benchmark correctness and instructability at the expense of prudence and calibration. The authors argue for incorporating difficulty concordance and supervision-aware losses during training/shaping, collecting datasets of human difficulty expectations and supervision judgments, and designing models or external supervisors that learn when to refrain. Especially in high-stakes domains (e.g., medicine), explicit reject options and calibrated avoidance could improve reliability beyond what human oversight alone can ensure.
The study reinterprets the reliability trajectory of LLMs under scaling and shaping. Key contributions include: (1) a three-outcome evaluation framework (correct/incorrect/avoidant) tied to human-calibrated difficulty across five benchmarks and natural prompt variants; (2) evidence that difficulty proxies predict correctness but do not yield secure easy regions; (3) a demonstrated trade-off where shaping reduces avoidance and increases ultracrepidarian errors; (4) a scaling analysis showing increased proportion of incorrect non-avoidant responses; and (5) the limited efficacy of human supervision in mitigating these issues. The authors advocate reorienting training and alignment toward difficulty-aware reliability, calibrated avoidance, and supervision-informed objectives, with future work building larger datasets of human difficulty expectations and supervision behaviors, exploring reject-option designs and external supervisors, and extending analyses to tool-augmented or reasoning-enhanced LLMs.
- Participant pool skewed toward non-experts, potentially inflating calibrated difficulty for some benchmarks; chosen to keep a consistent population across datasets.
- Natural prompts were collected from diverse sources without frequency data reflective of real-world usage.
- Analysis covers selected model families and does not include LLMs with tool use or advanced reasoning pipelines, which may exhibit different dynamics.
- Some indicators and correlations are aggregated; although multiple prompts and bins are used, fine-grained prompt distributions in the wild may differ.
Related Publications
Explore these studies to deepen your understanding of the subject.

