logo
Loading...
Larger and more instructable language models become less reliable

Computer Science

Larger and more instructable language models become less reliable

L. Zhou, W. Schellaert, et al.

Recent research by Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yael Moros-Daval, Cèsar Ferri, and José Hernández-Orallo highlights a paradox: while larger language models perform better on tough tasks, they falter in simpler ones, leading to plausible but incorrect answers. This raises critical questions about the reliability of AI models, especially in high-stakes situations.... show more
Abstract
Large language models (LLMs) have been scaled up and shaped up (via instruction tuning, RLHF, moderation) to increase capability and instructability. Studying GPT, LLaMA and BLOOM families across five benchmarks and human studies, the authors find that while scaled and shaped models improve correctness and prompt stability, they become less reliable for users: errors persist on instances that humans consider easy (difficulty discordance), avoidance is reduced and replaced by seemingly plausible but wrong answers (ultracrepidarianism), and stability improvements still leave pockets of variability across difficulty levels. Human supervisors often fail to detect these errors, especially on harder items. The work argues for rethinking design toward difficulty-aware reliability and calibrated avoidance, particularly for high-stakes applications.
Publisher
Nature
Published On
Sep 25, 2024
Authors
Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yael Moros-Daval, Cèsar Ferri, José Hernández-Orallo
Tags
large language models
task avoidance
prompting stability
AI design
difficulty concordance
errors
AI reliability
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny