Computer Science

Active Prompting with Chain-of-Thought for Large Language Models

S. Diao, P. Wang, et al.

Large language models improve complex reasoning when guided by example-based chain-of-thought prompts. This paper introduces Active-Prompt, an uncertainty-driven method to select the most informative questions for human CoT annotation, yielding superior performance on eight complex reasoning tasks — research conducted by Shizhe Diao, Pengcheng Wang, Yong Lin, Rui Pan, Xiang Liu, and Tong Zhang.

00:00

~3 min • Beginner • English

Index

Introduction

Large language models excel at in-context learning but struggle on complex reasoning unless guided by chain-of-thought (CoT) exemplars. Existing CoT prompting typically uses a small, fixed, human-curated set of examples, which may not be optimal for diverse tasks of varying difficulty and domain. The research question is how to identify and annotate the most informative task-specific questions to serve as exemplars that maximize reasoning performance under small annotation budgets. The paper proposes Active-Prompt, an uncertainty-driven selection approach that chooses the most uncertain questions for human CoT annotation, aiming to reduce model uncertainty and improve reasoning accuracy. This is important because selecting better exemplars can substantially boost LLM reasoning across arithmetic, commonsense, and symbolic tasks while keeping human effort low.

Literature Review

The work builds on chain-of-thought prompting (Wei et al., 2022b), including improvements via self-consistency (Wang et al., 2022), least-to-most prompting (Zhou et al., 2022), dynamic least-to-most, bootstrapping, self-training, verifiers, and prompt augmentation/selection. Prior methods typically rely on fixed or heuristically chosen exemplars. Auto-CoT (Zhang et al., 2022b) selects examples via clustering/diversity and zero-shot prompting but requires accessing the test set; the authors argue diversity and uncertainty are complementary. The study also relates to active learning, especially uncertainty-based strategies like max-entropy and least confidence, which have shown benefits for efficient labeling and fine-tuning. The paper adapts these uncertainty principles to in-context CoT selection rather than model training, demonstrating effectiveness for complex reasoning tasks.

Methodology

Active-Prompt selects a small number n of task-specific questions from a training pool for human annotation with chain-of-thought (CoT) and answers, then uses those exemplars for few-shot prompting on test inputs. The pipeline: - Uncertainty Estimation: For each candidate question q in a pool (up to 1,000 sampled from training if larger), query the LLM k times (k=10 in main experiments) to generate answers with intermediate steps, optionally stabilized with a few initial CoT exemplars (or zero-shot CoT). Compute an uncertainty score u per question using metrics: - Disagreement: u = h/k where h is the number of unique answers among k generations. - Entropy: u = −∑_j P(a_j|q) ln P(a_j|q), where P is the empirical frequency over the k answers. - Variance: for numerical answers, sample variance of predictions, optionally normalized by magnitudes appearing in the question to reduce dominance of large numbers. - Self-confidence: ask the model to rate confidence (very confident/confident/not confident/wrong), then select least confident; found ineffective due to overconfidence. - Selection and Annotation: Rank questions by u and select the top-n most uncertain (random tie-breaks when needed). A human annotator provides rationale chains and correct answers for these n questions to form the exemplar set E. - Inference: Prepend E to each test question and decode. Optionally apply self-consistency by sampling m=40 responses at temperature T=0.7 and choosing the most frequent answer. Implementation notes: Number of exemplars per task follows prior CoT settings (e.g., 8 for GSM8K). For datasets lacking training splits (ASDiv, SVAMP, SingleEq), exemplars are transferred from GSM8K selection/annotation. Main models used include code-davinci-002, text-davinci-002/003, and gpt-3.5-turbo APIs; some experiments include Llama2-70b-chat. Pool size and k were analyzed; accuracy increases with k and converges near k=10.

Key Findings

- Overall gains across eight benchmarks. With text-davinci-002, Active-Prompt (D) improves average accuracy over self-consistency by about +7.0 points (67.9 → 74.9). With code-davinci-002, Active-Prompt (D) improves by +1.8 (79.1 → 80.9); Active-Prompt (E) reaches 81.6 (+2.5 vs SC). With gpt-3.5-turbo (no SC), Active-Prompt improves averages from 78.5 (CoT) to 80.0 (D) and 81.0 (E). - Per-task highlights (code-davinci-002, Active-Prompt E): GSM8K 83.4, AQUA 57.0, SingleEq 95.5, Letter (4) 76.7; consistently strong across arithmetic, commonsense, and symbolic tasks. - Active selection vs random: On GSM8K, Random-CoT (78.6) only slightly beats SC (78.0) but lags Active-Prompt (82.2) by 3.6 points, indicating selection—not annotation effort—drives gains. Similar trends across tasks. - Uncertainty metrics: Disagreement and entropy perform similarly well; variance competitive but weaker on GSM8K; self-confidence underperforms due to LLM overconfidence. For binary-label tasks (StrategyQA), entropy is preferable as disagreement saturates. - Zero-shot variant: Zero-Shot-Active-Prompt (using “Let’s think step by step” instead of initial exemplars) is competitive with few-shot-based Active-Prompt. - Pool size and k: Increasing k (1,5,10,15) improves selection quality and accuracy, converging near k=10. - Transferability: Exemplars selected with one model (e.g., code-davinci-002) transfer effectively to others (text-davinci-002/003), suggesting task-inherent uncertainty. Cross-family transfer between Llama2-70b-chat and gpt-3.5-turbo is feasible; selecting with a larger model and applying to a smaller one improves results. - Weaker models: Active-Prompt improves over CoT for Llama2-70b-chat across multiple arithmetic datasets. - Logits-based uncertainty (when available) also works and can be comparable to disagreement/entropy; overconfidence noted for some models (e.g., Llama).

Discussion

The study addresses the question of how to choose the most informative questions for CoT exemplar annotation to improve LLM reasoning. By quantifying prediction uncertainty over multiple sampled solutions and annotating only the most uncertain items, the method reduces model uncertainty and yields better in-context reasoning. The approach is effective across arithmetic, commonsense, and symbolic tasks, with consistent gains over standard CoT and self-consistency baselines. Analyses reveal a strong negative correlation between uncertainty and accuracy, supporting the core hypothesis that lowering uncertainty via targeted exemplars enhances performance. The method’s transferability across models indicates that selected uncertainties are task-driven rather than model-specific, enabling reuse of exemplars and potential cost savings. The technique is complementary to other CoT advances (self-consistency, verifiers, complexity/diversity-based selection) and can be integrated to further improve results. The findings are significant for practical deployment: with small annotation budgets and minimal prompt engineering, users can substantially boost LLM reasoning quality.

Conclusion

Active-Prompt introduces an uncertainty-based active selection strategy to identify and annotate the most informative task-specific questions for chain-of-thought prompting. Using disagreement and entropy (with variance and self-confidence explored), the method consistently improves performance across eight reasoning benchmarks and multiple models, including stronger GPT-series and weaker Llama models. Extensive analyses demonstrate robustness to pool size and k, competitiveness in zero-shot settings, transferability of exemplars across models, and a clear uncertainty–accuracy relationship. Future directions include combining uncertainty with diversity/complexity criteria, enhancing cross-task prompt transfer, improving calibration (especially for logits-based measures), lowering annotation costs (e.g., via zero-shot CoT or verifiers), and evaluating on more powerful models like GPT-4.

Limitations

- Model coverage and cost: Many experiments rely on code-davinci-002 (chosen due to cost); comprehensive results for text-davinci-003 and especially GPT-4 are limited by API costs. Self-consistency with gpt-3.5-turbo was not fully explored for the same reason. - Reproducibility: OpenAI discontinued access to code-davinci-002 post-experiments, hindering reproduction (partial access may be available via special programs). - Transfer settings: For datasets without training splits (ASDiv, SVAMP, SingleEq), exemplars were transferred from GSM8K; transferability across tasks can limit peak gains and warrants further study. - Self-confidence metric: Direct model self-assessment proved unreliable due to overconfidence; additional calibration or external discriminators may be needed. - Annotation: A single annotator with minimal prompt engineering was used; while selection drove gains, annotation quality/consistency across annotators may affect outcomes.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions

O. R. Sarrias, M. P. M. D. Prado, et al.

Education

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

A. Gilson, C. W. Safranek, et al.

Psychology

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation

E. C. Stade, S. W. Stirman, et al.

Medicine and Health

A framework for human evaluation of large language models in healthcare derived from literature review

T. Y. C. Tam, S. Sivarajkumar, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny