logo
ResearchBunny Logo
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

Computer Science

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

X. Zhang, C. Du, et al.

The recent chain-of-thought (CoT) method generates explicit reasoning paths but can be suboptimal, while tree-of-thought (ToT) finds better paths at a high inference cost. This work shows that fine-tuning LLMs using ToT search trees via Chain of Preference Optimization (CPO) lets CoT match or surpass ToT performance without heavy inference. Research conducted by Authors present in <Authors> tag.... show more
Introduction

Large language models (LLMs) benefit from generating multi-step reasoning chains (CoT), but single-path decoding can miss optimal reasoning and lead to unconscious answering. Tree-of-Thought (ToT) introduces deliberate multi-branch exploration with evaluation to select higher-quality reasoning paths, but its inference-time search incurs substantial latency and compute cost. The central research question is whether ToT's strategic depth can be integrated into CoT to preserve efficiency while improving reasoning quality. The paper proposes Chain of Preference Optimization (CPO), leveraging per-step preference signals inherent in ToT's search to fine-tune LLMs so that standard CoT inference reproduces ToT-preferred paths, aiming to achieve better performance with low latency.

Literature Review

The paper situates its contribution within three strands: (1) Reasoning with LLMs: CoT and related methods (self-consistency, least-to-most, ReAct) improve multi-step reasoning; graph/tree-based approaches (Graph-of-Thoughts, ToT) add structure and evaluation but require inference-time search. (2) LLM self-improvement: RLHF, self-rewarding, and reinforced self-training show LLMs can leverage feedback (human or model-generated) to improve without extensive labels, though many methods rely on external reward models or seed labeled data. The proposed CPO differs by using self-generated, per-step preferences without ground-truth labels. (3) Monte Carlo Tree Search (MCTS) for LLMs: MCTS enhances decoding and training but significantly increases inference latency and often needs separate value/policy models and labeled data. CPO avoids high-latency search at inference and does not require extra models, instead mining ToT’s training-time search tree for stepwise preferences.

Methodology

CPO uses ToT at training time to synthesize per-step preference pairs and trains with a DPO-style objective to align the model with ToT-preferred thoughts. The process comprises: (1) Thought generation: At each reasoning step i, given state s_{i-1} = [x, z_1, ..., z_{i-1}], the LLM samples k next-step thoughts z_i^{(j)} ~ π_θ(z_i | x, z_1, ..., z_{i-1}), using a prompt format with 'Step i' and a stop criterion keyed to 'Step i+1'. This yields k new states s_i^{(j)}. (2) State evaluation: The LLM (no external reward model) evaluates each state via prompted guidelines and task-specific demonstrations, producing a justification and a classification (likely/impossible), mapped to scores (likely=10, impossible=1). To reduce bias, demonstration orders are shuffled and multiple samples are averaged. (3) Search and collection: A BFS with pruning retains the n-best thoughts at each step by evaluation score, continuing until a thought produces 'so the final answer is:' and returns selected paths. Preferred thoughts are those on the final selected path; dispreferred thoughts are sibling thoughts from the same parent state not included in the selected path, forming per-step preference pairs (z_i^w, z_i^l) for state s_{i-1}^w. This yields a dataset D of per-step preference pairs across the reasoning chain. Training objective: For each per-step pair, CPO applies the DPO objective at the thought level: L_i(θ; π_ref) = -log σ( β log[π_θ(z_i^w|x, s_{i-1}^w)/π_ref(z_i^w|x, s_{i-1}^w)] - β log[π_θ(z_i^l|x, s_{i-1}^w)/π_ref(z_i^l|x, s_{i-1}^w)] ), and optimizes E_{(x, z^*, z^{**}, ·) ~ D}[L_i]. Implementation details: Base models LLaMA2-7B/13B and Mistral-7B; LoRA adapters (rank=8, α=16); β=0.1; per state k=10 thoughts, prune to n=5; temperatures: 0.9 (SVAMP), 0.4 (others); learning rates: DPO 5e-6, SFT 1e-5; batch size with accumulation 32; optimizer AdamW; training 4 epochs with early stopping; 3 random seeds averaged; hardware: NVIDIA A100 40GB; Accelerate backend. Data construction budgets: ≤300 training instances per dataset; typical ~200 instances yield ~6,531 preference pairs on average.

Key Findings

Across seven datasets and three base LLMs, CPO improves over CoT by an average of 4.3% and up to 9.7% accuracy without additional human annotations. CPO achieves comparable or superior performance to ToT while retaining CoT-level inference latency, averaging 57.5× faster than ToT. Compared to TS-SFT (tree-search Supervised Fine-Tuning), CPO improves by an average of 2.7% and up to 10.3%, attributed to leveraging both selected and unselected thoughts at each step rather than only final paths. Representative results (accuracy %, latency s/instance): LLaMA2-7B: Bamboogle CPO 32.0% vs CoT 29.6%, ToT 33.6%; FEVER CPO 53.2% vs CoT 45.8%, ToT 47.3; SVAMP CPO 46.0% vs CoT 37.7%, ToT 42.7; average CPO 40.9% vs CoT 36.0%, ToT 39.1, with CPO latency ~37.9 vs ToT ~1749.1. LLaMA2-13B: average CPO 45.2% vs CoT 41.4%, ToT 45.8, latency CPO ~56.2 vs ToT ~2186.9. Mistral-7B: average CPO 49.1% vs CoT 45.0%, ToT 47.8, latency CPO ~47.9 vs ToT ~4229.4. Component analyses: (a) Dispreferred selection strategies (lowest, all lower-than-preferred, all non-selected) yield similar performance; using all non-selected maximizes pair counts. (b) Per-step preference supervision (CPO) outperforms full-path preference optimization (FPO), which suffers longest-common-prefix gradient cancellation and showed a relative performance decrease of 4.6% versus SFT; CPO optimizes all steps, including common prefixes. (c) Data size ablation: With <80 instances, training can underperform base due to overfitting; performance improves and stabilizes beyond ~120 instances. (d) Increasing the proportion of dispreferred data (i.e., CPO share) consistently improves performance, indicating the utility of dispreferred signals. Data mixture: Training on uniform QA or mixed-type data yields +3.2% over single-task for CPO on Bamboogle. Iterative learning: Using CoT for inference, CPO-only improved ~4% after two iterations; ToT inference on fine-tuned models did not consistently improve, potentially due to reduced output diversity limiting search benefits.

Discussion

The findings support the hypothesis that ToT’s deliberative reasoning signals can be distilled into CoT-style inference by aligning the model to per-step preferences derived from ToT’s search process. By training with DPO at the thought level, CPO teaches the model to select ToT-preferred thoughts at each step, enabling comparable performance to ToT without the inference-time search overhead. This stepwise supervision addresses limitations of SFT on final paths and avoids gradient cancellation issues in full-path preference optimization, thereby improving reasoning quality, stability, and efficiency across QA, fact verification, and arithmetic tasks. The reduced latency and removal of dependence on external reward models or labels make CPO practical for resource-constrained deployments while preserving or enhancing accuracy.

Conclusion

Chain of Preference Optimization (CPO) leverages per-step preferences inherent in ToT’s search to fine-tune LLMs so that standard CoT decoding reproduces higher-quality reasoning paths. Experiments with LLaMA2 and Mistral models across seven datasets show consistent accuracy gains over CoT (average +4.3%) with CoT-level latency and performance on par with or better than ToT. CPO surpasses TS-SFT by exploiting both selected and unselected thoughts. Future work includes integrating CPO with other structured reasoning paradigms (Graph-of-Thoughts, AlphaZero-like tree search), exploring weak-to-strong alignment by using weaker evaluators for stronger generators, accelerating training-time data generation via non-autoregressive decoding, speculative decoding, and KV-cache pruning, and expanding to broader tasks and modalities.

Limitations

CPO requires ToT-style search to generate training data, which is time-consuming; accelerating data generation is necessary (e.g., non-autoregressive generation, speculative decoding, KV cache pruning). The study focuses on text-only language models and a limited set of downstream tasks; generalization to other modalities (vision-language) and diverse applications remains to be demonstrated. Ethical considerations are critical: while the method can improve safety compliance (e.g., via constitutions), it could be misused to align models for malicious applications. Performance with ToT inference on fine-tuned models may decline due to reduced output diversity affecting search effectiveness.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny