Suy nghĩ hay không suy nghĩ- Bộ định tuyến cho các LLM lai
Suy nghĩ hay không suy nghĩ- Bộ định tuyến cho các LLM lai
- 6 min read
To Think or Not to Think: A Router for Hybrid LLMs
This article details the development of a reasoning-effort router for Large Language Models (LLMs), aiming to intelligently decide when to engage “thinking” (reasoning) mode to improve output quality versus when to skip it for faster responses.
Introduction
The concept of “test-time compute” has revolutionized how LLMs work. Models can now allocate more computational resources (tokens) to complex questions, leading to better and more accurate results. This has been seen in models like OpenAI’s o1 and DeepSeek-R1. More recently, hybrid models like Qwen3 allow users to toggle a “thinking” flag. However, choosing between a fast, non-reasoning model and a slower, reasoning-heavy one for each task can be inconvenient. This project introduces a router that predicts whether a given task requires thinking, acting as an “Auto” option.
Problem Setup
To build this router, the core need was paired data: for each query, we need a response generated with thinking mode enabled and another generated with it disabled, along with a label indicating which is better or if thinking is unnecessary. The goal is to train a classifier to predict a label: think or no_think.
think: Used when reasoning significantly improves quality.no_think: Used when reasoning doesn’t justify the extra tokens.
The data was primarily synthetic, focusing on Qwen3 models. The approach involved using the same base model with different test-time computes (thinking on/off) and then scoring these outputs to create a supervision signal for the router.
Data Collection
The datasets needed to capture diverse real-world queries and clearly reasoning-heavy tasks. They were categorized into open-ended and closed-ended datasets.
Open-Ended Datasets
- WildChat-filtered-Qwen3-8B-Scored: Contains real user chats.
- Nectar-Qwen3-8B: Another dataset of user conversations.
Due to budget constraints, filtering was applied: English only, single-turn interactions, samples scored highly by HuggingFaceFW/fineweb-edu-classifier, and queries originally made to GPT-4 (as a heuristic for harder questions). For each query, think-mode and no-think-mode outputs were generated and later scored by a reward model.
Closed-Ended Datasets
These datasets feature tasks where correctness is easily verifiable and non-thinking models perform noticeably worse.
- AIME-1983-2024-Qwen3-8B: Competition math problems.
- Big-Math-RL-Qwen3-8B: Another math-focused dataset.
For these, the ground-truth answer allowed for accuracy flags:
acc_think(x): 1 if think-mode answer matches ground truth, else 0.acc_no_think(x): 1 if no-think answer matches ground truth, else 0.
Data Labeling
Closed-Ended Data
Labeling is straightforward:
- If think-mode is correct and no-think-mode is wrong: label
think. - If no-think is correct and think-mode is wrong: label
no_think.
Edge cases:
- If both are correct: label
no_think(no need for extra tokens). - If both are wrong: label
think(assuming larger future models might solve it with more reasoning).
Open-Ended Data
These were scored using the Skywork-Reward-V2-Llama-3.1-8B reward model. The think mode was chosen if its reward was higher than the no_think mode, unless the absolute difference was below a threshold ε, in which case no_think was chosen. The <think>...</think> chain-of-thought was removed before scoring.
Training
After labeling, approximately 70k samples were available. The task was a classification problem predicting y ∈ {think, no_think}. Both BERT variants (encoder-only) and Qwen3-0.6B (decoder-only) were tested, with Qwen3-0.6B and mmBERT-small being selected.
The training curves (loss, F1, accuracy) showed good performance. The W&B dashboard provides detailed training logs.
Results
The router was evaluated on test sets from the open-ended dataset and 2025 math benchmarks.
WildChat (Test)
| Strategy | Avg Reward | vs No-Think | vs Think | Think % |
|---|---|---|---|---|
| No Think (Baseline) | 22.6868 | - | - | 0.0 |
| Think (Baseline) | 24.7362 | - | - | 100.0 |
| reasoning-router-0.6b | 24.0576 | +1.3708 | -0.6787 | 42.9% |
| reasoning-router-mmbert-small | 24.1236 | +1.4368 | -0.6126 | 46.3% |
| routellm-bert | 23.3694 | +0.6826 | -1.3668 | 33.5% |
| routellm-mf | 22.6865 | -0.0004 | -2.0498 | 0.2% |
The router models significantly outperformed the no-think baseline and RouteLLM models while using fewer tokens.
Nectar (Test)
| Strategy | Avg Reward | vs No-Think | vs Think | Think % |
|---|---|---|---|---|
| No Think (Baseline) | 8.6090 | - | - | 0.0 |
| Think (Baseline) | 9.9649 | - | - | 100.0 |
| reasoning-router-0.6b | 9.6866 | +1.0776 | -0.2784 | 28.5% |
| reasoning-router-mmbert-small | 8.9183 | +0.3093 | -1.0466 | 9.9% |
| routellm-bert | 9.9603 | +1.3513 | -0.0046 | 99.7% |
| routellm-mf | 8.5713 | -0.0377 | -1.3936 | 1.9% |
AIME 2025 (Qwen3-8B)
| Strategy | Avg Accuracy | vs No-Think | vs Think | Avg Length | Think % |
|---|---|---|---|---|---|
| No Think (Baseline) | 0.1267 | - | - | 3411.5 | 0.0 |
| Think (Baseline) | 0.2400 | - | - | 12835.2 | 100.0 |
| reasoning-router-0.6b | 0.1933 | +0.0667 | -0.0467 | 11815.0 | 90.0% |
| reasoning-router-mmbert-small | 0.2267 | +0.1000 | -0.0133 | 12322.1 | 93.3% |
| routellm-bert | 0.2333 | +0.1067 | -0.0067 | 12553.6 | 96.7% |
| routellm-mf | 0.1267 | +0.0000 | -0.1133 | 3411.5 | 0.0% |
The router shows improved accuracy over no-think baselines and generalizes reasonably to Qwen3-30B.
AIME 2025 (Qwen3-30B)
| Strategy | Avg Accuracy | vs No-Think | vs Think | Avg Length | Think % |
|---|---|---|---|---|---|
| No Think (Baseline) | 0.1933 | - | - | 1178.2 | 0.0 |
| Think (Baseline) | 0.6400 | - | - | 11773.5 | 100.0 |
| reasoning-router-0.6b | 0.5667 | +0.3733 | -0.0733 | 11112.8 | 90.0% |
| reasoning-router-mmbert-small | 0.6067 | +0.4133 | -0.0333 | 11477.9 | 93.3% |
| routellm-bert | 0.6133 | +0.4200 | -0.0267 | 11593.9 | 96.7% |
| routellm-mf | 0.1933 | +0.0000 | -0.4467 | 1178.2 | 0.0% |
HMMT 2025 (Qwen3-8B)
| Strategy | Avg Accuracy | vs No-Think | vs Think | Avg Length | Think % |
|---|---|---|---|---|---|
| No Think (Baseline) | 0.0467 | - | - | 5210.1 | 0.0 |
| Think (Baseline) | 0.0800 | - | - | 13628.4 | 100.0 |
| reasoning-router-0.6b | 0.0533 | +0.0067 | -0.0267 | 10425.3 | 60.0% |
| reasoning-router-mmbert-small | 0.0867 | +0.0400 | +0.0067 | 12360.6 | 83.3% |
| routellm-bert | 0.0800 | +0.0333 | +0.0000 | 13262.2 | 96.7% |
| routellm-mf | 0.0467 | +0.0000 | -0.0333 | 5210.1 | 0.0% |
HMMT 2025 (Qwen3-30B)
| Strategy | Avg Accuracy | vs No-Think | vs Think | Avg Length | Think % |
|---|---|---|---|---|---|
| No Think (Baseline) | 0.0867 | - | - | 1038.2 | 0.0 |
| Think (Baseline) | 0.4600 | - | - | 11718.4 | 100.0 |
| reasoning-router-0.6b | 0.3133 | +0.2267 | -0.1467 | 7597.4 | 60.0% |
| reasoning-router-mmbert-small | 0.3933 | +0.3067 | -0.0667 | 9801.3 | 83.3% |
| routellm-bert | 0.4267 | +0.3400 | -0.0333 | 11281.4 | 96.7% |
| routellm-mf | 0.0867 | +0.0000 | -0.3733 | 1038.2 | 0.0% |
Limitations
- Model Architecture: Router trained on transformers; other architectures not explored.
- Reward Model: Skywork-Reward-V2-Llama-3.1-8B is used, which might have biases.
- Data Diversity: Excludes multilingual queries, coding tasks, multi-turn conversations, and multimodal inputs due to constraints.
- Model Size: Evaluations primarily on Qwen3-8B, with limited testing on Qwen3-30B. Larger models and newer hybrid models (like OpenAI’s o3/o4-mini) are not fully explored.
- Multi-level Reasoning: Future work could extend the router beyond binary “think/no-think” to select from multiple reasoning tiers.
Conclusions
This work demonstrates that a reasoning-effort router can be effectively trained using primarily synthetic data, even without human preference data. The router reliably predicts when thinking mode is beneficial, validating the motivation behind this project. The release of models like OpenAI’s GPT-5 with built-in routers further supports this approach. This project offers a promising, lightweight method for test-time compute allocation in hybrid LLMs.
References
[1] RouteLLM – Learning to Route LLMs with Preference Data: https://arxiv.org/abs/2406.18665 [2] OpenAI o1 – Learning to Reason with LLMs: https://openai.com/index/learning-to-reason-with-llms/ [3] DeepSeek-R1 – Reinforcement Learning for Reasoning: https://arxiv.org/abs/2501.12948 [4] Qwen3 Technical Report – Hybrid reason/no-reason models: https://arxiv.org/abs/2505.09388 [5] OpenAI – Introducing GPT-5: https://openai.com/index/introducing-gpt-5/ [6] DeepSeek V3.1 – Hybrid Reasoning Announcement: https://api-docs.deepseek.com/news/news250821 [7] Skywork Reward Model – Skywork-Reward-V2-Llama-3.1-8B: https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.1-8B
Appendix
Hugging Face Collections
The full collection of datasets, models, and router artifacts is available at: Reasoning Router Collection: https://huggingface.co/collections/AmirMohseni/reasoning-router
Datasets
- AIME competition math: AIME-1983-2024-Qwen3-8B
- Nectar open-ended chats: Nectar-Qwen3-8B
- Large-scale math RL-style data: Big-Math-RL-Qwen3-8B
- Filtered real-world WildChat data: WildChat-filtered-Qwen3-8B-Scored
Code Repository
The code is available on GitHub: LLM-Router (GitHub): https://github.com/Amir-Mohseni/LLM-Router
Qwen3-8B Sampling Parameters
python
Sampling presets
THINKING_PARAMS = { “chat_template_kwargs”: {“enable_thinking”: True}, “do_sample”: True, “temperature”: 0.6, “top_p”: 0.95, “top_k”: 20, “min_p”: 0, }
NON_THINKING_PARAMS = { “chat_template_kwargs”: {“enable_thinking”: False}, “do_sample”: True, “temperature”: 0.7, “top_p”: 0.8, “top_k”: 20, “min_p”: 0, }
Citation
bibtex @misc{mohseni2025reasoningrouter, author = {Mohseni, Amir}, title = {To Think or Not to Think: A Router for Hybrid LLMs}, howpublished = {Hugging Face Blog Post}, month = {November}, year = {2025}, url = {https://huggingface.co/blog/AmirMohseni/reasoning-router} }
Link bài viết gốc
- Tags:
- Ai
- 3 Days Ago
- Huggingface.co