Suy nghĩ hay không suy nghĩ- Bộ định tuyến cho các LLM lai

November 19, 2025
6 min read

To Think or Not to Think: A Router for Hybrid LLMs

This article details the development of a reasoning-effort router for Large Language Models (LLMs), aiming to intelligently decide when to engage “thinking” (reasoning) mode to improve output quality versus when to skip it for faster responses.

Introduction

The concept of “test-time compute” has revolutionized how LLMs work. Models can now allocate more computational resources (tokens) to complex questions, leading to better and more accurate results. This has been seen in models like OpenAI’s o1 and DeepSeek-R1. More recently, hybrid models like Qwen3 allow users to toggle a “thinking” flag. However, choosing between a fast, non-reasoning model and a slower, reasoning-heavy one for each task can be inconvenient. This project introduces a router that predicts whether a given task requires thinking, acting as an “Auto” option.

Problem Setup

To build this router, the core need was paired data: for each query, we need a response generated with thinking mode enabled and another generated with it disabled, along with a label indicating which is better or if thinking is unnecessary. The goal is to train a classifier to predict a label: think or no_think.

think: Used when reasoning significantly improves quality.
no_think: Used when reasoning doesn’t justify the extra tokens.

The data was primarily synthetic, focusing on Qwen3 models. The approach involved using the same base model with different test-time computes (thinking on/off) and then scoring these outputs to create a supervision signal for the router.

Data Collection

The datasets needed to capture diverse real-world queries and clearly reasoning-heavy tasks. They were categorized into open-ended and closed-ended datasets.

Open-Ended Datasets

WildChat-filtered-Qwen3-8B-Scored: Contains real user chats.
Nectar-Qwen3-8B: Another dataset of user conversations.

Due to budget constraints, filtering was applied: English only, single-turn interactions, samples scored highly by HuggingFaceFW/fineweb-edu-classifier, and queries originally made to GPT-4 (as a heuristic for harder questions). For each query, think-mode and no-think-mode outputs were generated and later scored by a reward model.

Closed-Ended Datasets

These datasets feature tasks where correctness is easily verifiable and non-thinking models perform noticeably worse.

AIME-1983-2024-Qwen3-8B: Competition math problems.
Big-Math-RL-Qwen3-8B: Another math-focused dataset.

For these, the ground-truth answer allowed for accuracy flags:

acc_think(x): 1 if think-mode answer matches ground truth, else 0.
acc_no_think(x): 1 if no-think answer matches ground truth, else 0.

Data Labeling

Closed-Ended Data

Labeling is straightforward:

If think-mode is correct and no-think-mode is wrong: label think.
If no-think is correct and think-mode is wrong: label no_think.

Edge cases:

If both are correct: label no_think (no need for extra tokens).
If both are wrong: label think (assuming larger future models might solve it with more reasoning).

Open-Ended Data

These were scored using the Skywork-Reward-V2-Llama-3.1-8B reward model. The think mode was chosen if its reward was higher than the no_think mode, unless the absolute difference was below a threshold ε, in which case no_think was chosen. The <think>...</think> chain-of-thought was removed before scoring.

Training

After labeling, approximately 70k samples were available. The task was a classification problem predicting y ∈ {think, no_think}. Both BERT variants (encoder-only) and Qwen3-0.6B (decoder-only) were tested, with Qwen3-0.6B and mmBERT-small being selected.

The training curves (loss, F1, accuracy) showed good performance. The W&B dashboard provides detailed training logs.

Results

The router was evaluated on test sets from the open-ended dataset and 2025 math benchmarks.

WildChat (Test)

Strategy	Avg Reward	vs No-Think	vs Think	Think %
No Think (Baseline)	22.6868	-	-	0.0
Think (Baseline)	24.7362	-	-	100.0
reasoning-router-0.6b	24.0576	+1.3708	-0.6787	42.9%
reasoning-router-mmbert-small	24.1236	+1.4368	-0.6126	46.3%
routellm-bert	23.3694	+0.6826	-1.3668	33.5%
routellm-mf	22.6865	-0.0004	-2.0498	0.2%

The router models significantly outperformed the no-think baseline and RouteLLM models while using fewer tokens.

Nectar (Test)

Strategy	Avg Reward	vs No-Think	vs Think	Think %
No Think (Baseline)	8.6090	-	-	0.0
Think (Baseline)	9.9649	-	-	100.0
reasoning-router-0.6b	9.6866	+1.0776	-0.2784	28.5%
reasoning-router-mmbert-small	8.9183	+0.3093	-1.0466	9.9%
routellm-bert	9.9603	+1.3513	-0.0046	99.7%
routellm-mf	8.5713	-0.0377	-1.3936	1.9%

AIME 2025 (Qwen3-8B)

Strategy	Avg Accuracy	vs No-Think	vs Think	Avg Length	Think %
No Think (Baseline)	0.1267	-	-	3411.5	0.0
Think (Baseline)	0.2400	-	-	12835.2	100.0
reasoning-router-0.6b	0.1933	+0.0667	-0.0467	11815.0	90.0%
reasoning-router-mmbert-small	0.2267	+0.1000	-0.0133	12322.1	93.3%
routellm-bert	0.2333	+0.1067	-0.0067	12553.6	96.7%
routellm-mf	0.1267	+0.0000	-0.1133	3411.5	0.0%

The router shows improved accuracy over no-think baselines and generalizes reasonably to Qwen3-30B.

AIME 2025 (Qwen3-30B)

Strategy	Avg Accuracy	vs No-Think	vs Think	Avg Length	Think %
No Think (Baseline)	0.1933	-	-	1178.2	0.0
Think (Baseline)	0.6400	-	-	11773.5	100.0
reasoning-router-0.6b	0.5667	+0.3733	-0.0733	11112.8	90.0%
reasoning-router-mmbert-small	0.6067	+0.4133	-0.0333	11477.9	93.3%
routellm-bert	0.6133	+0.4200	-0.0267	11593.9	96.7%
routellm-mf	0.1933	+0.0000	-0.4467	1178.2	0.0%

HMMT 2025 (Qwen3-8B)

Strategy	Avg Accuracy	vs No-Think	vs Think	Avg Length	Think %
No Think (Baseline)	0.0467	-	-	5210.1	0.0
Think (Baseline)	0.0800	-	-	13628.4	100.0
reasoning-router-0.6b	0.0533	+0.0067	-0.0267	10425.3	60.0%
reasoning-router-mmbert-small	0.0867	+0.0400	+0.0067	12360.6	83.3%
routellm-bert	0.0800	+0.0333	+0.0000	13262.2	96.7%
routellm-mf	0.0467	+0.0000	-0.0333	5210.1	0.0%

HMMT 2025 (Qwen3-30B)

Strategy	Avg Accuracy	vs No-Think	vs Think	Avg Length	Think %
No Think (Baseline)	0.0867	-	-	1038.2	0.0
Think (Baseline)	0.4600	-	-	11718.4	100.0
reasoning-router-0.6b	0.3133	+0.2267	-0.1467	7597.4	60.0%
reasoning-router-mmbert-small	0.3933	+0.3067	-0.0667	9801.3	83.3%
routellm-bert	0.4267	+0.3400	-0.0333	11281.4	96.7%
routellm-mf	0.0867	+0.0000	-0.3733	1038.2	0.0%

Limitations

Model Architecture: Router trained on transformers; other architectures not explored.
Reward Model: Skywork-Reward-V2-Llama-3.1-8B is used, which might have biases.
Data Diversity: Excludes multilingual queries, coding tasks, multi-turn conversations, and multimodal inputs due to constraints.
Model Size: Evaluations primarily on Qwen3-8B, with limited testing on Qwen3-30B. Larger models and newer hybrid models (like OpenAI’s o3/o4-mini) are not fully explored.
Multi-level Reasoning: Future work could extend the router beyond binary “think/no-think” to select from multiple reasoning tiers.

Conclusions

This work demonstrates that a reasoning-effort router can be effectively trained using primarily synthetic data, even without human preference data. The router reliably predicts when thinking mode is beneficial, validating the motivation behind this project. The release of models like OpenAI’s GPT-5 with built-in routers further supports this approach. This project offers a promising, lightweight method for test-time compute allocation in hybrid LLMs.

References

[1] RouteLLM – Learning to Route LLMs with Preference Data: https://arxiv.org/abs/2406.18665 [2] OpenAI o1 – Learning to Reason with LLMs: https://openai.com/index/learning-to-reason-with-llms/ [3] DeepSeek-R1 – Reinforcement Learning for Reasoning: https://arxiv.org/abs/2501.12948 [4] Qwen3 Technical Report – Hybrid reason/no-reason models: https://arxiv.org/abs/2505.09388 [5] OpenAI – Introducing GPT-5: https://openai.com/index/introducing-gpt-5/ [6] DeepSeek V3.1 – Hybrid Reasoning Announcement: https://api-docs.deepseek.com/news/news250821 [7] Skywork Reward Model – Skywork-Reward-V2-Llama-3.1-8B: https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.1-8B

Appendix

Hugging Face Collections

The full collection of datasets, models, and router artifacts is available at: Reasoning Router Collection: https://huggingface.co/collections/AmirMohseni/reasoning-router

Datasets

AIME competition math: AIME-1983-2024-Qwen3-8B
Nectar open-ended chats: Nectar-Qwen3-8B
Large-scale math RL-style data: Big-Math-RL-Qwen3-8B
Filtered real-world WildChat data: WildChat-filtered-Qwen3-8B-Scored

Code Repository

The code is available on GitHub: LLM-Router (GitHub): https://github.com/Amir-Mohseni/LLM-Router

Qwen3-8B Sampling Parameters

python

Sampling presets

THINKING_PARAMS = { “chat_template_kwargs”: {“enable_thinking”: True}, “do_sample”: True, “temperature”: 0.6, “top_p”: 0.95, “top_k”: 20, “min_p”: 0, }

NON_THINKING_PARAMS = { “chat_template_kwargs”: {“enable_thinking”: False}, “do_sample”: True, “temperature”: 0.7, “top_p”: 0.8, “top_k”: 20, “min_p”: 0, }

Citation

bibtex @misc{mohseni2025reasoningrouter, author = {Mohseni, Amir}, title = {To Think or Not to Think: A Router for Hybrid LLMs}, howpublished = {Hugging Face Blog Post}, month = {November}, year = {2025}, url = {https://huggingface.co/blog/AmirMohseni/reasoning-router} }

AI Today - SkyAI