Suy nghĩ hay không suy nghĩ- Bộ định tuyến cho các LLM lai

Suy nghĩ hay không suy nghĩ- Bộ định tuyến cho các LLM lai

  • 6 min read
Suy nghĩ hay không suy nghĩ- Bộ định tuyến cho các LLM lai
Suy nghĩ hay không suy nghĩ- Bộ định tuyến cho các LLM lai

To Think or Not to Think: A Router for Hybrid LLMs

This article details the development of a reasoning-effort router for Large Language Models (LLMs), aiming to intelligently decide when to engage “thinking” (reasoning) mode to improve output quality versus when to skip it for faster responses.

Introduction

The concept of “test-time compute” has revolutionized how LLMs work. Models can now allocate more computational resources (tokens) to complex questions, leading to better and more accurate results. This has been seen in models like OpenAI’s o1 and DeepSeek-R1. More recently, hybrid models like Qwen3 allow users to toggle a “thinking” flag. However, choosing between a fast, non-reasoning model and a slower, reasoning-heavy one for each task can be inconvenient. This project introduces a router that predicts whether a given task requires thinking, acting as an “Auto” option.

Problem Setup

To build this router, the core need was paired data: for each query, we need a response generated with thinking mode enabled and another generated with it disabled, along with a label indicating which is better or if thinking is unnecessary. The goal is to train a classifier to predict a label: think or no_think.

  • think: Used when reasoning significantly improves quality.
  • no_think: Used when reasoning doesn’t justify the extra tokens.

The data was primarily synthetic, focusing on Qwen3 models. The approach involved using the same base model with different test-time computes (thinking on/off) and then scoring these outputs to create a supervision signal for the router.

Data Collection

The datasets needed to capture diverse real-world queries and clearly reasoning-heavy tasks. They were categorized into open-ended and closed-ended datasets.

Open-Ended Datasets

  • WildChat-filtered-Qwen3-8B-Scored: Contains real user chats.
  • Nectar-Qwen3-8B: Another dataset of user conversations.

Due to budget constraints, filtering was applied: English only, single-turn interactions, samples scored highly by HuggingFaceFW/fineweb-edu-classifier, and queries originally made to GPT-4 (as a heuristic for harder questions). For each query, think-mode and no-think-mode outputs were generated and later scored by a reward model.

Closed-Ended Datasets

These datasets feature tasks where correctness is easily verifiable and non-thinking models perform noticeably worse.

  • AIME-1983-2024-Qwen3-8B: Competition math problems.
  • Big-Math-RL-Qwen3-8B: Another math-focused dataset.

For these, the ground-truth answer allowed for accuracy flags:

  • acc_think(x): 1 if think-mode answer matches ground truth, else 0.
  • acc_no_think(x): 1 if no-think answer matches ground truth, else 0.

Data Labeling

Closed-Ended Data

Labeling is straightforward:

  • If think-mode is correct and no-think-mode is wrong: label think.
  • If no-think is correct and think-mode is wrong: label no_think.

Edge cases:

  • If both are correct: label no_think (no need for extra tokens).
  • If both are wrong: label think (assuming larger future models might solve it with more reasoning).

Open-Ended Data

These were scored using the Skywork-Reward-V2-Llama-3.1-8B reward model. The think mode was chosen if its reward was higher than the no_think mode, unless the absolute difference was below a threshold ε, in which case no_think was chosen. The <think>...</think> chain-of-thought was removed before scoring.

Training

After labeling, approximately 70k samples were available. The task was a classification problem predicting y ∈ {think, no_think}. Both BERT variants (encoder-only) and Qwen3-0.6B (decoder-only) were tested, with Qwen3-0.6B and mmBERT-small being selected.

The training curves (loss, F1, accuracy) showed good performance. The W&B dashboard provides detailed training logs.

Results

The router was evaluated on test sets from the open-ended dataset and 2025 math benchmarks.

WildChat (Test)

Strategy Avg Reward vs No-Think vs Think Think %
No Think (Baseline) 22.6868 - - 0.0
Think (Baseline) 24.7362 - - 100.0
reasoning-router-0.6b 24.0576 +1.3708 -0.6787 42.9%
reasoning-router-mmbert-small 24.1236 +1.4368 -0.6126 46.3%
routellm-bert 23.3694 +0.6826 -1.3668 33.5%
routellm-mf 22.6865 -0.0004 -2.0498 0.2%

The router models significantly outperformed the no-think baseline and RouteLLM models while using fewer tokens.

Nectar (Test)

Strategy Avg Reward vs No-Think vs Think Think %
No Think (Baseline) 8.6090 - - 0.0
Think (Baseline) 9.9649 - - 100.0
reasoning-router-0.6b 9.6866 +1.0776 -0.2784 28.5%
reasoning-router-mmbert-small 8.9183 +0.3093 -1.0466 9.9%
routellm-bert 9.9603 +1.3513 -0.0046 99.7%
routellm-mf 8.5713 -0.0377 -1.3936 1.9%

AIME 2025 (Qwen3-8B)

Strategy Avg Accuracy vs No-Think vs Think Avg Length Think %
No Think (Baseline) 0.1267 - - 3411.5 0.0
Think (Baseline) 0.2400 - - 12835.2 100.0
reasoning-router-0.6b 0.1933 +0.0667 -0.0467 11815.0 90.0%
reasoning-router-mmbert-small 0.2267 +0.1000 -0.0133 12322.1 93.3%
routellm-bert 0.2333 +0.1067 -0.0067 12553.6 96.7%
routellm-mf 0.1267 +0.0000 -0.1133 3411.5 0.0%

The router shows improved accuracy over no-think baselines and generalizes reasonably to Qwen3-30B.

AIME 2025 (Qwen3-30B)

Strategy Avg Accuracy vs No-Think vs Think Avg Length Think %
No Think (Baseline) 0.1933 - - 1178.2 0.0
Think (Baseline) 0.6400 - - 11773.5 100.0
reasoning-router-0.6b 0.5667 +0.3733 -0.0733 11112.8 90.0%
reasoning-router-mmbert-small 0.6067 +0.4133 -0.0333 11477.9 93.3%
routellm-bert 0.6133 +0.4200 -0.0267 11593.9 96.7%
routellm-mf 0.1933 +0.0000 -0.4467 1178.2 0.0%

HMMT 2025 (Qwen3-8B)

Strategy Avg Accuracy vs No-Think vs Think Avg Length Think %
No Think (Baseline) 0.0467 - - 5210.1 0.0
Think (Baseline) 0.0800 - - 13628.4 100.0
reasoning-router-0.6b 0.0533 +0.0067 -0.0267 10425.3 60.0%
reasoning-router-mmbert-small 0.0867 +0.0400 +0.0067 12360.6 83.3%
routellm-bert 0.0800 +0.0333 +0.0000 13262.2 96.7%
routellm-mf 0.0467 +0.0000 -0.0333 5210.1 0.0%

HMMT 2025 (Qwen3-30B)

Strategy Avg Accuracy vs No-Think vs Think Avg Length Think %
No Think (Baseline) 0.0867 - - 1038.2 0.0
Think (Baseline) 0.4600 - - 11718.4 100.0
reasoning-router-0.6b 0.3133 +0.2267 -0.1467 7597.4 60.0%
reasoning-router-mmbert-small 0.3933 +0.3067 -0.0667 9801.3 83.3%
routellm-bert 0.4267 +0.3400 -0.0333 11281.4 96.7%
routellm-mf 0.0867 +0.0000 -0.3733 1038.2 0.0%

Limitations

  • Model Architecture: Router trained on transformers; other architectures not explored.
  • Reward Model: Skywork-Reward-V2-Llama-3.1-8B is used, which might have biases.
  • Data Diversity: Excludes multilingual queries, coding tasks, multi-turn conversations, and multimodal inputs due to constraints.
  • Model Size: Evaluations primarily on Qwen3-8B, with limited testing on Qwen3-30B. Larger models and newer hybrid models (like OpenAI’s o3/o4-mini) are not fully explored.
  • Multi-level Reasoning: Future work could extend the router beyond binary “think/no-think” to select from multiple reasoning tiers.

Conclusions

This work demonstrates that a reasoning-effort router can be effectively trained using primarily synthetic data, even without human preference data. The router reliably predicts when thinking mode is beneficial, validating the motivation behind this project. The release of models like OpenAI’s GPT-5 with built-in routers further supports this approach. This project offers a promising, lightweight method for test-time compute allocation in hybrid LLMs.

References

[1] RouteLLM – Learning to Route LLMs with Preference Data: https://arxiv.org/abs/2406.18665 [2] OpenAI o1 – Learning to Reason with LLMs: https://openai.com/index/learning-to-reason-with-llms/ [3] DeepSeek-R1 – Reinforcement Learning for Reasoning: https://arxiv.org/abs/2501.12948 [4] Qwen3 Technical Report – Hybrid reason/no-reason models: https://arxiv.org/abs/2505.09388 [5] OpenAI – Introducing GPT-5: https://openai.com/index/introducing-gpt-5/ [6] DeepSeek V3.1 – Hybrid Reasoning Announcement: https://api-docs.deepseek.com/news/news250821 [7] Skywork Reward Model – Skywork-Reward-V2-Llama-3.1-8B: https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.1-8B

Appendix

Hugging Face Collections

The full collection of datasets, models, and router artifacts is available at: Reasoning Router Collection: https://huggingface.co/collections/AmirMohseni/reasoning-router

Datasets

Code Repository

The code is available on GitHub: LLM-Router (GitHub): https://github.com/Amir-Mohseni/LLM-Router

Qwen3-8B Sampling Parameters

python

Sampling presets

THINKING_PARAMS = { “chat_template_kwargs”: {“enable_thinking”: True}, “do_sample”: True, “temperature”: 0.6, “top_p”: 0.95, “top_k”: 20, “min_p”: 0, }

NON_THINKING_PARAMS = { “chat_template_kwargs”: {“enable_thinking”: False}, “do_sample”: True, “temperature”: 0.7, “top_p”: 0.8, “top_k”: 20, “min_p”: 0, }

Citation

bibtex @misc{mohseni2025reasoningrouter, author = {Mohseni, Amir}, title = {To Think or Not to Think: A Router for Hybrid LLMs}, howpublished = {Hugging Face Blog Post}, month = {November}, year = {2025}, url = {https://huggingface.co/blog/AmirMohseni/reasoning-router} }

Recommended for You

Apriel-H1- Chìa khóa bất ngờ để chưng cất các mô hình suy luận hiệu quả

Apriel-H1- Chìa khóa bất ngờ để chưng cất các mô hình suy luận hiệu quả

Apriel-H1- Chìa khóa bất ngờ để chưng cất các mô hình suy luận hiệu quả

Bản đồ Pharmome- bộ dữ liệu công khai toàn diện để mô hình hóa tương tác thuốc-mục tiêu

Bản đồ Pharmome- bộ dữ liệu công khai toàn diện để mô hình hóa tương tác thuốc-mục tiêu

Bản đồ Pharmome- bộ dữ liệu công khai toàn diện để mô hình hóa tương tác thuốc-mục tiêu