Căn chỉnh Ngân sách- Làm cho Mô hình Suy luận bằng Ngôn ngữ của Người dùng

November 10, 2025
14 min read

Budget Alignment: Making Models Reason in the User’s Language

Authors: Shan Chen*, Jirui Qi*, Zidi Xiong, Timothy Miller, Arianna Bisazza, Raquel Fernández, and Danielle Bitterman

Institutions: Mass General Brigham | Harvard University | University of Groningen | University of Amsterdam

Please read this as a late stage work in progress where we are colleagues sharing this in a lab meeting to help/motivate potential parallel research.

Introduction

When you ask a large language model (LLM) a math question in Japanese, it responds politely in Japanese—but internally, it’s reasoning in English/Chinese. Variables, steps, and mathematical lemmas often silently switch languages during the reasoning process. This behavior, where models default to English for chain-of-thought (CoT) reasoning, is more than just a quirk. It interferes with instruction following, confuses human reviewers, and undermines the goal of multilingual evaluation.

The objective is straightforward: we want models to reason about a question in the same language it was asked, not just answer in that language. However, this proves to be more challenging than it seems. Forcing models to reason in languages other than English typically leads to a decrease in accuracy. Previous research indicates that instructing models to reason only in the prompt language through prompting or steering improves coherence and alignment with grading 1, but this often comes at a significant cost to accuracy. Even a small amount of multilingual fine-tuning helps, but doesn’t eliminate this trade-off 2. Furthermore, models not only prefer to reason in English, but they also reason more effectively in English. When researchers enforce strict reasoning within the prompt language (e.g., in Swahili or Thai), models often lose accuracy compared to when they are allowed to reason in English. For higher-resource languages like French or German, this trade-off is less pronounced—models can reason in-language almost as well as in English. For low-resource languages, enforcing strict in-language reasoning harms performance more significantly.

Why do models switch to English in the first place? A large part of the reason lies in their training. Most reasoning data is in English. Fine-tuning even strong multilingual models on English CoT data often causes them to adopt English as their “internal language of logic.” Yong et al. (2025) observed a “quote-and-think” behavior 3, where models copy phrases from the prompt language but then explain everything in English 4. The model understands the question in the non-English language—it just prefers to reason in English.

Our technical goal is simple: stop the switching without sacrificing accuracy—ideally, push the Pareto frontier of (Accuracy, Language-consistency). And we want to share this as a practical guide with lessons we learned along the way.

Our code and data are available at:

🔗 https://github.com/Betswish/mCoT-XReasoning/tree/main/training

🤔 https://github.com/Betswish/mCoT-pass-K

🤗 https://huggingface.co/collections/shanchen/xreasoning

What we try (Method in two steps)

🔧 Base model. deepseek-ai/DeepSeek-R1-Distill-Qwen-7B, a large reasoning model distilled from R1 through supervised fine-tuning on its reasoning traces, exhibiting an English/Chinese-dominant prior.

Step 1 — Small SFT to teach in-language reasoning. We fine-tune on 817 curated multilingual reasoning chains (from LiMO 5). This supervision data contains high-quality reasoning data matching R1 long-form reasoning style. No Reinforcement Learning (RL) here—just teach the policy to keep reasoning in the user’s query language.

Step 2 — Math-only GRPO to push accuracy while retaining reasoning language. We run an RLVR-style GRPO with no KL, higher clip of 0.28 vs -0.2 (DAPO-like 6), rollout 24, LoRA r=8, LR = 1e-5, only on a Math-500 set translated to each language. Intuition: let RL optimize hard cases and verification behaviors, while the high clip reduces catastrophic reasoning style collapse back to English.

We set the verifiable rewards as 1.0 for accuracy, 0.2 for language consistency of reasoning traces, and 0.2 for answer format 7.

📊 Evaluation.

We tried our approach on three different languages: Japanese (JA) / French (FR) / Spanish (ES)

And tested on multiple datasets: MMLU College Math (MMLU Math), AIME25, GPQA, MMLU Pro Medicine (MMLU Med)

The first two are in-domain: MMLU-Math is similar to the training data in terms of hardness, while AIME25 is harder. The other two are out-of-domain: GPQA covers hard science questions, and MMLU Pro Medicine is made up of hard questions in the medical domain.

Regimes tested:

Base -> deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
SFT on top of base
GRPO-from-Base
GRPO-from-SFT

Metrics:

pass@k(1,5,10) where n = 32 for accuracy 8
Language-consistency % (both reasoning traces and final answers must be in the requested language; script-aware checks)

How we score language consistency: We check the entire CoT span and the final boxed answer. A sample counts as Following = 1 only if both passages are in the requested language (script tokens, numerals, and markers allowed); otherwise 0. We report the % across the set.

🔑 Key contributions

Small SFT reprograms inner monologue. With only 817 chains, language consistency rises near the ceiling in French/Spanish across datasets and substantially in Japanese (Fig. RQ0).
Two-step recipe Pareto-improves. SFT secures language consistency; GRPO-SFT recovers/boosts accuracy on tough sets (AIME/GPQA) without reverting to English (Figs. RQ1–RQ4).
Diagnose regressions and actionable fixes. Regressions stem from:
- Japanese tokenization/numeric friction,
- Spanish cue misalignment,
- medicine reward/style mismatch.
- Tokenizer-aware normalization, small Japanese/Spanish SFT top-ups, and multi-objective GRPO (with optional model merging) could recover accuracy without sacrificing in-language reasoning.
TL; DR. You can briefly see our main results from the two figures below: Starting from an EN/ZH-dominant reasoning prior, small multilingual SFT is the most cost-effective way to “steer” in-language chains of reasoning. Adding math-only GRPO then recovers or improves accuracy on hard sets like AIME and GPQA while mostly preserving SFT’s language consistency discipline — pushing the Accuracy × Following frontier in many language-dataset pairs. The two pain points, Japanese (tokenization/numeric friction) and medicine (reward/style mismatch), are expected from the base prior and training signal, and both have potential straightforward fixes with light domain augmentation. And surprisingly, model merging can be very useful and effective.

Figure 1.a) Performance comparison overall across methods

Figure 1.a

Figure 1.b) Overall language consistency rate comparison across methods

Figure 1.b

RQ0 — Can small SFT reprogram a reasoning model’s “reasoning tone”?

Models often output the final answer in the same language as the user query. We want the reasoning process to match the prompt (user) language, too.

Results. SFT drives the language consistency rate close to the ceiling (~99–100%) in French/Spanish and raises Japanese substantially (high-80s/90s). The language consistency rates averaged across all datasets are shown in Fig. RQ0: bars labeled Japanese/French/Spanish.

Interpretation. A few hundred high-quality chains are enough to overwrite the English/Chinese inner-monologue priority to other languages. Japanese remains stubborn—see RQ5.

RQ0 Results

RQ1 — Does SFT help accuracy, or only language reasoning style?

We have shown that SFT significantly improves language consistency rates, but how about the accuracy?

Design. Compare the accuracy Base vs SFT on pass@k per dataset-language (Fig. RQ1: Δ pass@10 = SFT – Base).

Findings.

MMLU-Math: substantial improvements when train and test are in the same domain
- French: ~76 → 98
- Spanish: ~80 → 99
- Japanese: ~68 → 88
AIME: mixed. Although AIME contains math problems, it is way more difficult than LiMo, making it less likely to be considered as in-domain. As a result, SFT trades accuracy for strict language consistency when reasoning in ES.
GPQA / MMLU Pro Medicine: Accuracy drops in most cases, but language consistency rises after SFT, indicating that it’s not trivial to generalize the capability of generating the correct answer from the training domain to others.

Takeaway. SFT reliably improves language consistency and often increases accuracy on in-domain tasks (Math). On OOD, SFT can over-narrate or change prior most probable token paths since the models are undertrained to reason in lower resource languages—accuracy may dip unless taking further actions (e.g., reinforced by RL, shown in RQ2 and RQ3).

Practical guidance. If your target is language consistency/reasoning style + some accuracy, SFT alone is cost-effective in-domain. If you also need robustness on hard and/or OOD sets, doing an RL top-up could be helpful.

RQ1 Results

RQ2 — When RL comes, how does GRPO help with accuracy?

Design. Train GRPO only on Math-500; evaluate deltas (GRPO-SFT − SFT) across MMLU-Math / AIME / GPQA / MMLU-Med (Fig. RQ2).

In-domain. SFT helps accuracy, but not always; GRPO brings a boost on top of the base SFT while maintaining language consistency of reasoning traces.

MMLU-Math-FR pass@10: 76.0 → 97.8 → 98.0 (Base → SFT → GRPO-SFT)
MMLU-Math-ES pass@10: 80.5 → 98.6 → 99.1 (Base → SFT → GRPO-SFT)
MMLU-Math-JA pass@10: 68.1 → 88.0 → 91.5 (Base → SFT → GRPO-SFT)

The improvement in accuracy is consistent but slight due to the fact that MMLU-Math is relatively easy: The model almost achieves 90–100% accuracy after SFT, leaving no room for GRPO. Thus, the OOD sets are more informative.

Out-of-domain.

Positive transfers on AIME JA/FR/ES and GPQA JA/FR. For instance:

GPQA-ES pass@10: 68.7 → 85.2 → 85.7 (Base → SFT → GRPO-SFT)
AIME-JA pass@10: 22.6 → 28.5 → 34.4 (Base → SFT → GRPO-SFT; GRPO adds a large JA gain)

More results are shown in the figure below. Although improvements on AIME-FR/ES and GPQA-ES are marginal, they still indicate a successful transfer of knowledge on the OOD setup after GRPO.

Negative transfers on Pro-Medicine.

Accuracy improves on Pro-Medicine-JA but decreases on French and Spanish.

Interpretation. GRPO learns verification/search habits that generalize: language consistency, math reasoning styles, re-checking numeric steps, and tighter answer boxing. Those help GPQA and AIME. But medicine needs domain lexicon, evidence phrasing, and calibrated claims—absent in math RL. Previous works have shown reasoning-only post-training harms performance on downstream instruction-following and knowledge recall tasks 9.

RQ2 Results

RQ3 — Where should RL start from: Base or SFT?

Design. Compare GRPO-from-Base vs GRPO-from-SFT (Fig. RQ3).

Patterns.

GRPO-from-SFT is a steadier path. On MMLU-Math FR, for example, GRPO-SFT sits around ~98 pass@10 while GRPO-Base is closer to ~70, i.e., starting from SFT provides language consistency and still improves accuracy.
SFT → RL keeps the multilingual policy. Because SFT already forced the model to reason in Japanese/French/Spanish, RL on top of that mostly optimizes correctness without switching back to EN/ZH reasoning (Fig. 1b).

Interpretation. SFT establishes the multilingual “reasoning policy.” Starting RL from the SFT model lets GRPO optimize correctness while preserving language consistency. RL from Base sometimes pushes the model back toward its original reasoning style while still producing answers in the target language. That can make a few out-of-domain slices look better, but it also increases variance and style regression compared to starting from SFT.

Practical rule. If you care about following (see figure 1b) and better in-domain accuracy, do GRPO after SFT.

RQ3 Results

RQ4 — Can we push the Pareto frontier instead of trading accuracy for language consistency?

Design. Plot Accuracy (x-axis) vs Following (y-axis) for each regime (4-panel Pareto figure). Then, inspect bar/line panels per dataset and language.

What we see.

SFT shifts points up (Following↑). On some hard sets, accuracy dips slightly.
GRPO-SFT shifts rightward (Accuracy↑) with at most a small upward loss, compared with SFT-only — creating new frontiers on:
- MMLU-Math (JA/FR/ES): both metrics are high.
- GPQA-ES: strong frontier point.
Non-frontier holdouts: Pro-Med FR/JA and AIME-ES, where domain/reward mismatch persists.

RQ4 Results

Bottom line. Read each plot within the same language marker (Japanese ▲, French ■, Spanish ●) and compare colors:

yellow vs. blue = GRPO-from-SFT vs. Base
green vs. blue = SFT vs. Base

Under this pairing:

SFT (green) vs. Base (blue) generally shifts points up/right, and GRPO-from-SFT most often traces the upper-right envelope when strict dominance does occur.

RQ5 — Does model merging help?

Motivation. GRPO+SFT often peaks on math but can regress on knowledge-heavy sets (e.g., Pro Medicine), and SFT alone doesn’t consistently stabilize accuracy across Japanese/French/Spanish.

Ideally, we want a solution that smooths these trade-offs while keeping language-consistency strong. Previous studies have shown that model merging is a promising approach to combine models’ abilities, albeit with some performance degradation 11.

Here, we merged the base model with the other three SFT models using merge-kit (https://github.com/arcee-ai/mergekit) with equal linear merge.

RQ5 Results - Figure 5.b

Result (avg pattern across datasets)

RQ5 Results - Figure 5.a

MERGE consistently shrinks worst-case losses and raises floor performance, especially where SFT/GRPO dip. On Pro Medicine, MERGE recovers large chunks of accuracy for Japanese/French (e.g., JA pass@10 climbs from SFT/GRPO’s ~47–58% to ~70%; FR from ~47–70% to ~76%), while staying competitive on AIME/GPQA and within a few points of GRPO+SFT on MMLU-Math.

In Spanish, where SFT already leads on Medicine, MERGE lands in the middle of Base vs SFT/GRPO+SFT rather than decreasing performance to base.

Overall, it trades a small slice of peak scores for lower variance across languages and tasks.

Interpretation

Parameter-space interpolation acts like an ensemble/regularizer:

MERGE blends GRPO’s strong multi-step heuristics with SFT’s alignment priors
Dampens overfitting to any single regime
Stabilizes cross-lingual behavior

Practically, it expresses a steering effect:

the highest leaderboard peak, pick GRPO+SFT
reliable, in-language reasoning across JA/FR/ES, especially on domain-heavy sets, pick MERGE

Discussion: Where performance regresses, and potential solutions

Empirical signal. After SFT followed by GRPO, Japanese language consistency improves markedly, but accuracy lags French (e.g., AIME-JA pass@1 4.4 → 17.9, pass@10 22.6 → 34.4; AIME-FR pass@1 22.2 → 27.3, pass@10 46.3 → 48.2), indicating Japanese-specific friction even with its high increase.

Spanish on AIME shows the opposite tension: the Base model scores well because it always reasons in English despite Spanish prompts, while SFT+GRPO enforces Spanish chains and accuracy drops.

In Pro-Medicine, math-only GRPO from SFT causes regression (e.g., FR pass@10 70.1 → 46.6, ES 86.6 → 76.6, JA 75.9 → 58.3), whereas GRPO started from Base hurts less.

Mechanisms

Language-prior competition. The model’s strongest reasoning prior is in EN/ZH. Under difficulty, chains drift toward those priors. SFT+GRPO strengthens language consistency, which reduces access to English-anchored reasoning traces that previously helped (e.g., AIME-ES). → evidenced by the huge language consistency bump.
Tokenizer & formatting tax (Japanese > French / Spanish). Mixed scripts, half/full-width digits, unit variants, and thousand separators inflate perplexity on numeric steps—precisely where accuracy is most sensitive.
Cue misalignment in Spanish math. AIME leans on algebra/number-theory “recipes” the model learned primarily in English (phrases like “let x be,” “gcd,” “mod”). Spanish equivalents (“sea x,” “mcd,” “módulo”) are rarer, longer, more accented → model drifts into slower or incorrect approaches mid-solution.
Reward misspecification in medicine. Math-only RL optimizes numeric correctness, not biomedical recall, calibration, or evidence style. The policy over-indexes math heuristics and becomes over-assertive on clinical QA.
Starting-point effect. RL from SFT pushes the policy toward SFT’s language/style anchors and away from neutral reasoning. On medicine, this causes bigger drops. RL from Base is more neutral; regressions are smaller.

Lightweight fixes that may work across cases

Prompt-level normalization (before more training).
- Japanese: unify to half-width digits/decimals/exp notation; no thousand separators; explicit math chain template in Japanese. Example: 数字は半角… SI を使用し….
- Spanish: prefer gcd / lcm / mod, exponent notation, half-width digits; terse step headers (Definimos / Sustituimos / Comprobación / Respuesta).
Tokenizer-aware formatting. Consistent spacing around numerals/operators; avoid formatting that fragments tokens.
Targeted SFT top-ups. Small, math-dense Japanese/Spanish datasets using normalized templates to reinforce per-language priors.
Reward shaping for GRPO.
- For AIME-ES: up-weight correctness and make “Spanish-only chain” a secondary objective. → nudges reasoning into Spanish without punishing English-anchored correct answers.
- For Medicine: add a tiny medical reward head (terminology fidelity, claim calibration, evidence cues), plus a KL / behavior-cloning regularizer toward medical SFT to preserve discourse style.
- Use mixed-objective batches (math + clinical QA), and replay OOD medical exemplars during RL to avoid domain forgetting.

Takeaway

The regressions likely stem from one cause:

Blog Summary — Practical takeaways

If you can only afford one step, do SFT (a few hundred high-quality SFT data). You’ll almost certainly fix language-consistency without compromising accuracy; you might also get accuracy improvements on in-domain tasks.
If you can afford two steps, do SFT → GRPO-SFT. Use high clip / no KL; keep rollouts moderate; verify you haven’t regressed following.
Another really practical yet computationally efficient approach is model merging among your SFT models.
For medicine or other narrative-dense domains, add a tiny domain reward with in-domain data or a dozens-scale domain SFT.
For Japanese (or any non-latin script), include numeric/style templates and optionally patch tokenization via formatting.
Track Pareto, not single metrics. Always plot (Accuracy, Following) together; real wins move you up-and-right.

Limitations & threats to validity

Dataset scope. We use four well-known benchmarks; real-world prompts are noisier.
Reward misspecification. Math-only RL can hurt non-math; our suggested fixes mitigate but don’t prove generality across all medical subspecialities.
Model prior. EN/ZH dominance shapes outcomes. A different base prior (e.g., EU-centric) could change which languages are hardest.
Language consistency metric. Strong, script-aware, but still an automatic proxy; human raters may be stricter.

https://huggingface.co/collections/shanchen/xreasoning

bibtex @inproceedings{qi-etal-2025-models, title = “When Models Reason in Your Language: Controlling Thinking Language Comes at the Cost of Accuracy”, author = “Qi, Jirui and Chen, Shan and Xiong, Zidi and Fern{'a}ndez, Raquel and Bitterman, Danielle and Bisazza, Arianna”, editor = “Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet”, booktitle = “Findings of the Association for Computational Linguistics: EMNLP 2025”, month = nov, year = “2025”, address = “Suzhou, China”, publisher = “Association for Computational Linguistics”, url = “https://aclanthology.org/2025.findings-emnlp.1103/", pages = “20279–20296”, ISBN = “979-8-89176-335-7”, }

@misc{chen2025modelsreasonlanguagecontrollingtraining, title={Budget Alignment: Making Models Reason in the User’s Language}, author={Shan Chen and Jirui Qi and Zidi Xiong and and Timothy Miller and Arianna Bisazza and Raquel Fernández and Danielle S. Bitterman}, year={2025}, archivePrefix={github}, url={https://github.com/Betswish/mCoT-XReasoning/tree/main/training}, }

AI Today - SkyAI