3.8 Reasoning models：test-time compute paradigm

2026-05-12

Reasoning model 把「LLM 該想多久」從固定的 forward pass 數變成可訓練、可在推論時動態擴展的維度。OpenAI o1（2024 年底）跟 DeepSeek-R1（2025 年初）是這條路線的兩個里程碑、後續 Qwen-QwQ、Claude thinking、Gemini thinking 等都跟上。本章把 reasoning model 的訓練原理、推論行為、本地可跑選項、適用 / 不適用任務拆成可操作的判讀。

本章不重複 chain-of-thought 跟 test-time compute 卡片的定義、聚焦「reasoning model 怎麼運作、怎麼跟本地工作流結合」。

本章目標

讀完本章後、你應該能：

解釋「reasoning model」相對 instruct model 的訓練差異。
看到 <think>...</think> 標記或「extended thinking」field 時、知道是 reasoning trace、怎麼解讀。
判斷一個任務該用 reasoning model 還是 instruct model。
對自己的硬體預算估算「能不能本地跑 reasoning model」、選哪個。

Paradigm shift：從 scaling pretrain 到 scaling test-time

LLM 能力提升的兩條歷史路徑：

12020-2023 時期：scale pretrain compute
2  GPT-3 → GPT-4：模型大 5-10×、訓練 compute 大 50-100×
3  策略：更多參數 + 更多訓練 token = 更好的 base model
4
52024-2026 時期：scale test-time compute
6  GPT-4 → o1：模型大小接近、但推論時花 5-50× 算力
7  策略：base model 不變、訓練「推理能力」+ 推論時動態擴展 reasoning trace

兩條路線不對立、是疊加：reasoning model 本身仍跑在大 base model 上、reasoning RL 是再加一層後訓練。Cost trade-off 對比的 framing 跟對使用者錢包的影響、見 test-time compute 卡片。本章接下來聚焦「reasoning model 的訓練流程」跟「本地選型」、不重複 paradigm 層的對比。

關鍵理解：reasoning model 不是「更聰明的 GPT-4」、是「同等聰明 base model + 學會把算力花在 reasoning 上」。底層 base model 依然是 Transformer、所有前面章節（attention、FFN、sampling）原理不變。

Reasoning model 的訓練流程

DeepSeek-R1 是第一個公開細節的開源 reasoning model、其 paper 揭示的訓練流程具有代表性：

 1Stage 1: Cold-start SFT
 2  用幾千份「高品質 long reasoning trace」資料 fine-tune base model
 3  目標：讓模型學會「該怎麼想」的 format
 4
 5Stage 2: Reasoning-focused RL
 6  Reward：最終答案正確（math / code / logic 等可機械驗證的任務）
 7  Policy：把 reasoning trace 越拉越長、越能正確、reward 越高
 8  約束：保留語言流暢度（不能 reasoning trace 變成亂碼）
 9  → 模型自發學會「困難問題想更久」
10
11Stage 3: SFT on reasoning + non-reasoning data
12  把 reasoning RL 學到的能力跟一般 instruct 能力 mix
13  避免「只會 reasoning、不會聊天」
14
15Stage 4: Final RLHF / DPO（可選）
16  跟 instruct model 同樣的 alignment 階段、refine helpfulness

關鍵特性：

Stage 2 的 reward 機械可驗證：math 答案、code unit test、logic 答案 — 不需要 human preference、所以可大量擴展訓練資料
Reasoning trace 是「emerge」出來的：訓練不直接告訴模型「該怎麼想」、只給「答案對不對」、模型自己摸索出最佳 reasoning strategy
跨任務 transfer 有限：reasoning model 在訓練分佈內任務（math、coding）強、跨到開放域對話、提升較小

DeepSeek-R1 distill 系列是另一條路：用 R1 full 模型產生 reasoning trace、再 SFT 一個小 base model（如 Qwen2.5-32B）— 讓較小模型也有 reasoning 能力、但跳過昂貴的 RL 階段。

Reasoning trace 的格式

主流 reasoning model 在推論時輸出 reasoning trace 的格式：

 1DeepSeek-R1 / Qwen-QwQ：用特殊 token 標記
 2  <think>
 3  讓我先列出已知條件...先試 case 1...結果矛盾、改試 case 2...
 4  </think>
 5  最終答案：X
 6
 7OpenAI o1：對使用者隱藏
 8  API 只回最終答案、但計費 reasoning token
 9  使用者看不到 reasoning trace 內容
10
11Claude 3.7 thinking：extended thinking field
12  API response 含 `extended_thinking` 跟 `text` 兩個 field
13  IDE / chat 介面通常折疊顯示 thinking 內容

實作層的關鍵考量：

Tokenizer 對 reasoning token 的處理：<think> 等特殊 token 在 vocab 中被保留、tokenizer 識別後不切碎
Context budget 分配：reasoning trace 通常 1000-10000 token、要預留 context window 容量
Streaming 行為：reasoning trace streaming 時、使用者看到「模型在想」、TTFT 變短但「first useful output」變長
Stop sequence：sampling 階段 </think> 或對應結束 token 是 reasoning trace 的 terminator

本地可跑的 reasoning model

2026/5 時、本地寫 code 工作流可考慮的 reasoning model：

模型	大小	Q4 量化後記憶體	適合硬體	reasoning trace 平均 token
DeepSeek-R1-Distill-Qwen-7B	7B	~4 GB	16GB+ Mac / 16GB+ VRAM	500-2000
DeepSeek-R1-Distill-Qwen-14B	14B	~8 GB	24GB+ Mac / 16GB+ VRAM	1000-3000
DeepSeek-R1-Distill-Qwen-32B	32B	~18 GB	32GB+ Mac / 24GB+ VRAM	1500-5000
QwQ-32B	32B	~18 GB	32GB+ Mac / 24GB+ VRAM	2000-8000
DeepSeek-R1（full）	671B（MoE）	~140 GB	不實際本地跑	5000-30000

事實查核註：模型大小、量化體積、reasoning trace 長度是 2026/5 主流版本的常見數量級；具體數字隨量化等級、context 配置、任務類型而變、引用前以對應 model card 跟自己 llama-bench 跑為準。

選型判讀（個人 dev 場景）：

24GB Mac（M4 Pro）：可跑 14B distill、或 32B distill Q4 緊張、context 開小
32GB Mac（M4 Pro 升級）：跑 32B distill 舒服、context 32K+ 可開
48GB+ Mac（M4 Max）：跑 32B distill 寬鬆、可考慮 QwQ-32B 配 64K context
16GB+ VRAM PC：跑 14B distill；32B distill 屬 dense 架構（不是 MoE）、要用 dense CPU offload（部分層放 RAM、靠 PCIe 走、tok/s 受 PCIe 頻寬限制）、跟 MoE CPU offload 是不同的戰術
24GB+ VRAM PC（5090）：跑 32B distill 寬鬆

適合 reasoning model 的任務

Reasoning model 的優勢任務有明確 pattern：

任務類型	為什麼適合	案例
複雜 algorithm design	需要多步推理 + 探索多個解法	Leetcode hard、設計 sliding window 解法
棘手 debug	需要排除多種可能、追蹤跨檔案邏輯	「為什麼這個 race condition 偶爾出現」
Math / 量化分析	機械可驗證、模型訓練分佈內	估算系統 capacity、複雜利率計算
Multi-step refactor 規劃	需要看到整體影響、分階段	「把這個 service 拆成 3 個 microservice 的步驟」
系統設計取捨	多 dimension 比較、需要展開論證	「DB 該選 Postgres 還是 Cassandra」
解 obscure error	需要 reason about 多個可能根因	「kernel panic 訊息 X 可能來源」

不適合用 reasoning model 的任務（用 instruct model 即可）：

任務類型	為什麼不適合	改用
Autocomplete	reasoning trace 拉長 TTFT、體感變慢	Instruct 小模型（如 Qwen3-Coder-7B）
簡單 docstring / comment	過度推理、浪費 token	Instruct model
純翻譯 / 風格改寫	不需要 reasoning	Instruct model
高頻短查詢	每次 reasoning overhead 累積	Instruct model + KV cache
已知答案的查表	reasoning 反而引入錯誤	Instruct model
探索性 brainstorming	不需要「正確答案」、reasoning 反而限制創意	Instruct model + 高 temperature

判讀反射：先問「這任務有沒有客觀正確答案 + 是否需要多步推理」、兩者都 yes 才用 reasoning model。

Reasoning model + tool use

Reasoning model 跟 tool use 結合是 2026 新趨勢、典型形態：

1模型在 reasoning trace 中發現「需要驗證一個事實」
2  ↓
3呼叫 tool（calculator / web search / code interpreter）
4  ↓
5拿到結果、繼續 reasoning
6  ↓
7最終答案

代表場景：

Coding agent + reasoning：reasoning 階段規劃 refactor 步驟、tool use 階段執行 file edit、reasoning 階段檢查結果
Math / data analysis：reasoning 階段拆問題、code interpreter 跑 calculation、reasoning 階段解讀
Web 研究：reasoning 階段列出該查的事實、web search、reasoning 階段彙整

挑戰：

Reasoning trace + tool result 都進 context：context 用量爆炸快、需要 long context 模型（見 4.11 Long context engineering）
Tool use 訓練跟 reasoning 訓練是兩件事：本地 distill 模型 tool use 能力 = 對應 base model 的 tool use 能力、不一定強
Error recovery：reasoning 階段假設錯了、tool 回 error、模型要會 backtrack（agent loop 失敗模式）

實務上、本地 reasoning + agent 是「值得試、但仍處早期」階段；雲端 R1 / o3 / Claude thinking + Claude Code / Cursor 是現階段更穩的組合。

跟 instruct model 共存的混用策略

寫 code 場景的合理混用配置：

 1Default model（Continue.dev primary）：instruct model
 2  Qwen3-Coder-30B-Instruct / Gemma 4 31B Instruct
 3  日常 autocomplete、解釋、簡單 refactor
 4
 5Reasoning model（Continue.dev secondary、手動切）：local reasoning
 6  DeepSeek-R1-Distill-Qwen-32B / QwQ-32B
 7  困難 bug、algorithm、複雜 refactor 規劃
 8
 9Cloud fallback（手動切）：雲端旗艦
10  Claude 3.7 Sonnet thinking / GPT-5 / o3
11  本地 reasoning 卡住、或極困難任務

Continue.dev 的 multi-model config 可同時設多個、UI 下拉切換、不用重啟 server。安全 / 隱私面：reasoning trace 可能含敏感推理過程、跨雲端 / 本地邊界判讀同 6.4。

何時過時 / 何時不過時

不會過時的部分：

Test-time compute 作為一個獨立 scaling 維度的概念
Reasoning trace 結構（pre-answer reasoning + answer）
「適合 reasoning vs instruct」的判讀框架
「機械可驗證的 reward + RL」是 reasoning training 的核心
Reasoning model + tool use 的設計取捨

會變的部分：

具體 reasoning model（R1 → R2 → …、o1 → o3 → …、會持續迭代）
Reasoning trace 的具體格式（<think>、extended thinking field、未來可能標準化）
本地可跑的模型選項（distill 系列會持續更新）
Reasoning 跟 agent 結合的最佳實踐（仍在演化）
是否會出現 reasoning paradigm 的下一個替代（如 neurosymbolic、multi-agent reasoning）

新 reasoning model 出來時、回到本章的 framing：訓練流程是否同 R1 pattern、reasoning trace 怎麼產出、本地能否跑、適用任務是否同樣 pattern — 多數新模型仍會 fit 進這個框架。

下一章：3.9 Speculative decoding 內部、看另一個推論時加速的技術細節。

#llm #theory #reasoning #chain-of-thought #test-time-compute