<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Constrained-Decoding on Tarragon</title><link>https://tarrragon.github.io/blog/tags/constrained-decoding/</link><description>Recent content in Constrained-Decoding on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Tue, 12 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/constrained-decoding/index.xml" rel="self" type="application/rss+xml"/><item><title>3.10 Constrained decoding 內部：grammar mask 跟性能取捨</title><link>https://tarrragon.github.io/blog/llm/03-theoretical-foundations/constrained-decoding-internals/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/03-theoretical-foundations/constrained-decoding-internals/</guid><description>&lt;p>&lt;a href="https://tarrragon.github.io/blog/llm/03-theoretical-foundations/sampling-and-decoding/" data-link-title="3.5 Sampling 與 Decoding 策略" data-link-desc="Greedy、beam search、top-k、top-p、temperature、min-p：模型輸出後怎麼挑下一個 token">3.5 sampling-and-decoding&lt;/a> 寫了 greedy / beam / top-p / top-k sampling、是「在合法輸出中選下一個 token」的基本機制。&lt;a href="https://tarrragon.github.io/blog/llm/04-applications/application-protocols/" data-link-title="4.6 應用層協議：function calling / structured output / MCP" data-link-desc="三個常被混為一談的概念：模型能力、sampling 約束、server 協議，三者的層級差異與組合方式">4.6 application-protocols&lt;/a> 寫了 function calling / structured output 的應用層 — 但「為什麼 LLM 能保證輸出合法 JSON」這層原理在前兩章都沒展開。本章補 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/constrained-decoding/" data-link-title="Constrained Decoding" data-link-desc="推論時用 grammar 強制 LLM 輸出符合特定格式（JSON / regex / CFG）的 sampling 機制、把不合法 token 的機率歸零">constrained decoding&lt;/a> 的內部機制：token mask 怎麼算、JSON schema / regex / &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/grammar/" data-link-title="Grammar" data-link-desc="描述合法字串形狀的形式規則，在 structured output 中用來限制 LLM 每一步可輸出的 token">CFG&lt;/a> 三種 grammar、為什麼 XGrammar 等實作反而加速生成。&lt;/p>
&lt;h2 id="本章目標">本章目標&lt;/h2>
&lt;p>讀完本章後、你應該能：&lt;/p>
&lt;ol>
&lt;li>解釋「grammar 強制」是在 sampling 階段哪一步做的。&lt;/li>
&lt;li>區分 JSON schema / regex / CFG 三種 grammar 的適用場景。&lt;/li>
&lt;li>看 XGrammar / outlines / llama.cpp grammar 等實作、能對應到本章 framing。&lt;/li>
&lt;li>判讀「constrained decoding 加速還是拖慢」的具體場景。&lt;/li>
&lt;/ol>
&lt;h2 id="sampling-階段的位置">Sampling 階段的位置&lt;/h2>
&lt;p>回顧 LLM 輸出流程（見 &lt;a href="https://tarrragon.github.io/blog/llm/03-theoretical-foundations/sampling-and-decoding/" data-link-title="3.5 Sampling 與 Decoding 策略" data-link-desc="Greedy、beam search、top-k、top-p、temperature、min-p：模型輸出後怎麼挑下一個 token">3.5&lt;/a>）：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">[forward pass] → logits（vocab_size 維、每個 token 一個實數）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> ↓ apply temperature（logits / T）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> ↓ apply constrained decoding（本章聚焦） ← grammar mask
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> ↓ softmax → probability distribution
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl"> ↓ top-p / top-k / sampling
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl"> ↓ next token&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Constrained decoding 在 softmax &lt;strong>之前&lt;/strong>插入 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/constrained-decoding/" data-link-title="Constrained Decoding" data-link-desc="推論時用 grammar 強制 LLM 輸出符合特定格式（JSON / regex / CFG）的 sampling 機制、把不合法 token 的機率歸零">grammar mask&lt;/a>：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">For each position：
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> 1. Grammar 算當前位置的「合法 token 集合」（vocab 子集）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> 2. 對不在合法集的 token、logit 設 -∞
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> 3. Softmax 後、不合法 token 機率為 0
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl"> 4. Sampling 只可能選到合法 token&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>關鍵理解：grammar 不改變模型本身、不改變 logits 數值（除了 mask 部分）、只是&lt;strong>限制 sampling 空間&lt;/strong>。&lt;/p></description><content:encoded><![CDATA[<p><a href="/blog/llm/03-theoretical-foundations/sampling-and-decoding/" data-link-title="3.5 Sampling 與 Decoding 策略" data-link-desc="Greedy、beam search、top-k、top-p、temperature、min-p：模型輸出後怎麼挑下一個 token">3.5 sampling-and-decoding</a> 寫了 greedy / beam / top-p / top-k sampling、是「在合法輸出中選下一個 token」的基本機制。<a href="/blog/llm/04-applications/application-protocols/" data-link-title="4.6 應用層協議：function calling / structured output / MCP" data-link-desc="三個常被混為一談的概念：模型能力、sampling 約束、server 協議，三者的層級差異與組合方式">4.6 application-protocols</a> 寫了 function calling / structured output 的應用層 — 但「為什麼 LLM 能保證輸出合法 JSON」這層原理在前兩章都沒展開。本章補 <a href="/blog/llm/knowledge-cards/constrained-decoding/" data-link-title="Constrained Decoding" data-link-desc="推論時用 grammar 強制 LLM 輸出符合特定格式（JSON / regex / CFG）的 sampling 機制、把不合法 token 的機率歸零">constrained decoding</a> 的內部機制：token mask 怎麼算、JSON schema / regex / <a href="/blog/llm/knowledge-cards/grammar/" data-link-title="Grammar" data-link-desc="描述合法字串形狀的形式規則，在 structured output 中用來限制 LLM 每一步可輸出的 token">CFG</a> 三種 grammar、為什麼 XGrammar 等實作反而加速生成。</p>
<h2 id="本章目標">本章目標</h2>
<p>讀完本章後、你應該能：</p>
<ol>
<li>解釋「grammar 強制」是在 sampling 階段哪一步做的。</li>
<li>區分 JSON schema / regex / CFG 三種 grammar 的適用場景。</li>
<li>看 XGrammar / outlines / llama.cpp grammar 等實作、能對應到本章 framing。</li>
<li>判讀「constrained decoding 加速還是拖慢」的具體場景。</li>
</ol>
<h2 id="sampling-階段的位置">Sampling 階段的位置</h2>
<p>回顧 LLM 輸出流程（見 <a href="/blog/llm/03-theoretical-foundations/sampling-and-decoding/" data-link-title="3.5 Sampling 與 Decoding 策略" data-link-desc="Greedy、beam search、top-k、top-p、temperature、min-p：模型輸出後怎麼挑下一個 token">3.5</a>）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">[forward pass] → logits（vocab_size 維、每個 token 一個實數）
</span></span><span class="line"><span class="ln">2</span><span class="cl">       ↓ apply temperature（logits / T）
</span></span><span class="line"><span class="ln">3</span><span class="cl">       ↓ apply constrained decoding（本章聚焦）  ← grammar mask
</span></span><span class="line"><span class="ln">4</span><span class="cl">       ↓ softmax → probability distribution
</span></span><span class="line"><span class="ln">5</span><span class="cl">       ↓ top-p / top-k / sampling
</span></span><span class="line"><span class="ln">6</span><span class="cl">       ↓ next token</span></span></code></pre></div><p>Constrained decoding 在 softmax <strong>之前</strong>插入 <a href="/blog/llm/knowledge-cards/constrained-decoding/" data-link-title="Constrained Decoding" data-link-desc="推論時用 grammar 強制 LLM 輸出符合特定格式（JSON / regex / CFG）的 sampling 機制、把不合法 token 的機率歸零">grammar mask</a>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">For each position：
</span></span><span class="line"><span class="ln">2</span><span class="cl">  1. Grammar 算當前位置的「合法 token 集合」（vocab 子集）
</span></span><span class="line"><span class="ln">3</span><span class="cl">  2. 對不在合法集的 token、logit 設 -∞
</span></span><span class="line"><span class="ln">4</span><span class="cl">  3. Softmax 後、不合法 token 機率為 0
</span></span><span class="line"><span class="ln">5</span><span class="cl">  4. Sampling 只可能選到合法 token</span></span></code></pre></div><p>關鍵理解：grammar 不改變模型本身、不改變 logits 數值（除了 mask 部分）、只是<strong>限制 sampling 空間</strong>。</p>
<h2 id="三種主流-grammar">三種主流 grammar</h2>
<h3 id="json-schema">JSON Schema</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;object&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">  <span class="nt">&#34;properties&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">    <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="p">{</span><span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;string&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">    <span class="nt">&#34;age&#34;</span><span class="p">:</span> <span class="p">{</span><span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;integer&#34;</span><span class="p">,</span> <span class="nt">&#34;minimum&#34;</span><span class="p">:</span> <span class="mi">0</span><span class="p">}</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">  <span class="p">},</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">  <span class="nt">&#34;required&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;name&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p>LLM 輸出必須是合法 JSON 且符合 schema。實作：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">當前已生：&#39;{&#34;name&#34;: &#34;alice&#34;, &#39;
</span></span><span class="line"><span class="ln">2</span><span class="cl">  ↓ 算下一個合法 token：
</span></span><span class="line"><span class="ln">3</span><span class="cl">  - 必須繼續產合法 JSON
</span></span><span class="line"><span class="ln">4</span><span class="cl">  - schema 還沒填 age（optional）但 name 已填、所以 } 合法、&#34;age&#34; 也合法
</span></span><span class="line"><span class="ln">5</span><span class="cl">  - 不合法：&#39;{&#39; / &#39;]&#39; / 任意其他 key
</span></span><span class="line"><span class="ln">6</span><span class="cl">  ↓ Token mask 套用
</span></span><span class="line"><span class="ln">7</span><span class="cl">  → 模型只能選 } 或 &#34;age&#34;</span></span></code></pre></div><h3 id="regex">Regex</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">\d{3}-\d{4}-\d{4}  # 台灣 phone number 格式</span></span></code></pre></div><p>LLM 輸出必須符合 regex。實作：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">當前已生：&#39;09&#39;
</span></span><span class="line"><span class="ln">2</span><span class="cl">  ↓ 算下一個合法 token：
</span></span><span class="line"><span class="ln">3</span><span class="cl">  - regex 期望 \d 接下來
</span></span><span class="line"><span class="ln">4</span><span class="cl">  - 合法 token：&#39;0&#39;-&#39;9&#39; 開頭的 token
</span></span><span class="line"><span class="ln">5</span><span class="cl">  - 不合法：字母、符號
</span></span><span class="line"><span class="ln">6</span><span class="cl">  ↓ Token mask</span></span></code></pre></div><h3 id="cfgcontext-free-grammar">CFG（Context-Free Grammar）</h3>
<p>用 <a href="/blog/llm/knowledge-cards/bnf/" data-link-title="BNF（Backus-Naur Form）" data-link-desc="用遞迴產生式描述語法的經典記法，是 CFG、parser 與 grammar-constrained sampling 常見的基礎表示">BNF</a> / EBNF 描述合法語法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">expr   ::= term (&#34;+&#34; term)*
</span></span><span class="line"><span class="ln">2</span><span class="cl">term   ::= number | &#34;(&#34; expr &#34;)&#34;
</span></span><span class="line"><span class="ln">3</span><span class="cl">number ::= [0-9]+</span></span></code></pre></div><p>LLM 輸出必須符合此 grammar。實作：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">當前已生：&#39;(1+2&#39;
</span></span><span class="line"><span class="ln">2</span><span class="cl">  ↓ CFG 算當下合法 next token：
</span></span><span class="line"><span class="ln">3</span><span class="cl">  - 已 match 部分 term + &#34;+&#34; + term
</span></span><span class="line"><span class="ln">4</span><span class="cl">  - 合法：&#34;)&#34; 或 &#34;+&#34; 開始新 term
</span></span><span class="line"><span class="ln">5</span><span class="cl">  - 不合法：字母、其他符號
</span></span><span class="line"><span class="ln">6</span><span class="cl">  ↓ Token mask</span></span></code></pre></div><p>CFG 是最強表達力、但實作最複雜。SQL / 程式碼 generation 多用 CFG-based grammar。</p>
<h2 id="xgrammar-的-pre-compile-機制">XGrammar 的 pre-compile 機制</h2>
<p>XGrammar（Dong et al., 2024）是 2024-2025 主流的高效實作。核心優化：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Naive 實作（如 outlines 早期版）：
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  每次 sampling 都重算 grammar state
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  每個 token 都跑一次 grammar parse
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  → 開銷大、可能拖慢 generation
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">XGrammar 優化：
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  1. Pre-compile grammar → 確定性 DFA / push-down automaton
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  2. Cache 每個 grammar state 的「合法 token mask bitmap」
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  3. Sampling 時 O(1) 查表得到 mask
</span></span><span class="line"><span class="ln">10</span><span class="cl">  4. Mask 用 bitwise op 套用到 logits</span></span></code></pre></div><p>效果：grammar 套用 overhead 趨近 0、甚至<strong>因為跳過 boilerplate token 反而加速</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">無 grammar 生 JSON：
</span></span><span class="line"><span class="ln">2</span><span class="cl">  {     &#34; n a m e &#34;     : &#34; a l i c e &#34; ...
</span></span><span class="line"><span class="ln">3</span><span class="cl">  ←     每個 token 都跑 forward pass    →
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl">有 grammar 生 JSON：
</span></span><span class="line"><span class="ln">6</span><span class="cl">  跳過固定 token（{ &#34; : 等）、直接生關鍵字串
</span></span><span class="line"><span class="ln">7</span><span class="cl">  forward pass 次數減少
</span></span><span class="line"><span class="ln">8</span><span class="cl">  → 實測加速 1.5-3×</span></span></code></pre></div><p>主流推論伺服器（vLLM、SGLang、TensorRT-LLM）2025 後預設用 XGrammar。</p>
<h2 id="性能取捨加速還是拖慢">性能取捨：加速還是拖慢</h2>
<p>常見誤解：「constrained decoding 拖慢生成」。實際看實作：</p>
<table>
  <thead>
      <tr>
          <th>實作</th>
          <th>性能</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>XGrammar（vLLM 等預設）</td>
          <td><strong>加速 1.5-3×</strong>（跳過固定 token、forward pass 次數減）</td>
      </tr>
      <tr>
          <td>outlines（pre-compiled）</td>
          <td>略加速到中性</td>
      </tr>
      <tr>
          <td>outlines（lazy compile）</td>
          <td>略拖慢</td>
      </tr>
      <tr>
          <td>guidance（高階 API）</td>
          <td>中性到略拖慢</td>
      </tr>
      <tr>
          <td>llama.cpp grammar</td>
          <td>中性</td>
      </tr>
      <tr>
          <td>Lazy / naive 實作</td>
          <td>拖慢</td>
      </tr>
  </tbody>
</table>
<p>判讀：用主流推論伺服器（vLLM / SGLang）+ XGrammar 路線、constrained decoding 通常加速；自己寫 naive 實作可能拖慢。</p>
<h2 id="跟-function-calling-的關係">跟 <a href="/blog/llm/knowledge-cards/function-calling/" data-link-title="Function Calling" data-link-desc="模型訓練階段建立的「呼叫工具」能力：知道何時該呼叫、傳什麼參數">function calling</a> 的關係</h2>
<p>兩個概念可獨立、也可疊用：</p>
<table>
  <thead>
      <tr>
          <th>路線</th>
          <th>機制</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pure function calling（無 constrained decoding）</td>
          <td>靠模型訓練、不強制合法、可能有解析失敗</td>
      </tr>
      <tr>
          <td>Pure constrained decoding（無 function calling 訓練）</td>
          <td>推論時強制合法、但模型不一定知道「何時該呼叫工具」</td>
      </tr>
      <tr>
          <td>Function calling + constrained decoding</td>
          <td>訓練教模型何時呼叫、grammar 強制呼叫格式合法</td>
      </tr>
  </tbody>
</table>
<p>主流商業 API（Anthropic / OpenAI / Gemini）的 function calling 通常<strong>內部已用 constrained decoding</strong>、開發者無感。本地推論用 vLLM / SGLang + XGrammar 也是預設組合。</p>
<h2 id="失敗模式">失敗模式</h2>
<h3 id="1-grammar-太嚴讓模型該說的話說不出來">1. Grammar 太嚴讓模型「該說的話說不出來」</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Schema 強制 type 是 enum [&#34;A&#34;, &#34;B&#34;, &#34;C&#34;]
</span></span><span class="line"><span class="ln">2</span><span class="cl">但真實答案是「none of the above」
</span></span><span class="line"><span class="ln">3</span><span class="cl">→ 模型強制選 A/B/C、輸出語義錯誤</span></span></code></pre></div><p><strong>緩解</strong>：enum 加 fallback option（&ldquo;unknown&rdquo; / &ldquo;none&rdquo;）、schema 別過度約束</p>
<h3 id="2-cfg-太複雜編譯失敗--慢">2. CFG 太複雜、編譯失敗 / 慢</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">復雜 CFG（如完整 SQL grammar）pre-compile 數秒
</span></span><span class="line"><span class="ln">2</span><span class="cl">production cold start 多花這數秒</span></span></code></pre></div><p><strong>緩解</strong>：cache compiled grammar、用較簡單 grammar 版本（如「INSERT only」而非完整 SQL）</p>
<h3 id="3-grammar-跟-model-訓練分佈不符">3. Grammar 跟 model 訓練分佈不符</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Schema 要求很罕見的 JSON 結構
</span></span><span class="line"><span class="ln">2</span><span class="cl">模型訓練沒見過這結構
</span></span><span class="line"><span class="ln">3</span><span class="cl">即使 grammar 強制合法、語義可能空洞</span></span></code></pre></div><p><strong>緩解</strong>：grammar 用模型訓練過的形態（function call spec、common JSON）、自定義 schema 加 few-shot example</p>
<h3 id="4-streaming-跟-grammar-衝突">4. Streaming 跟 grammar 衝突</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Streaming 邊生邊輸出
</span></span><span class="line"><span class="ln">2</span><span class="cl">Grammar 中段 token 可能要 backtrack 修正
</span></span><span class="line"><span class="ln">3</span><span class="cl">streaming UX 跳字</span></span></code></pre></div><p><strong>緩解</strong>：用 incremental-parsing grammar（XGrammar 支援）、避免 backtrack 場景</p>
<h3 id="5-constrained-decoding-蓋過-function-calling-訓練">5. Constrained decoding 蓋過 function calling 訓練</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">模型訓練用 OpenAI function spec、應用強制套 Anthropic tools 的 grammar
</span></span><span class="line"><span class="ln">2</span><span class="cl">模型輸出「合法但語意空洞」（schema 對、欄位胡亂填）</span></span></code></pre></div><p><strong>緩解</strong>：grammar spec 跟模型訓練 spec 一致、別人工維護兩份不同 schema</p>
<h2 id="何時不該用-constrained-decoding">何時不該用 constrained decoding</h2>
<ol>
<li><strong>自由 / 創意輸出</strong>：寫作、brainstorming、grammar 限制模型表達</li>
<li><strong>可靠的 model + simple format</strong>：模型本身能穩定輸出 JSON、grammar overhead 不必要</li>
<li><strong>Grammar 太嚴有語義錯</strong>：見失敗模式 1</li>
<li><strong>Streaming + 複雜 grammar</strong>：streaming UX 受影響</li>
</ol>
<h2 id="主流實作詳細">主流實作詳細</h2>
<table>
  <thead>
      <tr>
          <th>實作</th>
          <th>適合場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>XGrammar</strong></td>
          <td>Production 高吞吐（vLLM / SGLang / TensorRT-LLM 預設）</td>
      </tr>
      <tr>
          <td><strong>outlines</strong></td>
          <td>Python script、開發 / 實驗、HF Transformers 用</td>
      </tr>
      <tr>
          <td><strong>lm-format-enforcer</strong></td>
          <td>動態 grammar、運行時切 schema</td>
      </tr>
      <tr>
          <td><strong>guidance</strong></td>
          <td>Microsoft 系、想要 high-level API</td>
      </tr>
      <tr>
          <td><strong>llama.cpp grammar</strong></td>
          <td>本地 GGUF 模型、GBNF 語法</td>
      </tr>
      <tr>
          <td><strong>OpenAI Structured Outputs</strong></td>
          <td>OpenAI API、JSON schema、開發者無感</td>
      </tr>
      <tr>
          <td><strong>Anthropic JSON mode</strong></td>
          <td>Anthropic API、簡化版</td>
      </tr>
  </tbody>
</table>
<h2 id="何時過時--何時不過時">何時過時 / 何時不過時</h2>
<p><strong>不會過時的部分</strong>：</p>
<ul>
<li>Constrained decoding 在 sampling 哪一步插入（softmax 之前）的 framing</li>
<li>三種 grammar 類型（JSON schema / regex / CFG）的分類</li>
<li>Token mask 機制（不合法 token logit 設 -∞）</li>
<li>「正確實作下加速、不是拖慢」的反直覺結論</li>
<li>5 大失敗模式分類</li>
</ul>
<p><strong>會變的部分</strong>：</p>
<ul>
<li>XGrammar / outlines 等實作的具體效能跟功能</li>
<li>主流推論伺服器的預設 grammar engine</li>
<li>JSON schema spec 標準化（新版會出）</li>
<li>Function calling + constrained decoding 是否會被 native multimodal 取代</li>
</ul>
<h2 id="下一章">下一章</h2>
<p>下一章：<a href="/blog/llm/03-theoretical-foundations/going-deeper-theory/" data-link-title="3.11 想學更深：推薦公開課程" data-link-desc="Karpathy、Stanford CS224N / CS25 / CS336、DeepLearning.AI、Hugging Face：LLM 理論深入學習的完整路線">3.11 想學更深</a>、整個模組三理論基礎走完。</p>
]]></content:encoded></item></channel></rss>