Structured-Output on Tarragon

DSL（Domain-Specific Language）

Thu, 14 May 2026 00:00:00 +0000

DSL（Domain-Specific Language）的核心概念是「為特定領域設計的小語言」。它不像通用程式語言要解所有問題，而是把某個領域的可用操作、資料形狀與限制收斂成小而可解析的語法，讓人類、LLM 與程式都能用同一種中介表示溝通。

概念位置

在 LLM 應用裡，DSL 常出現在自然語言與程式執行之間。模型把使用者意圖轉成 DSL，應用再 parse、validate、authorize、execute；這比直接讓模型輸出任意程式碼更容易控管，也比純自然語言更容易自動化。

1使用者：「找出高優先、尚未處理的 billing ticket」
2 ↓
3LLM 輸出 DSL：ticket.where(category="billing", priority="high", status!="done")
4 ↓
5parser / validator / executor

可觀察訊號與例子

看到「特定 query language」「workflow mini-language」「policy expression」「filter expression」「tool command language」就是 DSL 候選。例子包括搜尋篩選語法、監控告警規則、資料轉換 pipeline、客服工單查詢、CI workflow 條件式。

DSL 的風險是語法看起來可控，但語意與權限仍然危險。模型生成的 DSL 要經過 parser 確認語法、validator 確認欄位與型別、authorization 確認可操作範圍、dry run 或 preview 確認副作用；不能因為輸出不是通用程式碼就直接執行。

設計責任

DSL 適合操作集合固定、需要高可控性、且自然語言到執行之間需要審計紀錄的場景。設計時先定義最小語法、失敗路由與不可表示狀態；需要讓 LLM 穩定產生 DSL 時，用 grammar 或 JSON Schema 約束輸出。下一步路由是 Structured Output 與 Sampling Constraint。

Grammar

Thu, 14 May 2026 00:00:00 +0000

Grammar（語法規則）的核心概念是「用形式化規則描述哪些字串是合法輸出」。在 LLM structured output 裡，grammar 是 parser / decoder 可以執行的規則集合，用來判斷 JSON、SQL、DSL、表達式或自訂格式是否符合預期形狀——此處的 grammar 指形式語法，而非英文文法。

概念位置

Grammar 位在格式定義層，常被 constrained decoding 編譯成 token mask。它跟 schema 的差異在表達方式：schema 常描述資料結構與欄位限制，grammar 描述字串如何從符號規則生成；JSON Schema 適合物件欄位，grammar 適合自訂語言、查詢語法、括號結構與特定文字格式。

1grammar 規則 → parser / decoder 編譯
2 ↓
3每個生成位置算出合法 token
4 ↓
5不合法 token 被 mask 掉

可觀察訊號與例子

看到 expr: term ("+" term)*、start: object、<json> ::= ... 這類規則就是 grammar。例子是讓模型只輸出簡化查詢語言：欄位只能是 status / owner，運算子只能是 = / in，字串必須加引號；grammar 可以把非法查詢擋在生成階段。

Grammar 的邊界是語意與外部狀態。它可以限制語法合法，卻不能知道 owner = "alice" 是否真有這個使用者，也不能判斷查詢是否符合權限；這些仍要交給 validator、authorization 與業務規則。

設計責任

需要自訂輸出格式時，先判斷格式是資料結構還是小語言：物件欄位優先用 JSON Schema，小語言或查詢語法才用 grammar。下一步路由是：需要語法表示法讀 BNF 或 Lark Grammar；需要應用層自訂語言讀 DSL。

Structured Output

Thu, 14 May 2026 00:00:00 +0000

Structured output 的核心概念是「讓 LLM 輸出符合可機器解析的固定形狀」。它解的是應用層 parser 能不能穩定消費模型輸出的問題：輸出要能被 JSON parser、schema validator、dispatcher、workflow engine 確定性處理，而不是靠人類讀自然語言再猜意圖。

概念位置

Structured output 位在推論與應用交界，常見實作包含 JSON mode、JSON Schema、grammar 約束、constrained decoding 與 logit mask。它跟 function calling 的差異在責任層：function calling 是模型訓練出的工具呼叫能力，structured output 是推論時讓輸出形狀穩定的約束。

1模型能力：知道是否該呼叫工具、該填什麼參數
2推論約束：輸出必須符合 JSON / schema / grammar
3應用消費：parser 解析、validator 檢查、dispatcher 執行

可觀察訊號與例子

看到「固定輸出 JSON」「把結果分類成 enum」「回傳符合 schema 的物件」「讓 parser 不再處理自由文字」就是 structured output 場景。例子是客服工單分類：模型輸出 {"category":"billing","priority":"high"}，後端可以直接依欄位路由，而不是從一段自然語言裡抽關鍵字。

Structured output 的成功訊號是合法率、schema 對位率與下游解析失敗率。JSON 合法率只代表文字可被 parser 讀，schema 對位率才代表欄位、型別、enum、required 都符合應用契約；兩者分開看，才能分辨是語法錯、schema 錯，還是模型語意判斷錯。

設計責任

Structured output 適合「下游要自動執行」的輸出：tool 參數、分類、抽取、workflow 狀態、查詢條件。它的邊界是語意品質：grammar 可以保證格式合法，但不能保證模型填的值正確。下一步路由是：需要理解 token mask 機制讀 Constrained Decoding；需要判斷它跟工具呼叫的分工讀 Function Calling；需要完整應用層組合讀 4.6 應用層協議。

Constrained Decoding

Tue, 12 May 2026 00:00:00 +0000

Constrained decoding（受限解碼）的核心概念是「推論時用 grammar 動態算出每個位置的合法 token mask、把不合法 token 的 logit 設成 -∞、softmax 後機率為 0」。是 structured output（JSON mode / function calling 的合法性保證）背後的 sampling 機制。代表實作：XGrammar、outlines、lm-format-enforcer、guidance、SGLang。

概念位置

跟既有 sampling 概念的層次：

1模型 forward pass → logits（每個 vocab token 一個分數）
2 ↓ apply temperature
3 ↓ apply grammar mask（constrained decoding） ← 本卡聚焦
4 - 算出當下位置的合法 token 集合
5 - 不合法 token 的 logit 設 -∞
6 ↓ softmax → 機率分佈
7 ↓ sampling（greedy / top-p / top-k）
8 ↓ next token

主要 grammar 類型：

Grammar 類型	描述	用例
JSON Schema	標準 JSON schema 定義合法 JSON 結構	Function calling、structured output
Regex	Regular expression	受限文字格式（如 phone number、email）
CFG（Context-Free Grammar）	BNF 等 grammar 描述合法語法	Code generation、DSL、SQL
Choice list	一組固定字串選項	Classification、enum 輸出

主流實作對比：

實作	機制	推論伺服器整合
XGrammar	Pre-compile grammar → token mask cache、極快	vLLM / SGLang / TensorRT-LLM 預設
outlines	Python lib、JSON schema / regex / CFG	用 Transformers / vLLM
lm-format-enforcer	Lazy compile、適合動態 grammar	Hugging Face Transformers
guidance	Microsoft 系、API 較高階	自家 server
llama.cpp grammar	Built-in GBNF（GGML BNF）	llama.cpp 內建

設計責任

讀 sampling / structured output / function calling 進階文件看到「constrained decoding」「grammar mask」「JSON schema enforcement」就是這 framing。寫 code 場景的判讀：

何時值得用：需要 100% 合法 JSON / 特定格式、function calling spec 嚴格、structured output 不可有解析錯誤
不該用的情況：自由 / 創意輸出（會限制模型表達）、grammar 太嚴讓模型「該說的話說不出來」（如 enum 不含「unknown」、模型強制選錯）
跟 function calling 的關係：function calling 是「模型訓練 + structured output」、constrained decoding 是 sampling 層的工程實作、可獨立組合
加速 vs 拖慢：常見誤解是 grammar 拖慢 — 實測 XGrammar 等 pre-compiled 實作反而加速生成（跳過 boilerplate token 直接生關鍵 token、節省 forward pass）
跟 3.10 constrained decoding 章節的關係：本卡是定義、章節是內部機制（token mask 計算、CFG 編譯、性能取捨）

4.6 應用層協議：function calling / structured output / MCP

Mon, 11 May 2026 00:00:00 +0000

Function calling、structured output、MCP 是 LLM 應用落地時最常被混為一談的三個術語。三者解的問題層級完全不同：function calling 是模型能力（訓練階段建立）、structured output 是**sampling 約束（推論階段控制）、MCP 是server 協議**（架構層標準化）。把三者放回正確層級、應用設計就會變清楚；混為一談會看到「我啟用了 function calling 為什麼還需要 structured output」「MCP 跟 function calling 衝突嗎」這類根本誤解。

本章把三者的層級差異拆開、解釋為什麼會出現 MCP、跟它們在實際應用中怎麼組合。具體 spec 細節（OpenAI function calling JSON 格式、Anthropic tools API、MCP server 實作）不在本章——這些半年一變、本章寫的是「換 spec 之後仍成立」的概念結構。

本章目標

讀完本章後你能：

用一句話分別說清楚三者解什麼問題。
看到「啟用 function calling」「設定 structured output」「裝 MCP server」這些句子時、知道在說哪一層。
判斷一個 LLM 應用該用哪幾個組合、什麼情境只需要一部分。
解釋為什麼 MCP 會出現、它複用了哪個成功模式。

三個概念的層級差異

概念	解的問題	在哪一層	跟模型訓練的關係
Function calling	模型怎麼「知道」要呼叫工具	模型能力	訓練時建立、寫進權重
Structured output	模型輸出怎麼被 parser 確定性消費	Sampling 約束	推論時控制、跟訓練無關
MCP	LLM application 怎麼接外部 tool	Server 協議	不涉模型、純架構標準

三者正交、可獨立或組合：

用 function calling 但不用 structured output：訓練過 tool use 的模型直接呼叫工具、靠模型自律輸出合法 JSON。
用 structured output 但不用 function calling：模型沒訓練過 tool use、用 prompt + grammar 強制輸出合法格式。
用 MCP 但不用 function calling：MCP 標準化 tool 的暴露方式、模型用什麼機制呼叫不重要。
三者都用：function calling 讓模型穩、structured output 約束格式、MCP 提供 tool ecosystem。

把這張表記熟、再看 LLM 應用相關討論、會發現「這個工具支援 function calling」「我的應用要 MCP」這類句子實際在說不同層級。

Function Calling 是模型能力

Function calling 是模型在訓練階段建立的能力：SFT 階段大量「使用者 query + 該呼叫什麼工具 + 傳什麼參數」的範例、讓模型學會「看到 query 知道何時呼叫、怎麼呼叫」。

判讀模型 function calling 強弱的訊號：

該呼叫時呼叫、不該呼叫時不呼叫的準確度。
呼叫格式合法率（不亂寫 JSON）。
參數準確度（type 正確、value 合理）。
多工具情況下選對工具的準確度。

這四個訊號跨模型差異大、根因是訓練資料分佈：

OpenAI / Anthropic 旗艦模型 SFT 階段 function calling 範例大量、表現穩定。
Llama 3 / Gemma 4 / Qwen3 開源旗艦模型 SFT 階段也加 function calling、但範例量不一、表現有落差。
小型開源模型（< 14B）function calling 訓練嚴重不足；tool schema 複雜、多工具選擇、巢狀參數時失敗率高、單一工具 + 平坦 schema 仍可用。

理解這點的價值：看到「這個模型支援 function calling」的宣稱、要追問「訓練範例 coverage 多廣」、不是 binary 的支援 / 不支援、是 spectrum 的訓練深度。

Structured Output 是 Sampling 約束

Structured output 是推論階段的技巧、跟模型訓練無關：在 sampling（從機率分佈挑下一個 token 的步驟）時對每個 token 做 grammar / schema 約束、不合法 token 的機率（logit、token 機率的對數）被歸零、把不合法輸出的可能性壓到不會被 sample。

主要實作機制（適用 / 限制條件附在每項下）：

JSON mode：每步 sampling 過濾、只允許「保持 JSON 仍合法」的 token。適用：絕大多數 OpenAI 相容 API 都有支援；限制：只保 JSON 合法、不保 schema 對位。
Grammar-constrained sampling：用 grammar（描述合法語法的形式化規則、實作上常用 BNF 或 Lark grammar）描述完整輸出形狀、推論時逐 token 過濾。適用：需要嚴格自訂格式（DSL、特定 query language）；限制：要伺服器層支援（llama.cpp、vLLM 有、有些雲端 API 沒）。
Schema-guided：依 JSON Schema 動態決定每步允許哪些 token、強制 enum / type / required 等約束。適用：複雜結構化資料；限制：實作複雜度高、跨伺服器一致性差。
Logit bias：對特定 token 加 bias、間接引導 sampling、最弱但最靈活的方式。適用：簡單的 token 黑名單 / 白名單；限制：無法保證結構合法。

優勢相對 function calling：

跨模型可移植：不依賴模型訓練、任何能跑 sampling 的模型都能上。
可任意自訂格式：不限於 OpenAI 或某 provider 的 function spec、想定義什麼 schema 都行。
保證 100% 合法輸出：grammar 約束下不可能輸出 invalid JSON。

代價：

約束太嚴可能跟模型「自然」輸出衝突：模型本來想說 A、grammar 強制只能說 B、品質會降。
實作成本：grammar 解析跟動態 logit mask 在推論伺服器要支援、不是所有 server 都成熟。
跟模型訓練脫鉤：模型「不知道」自己被約束、可能還是用沒用 function calling 訓練的「猜測」方式生成。

實務上 structured output 跟 function calling 經常組合：function calling 訓練讓模型「自然」傾向合法輸出、structured output 約束兜底保證「真的合法」。

MCP 是 Server 協議

MCP（Model Context Protocol、2024 年由 Anthropic 提出）是「LLM application ↔ 外部 tool server 之間的標準化協議」。它不在模型能力層、不在 sampling 層、是更高層的架構規範。

要理解 MCP 的定位、回顧 LLM 生態的歷史問題：

每個 LLM application（Cursor、Continue.dev、Claude Desktop、aider 等）要接每個 tool（檔案系統、資料庫、search、自訂 API），都得寫 adapter。N 個 application × M 個 tool 的整合成本是 N×M、生態擴張時成本爆炸。

MCP 把這個成本拆成兩段：

LLM application 端：實作 MCP client（一次）、之後支援任意 MCP server。
Tool 端：實作 MCP server（一次）、之後被任意 MCP client 接到。

整合成本從 N×M 降到 N+M。同樣的 ecosystem effect 跟模組零的 OpenAI 相容 API 一樣——標準化中介把生態整合複雜度從乘法降到加法。

MCP 涵蓋的「server 該提供什麼」包括：

Tool 註冊（這個 server 提供哪些 tool）。
Tool schema（每個 tool 的參數定義）。
Tool 呼叫協議（呼叫方式 + 回應格式）。
Resource 暴露（檔案、文件等讀取資源）。
Prompt template 共享（reusable system prompt）。

這些都在 protocol 層、模型怎麼用 tool（function calling 還是 structured output）不在 MCP 規範範圍——MCP 不管你模型強不強、它只管「tool 怎麼被暴露」。

為什麼會出現 MCP

MCP 是 LLM application 生態擴張到一定程度後的必然產物。觀察生態演化：

2023 早期：每個 LLM app 各自寫工具整合、Cursor 接 file system、Continue.dev 接 codebase、aider 接 git——各自的 adapter 邏輯互不通用。
2024 中期：function calling spec 標準化（OpenAI 跟 Anthropic 各自定義）、解決「模型怎麼呼叫工具」、但「工具怎麼暴露給 application」還是各家自己處理。
2024 底：Anthropic 提 MCP、把「工具暴露」也標準化、補完 ecosystem 拼圖。

複用 OpenAI 相容 API 的成功模式：

OpenAI 相容 API：標準化「介面層 ↔ 推論伺服器」、所有 IDE plugin 都接這個。
MCP：標準化「LLM application ↔ tool server」、所有 application 都接這個。

兩者都採用同個策略：定義最小可用標準、讓生態繞著標準長、所有 player 受益。

MCP 成熟度判讀訊號（不固化在某一個時間點、用這幾個 signal 重新評估）：

Application 採納範圍：主要 LLM application（Claude Desktop、Cursor、Continue.dev、其他主流 IDE / chat 介面）是否原生支援。
Tool server catalog 規模：社群維護的 MCP server 數量跟覆蓋範圍（檔案系統、git、Slack、雲端 API 等是否都有現成 server）。
本地推論生態接入度：Ollama、LM Studio 等本地伺服器是否原生支援 MCP（或仍以 OpenAI 相容 API 為主）。
跨平台一致性：Windows / macOS / Linux 上的 MCP server 行為是否一致、SDK 是否穩定。

四個訊號全部成熟前、MCP 仍處於「主要 application 支援、本地生態剛開始接」的擴張期；訊號逐步達標後、預期會像 OpenAI 相容 API 一樣成為應用層的默認標準。

它跟 function calling 的關係：MCP 提供 tool 的暴露機制、模型怎麼呼叫這些 tool 仍走 function calling（如果模型支援）或 structured output（如果用約束）。三者疊加而非互斥。

三者組合的實際工作流

一個完整 LLM application 的典型 stack：

 1使用者 prompt
 2  ↓
 3LLM application（Claude Desktop / Cursor / 自家應用）
 4  ↓ (MCP client、列出所有可用 tool)
 5MCP server pool（檔案系統 server、git server、自家 API server...）
 6  ↑
 7LLM application 把 tool 描述塞進 prompt
 8  ↓
 9推論伺服器（OpenAI API / Ollama / Anthropic API）
10  ↓ (function calling 訓練 + structured output 約束)
11模型輸出：「我要呼叫 tool X、參數是 Y」
12  ↓
13LLM application 用 MCP 把呼叫送到對應 server
14  ↓
15Server 執行、回應
16  ↓
17LLM application 把結果塞進 context、回到推論伺服器繼續

三者各司其職：

Function calling 讓模型穩定輸出工具呼叫（訓練支撐）。
Structured output 兜底保證呼叫格式合法（sampling 約束）。
MCP 提供 tool ecosystem、application 不用為每個 tool 寫專屬 adapter（架構標準）。

少了任一個都還能跑、但效率跟生態擴展性降一級：

沒 function calling、靠 prompt + structured output、跨模型品質不穩。判讀訊號：同 prompt 在不同模型上 tool 呼叫格式錯誤率差 30% 以上。
沒 structured output、靠模型自律、偶有失敗。判讀訊號：< 30B 模型在複雜 schema 下 JSON 合法率 < 90%。
沒 MCP、每個 application 自己寫所有 tool 整合、ecosystem 不可規模化。判讀訊號：團隊維護 > 5 個 tool adapter、每換 LLM provider 重寫一輪。

常見的組合誤用

三者組合在以下情境會失敗、是判讀「我的應用為何不穩」的常見候選：

Structured output 蓋過 function calling 訓練：模型訓練時用 Anthropic tools 格式、應用強制套 OpenAI function spec 的 grammar、模型輸出「合法但語意空洞」的 JSON（schema 對、欄位填湊數）。修法：用模型訓練過的 spec、避免在 grammar 層強制改寫。
MCP server 在 prompt context 撐爆 tool 描述：MCP server 暴露幾十個 tool、每個都有 schema 跟 description、全塞進 system prompt 把 context budget 耗光。修法：dynamic tool selection（先讓 LLM 看「tool 摘要」選相關的、再把選中 tool 的詳細 schema 塞進 context）。
Function calling + structured output 兩邊 schema 不一致：模型訓練的 function spec 跟 application 套的 JSON schema 欄位不對、模型輸出符合訓練 spec 但不符合 application schema、parser 失敗。修法：grammar 直接從 function spec 生、避免人工維護兩份。
MCP server 沒做 input validation、prompt injection 通過 tool 結果污染 context：tool 回的內容沒檢查、惡意內容（如 PR 留言中的「請執行 rm -rf」）被模型當指令執行。修法：tool 輸出做 sanitization、可疑內容用 sandbox 標籤包起來、模型 prompt 明確區分「使用者指令」vs「tool 結果」。個人 dev 在自己機器上跑 MCP server 的權限模型（檔案系統 / shell / 網路存取邊界、第三方 MCP 信任）見 6.2；IDE 場景中 codebase / 外部文件 / 剪貼簿等 prompt injection 攻擊面見 6.3。

何時可以只用一部分

三者組合的需求視場景而定：

單純 structured 輸出（不呼叫工具）：只需 structured output、不需 function calling / MCP。例：把使用者輸入分類成 enum、輸出固定 schema 的 JSON。
In-process tool（直接 Python function）：function calling + 簡單 dispatcher、不需 MCP。應用規模小時最直接。
跨 application 共用 tool：才需要 MCP。如果你只寫自己用的 app、in-process 比 MCP 簡單。
用較弱模型：可能只用 structured output、跳過 function calling。

三者的「最小可用組合」視應用複雜度而定。早期應用通常從 function calling 開始、規模化後加 MCP、品質要求高時加 structured output 兜底——演化路徑不必一步到位。

何時過時 / 何時不過時

不會過時的部分：

三個層級的分界（模型能力 / sampling 約束 / server 協議）。
N×M → N+M 的標準化收益、跟 OpenAI 相容 API 的對應。
三者疊加而非互斥的設計取捨。
「最小可用組合」的判讀框架。

會變的部分：

MCP 是 2024-2025 才標準化的協議、未來 5 年可能演化或被新協議補充（協議層更新慢、但會更新）。
各家 function calling spec 的具體格式（OpenAI / Anthropic / 開放標準會持續細化）。
Structured output 的具體實作（grammar engines / JSON mode 會持續優化）。
哪些工具有 MCP server 可用（生態 catalog 會擴展）。

看到新協議或新 spec 時、回到本章三層 framing 問：它解的是哪一層？能不能跟既有的另兩層組合？這個問題的答案能很快定位新東西在 stack 中的位置。

下一章：4.7 Workflow 編排模式、把多 LLM call 組合的設計模式整理出來。

3.10 Constrained decoding 內部：grammar mask 跟性能取捨

Tue, 12 May 2026 00:00:00 +0000

3.5 sampling-and-decoding 寫了 greedy / beam / top-p / top-k sampling、是「在合法輸出中選下一個 token」的基本機制。4.6 application-protocols 寫了 function calling / structured output 的應用層 — 但「為什麼 LLM 能保證輸出合法 JSON」這層原理在前兩章都沒展開。本章補 constrained decoding 的內部機制：token mask 怎麼算、JSON schema / regex / CFG 三種 grammar、為什麼 XGrammar 等實作反而加速生成。

本章目標

讀完本章後、你應該能：

解釋「grammar 強制」是在 sampling 階段哪一步做的。
區分 JSON schema / regex / CFG 三種 grammar 的適用場景。
看 XGrammar / outlines / llama.cpp grammar 等實作、能對應到本章 framing。
判讀「constrained decoding 加速還是拖慢」的具體場景。

Sampling 階段的位置

回顧 LLM 輸出流程（見 3.5）：

1[forward pass] → logits（vocab_size 維、每個 token 一個實數）
2       ↓ apply temperature（logits / T）
3       ↓ apply constrained decoding（本章聚焦）  ← grammar mask
4       ↓ softmax → probability distribution
5       ↓ top-p / top-k / sampling
6       ↓ next token

Constrained decoding 在 softmax 之前插入 grammar mask：

1For each position：
2  1. Grammar 算當前位置的「合法 token 集合」（vocab 子集）
3  2. 對不在合法集的 token、logit 設 -∞
4  3. Softmax 後、不合法 token 機率為 0
5  4. Sampling 只可能選到合法 token

關鍵理解：grammar 不改變模型本身、不改變 logits 數值（除了 mask 部分）、只是限制 sampling 空間。

三種主流 grammar

JSON Schema

1{
2  "type": "object",
3  "properties": {
4    "name": {"type": "string"},
5    "age": {"type": "integer", "minimum": 0}
6  },
7  "required": ["name"]
8}

LLM 輸出必須是合法 JSON 且符合 schema。實作：

1當前已生：'{"name": "alice", '
2  ↓ 算下一個合法 token：
3  - 必須繼續產合法 JSON
4  - schema 還沒填 age（optional）但 name 已填、所以 } 合法、"age" 也合法
5  - 不合法：'{' / ']' / 任意其他 key
6  ↓ Token mask 套用
7  → 模型只能選 } 或 "age"

Regex

1\d{3}-\d{4}-\d{4}  # 台灣 phone number 格式

LLM 輸出必須符合 regex。實作：

1當前已生：'09'
2  ↓ 算下一個合法 token：
3  - regex 期望 \d 接下來
4  - 合法 token：'0'-'9' 開頭的 token
5  - 不合法：字母、符號
6  ↓ Token mask

CFG（Context-Free Grammar）

用 BNF / EBNF 描述合法語法：

1expr   ::= term ("+" term)*
2term   ::= number | "(" expr ")"
3number ::= [0-9]+

LLM 輸出必須符合此 grammar。實作：

1當前已生：'(1+2'
2  ↓ CFG 算當下合法 next token：
3  - 已 match 部分 term + "+" + term
4  - 合法：")" 或 "+" 開始新 term
5  - 不合法：字母、其他符號
6  ↓ Token mask

CFG 是最強表達力、但實作最複雜。SQL / 程式碼 generation 多用 CFG-based grammar。

XGrammar 的 pre-compile 機制

XGrammar（Dong et al., 2024）是 2024-2025 主流的高效實作。核心優化：

 1Naive 實作（如 outlines 早期版）：
 2  每次 sampling 都重算 grammar state
 3  每個 token 都跑一次 grammar parse
 4  → 開銷大、可能拖慢 generation
 5
 6XGrammar 優化：
 7  1. Pre-compile grammar → 確定性 DFA / push-down automaton
 8  2. Cache 每個 grammar state 的「合法 token mask bitmap」
 9  3. Sampling 時 O(1) 查表得到 mask
10  4. Mask 用 bitwise op 套用到 logits

效果：grammar 套用 overhead 趨近 0、甚至因為跳過 boilerplate token 反而加速：

1無 grammar 生 JSON：
2  {     " n a m e "     : " a l i c e " ...
3  ←     每個 token 都跑 forward pass    →
4
5有 grammar 生 JSON：
6  跳過固定 token（{ " : 等）、直接生關鍵字串
7  forward pass 次數減少
8  → 實測加速 1.5-3×

主流推論伺服器（vLLM、SGLang、TensorRT-LLM）2025 後預設用 XGrammar。

性能取捨：加速還是拖慢

常見誤解：「constrained decoding 拖慢生成」。實際看實作：

實作	性能
XGrammar（vLLM 等預設）	加速 1.5-3×（跳過固定 token、forward pass 次數減）
outlines（pre-compiled）	略加速到中性
outlines（lazy compile）	略拖慢
guidance（高階 API）	中性到略拖慢
llama.cpp grammar	中性
Lazy / naive 實作	拖慢

判讀：用主流推論伺服器（vLLM / SGLang）+ XGrammar 路線、constrained decoding 通常加速；自己寫 naive 實作可能拖慢。

跟 function calling 的關係

兩個概念可獨立、也可疊用：

路線	機制
Pure function calling（無 constrained decoding）	靠模型訓練、不強制合法、可能有解析失敗
Pure constrained decoding（無 function calling 訓練）	推論時強制合法、但模型不一定知道「何時該呼叫工具」
Function calling + constrained decoding	訓練教模型何時呼叫、grammar 強制呼叫格式合法

主流商業 API（Anthropic / OpenAI / Gemini）的 function calling 通常內部已用 constrained decoding、開發者無感。本地推論用 vLLM / SGLang + XGrammar 也是預設組合。

失敗模式

1. Grammar 太嚴讓模型「該說的話說不出來」

1Schema 強制 type 是 enum ["A", "B", "C"]
2但真實答案是「none of the above」
3→ 模型強制選 A/B/C、輸出語義錯誤

緩解：enum 加 fallback option（“unknown” / “none”）、schema 別過度約束

2. CFG 太複雜、編譯失敗 / 慢

1復雜 CFG（如完整 SQL grammar）pre-compile 數秒
2production cold start 多花這數秒

緩解：cache compiled grammar、用較簡單 grammar 版本（如「INSERT only」而非完整 SQL）

3. Grammar 跟 model 訓練分佈不符

1Schema 要求很罕見的 JSON 結構
2模型訓練沒見過這結構
3即使 grammar 強制合法、語義可能空洞

緩解：grammar 用模型訓練過的形態（function call spec、common JSON）、自定義 schema 加 few-shot example

4. Streaming 跟 grammar 衝突

1Streaming 邊生邊輸出
2Grammar 中段 token 可能要 backtrack 修正
3streaming UX 跳字

緩解：用 incremental-parsing grammar（XGrammar 支援）、避免 backtrack 場景

5. Constrained decoding 蓋過 function calling 訓練

1模型訓練用 OpenAI function spec、應用強制套 Anthropic tools 的 grammar
2模型輸出「合法但語意空洞」（schema 對、欄位胡亂填）

緩解：grammar spec 跟模型訓練 spec 一致、別人工維護兩份不同 schema

何時不該用 constrained decoding

自由 / 創意輸出：寫作、brainstorming、grammar 限制模型表達
可靠的 model + simple format：模型本身能穩定輸出 JSON、grammar overhead 不必要
Grammar 太嚴有語義錯：見失敗模式 1
Streaming + 複雜 grammar：streaming UX 受影響

主流實作詳細

實作	適合場景
XGrammar	Production 高吞吐（vLLM / SGLang / TensorRT-LLM 預設）
outlines	Python script、開發 / 實驗、HF Transformers 用
lm-format-enforcer	動態 grammar、運行時切 schema
guidance	Microsoft 系、想要 high-level API
llama.cpp grammar	本地 GGUF 模型、GBNF 語法
OpenAI Structured Outputs	OpenAI API、JSON schema、開發者無感
Anthropic JSON mode	Anthropic API、簡化版

何時過時 / 何時不過時

不會過時的部分：

Constrained decoding 在 sampling 哪一步插入（softmax 之前）的 framing
三種 grammar 類型（JSON schema / regex / CFG）的分類
Token mask 機制（不合法 token logit 設 -∞）
「正確實作下加速、不是拖慢」的反直覺結論
5 大失敗模式分類

會變的部分：

XGrammar / outlines 等實作的具體效能跟功能
主流推論伺服器的預設 grammar engine
JSON schema spec 標準化（新版會出）
Function calling + constrained decoding 是否會被 native multimodal 取代

下一章：3.11 想學更深、整個模組三理論基礎走完。