Coding-Agent on Tarragon

Context Budget

Tue, 12 May 2026 00:00:00 +0000

Context budget 的核心概念是「把 context window 視為有限資源、明確規劃 system prompt / tool schema / history / file content / reasoning trace / tool result 各佔多少」。coding agent 的最大失敗模式是「context 用爆 → 模型開始遺忘關鍵指令 → 行為飄」、預算化是 harness 設計的核心責任。

概念位置

典型 coding agent 的 context 構成（以 200K 模型為例）：

 1[1. System prompt + tool schema]：     固定 ~10K-30K
 2   - agent 角色、輸出規則、tool 列表 + spec、subagent 路由
 3   - 經常用 prompt cache 加速、見 [prompt cache 卡]
 4
 5[2. 工作歷史 / conversation history]：  動態 0-60K
 6   - 過去回合的 user query + assistant answer + tool calls
 7   - 越長越貴、harness 要決定何時 summarize / trim
 8
 9[3. 當前任務 file context]：           動態 0-100K
10   - 開啟的檔案、grep 結果、@-mention 帶入的內容
11
12[4. Reasoning trace（若 reasoning model）]：  動態 1K-10K / step
13   - ... 段、每次推論都會佔 context
14
15[5. Tool result]：                    動態 0-50K
16   - file read 結果、bash output、test result
17
18[6. Margin / safety buffer]：         保留 20-30K
19   - 防止 generation 階段碰到 context limit

主流 coding agent 的 25% 規則（context engineering 慣例）：

規則	直覺
Scaffold 部分（1+2） ≤ 25%	留 75% 給「當下任務」、避免 lost-in-the-middle 把指令吃掉
File content ≤ 50%	不全載入大檔、用 grep / chunked read 替代
Margin ≥ 10%	Generation 階段才不會被 context limit 截斷
Reasoning trace 配長 context	Reasoning model 至少配 64K context、見 reasoning-model 卡

設計責任

讀 coding agent 設計 / harness paper 看到「context budget」「context engineering」「token budgeting」就是這 framing。寫 code 場景的判讀：

超出 budget 的訊號：模型開始忽略 system prompt、回答跟前文重複、tool call 重複過去步驟、reasoning trace 截斷
節省 budget 的策略：用 prompt cache 把 system + tool schema 攤平、grep 取代全檔讀、tool result 限長度（如 head -100）、定期 summarize history
跟 lost-in-the-middle 的關係：context 用越多、中段內容 recall 越差、所以「能用 20K 解就別用 100K」、不是「能塞 200K 就塞滿」
不同 task 不同 budget：autocomplete 任務 budget 小（系統 prompt + 最近 50 行 code 就夠）；refactor 任務 budget 大（多檔案）；agent loop 任務 budget 動態（每步可能 grow）

Scaffold vs Harness

Tue, 12 May 2026 00:00:00 +0000

Scaffold 跟 harness 的核心概念是「把 coding agent 拆成『建構時靜態結構』跟『runtime 動態邏輯』兩層」。Scaffold 是建構時就決定的：system prompt 模板、tool schema 註冊、subagent 拓樸；harness 是 runtime 動態運作：tool dispatch、context budget 管理、safety / 中斷、handoff。Claude Code、Cursor、Aider、Codex 這類 coding agent 的內部設計都遵循這個分層。

概念位置

兩層的職責劃分：

 1Scaffold（建構時、static）：
 2  ├── System prompt 模板（角色、約束、輸出格式）
 3  ├── Tool schema 註冊（read_file / write_file / run_bash 等的 spec）
 4  ├── Subagent 拓樸（main agent + 子 agent 的調用關係）
 5  ├── Skill / playbook 註冊
 6  └── 安全 policy（什麼可寫、什麼要 confirm）
 7
 8   ↓ 編譯 / 載入
 9
10Harness（runtime、dynamic）：
11  ├── Tool dispatch（接 LLM tool call、執行、回 result）
12  ├── Context budget 管理（剪裁歷史、塞新內容、不超 25% 規則）
13  ├── Safety / 中斷（confirm UI、permission boundary、可逆性檢查）
14  ├── Error recovery（tool failed → retry / fallback / escalate）
15  └── Telemetry（trace / metrics / cost）

跟既有概念的關係：

概念	跟 scaffold / harness 的關係
System prompt	Scaffold 的核心元件、定義 agent 角色
Tool use	Scaffold 註冊 tool spec、Harness 在 runtime dispatch
Agent loop	Harness 的核心 loop（perceive / reason / act / observe / terminate）
Function calling	Tool spec 的具體 protocol

設計責任

讀 coding agent paper / blog 看到「scaffold」「harness」「context engineering」就是這 framing。寫 code 場景的判讀：

看新 coding agent 時、分兩層拆解：scaffold（system prompt、tool list、subagent 結構）是「設計做了什麼」、harness（context 怎麼裁、tool 怎麼 dispatch、安全怎麼擋）是「runtime 怎麼跑」
修改 / 客製 agent 時、看你動的是哪層：改 system prompt = 動 scaffold；改 tool 執行邏輯 = 動 harness
跟 4.17 coding-agent harness 的關係：本卡是定義、4.12 是 coding 場景的工程實務（context budget、scaffold 模式、harness pattern）

Subagent

Tue, 12 May 2026 00:00:00 +0000

Subagent 的核心概念是「把 coding agent 切成多個專責子 agent、每個有獨立 context window 跟 system prompt、由 main agent 透過 handoff 機制調度」。代表設計：Claude Code 的 Task agent、OpenAI Agents SDK 的 handoff、Anthropic multi-agent research。是「context budget 不夠 + 任務跨多個 specialty」場景的工程選擇。

概念位置

Single agent vs subagent 架構的對比：

 1Single agent（無 subagent）：
 2 Main agent context：
 3 [system prompt + tool schema + 跨所有 specialty 的 history + 所有 file content]
 4 ↓ 容易爆 context、specialty 互相干擾
 5
 6Subagent 架構：
 7 Main agent context（路由 + 高階決策）：
 8 [main system prompt + handoff tool spec + 高階任務歷史]
 9 ↓ 路由到 subagent
10
11 Subagent A context（如「跑測試」專家）：
12 [test-runner system prompt + 測試 tool + 測試相關 file]
13
14 Subagent B context（如「寫 docs」專家）：
15 [docs system prompt + 寫 docs tool + 相關 docs 檔案]

主要好處：

Context budget 隔離：每個 subagent 只看自己 specialty 相關 context、不被別的 specialty 污染
System prompt 專門化：寫 docs 的 system prompt 跟跑測試的 system prompt 不同、各自最佳化
Specialty 路由：main agent 只決定「這個任務該交給哪個 subagent」、不直接做 specialty 工作

主要挑戰：

Handoff 設計：main agent 要怎麼選 subagent、怎麼傳 context、怎麼接 result
跨 subagent 共享狀態：codebase 知識、history、要避免重複 work
失敗模式：subagent 之間互相 deadlock、main agent 失去 high-level view、subagent 邊界劃錯

設計責任

讀 multi-agent / subagent paper / coding agent docs 看到「subagent」「handoff」「Task tool」「specialist agent」就是這 framing。寫 code 場景的判讀：

何時用 subagent：單一 agent context 不夠用、specialty 邊界清楚（如 search / coding / testing / documentation）、main agent 的 system prompt 已太長
何時不用：任務簡單、specialty 邊界模糊（強行拆會增加 handoff overhead）、本地小模型（handoff 機制對小模型不穩）
跟 agent loop 的關係：每個 subagent 內部仍是 agent loop（perceive / reason / act / observe / terminate）、只是 loop 範圍縮窄
跟 scaffold vs harness 的關係：subagent 註冊在 scaffold（建構時）、handoff 在 harness（runtime）執行

4.17 Coding agent harness：scaffold / context engineering / subagent

Tue, 12 May 2026 00:00:00 +0000

教材整體 framing 是「LLM 寫 code 工程實務」、模組四前面 11 章寫的是通用 LLM 應用層原理（RAG / tool use / agent / VLM 等）。本章補上「coding agent 怎麼設計」這層 — 為什麼 Claude Code / Cursor / Aider / Codex 這類工具長那樣、scaffold 跟 harness 怎麼分、context budget 怎麼配。本章把這些設計取捨從特定產品抽出來、寫成跨工具世代不變的工程原理。

本章目標

讀完本章後、你應該能：

用 scaffold vs harness 分層拆解任何 coding agent。
對自己工作流計算 context budget、看到 budget 超標訊號時知道怎麼修。
判斷何時值得拆 subagent、何時用 single agent。
看 Claude Code / Cursor / Aider 等 coding agent 的設計差異、能對應到本章 framing。

Scaffold vs Harness 分層

Coding agent 的內部結構分兩層：

 1Scaffold（建構時靜態結構、編譯 / 載入時就決定）：
 2  - System prompt 模板（agent 角色、輸出約束、錯誤處理 policy）
 3  - Tool schema 註冊（read_file / write_file / run_bash / web_fetch 等 spec）
 4  - Subagent 拓樸（main agent + 子 agent 關係）
 5  - Skill / playbook 註冊（特定任務的 known recipe）
 6  - 安全 policy（permission boundary、要 confirm 的動作清單）
 7
 8Harness（runtime 動態運作、每個 query / loop iteration 跑）：
 9  - Tool dispatch（接 LLM tool call、call function、回 result）
10  - Context budget 管理（剪裁 history、塞新內容、避免超 budget）
11  - Safety / 中斷（confirm UI、permission check、可逆性判斷）
12  - Error recovery（tool failed → retry / fallback / escalate）
13  - Telemetry（trace / metrics / cost、見 [4.20 OTel tracing](/llm/04-applications/llm-tracing-and-observability/)）

不同 coding agent 的 scaffold / harness 比較：

工具	Scaffold 特點	Harness 特點
Claude Code	Skill registry、subagent system、structured permission	強 context budget 管理、explicit handoff、trace
Cursor	Composer + chat + tab、tool list 較簡	IDE-integrated、tool dispatch 在 client + server 切
Aider	跟 git 緊密、edit-format spec	Repl-style、自動 commit、線性 loop
Codex CLI	跟 OpenAI assistants API 對齊	Stream-based、tool call 即時執行
Continue.dev	Plugin-style、provider 抽象	較輕量、tool dispatch 在 plugin host

關鍵理解：所有 coding agent 都遵循這個 framing、差異在「scaffold 多複雜」「harness 多強」、不是有沒有這兩層。

Context Budget 工程實務

Context budget 是 coding agent harness 的核心責任。實務拆分（以 200K context 模型為例）：

元件	預算 %	內容
System prompt + tool schema	5-15%	Agent 角色、輸出約束、tool spec
Conversation history	10-30%	過去回合的 user query + assistant + tool call
Current task file context	30-50%	開啟檔案、grep 結果、@-mention
Tool result（current step）	0-20%	file read / bash output / test result
Reasoning trace（若 reasoning model）	0-15%	`...` 段
Margin / safety buffer	10-20%	Generation 階段不被 context limit 截斷

關鍵 25% 規則：Scaffold 部分（system prompt + tool schema + conversation history）合計不超過 25% context。剩 75% 給「當下任務」、避免 lost-in-the-middle 把指令吃掉。

超標訊號跟對應策略：

超標訊號	緩解策略
模型開始忽略 system prompt 指令	用 prompt cache 把 system prompt 攤平
Tool call 重複過去步驟	History 過長、需要 summarize 舊回合
回答跟前文重複 / 矛盾	中段 lost-in-the-middle、reorder 重要內容到末尾
Generation 被截斷	Margin 不夠、降低 file content 或 history
Reasoning trace 截斷	換更長 context 模型、或拆任務

實作概要：

1每個 turn 開始時、harness 算：
2  available_input = context_window - reserve_margin
3  used = len(system + tool_schema + history + new_content)
4
5  if used > available_input × 0.75：
6    觸發 summarize：把舊 history 壓縮成 1 段摘要
7    或觸發 dispatch：交給 subagent 處理特定子任務、回主 agent 時只帶 summary

Subagent 設計

Subagent 把單一大 agent 拆成多個專責子 agent、各自有獨立 context。何時用：

情境	用 subagent？
Single agent context 撐不住任務複雜度	是
Specialty 邊界清楚（test / docs / refactor 各自有專家）	是
任務簡單（autocomplete、單行修改）	否
Specialty 邊界模糊（強行拆增加 handoff overhead）	否
本地小模型（< 14B）	否（handoff 對小模型不穩）

主流 subagent 模式：

1. Search subagent

Specialty：在大 codebase 找相關片段、不污染 main agent context Tool：grep / find / semantic search Output：top-K 相關段落 + 摘要、main agent 不需要看完整 grep 結果

2. Test runner subagent

Specialty：跑測試、解讀失敗、提出 fix 建議 Tool：run_bash（pytest / jest 等）+ read failed test Output：「測試結果 + 失敗根因 + 建議 fix」、不是完整 test log

3. Docs writer subagent

Specialty：寫 docstring / README / commit message System prompt：強化「寫作風格、語言、長度」、跟 main coding agent 完全不同的 system prompt Output：寫好的 docs 文字

4. Code review subagent

Specialty：對 PR diff 做 review、檢查 style / bug / security Tool：git diff / grep Output：comments 列表

5. Long-running task subagent

Specialty：跑可能持續數分鐘的任務（如 large-scale refactor）、main agent 不阻塞 Tool：背景 process management Output：階段性進度回報 + 最終結果

主 agent 對 subagent 的 handoff 設計：

1main agent 收到任務
2   ↓ 判斷 specialty
3   ↓ 用 dispatch_subagent tool 呼叫
4   tool spec：{name, task_brief, expected_output_format}
5   ↓
6Subagent 在自己 context 內跑完
7   ↓ 回 summary（不是完整 trace）
8   ↓
9main agent 拿到 summary、繼續推進

跟既有概念的關係

既有章節	跟本章的關係
4.3 Tool use	Tool spec 是 scaffold 的核心、tool dispatch 在 harness
4.4 Agent 架構	Agent loop 是 harness 的內部執行迴圈
4.6 應用層協議	Function calling / MCP 是 tool 跟 subagent 之間的協議
4.11 Long context	Context budget 是 long context 的工程實務面
4.18 Prompt caching	是 scaffold 部分（system + tool schema）的 cost / latency 優化
4.19 Agent memory	History 跟 long-term memory 是 harness 跟 storage 的界面

跟具體 coding agent 的 mapping

讀者實際用 / 想客製某個 coding agent 時、用本章的 framing 拆解：

Claude Code

Scaffold：CLAUDE.md（system prompt 入口）、Skills registry、SubagentTypes、tool schema Harness：context budget management、Task tool（dispatch subagent）、permission system、trace 特色：完整 scaffold-harness 分層、強 subagent system、explicit context budget

Cursor

Scaffold：System prompt 較固定、tool list 較簡、Composer mode 是 scaffold variant Harness：IDE 整合度高、tool dispatch 跨 client / server、streaming response 特色：產品優化重於可客製、scaffold 半開放

Aider

Scaffold：edit-format（diff / udiff / whole）+ git integration、tool 較少（read / edit / run） Harness：repl-style loop、自動 commit、線性對話特色：CLI-first、scaffold 簡單、harness 圍繞 git 設計

Continue.dev（搭本地 LLM）

Scaffold：Provider-agnostic、tool list 由 plugin 註冊 Harness：較輕量、tool dispatch 在 VS Code extension host 特色：適合本地 LLM、scaffold / harness 都相對開放

失敗模式跟緩解

Coding agent 常見失敗：

失敗	根因	緩解
Context 用爆、模型失憶	Budget 設計不當	25% 規則、prompt cache、subagent 分擔
Tool call infinite loop	Harness 沒設 step 上限或 cost cap	加 max_steps / max_cost、定期讓 user check
Subagent 答錯仍被 main 採用	Main agent 沒 verify subagent output	加 verification step、let main 看 subagent trace
修改檔案後 test 沒跑	Scaffold 沒強制「先 test 後 commit」	System prompt 加 explicit checklist、harness 加 hook
Reasoning model 配短 context	Reasoning trace 擠壓任務 context	配 64K+ context、或拆任務
Permission boundary 不夠細	Scaffold 安全 policy 太寬	副作用類 tool 拆細、加 confirm UI（見 hands-on permission-boundary）

本地小模型跑 coding agent 的限制

本地 < 14B 模型跑完整 coding agent 通常不穩、根因（跟 3.8 reasoning-models / 4.4 agent-architecture 已述）：

Tool use 不穩：小模型 function calling 訓練不足、tool call 格式錯誤率高
Long context 退化：< 14B 模型 effective context 通常 < 16K、coding agent 場景容易撞 budget
Reasoning 弱：multi-step planning、failure recovery 都需要 reasoning 能力
Subagent handoff 失敗：小模型對「該 handoff 給誰」的判斷不穩

實務組合：

Autocomplete + 簡單 chat：本地 7B-14B coder（Qwen3-Coder / Gemma 4 coder）可勝任
完整 coding agent：30B+ 本地模型或雲端旗艦
混用：本地小模型當 autocomplete + 雲端旗艦當 agent

何時過時 / 何時不過時

不會過時的部分：

Scaffold vs harness 分層 framing
Context budget 配額概念跟 25% 規則
Subagent 設計原則跟 handoff 機制
失敗模式分類（context 爆、infinite loop、permission 邊界）
本地小模型限制

會變的部分：

具體 coding agent（Claude Code / Cursor / Aider 等持續演化）
Subagent registry 標準化（目前各家不同）
Tool schema 標準化（MCP 是其中一條路）
本地小模型的 agent 能力（會逐步追上）

下一章：4.18 Prompt caching 工程實務、看 scaffold 部分的 cost / latency 優化。

4.18 Prompt caching 工程實務：cost / latency 最大槓桿

Tue, 12 May 2026 00:00:00 +0000

Prompt cache 把重複 prefix 的計算結果在 LLM 服務端跨 request 持久化、後續 query 跳過 prefill 階段。Anthropic / OpenAI / Bedrock / Gemini 都列為 cost 跟 TTFT 的最大單一槓桿 — 90% cost 折扣 + 顯著 latency 改善。本章把 prompt caching 的運作機制、設計原則、coding agent / long-context 場景的 pattern、常見 anti-pattern 拆成可操作的工程實務。

注意三層 cache 概念的層次差異（prompt cache 卡片有完整對比表）：KV cache 是單次推論內、過去 token 的 K/V 暫存（autoregressive 才省重算）；prefix cache 是同一推論伺服器內跨 request 共用 KV cache；prompt cache（本章聚焦） 是雲端 LLM API 商業 feature、跨 request 跨時間、有 TTL。三者不同層、要區分。

本章目標

讀完本章後、你應該能：

解釋 prompt cache 跟 KV cache / prefix cache 的層次差異。
對 coding agent / RAG / long-conversation 場景設計 cache breakpoint。
估算自己應用開 prompt cache 的 cost / latency 收益。
看到「cache 不命中」訊號時、能定位 anti-pattern 並修。

Prompt cache 怎麼運作

LLM 推論的 prefill 階段對整個 prompt 算 KV cache、是長 prompt 的主要 latency 跟 compute 成本：

1無 cache：
2  Request 1：[10K system prompt] + [tool schema 5K] + [user query 500] = 15.5K prefill
3  Request 2：[10K system prompt] + [tool schema 5K] + [user query 700] = 15.7K prefill
4  → 兩次都付 15K prefill 成本

開 prompt cache 後：

1Request 1：[10K system + 5K tool schema] | cache_control | + [user query 500]
2  → 算出 prefix 的 KV cache、寫進服務端 cache（付 1.25× cost）
3  → 後段 prefill 500 token
4
5Request 2（5 分鐘內）：[10K system + 5K tool schema] | + [user query 700]
6  → 服務端命中 cache、跳過 prefix 的 prefill（付 0.1× cost = 90% 折扣）
7  → 只 prefill 700 token
8  → TTFT 大幅降低

關鍵運作細節：

Cache key = prefix 的 token sequence：完全相同的 token sequence 才命中、差一個 token 就 miss
TTL（time-to-live）：cache 過一段時間（多數 5 min）自動失效、要 ext 1h 通常付額外 cost
Write 比原價略貴、Read 大幅打折：Anthropic 模型 write 1.25×、read 0.1×；OpenAI 模型 read 0.5×
Minimum cacheable size：通常 1K-4K token 起跳、短 prompt 不適合
Cache 範圍：跨 request、跨 conversation、跨 session、但同一 model + 同一 region

Cache breakpoint 設計

Anthropic 用 cache_control 標記顯式 breakpoint、OpenAI 用自動偵測。但設計原則一致：把不變的內容放 prefix、變動的放後面。

典型 coding agent 的 prompt 結構：

1[1. System prompt]：agent 角色、規則、輸出格式             ← 不變
2[2. Tool schema]：所有 tool 的 spec                       ← 不變（除非加新 tool）
3[3. Skill registry / playbook]：known recipes              ← 半變（偶爾更新）
4[4. Codebase context]：固定載入的核心檔案                  ← 半變
5       ↓ cache_control breakpoint ↑
6[5. Conversation history]：過去回合                       ← 變動
7[6. Current user query]：當前 query                       ← 變動
8[7. Current tool result]：剛跑完的 tool output             ← 變動

Breakpoint 放在「不變 vs 變動」交界處、讓 [1-4] 永遠 cache hit。

Anthropic 最多 4 個 breakpoint、可分層：

1breakpoint 1（最早）：[system prompt] → 永久 cache
2breakpoint 2：       [+ tool schema] → 永久 cache
3breakpoint 3：       [+ skill registry] → 半永久 cache
4breakpoint 4（最晚）：[+ recent stable context] → 短期 cache
5[後段]：             variable content（不 cache）

每個 breakpoint 各自命中 / miss、layered cache 讓「加新 skill」只 invalidate breakpoint 3 之後、不影響 breakpoint 1-2。

場景 1：Coding agent

Coding agent 是 prompt cache 命中區 — system prompt + tool schema 動輒 10K-30K token、每個 user turn 都重用。

收益估算（200K context 模型、10K scaffold、5K user query、3K answer）：

 1無 cache：
 2  每 turn input cost = (10K + 5K) × $3/M = $0.045
 3  每 turn TTFT = 10K-15K prefill time（200-400ms）
 4
 5開 cache：
 6  Turn 1（write）：(10K × 1.25 + 5K) × $3/M = $0.0525
 7  Turn 2-N（read）：(10K × 0.1 + 5K) × $3/M = $0.018
 8  TTFT：read 階段省掉 10K prefill、只剩 5K
 9
1010 turns 的累計 cost：
11  無 cache：10 × $0.045 = $0.45
12  開 cache：$0.0525 + 9 × $0.018 = $0.215
13  → 節省 52%

長對話越長、cache 收益越大（cache write 是一次性成本）。

場景 2：RAG / long-context

RAG 場景把 retrieved chunks 放 prefix、user query 放後面、可以 cache retrieved chunks：

1[system prompt]
2       ↓ breakpoint 1（system 永久 cache）
3[retrieved chunks 來自 RAG]
4       ↓ breakpoint 2（同 chunks 在 5min 內 cache）
5[user query]

注意：每次 retrieval 不同 chunks 就 cache miss、所以 cache 適合「同個對話多輪、retrieval 結果穩定」、不適合「每 query 都 fresh retrieve」；後者要回到 retrieval cost 評估。

場景 3：Long document Q&A

讀者上傳 PDF / 文件、多輪問問題：

1[system prompt]
2       ↓ breakpoint 1
3[完整文件內容（可能 100K token）]
4       ↓ breakpoint 2（文件永久 cache）
5[user query]

第一次 query 付 1.25× 文件成本、後續 query 都 0.1×。100K 文件 + 10 個問題的場景下、節省極顯著（> 80% cost）。

常見 anti-pattern

在 prefix 插入 timestamp / request-id

1反例：System prompt: "你是 coding assistant、當前時間 2026-05-12 16:30:42、..."
2   → 每秒不同 cache key、永遠 cache miss、付 1.25× write 不回本
3正解：把 timestamp 放後段、或省略（多數場景模型不需要精確時間）

在 prefix 動態插入 user metadata

1反例：System prompt: "User: alice@example.com, plan: premium、..."
2   → 每個 user 不同 cache、命中率低
3正解：User metadata 放後段、prefix 保持 user-agnostic

Tool schema 順序不固定

1反例：每次 request 把 tool list 隨機 shuffle
2   → 同樣 tool 但 token sequence 不同、cache miss
3正解：Tool list 順序固定、新加 tool 都 append 到末尾

太短的 prompt 也想 cache

1反例：500 token system prompt 開 cache
2   → 多數服務商 minimum 1K-4K、不到門檻不 cache、且 write cost 不回本
3正解：Cache 留給 > 1K 的 prefix、短 prompt 不必開

混用 stream + cache 卻不檢查命中

1反例：開 cache 後不檢查 response 的 cache_read_input_tokens 欄位
2   → 不知道實際命中率、可能 anti-pattern 已在燒 cost 沒察覺
3正解：監控 cache_read / cache_creation token 比例、低於 80% 命中率時 debug

Cache miss 訊號跟診斷

訊號：

Cost 比預期高：應該命中的場景仍付 full price
TTFT 沒改善：cache hit 應該大幅降 TTFT、沒改善 = miss
Response 的 usage 顯示 cache_read = 0：直接訊號

診斷流程：

11. 印出 raw request 的 prefix（cache_control 之前）
22. 比對連續兩次 request 的 prefix token sequence
33. 找出差異位置（diff）
44. 移除 / 重構讓兩次 prefix 完全相同
55. 跑 2-3 次 request、看 cache_read_input_tokens 是否上升

常見差異：timestamp、request id、user id、tool list 順序、retrieved chunks 順序、conversation summary 變動。

跟其他 cost 優化技巧的關係

技巧	攻擊的 cost / latency 來源	跟 prompt cache 的關係
Speculative decoding	Generation 階段 token cost	正交、可疊加
Batching	Throughput per GPU	Production 才用、跟 prompt cache 都用
Prefix cache	同 server 跨 request 共用 KV cache	本地推論伺服器特性、prompt cache 是雲端 API 商業 feature
模型量化	Generation tok/s	正交、可疊加
RAG 而非 long context	Input token 量	RAG + cache 可同時用

本地推論伺服器有沒有類似機制

Ollama / LM Studio / llama.cpp 自身的 prompt cache：

工具	機制	範圍
llama.cpp	`--prompt-cache` flag、persistent file	重複跑同樣 prompt 時跳過 prefill
Ollama	內建 prefix cache、跨 request 共用	同 server 跨 request
LM Studio	同 Ollama 級別、視版本	同上
vLLM	強 prefix cache（PagedAttention 設計支援）	高併發 production

本地推論的 cache 主要靠 prefix cache 機制、跟雲端 API 的 prompt cache 商業 feature 同源、但定價 / TTL / 顯式 control 是雲端 API 才有的 product layer。

何時不適合用 prompt cache

每 request prefix 必變：streaming 任務、每 query 都帶 fresh 上下文
Single-shot 對話：用完就丟、沒有重複使用、write cost 不回本
Prefix < 1K token：不到 minimum、cache 不生效
Cost 不敏感場景：個人小流量、cache 設計 overhead 大於收益
本地推論為主：本地多用 prefix cache、prompt cache 是雲端 API 概念

何時過時 / 何時不過時

不會過時的部分：

「不變放 prefix、變動放後段」的設計原則
Cache breakpoint 分層（system / tool schema / skill / context）
Anti-pattern 分類（timestamp、user metadata、tool 順序）
Cache miss 診斷流程

會變的部分：

各 vendor 的具體定價（write × / read × 折扣）
TTL（5min vs 1h）的可選性跟價格
Automatic vs explicit cache（OpenAI vs Anthropic 路線）
Breakpoint 上限數量
本地推論伺服器的 cache 功能（持續演化）

下一章：4.19 Agent memory 分層、看 agent 如何在 context window 之外管理長期狀態。