Llm-as-Judge on Tarragon

Hands-on：用本地 LLM 跑 judge harness（最小可行版）

Tue, 12 May 2026 00:00:00 +0000

4.21 LLM-as-judge 寫的是原理。本篇用 Ollama / LM Studio 在本地跑一個最小可行的 judge harness、對自己工作流的真實案例做 systematic eval。隱私敏感場景特別合用 — eval 資料（user query、agent output、可能含 PII）不需要送雲端。

本篇 framing 是「真的能跑、不只跑 demo」、所以包含：硬體預算估算、judge model 選型、bias 緩解、calibration 流程、跟 production trace 串接的延伸；術語對應 LLM-as-Judge 與 LLM Tracing。

驗證日期：2026-05-12 環境：M4 Max 64GB / 或 24GB+ VRAM PC + Ollama Judge model：DeepSeek-R1-Distill-Qwen-32B 或 QwQ-32B（reasoning model 當 judge 更穩）

為什麼用本地 LLM 當 judge

跟雲端 judge（GPT-5 / Claude 4）對比：

維度	本地 judge	雲端 judge
Cost	0（電費）	$0.001-0.01 per item
隱私	完全本地、eval 資料不出機器	送雲端、依政策
Latency	視硬體、reasoning model 30B 約 30-60s	API call 5-30s
品質上限	本地 30B reasoning 接近 2024 雲端中段	雲端旗艦上限高
大量 batch	慢但 zero cost	快但 cost 累積

判讀：

大量 production trace eval（千筆以上）+ 隱私敏感 → 本地 judge
少量 high-stake eval（< 50 筆） → 雲端旗艦 judge
A/B test 快速 iterate → 雲端（latency 重要）

硬體預算

Judge model 選擇看硬體：

硬體	適合 judge model	預期 latency / item
M4 Pro 24GB / 4090 16GB	Qwen2.5-32B Q4 或 DeepSeek-R1-Distill-14B	30-60s
M4 Pro 36GB	DeepSeek-R1-Distill-Qwen-32B Q4	60-120s
M4 Max 48-64GB / 5090 24GB	QwQ-32B 或 DeepSeek-R1-Distill-Qwen-32B Q6	60-180s（含 reasoning trace）
M4 Max 128GB / 多卡 PC	Llama 3.3 70B 或 Qwen3-72B	120-300s

注意：reasoning model 的 thinking trace 拉長 latency、跑大量 batch 要規劃時間（100 item × 60s = 100 min）。

何時不適合用本地 judge：

硬體低於 M4 Pro 24GB / 4090 16GB（如 M1/M2 16GB、無獨立 GPU PC）：跑 32B reasoning model 太緊、強行跑會 swap、latency 爆 5-10×。改用 14B instruct model（如 Qwen2.5-14B Q4）作 judge、或直接走雲端 judge
Batch × latency > 你可接受的等待時間：100 item × 60s/item = 100 min；500 item × 120s = 17 hr。預估超過 4 hr 時改雲端 batch API
eval 任務太 nuanced：細粒度倫理 / 法律 / 高 stake 判讀、本地 32B distill 能力不夠、用雲端旗艦 judge 或人工 review
calibration 階段：第一次跑、要快速 iterate rubric、雲端 judge latency 短（5-30s）更適合 iterate

整體流程

11. 蒐集 eval dataset    → JSONL：每行一個 (input, output) 待評
22. 設計 rubric         → 評分維度、scale、明確 anti-pattern
33. 寫 judge prompt     → 4 段式（task / input-output / rubric / format）
44. 跑 harness          → 對每筆 input call judge、parse JSON output
55. Aggregate 結果      → 算平均分數、找 outlier、看 reasoning
66. Calibration（可選）  → 跟 human eval 比對、調 rubric
77. 跟 production trace 串接 → 定期跑 production sample

Step 1：蒐集 eval dataset

JSONL format（每行一筆）：

1{"id": "001", "input": "用 Python 寫 fibonacci function", "output": "def fib(n):\n    if n <= 1:\n        return n\n    return fib(n-1) + fib(n-2)"}
2{"id": "002", "input": "解釋這段 code 在做什麼：[code]", "output": "這段 code 實作了 ..."}
3{"id": "003", "input": "[bug 描述]", "output": "[suggested fix]"}

來源：

過往 Continue.dev / Cursor 跟 LLM 的對話 log
Production agent 的 trace（手動 export 或 LangSmith / Phoenix dump）
自己 hand-craft 30-100 個典型 case

放在 data/eval.jsonl。

Step 2：設計 rubric

依任務類型設計、coding 任務的範例 rubric：

 1評分維度：
 21. Correctness（程式碼能否運作、邏輯是否正確）：1-5
 32. Style（是否符合 codebase convention、習慣命名）：1-5
 43. Completeness（是否完整解決 user request）：1-5
 5
 6評分規則：
 7- 5：完美無瑕、可直接 merge
 8- 4：小修可用、整體正確
 9- 3：方向正確、需 substantial 修改
10- 2：部分對、主要邏輯有錯
11- 1：完全錯、誤導使用者
12
13明確不加分（緩解 verbosity bias）：
14- 冗長 / verbose（同樣正確的短答 = 長答）
15- 道歉 / 開場白
16- 「我希望這有幫助」這類禮貌話
17- 過多 markdown 修飾（不加分）

Step 3：Judge prompt 模板

寫成 file prompts/judge.txt：

 1你是 LLM 輸出品質評估員、要評估 coding assistant 對使用者請求的回答品質。
 2重要：請保持公正、忽略風格偏好、聚焦在實質品質。
 3
 4User request:
 5{input}
 6
 7Assistant response:
 8{output}
 9
10評分維度（每維 1-5、加總用 overall）：
11
121. Correctness：程式碼能否運作、邏輯正確
13   5: 完美無瑕
14   4: 小修可用
15   3: 方向正確、需 substantial 修改
16   2: 部分對、主要邏輯有錯
17   1: 完全錯
18
192. Style：符合 codebase convention
20   1-5 同 scale
21
223. Completeness：完整解決 user request
23   1-5 同 scale
24
25明確不加分項：
26- 冗長 / verbose（同樣正確的短答 = 長答）
27- 道歉 / 開場白
28- 「我希望這有幫助」這類禮貌話
29- 過多 markdown 修飾
30
31請依下列 JSON 輸出（不要加額外文字、不要 markdown code fence）：
32{
33  "correctness": <1-5>,
34  "style": <1-5>,
35  "completeness": <1-5>,
36  "reasoning": "<簡短解釋、< 100 字>",
37  "overall": <1-5>
38}

Step 4：跑 harness

Python 最小可行版：

 1# judge_harness.py
 2import json
 3import requests
 4from pathlib import Path
 5
 6JUDGE_MODEL = "deepseek-r1:32b"  # 或 qwq:32b
 7OLLAMA_URL = "http://localhost:11434/v1/chat/completions"
 8
 9def load_dataset(path):
10    """Load JSONL eval dataset."""
11    with open(path) as f:
12        return [json.loads(line) for line in f if line.strip()]
13
14def load_prompt_template(path):
15    return Path(path).read_text()
16
17def call_judge(prompt):
18    """Call Ollama judge model、回 raw response text."""
19    resp = requests.post(OLLAMA_URL, json={
20        "model": JUDGE_MODEL,
21        "messages": [{"role": "user", "content": prompt}],
22        "temperature": 0.1,  # judge 用低 temperature 穩定
23        "stream": False,
24    }, timeout=600)
25    return resp.json()["choices"][0]["message"]["content"]
26
27def parse_judge_output(text):
28    """Parse judge 回的 JSON、容錯處理（reasoning model 可能加  標記）。"""
29    # 跳過 reasoning trace
30    if "" in text:
31        text = text.split("")[-1]
32
33    # 找 JSON 區塊
34    start = text.find("{")
35    end = text.rfind("}") + 1
36    if start == -1 or end == 0:
37        return None
38    try:
39        return json.loads(text[start:end])
40    except json.JSONDecodeError:
41        return None
42
43def run_harness(dataset_path, prompt_template_path, output_path):
44    dataset = load_dataset(dataset_path)
45    template = load_prompt_template(prompt_template_path)
46
47    results = []
48    for i, item in enumerate(dataset):
49        prompt = template.format(input=item["input"], output=item["output"])
50        raw = call_judge(prompt)
51        parsed = parse_judge_output(raw)
52
53        result = {
54            "id": item["id"],
55            "scores": parsed,
56            "raw_judge_output": raw[:500],  # 保留前 500 字便於 debug
57        }
58        results.append(result)
59        print(f"[{i+1}/{len(dataset)}] id={item['id']} overall={parsed.get('overall') if parsed else 'FAIL'}")
60
61    # 寫出 JSONL
62    with open(output_path, "w") as f:
63        for r in results:
64            f.write(json.dumps(r) + "\n")
65
66    # Aggregate
67    valid = [r for r in results if r["scores"]]
68    if valid:
69        avg = sum(r["scores"]["overall"] for r in valid) / len(valid)
70        print(f"\nAggregate: {len(valid)}/{len(results)} valid、avg overall = {avg:.2f}")
71
72if __name__ == "__main__":
73    run_harness("data/eval.jsonl", "prompts/judge.txt", "results/eval.jsonl")

跑：

1# 先確認 judge model 已 pull
2ollama pull deepseek-r1:32b
3
4# 跑 harness
5python judge_harness.py

Step 5：Aggregate 跟看 outlier

跑完後 results/eval.jsonl 含每筆評分跟 reasoning。看哪些是 outlier：

1# 找 overall < 3 的 case（低分、值得 review）
2jq 'select(.scores.overall < 3)' results/eval.jsonl
3
4# 看 reasoning 找系統性問題
5jq '.scores.reasoning' results/eval.jsonl | sort -u

判讀：

多數 score 4-5、少數 1-2：整體品質好、focus 在低分 case 找 fix
多數 score 2-3：系統性問題、改 prompt / model / agent design
分數分佈兩極（很多 5 很多 1）：可能是 task difficulty 分群、stratified analysis

Step 6：Calibration（可選但推薦）

跟 human eval 比對、確認 judge 對齊：

11. 從 dataset 抽 30 個（覆蓋 difficulty / score 分佈）
22. 自己 human eval（依同樣 rubric）
33. 對比 judge 跟 human 的 overall score
44. 算 Spearman correlation
5   - > 0.7：judge 對齊夠好、可信
6   - 0.5-0.7：部分問題、改 rubric
7   - < 0.5：judge 不可信、換 model 或重寫 rubric

低 correlation 的常見原因：

Rubric 太 vague、judge 自由發揮
Judge model 能力不夠（換更強 judge）
Verbosity / position bias 沒緩解
Eval task 跟 judge 訓練分佈差距大

Step 7：跟 production trace 串接（延伸）

把 4.20 LLM tracing 蒐集的 production trace export 成 JSONL、定期跑 judge：

1# 假設用 Langfuse self-host
2langfuse export --filter "user_feedback=negative" --output traces.jsonl
3
4# 轉成 eval format
5python convert_trace_to_eval.py traces.jsonl > data/eval-from-prod.jsonl
6
7# 跑 judge
8python judge_harness.py

這是 production quality engineering 閉環的本地版本、隱私敏感場景的 cost-free alternative。

失敗模式

Judge 不輸出合法 JSON：reasoning model 可能在 ... 後仍加 markdown / 解釋

緩解：parse 時跳段、容錯處理、或開 constrained decoding（llama.cpp grammar）

Latency 太長、batch 跑不完：reasoning model 32B 每 item 60-120s、100 item 要 2 小時

緩解：用較小 judge model（如 Qwen2.5-32B instruct、非 reasoning）、或拆 batch 並行

Judge bias 沒緩解：本地 judge 跟雲端 judge 都會有 verbosity / position bias

緩解：rubric 寫明、pairwise 換位置跑 2 次

本地 judge 能力上限：30B distill 對 nuanced case 判讀不如雲端旗艦

緩解：critical case 加 spot human review、或混用本地（量大）+ 雲端（精選 sample）

跟其他章節的關係

原理層的 LLM-as-judge 設計見 4.21
Production trace 串接見 4.20 tracing
Reasoning model 選型見 3.8
隱私 / 跨雲端邊界判讀見 6.4
Benchmark 跟 in-house eval 的層次見 4.14

4.21 LLM-as-Judge 評估方法

Tue, 12 May 2026 00:00:00 +0000

4.14 benchmarking-and-evaluation 寫了 capability benchmark（MMLU、SWE-bench 等）跟 in-house benchmark 概念。但「自己工作流的真實案例該怎麼系統性 eval」這個操作層、4.14 點到沒展開。本章補上 LLM-as-Judge — production AI app 的事實標準 eval 方法、比 human eval 便宜 500-5000×、跟人類有 80%+ agreement、但要處理 bias。

Judge 在 eval 系統中的定位：4.13 Eval 設計座標系把 eval 分三軸八象限、判斷哪個象限該用什麼工具——judge 的位置是 subjective 軸（沒 ground truth 的行為）、不是 objective 軸（有 ground truth 用 deterministic check 更便宜更準）。讀本章前先看 4.13 的軸誤選段、避開「全部 eval 都做成 judge」的常見反模式。

本章目標

讀完本章後、你應該能：

區分 LLM-as-Judge、standard benchmark、human eval 三條 eval 路徑。
設計可重現的 judge rubric（input / output / rubric / reasoning 四段）。
用 pairwise vs direct scoring、知道何時用哪種。
緩解三大 bias（position / verbosity / self-preference）。
把 production trace 餵回 judge、形成自動 eval 閉環。

為什麼需要 LLM-as-Judge

4.14 推「in-house benchmark 是 final test」、但操作層是個 gap：

Eval 痛點	LLM-as-Judge 解法
Standard benchmark 跟自己 use case 不符	Judge 用自己 case 跑、rubric 自定義
Human eval 太貴 / 太慢	Judge 自動跑、$0.001-0.01 per item
Production trace 量大、人工看不完	Judge 跑 100% production trace 都可行
Rule-based eval 抓不到語意問題	Judge 能判斷「答案是否符合意圖、即使措辭不同」
Iteration 需要快速 feedback	Judge 幾分鐘跑完 100 items、prompt 改完馬上重測

主要 use case（重複 LLM-as-Judge 卡片）：in-house benchmark、production trace eval、A/B test、synthetic data quality。

Judge prompt 結構

可重現的 judge 必須四段式：

 1[Section 1: Task description]
 2你是 LLM 輸出品質評估員。要評估 coding assistant 對使用者請求的回答品質。
 3
 4[Section 2: Input + Output to evaluate]
 5User request: {input}
 6Assistant response: {output}
 7
 8[Section 3: Rubric（評分標準）]
 9評分維度：
101. Correctness（程式碼能否運作、邏輯是否正確）：1-5
112. Style（是否符合 codebase convention）：1-5
123. Completeness（是否完整解決 user request）：1-5
13
14評分規則：
15- 5：完美無瑕、可直接 merge
16- 4：小修可用、整體正確
17- 3：方向正確、需 substantial 修改
18- 2：部分對、主要邏輯有錯
19- 1：完全錯、誤導使用者
20
21明確不加分：
22- 冗長 / verbose（同樣正確的短答 = 長答）
23- 道歉 / 開場白
24- 「我希望這有幫助」這類禮貌話
25
26[Section 4: Output format]
27請依下列 JSON 輸出：
28{
29  "correctness": <1-5>,
30  "style": <1-5>,
31  "completeness": <1-5>,
32  "reasoning": "<簡短解釋>",
33  "overall": <1-5>
34}

關鍵設計原則：

Rubric 明確、可重現：用 1-5 scale + 每分明確定義、避免 judge 自由發揮
明確列「不加分項」：vag rubric 容易讓 judge 加分長答 / 道歉 / 客套（verbosity bias）
要求 reasoning：強迫 judge 寫評分理由、提升 calibration、後續可 debug
Structured output：用 JSON / structured output 強制格式、後續可程式化處理

Pairwise vs Direct scoring

兩種主流評分方式：

Direct scoring（直接打分）

給一個 (input, output)、judge 給絕對分數（1-5、1-10）。

優點：簡單、可看「絕對品質」隨時間改變缺點：分數 calibration 不穩（不同 batch 跑、judge 可能 baseline drift）

Pairwise comparison（兩兩比較）

給一個 input + 兩個 output（A、B）、judge 選哪個比較好。

優點：相對比較比絕對打分穩、適合 A/B testing 缺點：需要兩個 candidates、結果是「A > B」不是「A 多好」

實務組合：

場景	適合方式
Production quality monitoring	Direct scoring（每個 trace 一個分數）
Prompt / model A/B test	Pairwise（A 跟 B 比）
Fine-tune 前後比較	Pairwise
Regression detection	Direct（跟 baseline 比較）
Synthetic data filtering	Direct（保留 ≥ 4 分）

三大 Bias 跟緩解

1. Position bias（位置偏見）

Pairwise 比較時、judge 對「先出現」的 candidate 有偏好（通常偏 A）。

緩解：

換位置跑 2 次（A-B 跟 B-A）
只 count 兩次都偏 A 的為「prefer A」、不一致為「tie」
標準 LLM-as-Judge framework（如 MT-Bench）內建這做法

2. Verbosity bias（冗長偏見）

Judge 傾向給「長答」高分、即使內容沒比「短答」更好。

緩解：

Rubric 明確寫「冗長不加分」「同樣正確的短答 = 長答」
長度 normalize：分數 = raw_score / log(length)
用 length-controlled benchmark（如 length-controlled AlpacaEval）

3. Self-preference bias（自家偏好）

Judge 偏好自家風格的答案（GPT 當 judge、偏好 GPT-style 輸出；Claude 當 judge、偏好 Claude-style）。

緩解：

用 3 個不同 family 的 judge model（如 Claude + GPT + Gemini）取多數
避免 judge 跟 test subject 同 model
用 reasoning model 當 judge（多家 reasoning model 共識更穩）

補充 bias：Format bias

Judge 對「有 markdown / 有 code block / 有結構」的答案偏好、即使內容沒比「純文字」更好。

緩解：rubric 明確寫「格式不加分、看內容」。

Calibration（校準）

Judge 不該光信、要 calibrate：

 11. 蒐集 100 個 (input, output) pair
 22. Human eval（你自己或可信 human）打 ground truth 分數
 33. Judge 跑同樣 100 個
 44. 算 agreement rate：
 5   - Pairwise：judge 跟 human 同意比例（target > 75%）
 6   - Direct scoring：Spearman correlation（target > 0.7）
 75. 若 agreement 低：
 8   - 改 rubric（更明確）
 9   - 換 judge model（更強）
10   - 改 prompt（few-shot example）
116. Calibrate 後的 judge 才能跑 production

Calibration 是「judge 評什麼」跟「人類評什麼」對齊的步驟、跳過會讓 production eval 失準。

跟 4.20 LLM tracing 的閉環

Production trace + LLM-as-Judge 形成自動 eval pipeline：

 1Production users
 2   ↓ 產生 trace
 3[LLM tracing 平台]（LangSmith / Phoenix / Langfuse / Braintrust）
 4   ↓ filter：user thumbs-down、error、long latency 等 trace
 5   ↓ sample 100 個 / day
 6[LLM-as-Judge batch run]
 7   ↓ rubric scoring
 8[Dashboard]
 9   - 哪類 query 品質下降
10   - 哪個 deployment version 品質差
11   - 哪個 user segment 體驗差
12   ↓
13觸發 alert / 改 prompt / 改 model / 回退
14   ↓ A/B test
15   ↓ Pairwise judge eval new vs old
16   ↓ Deploy 勝者

這是 production LLM 應用 quality engineering 的標準閉環。

Judge model 選型

Judge model 候選	強項	弱項
Claude Sonnet / Opus	reasoning 強、rubric 跟得緊	Cost 中等
GPT-5 / GPT-4o	普及、tool-calling 強	對自家 GPT 輸出有 self-preference
Gemini Pro 2.5	Long context 強、multi-modal	rubric 跟得較鬆
o1 / o3 / R1（reasoning model）	推理能力強、判 nuanced case 穩	Cost 高、latency 長
本地 30B+ 模型（QwQ、DeepSeek-R1 distill）	隱私強、cost 0	能力上限低於雲端旗艦

判讀：

大 stake / final QA：雲端旗艦 reasoning model
大量 production trace eval：中等模型（GPT-4o / Sonnet）、cost / speed 平衡
隱私敏感（user trace 不能送雲端）：本地 reasoning model（QwQ-32B / R1 distill）
A/B test prompt 改進：用同個 judge 跑前後比對、保持 baseline

失敗模式

Rubric 太 vague：judge 自由發揮、分數沒重複性

緩解：rubric 寫得像 unit test、每分有具體 criteria

沒做 calibration：judge 跟 human agreement 沒驗、可能 systematically off

緩解：每次大改 rubric / 換 judge model 都重新 calibrate

Sample 不代表 production：只 eval easy case、production 真實困難 case 沒覆蓋

緩解：用 stratified sampling（按 difficulty / user segment / feature 抽樣）

Bias 沒緩解：position / verbosity / self-preference 直接 baked in

緩解：標準 framework（DeepEval / Inspect / Braintrust）內建 bias 緩解、用既有 framework 比 DIY 穩

Judge cost 比預期高：production trace 全跑 judge、cost 爆

緩解：sample rate < 10%、配合 LLM tracing 的 sampling

Over-reliance on judge：忘記 judge 也會錯、把 judge 當絕對真理

緩解：高 stake 任務仍需 spot human review、judge 是 80% 解、不是 100%

主流 framework

Framework	特色
DeepEval	OSS、Python、跟 pytest 整合
Inspect（UK AI Safety）	強 eval framework、reasoning model 友善
Braintrust	SaaS、eval + tracing 一體
Langfuse evals	OSS、跟 tracing 整合
OpenAI evals	OSS、Anthropic 也支援
Patronus	Production eval SaaS

何時不該用 LLM-as-Judge

可機械驗證：unit test、exact match、output schema validation — 用 deterministic rule 比 judge 穩
極小 dataset（< 20 items）：直接 human eval、不必 judge
判讀需要 domain expertise：醫療 / 法律 / 安全的 high-stake 判讀、judge 不該替代 expert
Judge 能力 < test subject：用 GPT-4o judge 評 o3 輸出、judge 看不懂 reasoning trace

何時過時 / 何時不過時

不會過時的部分：

LLM-as-Judge 作為 production eval 主流方法的地位
四段式 judge prompt 結構（task / input-output / rubric / format）
Pairwise vs direct scoring 的取捨
三大 bias 分類跟緩解方法
Production trace → judge → action 的閉環

會變的部分：

主流 framework（DeepEval / Inspect / Braintrust 等）
各 judge model 的具體能力（每代強模型）
Bias 的具體量化（人類 agreement 數字會隨時間 / 任務變）
新興 bias 跟緩解方法

下一步

下一步：模組四到此覆蓋從基礎（4.0 prompt 技術光譜 / 4.1-4.2 RAG / 4.3 tool / 4.4 agent / 4.5 HITL）、協議與編排（4.6 protocols / 4.7 workflow / 4.8 multi-agent）、production 細節（4.9-4.12 resource / artifact / long-context / embedding）、到 eval 跟 production observability 閉環（4.13 eval 框架 / 4.14 benchmarking / 4.17-4.21 harness / caching / memory / tracing / judge）的完整應用層地圖。Hands-on 端到端案例見 hands-on 子分類。可進入模組五看本地推論硬體、進入模組六看安全議題（特別是 6.6 OWASP LLM Top 10 對照、把 production eval 的安全議題對應到企業合規詞彙）、或回 4.13 Eval 設計座標系看 judge 在 meta eval 框架中的定位。