<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Llm-as-Judge on Tarragon</title><link>https://tarrragon.github.io/blog/tags/llm-as-judge/</link><description>Recent content in Llm-as-Judge on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Tue, 12 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/llm-as-judge/index.xml" rel="self" type="application/rss+xml"/><item><title>Hands-on：用本地 LLM 跑 judge harness（最小可行版）</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/local-llm-judge-harness/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/local-llm-judge-harness/</guid><description>&lt;p>&lt;a href="https://tarrragon.github.io/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-judge&lt;/a> 寫的是原理。本篇用 Ollama / LM Studio 在本地跑一個最小可行的 judge harness、對自己工作流的真實案例做 systematic eval。隱私敏感場景特別合用 — eval 資料（user query、agent output、可能含 PII）不需要送雲端。&lt;/p>
&lt;p>本篇 framing 是「&lt;strong>真的能跑、不只跑 demo&lt;/strong>」、所以包含：硬體預算估算、judge model 選型、bias 緩解、calibration 流程、跟 production trace 串接的延伸；術語對應 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/llm-as-judge/" data-link-title="LLM-as-Judge" data-link-desc="用 LLM 評估另一個 LLM 的輸出品質、production eval 的主流方法、500-5000× 成本降但有 bias 要處理">LLM-as-Judge&lt;/a> 與 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/llm-tracing/" data-link-title="LLM Tracing" data-link-desc="把 LLM 應用的每次 LLM call / tool call / memory op 編成結構化 span、用 OpenTelemetry GenAI semantic conventions 標準化">LLM Tracing&lt;/a>。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>環境&lt;/strong>：M4 Max 64GB / 或 24GB+ VRAM PC + Ollama
&lt;strong>Judge model&lt;/strong>：DeepSeek-R1-Distill-Qwen-32B 或 QwQ-32B（reasoning model 當 judge 更穩）&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼用本地-llm-當-judge">為什麼用本地 LLM 當 judge&lt;/h2>
&lt;p>跟雲端 judge（GPT-5 / Claude 4）對比：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>本地 judge&lt;/th>
 &lt;th>雲端 judge&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Cost&lt;/td>
 &lt;td>0（電費）&lt;/td>
 &lt;td>$0.001-0.01 per item&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>隱私&lt;/td>
 &lt;td>完全本地、eval 資料不出機器&lt;/td>
 &lt;td>送雲端、依政策&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Latency&lt;/td>
 &lt;td>視硬體、reasoning model 30B 約 30-60s&lt;/td>
 &lt;td>API call 5-30s&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>品質上限&lt;/td>
 &lt;td>本地 30B reasoning 接近 2024 雲端中段&lt;/td>
 &lt;td>雲端旗艦上限高&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>大量 batch&lt;/td>
 &lt;td>慢但 zero cost&lt;/td>
 &lt;td>快但 cost 累積&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>判讀：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>大量 production trace eval（千筆以上）+ 隱私敏感&lt;/strong> → 本地 judge&lt;/li>
&lt;li>&lt;strong>少量 high-stake eval（&amp;lt; 50 筆）&lt;/strong> → 雲端旗艦 judge&lt;/li>
&lt;li>&lt;strong>A/B test 快速 iterate&lt;/strong> → 雲端（latency 重要）&lt;/li>
&lt;/ul>
&lt;h2 id="硬體預算">硬體預算&lt;/h2>
&lt;p>Judge model 選擇看硬體：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>硬體&lt;/th>
 &lt;th>適合 judge model&lt;/th>
 &lt;th>預期 latency / item&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>M4 Pro 24GB / 4090 16GB&lt;/td>
 &lt;td>Qwen2.5-32B Q4 或 DeepSeek-R1-Distill-14B&lt;/td>
 &lt;td>30-60s&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>M4 Pro 36GB&lt;/td>
 &lt;td>DeepSeek-R1-Distill-Qwen-32B Q4&lt;/td>
 &lt;td>60-120s&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>M4 Max 48-64GB / 5090 24GB&lt;/td>
 &lt;td>QwQ-32B 或 DeepSeek-R1-Distill-Qwen-32B Q6&lt;/td>
 &lt;td>60-180s（含 reasoning trace）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>M4 Max 128GB / 多卡 PC&lt;/td>
 &lt;td>Llama 3.3 70B 或 Qwen3-72B&lt;/td>
 &lt;td>120-300s&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>注意：reasoning model 的 thinking trace 拉長 latency、跑大量 batch 要規劃時間（100 item × 60s = 100 min）。&lt;/p></description><content:encoded><![CDATA[<p><a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-judge</a> 寫的是原理。本篇用 Ollama / LM Studio 在本地跑一個最小可行的 judge harness、對自己工作流的真實案例做 systematic eval。隱私敏感場景特別合用 — eval 資料（user query、agent output、可能含 PII）不需要送雲端。</p>
<p>本篇 framing 是「<strong>真的能跑、不只跑 demo</strong>」、所以包含：硬體預算估算、judge model 選型、bias 緩解、calibration 流程、跟 production trace 串接的延伸；術語對應 <a href="/blog/llm/knowledge-cards/llm-as-judge/" data-link-title="LLM-as-Judge" data-link-desc="用 LLM 評估另一個 LLM 的輸出品質、production eval 的主流方法、500-5000× 成本降但有 bias 要處理">LLM-as-Judge</a> 與 <a href="/blog/llm/knowledge-cards/llm-tracing/" data-link-title="LLM Tracing" data-link-desc="把 LLM 應用的每次 LLM call / tool call / memory op 編成結構化 span、用 OpenTelemetry GenAI semantic conventions 標準化">LLM Tracing</a>。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>環境</strong>：M4 Max 64GB / 或 24GB+ VRAM PC + Ollama
<strong>Judge model</strong>：DeepSeek-R1-Distill-Qwen-32B 或 QwQ-32B（reasoning model 當 judge 更穩）</p></blockquote>
<h2 id="為什麼用本地-llm-當-judge">為什麼用本地 LLM 當 judge</h2>
<p>跟雲端 judge（GPT-5 / Claude 4）對比：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>本地 judge</th>
          <th>雲端 judge</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cost</td>
          <td>0（電費）</td>
          <td>$0.001-0.01 per item</td>
      </tr>
      <tr>
          <td>隱私</td>
          <td>完全本地、eval 資料不出機器</td>
          <td>送雲端、依政策</td>
      </tr>
      <tr>
          <td>Latency</td>
          <td>視硬體、reasoning model 30B 約 30-60s</td>
          <td>API call 5-30s</td>
      </tr>
      <tr>
          <td>品質上限</td>
          <td>本地 30B reasoning 接近 2024 雲端中段</td>
          <td>雲端旗艦上限高</td>
      </tr>
      <tr>
          <td>大量 batch</td>
          <td>慢但 zero cost</td>
          <td>快但 cost 累積</td>
      </tr>
  </tbody>
</table>
<p>判讀：</p>
<ul>
<li><strong>大量 production trace eval（千筆以上）+ 隱私敏感</strong> → 本地 judge</li>
<li><strong>少量 high-stake eval（&lt; 50 筆）</strong> → 雲端旗艦 judge</li>
<li><strong>A/B test 快速 iterate</strong> → 雲端（latency 重要）</li>
</ul>
<h2 id="硬體預算">硬體預算</h2>
<p>Judge model 選擇看硬體：</p>
<table>
  <thead>
      <tr>
          <th>硬體</th>
          <th>適合 judge model</th>
          <th>預期 latency / item</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>M4 Pro 24GB / 4090 16GB</td>
          <td>Qwen2.5-32B Q4 或 DeepSeek-R1-Distill-14B</td>
          <td>30-60s</td>
      </tr>
      <tr>
          <td>M4 Pro 36GB</td>
          <td>DeepSeek-R1-Distill-Qwen-32B Q4</td>
          <td>60-120s</td>
      </tr>
      <tr>
          <td>M4 Max 48-64GB / 5090 24GB</td>
          <td>QwQ-32B 或 DeepSeek-R1-Distill-Qwen-32B Q6</td>
          <td>60-180s（含 reasoning trace）</td>
      </tr>
      <tr>
          <td>M4 Max 128GB / 多卡 PC</td>
          <td>Llama 3.3 70B 或 Qwen3-72B</td>
          <td>120-300s</td>
      </tr>
  </tbody>
</table>
<p>注意：reasoning model 的 thinking trace 拉長 latency、跑大量 batch 要規劃時間（100 item × 60s = 100 min）。</p>
<p><strong>何時不適合用本地 judge</strong>：</p>
<ol>
<li><strong>硬體低於 M4 Pro 24GB / 4090 16GB</strong>（如 M1/M2 16GB、無獨立 GPU PC）：跑 32B reasoning model 太緊、強行跑會 swap、latency 爆 5-10×。改用 14B instruct model（如 Qwen2.5-14B Q4）作 judge、或直接走雲端 judge</li>
<li><strong>Batch × latency &gt; 你可接受的等待時間</strong>：100 item × 60s/item = 100 min；500 item × 120s = 17 hr。預估超過 4 hr 時改雲端 batch API</li>
<li><strong>eval 任務太 nuanced</strong>：細粒度倫理 / 法律 / 高 stake 判讀、本地 32B distill 能力不夠、用雲端旗艦 judge 或人工 review</li>
<li><strong>calibration 階段</strong>：第一次跑、要快速 iterate rubric、雲端 judge latency 短（5-30s）更適合 iterate</li>
</ol>
<h2 id="整體流程">整體流程</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. 蒐集 eval dataset    → JSONL：每行一個 (input, output) 待評
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. 設計 rubric         → 評分維度、scale、明確 anti-pattern
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. 寫 judge prompt     → 4 段式（task / input-output / rubric / format）
</span></span><span class="line"><span class="ln">4</span><span class="cl">4. 跑 harness          → 對每筆 input call judge、parse JSON output
</span></span><span class="line"><span class="ln">5</span><span class="cl">5. Aggregate 結果      → 算平均分數、找 outlier、看 reasoning
</span></span><span class="line"><span class="ln">6</span><span class="cl">6. Calibration（可選）  → 跟 human eval 比對、調 rubric
</span></span><span class="line"><span class="ln">7</span><span class="cl">7. 跟 production trace 串接 → 定期跑 production sample</span></span></code></pre></div><h2 id="step-1蒐集-eval-dataset">Step 1：蒐集 eval dataset</h2>
<p>JSONL format（每行一筆）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="p">{</span><span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="s2">&#34;001&#34;</span><span class="p">,</span> <span class="nt">&#34;input&#34;</span><span class="p">:</span> <span class="s2">&#34;用 Python 寫 fibonacci function&#34;</span><span class="p">,</span> <span class="nt">&#34;output&#34;</span><span class="p">:</span> <span class="s2">&#34;def fib(n):\n    if n &lt;= 1:\n        return n\n    return fib(n-1) + fib(n-2)&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="p">{</span><span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="s2">&#34;002&#34;</span><span class="p">,</span> <span class="nt">&#34;input&#34;</span><span class="p">:</span> <span class="s2">&#34;解釋這段 code 在做什麼：[code]&#34;</span><span class="p">,</span> <span class="nt">&#34;output&#34;</span><span class="p">:</span> <span class="s2">&#34;這段 code 實作了 ...&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="p">{</span><span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="s2">&#34;003&#34;</span><span class="p">,</span> <span class="nt">&#34;input&#34;</span><span class="p">:</span> <span class="s2">&#34;[bug 描述]&#34;</span><span class="p">,</span> <span class="nt">&#34;output&#34;</span><span class="p">:</span> <span class="s2">&#34;[suggested fix]&#34;</span><span class="p">}</span></span></span></code></pre></div><p>來源：</p>
<ul>
<li>過往 Continue.dev / Cursor 跟 LLM 的對話 log</li>
<li>Production agent 的 trace（手動 export 或 LangSmith / Phoenix dump）</li>
<li>自己 hand-craft 30-100 個典型 case</li>
</ul>
<p>放在 <code>data/eval.jsonl</code>。</p>
<h2 id="step-2設計-rubric">Step 2：設計 rubric</h2>
<p>依任務類型設計、coding 任務的範例 rubric：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">評分維度：
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">1. Correctness（程式碼能否運作、邏輯是否正確）：1-5
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">2. Style（是否符合 codebase convention、習慣命名）：1-5
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">3. Completeness（是否完整解決 user request）：1-5
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">評分規則：
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">- 5：完美無瑕、可直接 merge
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">- 4：小修可用、整體正確
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">- 3：方向正確、需 substantial 修改
</span></span><span class="line"><span class="ln">10</span><span class="cl">- 2：部分對、主要邏輯有錯
</span></span><span class="line"><span class="ln">11</span><span class="cl">- 1：完全錯、誤導使用者
</span></span><span class="line"><span class="ln">12</span><span class="cl">
</span></span><span class="line"><span class="ln">13</span><span class="cl">明確不加分（緩解 verbosity bias）：
</span></span><span class="line"><span class="ln">14</span><span class="cl">- 冗長 / verbose（同樣正確的短答 = 長答）
</span></span><span class="line"><span class="ln">15</span><span class="cl">- 道歉 / 開場白
</span></span><span class="line"><span class="ln">16</span><span class="cl">- 「我希望這有幫助」這類禮貌話
</span></span><span class="line"><span class="ln">17</span><span class="cl">- 過多 markdown 修飾（不加分）</span></span></code></pre></div><h2 id="step-3judge-prompt-模板">Step 3：Judge prompt 模板</h2>
<p>寫成 file <code>prompts/judge.txt</code>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">你是 LLM 輸出品質評估員、要評估 coding assistant 對使用者請求的回答品質。
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">重要：請保持公正、忽略風格偏好、聚焦在實質品質。
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">User request:
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">{input}
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">Assistant response:
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">{output}
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl">評分維度（每維 1-5、加總用 overall）：
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl">1. Correctness：程式碼能否運作、邏輯正確
</span></span><span class="line"><span class="ln">13</span><span class="cl">   5: 完美無瑕
</span></span><span class="line"><span class="ln">14</span><span class="cl">   4: 小修可用
</span></span><span class="line"><span class="ln">15</span><span class="cl">   3: 方向正確、需 substantial 修改
</span></span><span class="line"><span class="ln">16</span><span class="cl">   2: 部分對、主要邏輯有錯
</span></span><span class="line"><span class="ln">17</span><span class="cl">   1: 完全錯
</span></span><span class="line"><span class="ln">18</span><span class="cl">
</span></span><span class="line"><span class="ln">19</span><span class="cl">2. Style：符合 codebase convention
</span></span><span class="line"><span class="ln">20</span><span class="cl">   1-5 同 scale
</span></span><span class="line"><span class="ln">21</span><span class="cl">
</span></span><span class="line"><span class="ln">22</span><span class="cl">3. Completeness：完整解決 user request
</span></span><span class="line"><span class="ln">23</span><span class="cl">   1-5 同 scale
</span></span><span class="line"><span class="ln">24</span><span class="cl">
</span></span><span class="line"><span class="ln">25</span><span class="cl">明確不加分項：
</span></span><span class="line"><span class="ln">26</span><span class="cl">- 冗長 / verbose（同樣正確的短答 = 長答）
</span></span><span class="line"><span class="ln">27</span><span class="cl">- 道歉 / 開場白
</span></span><span class="line"><span class="ln">28</span><span class="cl">- 「我希望這有幫助」這類禮貌話
</span></span><span class="line"><span class="ln">29</span><span class="cl">- 過多 markdown 修飾
</span></span><span class="line"><span class="ln">30</span><span class="cl">
</span></span><span class="line"><span class="ln">31</span><span class="cl">請依下列 JSON 輸出（不要加額外文字、不要 markdown code fence）：
</span></span><span class="line"><span class="ln">32</span><span class="cl">{
</span></span><span class="line"><span class="ln">33</span><span class="cl">  &#34;correctness&#34;: &lt;1-5&gt;,
</span></span><span class="line"><span class="ln">34</span><span class="cl">  &#34;style&#34;: &lt;1-5&gt;,
</span></span><span class="line"><span class="ln">35</span><span class="cl">  &#34;completeness&#34;: &lt;1-5&gt;,
</span></span><span class="line"><span class="ln">36</span><span class="cl">  &#34;reasoning&#34;: &#34;&lt;簡短解釋、&lt; 100 字&gt;&#34;,
</span></span><span class="line"><span class="ln">37</span><span class="cl">  &#34;overall&#34;: &lt;1-5&gt;
</span></span><span class="line"><span class="ln">38</span><span class="cl">}</span></span></code></pre></div><h2 id="step-4跑-harness">Step 4：跑 harness</h2>
<p>Python 最小可行版：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># judge_harness.py</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">json</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="kn">import</span> <span class="nn">requests</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">JUDGE_MODEL</span> <span class="o">=</span> <span class="s2">&#34;deepseek-r1:32b&#34;</span>  <span class="c1"># 或 qwq:32b</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">OLLAMA_URL</span> <span class="o">=</span> <span class="s2">&#34;http://localhost:11434/v1/chat/completions&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="k">def</span> <span class="nf">load_dataset</span><span class="p">(</span><span class="n">path</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="s2">&#34;&#34;&#34;Load JSONL eval dataset.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">path</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">        <span class="k">return</span> <span class="p">[</span><span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">line</span><span class="p">)</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">f</span> <span class="k">if</span> <span class="n">line</span><span class="o">.</span><span class="n">strip</span><span class="p">()]</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="k">def</span> <span class="nf">load_prompt_template</span><span class="p">(</span><span class="n">path</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">    <span class="k">return</span> <span class="n">Path</span><span class="p">(</span><span class="n">path</span><span class="p">)</span><span class="o">.</span><span class="n">read_text</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="k">def</span> <span class="nf">call_judge</span><span class="p">(</span><span class="n">prompt</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">    <span class="s2">&#34;&#34;&#34;Call Ollama judge model、回 raw response text.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">    <span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">OLLAMA_URL</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">        <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="n">JUDGE_MODEL</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">        <span class="s2">&#34;messages&#34;</span><span class="p">:</span> <span class="p">[{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">prompt</span><span class="p">}],</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">        <span class="s2">&#34;temperature&#34;</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span>  <span class="c1"># judge 用低 temperature 穩定</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">        <span class="s2">&#34;stream&#34;</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">    <span class="p">},</span> <span class="n">timeout</span><span class="o">=</span><span class="mi">600</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">    <span class="k">return</span> <span class="n">resp</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s2">&#34;choices&#34;</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s2">&#34;message&#34;</span><span class="p">][</span><span class="s2">&#34;content&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl">
</span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="k">def</span> <span class="nf">parse_judge_output</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">28</span><span class="cl">    <span class="s2">&#34;&#34;&#34;Parse judge 回的 JSON、容錯處理（reasoning model 可能加 &lt;think&gt; 標記）。&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl">    <span class="c1"># 跳過 reasoning trace</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl">    <span class="k">if</span> <span class="s2">&#34;&lt;/think&gt;&#34;</span> <span class="ow">in</span> <span class="n">text</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl">        <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">&#34;&lt;/think&gt;&#34;</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">32</span><span class="cl">
</span></span><span class="line"><span class="ln">33</span><span class="cl">    <span class="c1"># 找 JSON 區塊</span>
</span></span><span class="line"><span class="ln">34</span><span class="cl">    <span class="n">start</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">&#34;{&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">35</span><span class="cl">    <span class="n">end</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">rfind</span><span class="p">(</span><span class="s2">&#34;}&#34;</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
</span></span><span class="line"><span class="ln">36</span><span class="cl">    <span class="k">if</span> <span class="n">start</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span> <span class="ow">or</span> <span class="n">end</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">37</span><span class="cl">        <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="ln">38</span><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">39</span><span class="cl">        <span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">text</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">end</span><span class="p">])</span>
</span></span><span class="line"><span class="ln">40</span><span class="cl">    <span class="k">except</span> <span class="n">json</span><span class="o">.</span><span class="n">JSONDecodeError</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">41</span><span class="cl">        <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="ln">42</span><span class="cl">
</span></span><span class="line"><span class="ln">43</span><span class="cl"><span class="k">def</span> <span class="nf">run_harness</span><span class="p">(</span><span class="n">dataset_path</span><span class="p">,</span> <span class="n">prompt_template_path</span><span class="p">,</span> <span class="n">output_path</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">44</span><span class="cl">    <span class="n">dataset</span> <span class="o">=</span> <span class="n">load_dataset</span><span class="p">(</span><span class="n">dataset_path</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">45</span><span class="cl">    <span class="n">template</span> <span class="o">=</span> <span class="n">load_prompt_template</span><span class="p">(</span><span class="n">prompt_template_path</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">46</span><span class="cl">
</span></span><span class="line"><span class="ln">47</span><span class="cl">    <span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="ln">48</span><span class="cl">    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">item</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">dataset</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">49</span><span class="cl">        <span class="n">prompt</span> <span class="o">=</span> <span class="n">template</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">item</span><span class="p">[</span><span class="s2">&#34;input&#34;</span><span class="p">],</span> <span class="n">output</span><span class="o">=</span><span class="n">item</span><span class="p">[</span><span class="s2">&#34;output&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="ln">50</span><span class="cl">        <span class="n">raw</span> <span class="o">=</span> <span class="n">call_judge</span><span class="p">(</span><span class="n">prompt</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">51</span><span class="cl">        <span class="n">parsed</span> <span class="o">=</span> <span class="n">parse_judge_output</span><span class="p">(</span><span class="n">raw</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">52</span><span class="cl">
</span></span><span class="line"><span class="ln">53</span><span class="cl">        <span class="n">result</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">54</span><span class="cl">            <span class="s2">&#34;id&#34;</span><span class="p">:</span> <span class="n">item</span><span class="p">[</span><span class="s2">&#34;id&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="ln">55</span><span class="cl">            <span class="s2">&#34;scores&#34;</span><span class="p">:</span> <span class="n">parsed</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">56</span><span class="cl">            <span class="s2">&#34;raw_judge_output&#34;</span><span class="p">:</span> <span class="n">raw</span><span class="p">[:</span><span class="mi">500</span><span class="p">],</span>  <span class="c1"># 保留前 500 字便於 debug</span>
</span></span><span class="line"><span class="ln">57</span><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="ln">58</span><span class="cl">        <span class="n">results</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">59</span><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;[</span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s2">/</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span><span class="si">}</span><span class="s2">] id=</span><span class="si">{</span><span class="n">item</span><span class="p">[</span><span class="s1">&#39;id&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> overall=</span><span class="si">{</span><span class="n">parsed</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;overall&#39;</span><span class="p">)</span> <span class="k">if</span> <span class="n">parsed</span> <span class="k">else</span> <span class="s1">&#39;FAIL&#39;</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">60</span><span class="cl">
</span></span><span class="line"><span class="ln">61</span><span class="cl">    <span class="c1"># 寫出 JSONL</span>
</span></span><span class="line"><span class="ln">62</span><span class="cl">    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_path</span><span class="p">,</span> <span class="s2">&#34;w&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">63</span><span class="cl">        <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">64</span><span class="cl">            <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o">+</span> <span class="s2">&#34;</span><span class="se">\n</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">65</span><span class="cl">
</span></span><span class="line"><span class="ln">66</span><span class="cl">    <span class="c1"># Aggregate</span>
</span></span><span class="line"><span class="ln">67</span><span class="cl">    <span class="n">valid</span> <span class="o">=</span> <span class="p">[</span><span class="n">r</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results</span> <span class="k">if</span> <span class="n">r</span><span class="p">[</span><span class="s2">&#34;scores&#34;</span><span class="p">]]</span>
</span></span><span class="line"><span class="ln">68</span><span class="cl">    <span class="k">if</span> <span class="n">valid</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">69</span><span class="cl">        <span class="n">avg</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">r</span><span class="p">[</span><span class="s2">&#34;scores&#34;</span><span class="p">][</span><span class="s2">&#34;overall&#34;</span><span class="p">]</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">valid</span><span class="p">)</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">valid</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">70</span><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="se">\n</span><span class="s2">Aggregate: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">valid</span><span class="p">)</span><span class="si">}</span><span class="s2">/</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">results</span><span class="p">)</span><span class="si">}</span><span class="s2"> valid、avg overall = </span><span class="si">{</span><span class="n">avg</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">71</span><span class="cl">
</span></span><span class="line"><span class="ln">72</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">&#34;__main__&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">73</span><span class="cl">    <span class="n">run_harness</span><span class="p">(</span><span class="s2">&#34;data/eval.jsonl&#34;</span><span class="p">,</span> <span class="s2">&#34;prompts/judge.txt&#34;</span><span class="p">,</span> <span class="s2">&#34;results/eval.jsonl&#34;</span><span class="p">)</span></span></span></code></pre></div><p>跑：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 先確認 judge model 已 pull</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">ollama pull deepseek-r1:32b
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 跑 harness</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">python judge_harness.py</span></span></code></pre></div><h2 id="step-5aggregate-跟看-outlier">Step 5：Aggregate 跟看 outlier</h2>
<p>跑完後 results/eval.jsonl 含每筆評分跟 reasoning。看哪些是 outlier：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 找 overall &lt; 3 的 case（低分、值得 review）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">jq <span class="s1">&#39;select(.scores.overall &lt; 3)&#39;</span> results/eval.jsonl
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 看 reasoning 找系統性問題</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">jq <span class="s1">&#39;.scores.reasoning&#39;</span> results/eval.jsonl <span class="p">|</span> sort -u</span></span></code></pre></div><p>判讀：</p>
<ul>
<li><strong>多數 score 4-5、少數 1-2</strong>：整體品質好、focus 在低分 case 找 fix</li>
<li><strong>多數 score 2-3</strong>：系統性問題、改 prompt / model / agent design</li>
<li><strong>分數分佈兩極（很多 5 很多 1）</strong>：可能是 task difficulty 分群、stratified analysis</li>
</ul>
<h2 id="step-6calibration可選但推薦">Step 6：Calibration（可選但推薦）</h2>
<p>跟 human eval 比對、確認 judge 對齊：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. 從 dataset 抽 30 個（覆蓋 difficulty / score 分佈）
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. 自己 human eval（依同樣 rubric）
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. 對比 judge 跟 human 的 overall score
</span></span><span class="line"><span class="ln">4</span><span class="cl">4. 算 Spearman correlation
</span></span><span class="line"><span class="ln">5</span><span class="cl">   - &gt; 0.7：judge 對齊夠好、可信
</span></span><span class="line"><span class="ln">6</span><span class="cl">   - 0.5-0.7：部分問題、改 rubric
</span></span><span class="line"><span class="ln">7</span><span class="cl">   - &lt; 0.5：judge 不可信、換 model 或重寫 rubric</span></span></code></pre></div><p>低 correlation 的常見原因：</p>
<ul>
<li>Rubric 太 vague、judge 自由發揮</li>
<li>Judge model 能力不夠（換更強 judge）</li>
<li>Verbosity / position bias 沒緩解</li>
<li>Eval task 跟 judge 訓練分佈差距大</li>
</ul>
<h2 id="step-7跟-production-trace-串接延伸">Step 7：跟 production trace 串接（延伸）</h2>
<p>把 <a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20 LLM tracing</a> 蒐集的 production trace export 成 JSONL、定期跑 judge：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 假設用 Langfuse self-host</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">langfuse <span class="nb">export</span> --filter <span class="s2">&#34;user_feedback=negative&#34;</span> --output traces.jsonl
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 轉成 eval format</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">python convert_trace_to_eval.py traces.jsonl &gt; data/eval-from-prod.jsonl
</span></span><span class="line"><span class="ln">6</span><span class="cl">
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># 跑 judge</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">python judge_harness.py</span></span></code></pre></div><p>這是 production quality engineering 閉環的本地版本、隱私敏感場景的 cost-free alternative。</p>
<h2 id="失敗模式">失敗模式</h2>
<ol>
<li><strong>Judge 不輸出合法 JSON</strong>：reasoning model 可能在 <code>&lt;think&gt;...&lt;/think&gt;</code> 後仍加 markdown / 解釋</li>
</ol>
<p><strong>緩解</strong>：parse 時跳 <code>&lt;think&gt;</code> 段、容錯處理、或開 <a href="/blog/llm/knowledge-cards/constrained-decoding/" data-link-title="Constrained Decoding" data-link-desc="推論時用 grammar 強制 LLM 輸出符合特定格式（JSON / regex / CFG）的 sampling 機制、把不合法 token 的機率歸零">constrained decoding</a>（llama.cpp grammar）</p>
<ol start="2">
<li><strong>Latency 太長、batch 跑不完</strong>：reasoning model 32B 每 item 60-120s、100 item 要 2 小時</li>
</ol>
<p><strong>緩解</strong>：用較小 judge model（如 Qwen2.5-32B instruct、非 reasoning）、或拆 batch 並行</p>
<ol start="3">
<li><strong>Judge bias 沒緩解</strong>：本地 judge 跟雲端 judge 都會有 verbosity / position bias</li>
</ol>
<p><strong>緩解</strong>：rubric 寫明、pairwise 換位置跑 2 次</p>
<ol start="4">
<li><strong>本地 judge 能力上限</strong>：30B distill 對 nuanced case 判讀不如雲端旗艦</li>
</ol>
<p><strong>緩解</strong>：critical case 加 spot human review、或混用本地（量大）+ 雲端（精選 sample）</p>
<h2 id="跟其他章節的關係">跟其他章節的關係</h2>
<ul>
<li>原理層的 LLM-as-judge 設計見 <a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21</a></li>
<li>Production trace 串接見 <a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20 tracing</a></li>
<li>Reasoning model 選型見 <a href="/blog/llm/03-theoretical-foundations/reasoning-models/" data-link-title="3.8 Reasoning models：test-time compute paradigm" data-link-desc="Chain-of-thought 從 prompting 技巧演化成訓練 paradigm、reasoning model 的內部運作、本地可跑的選項與適用任務">3.8</a></li>
<li>隱私 / 跨雲端邊界判讀見 <a href="/blog/llm/06-security/cross-cloud-local-data-boundary/" data-link-title="6.4 跨雲端 / 本地的資料邊界" data-link-desc="個人 dev 場景下混用雲端 LLM 跟本地 LLM 時的 prompt 洩漏點：Continue.dev 多 provider 設定、隱私資料流、按敏感度分流的判讀">6.4</a></li>
<li>Benchmark 跟 in-house eval 的層次見 <a href="/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14</a></li>
</ul>
]]></content:encoded></item><item><title>4.21 LLM-as-Judge 評估方法</title><link>https://tarrragon.github.io/blog/llm/04-applications/llm-as-judge/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/04-applications/llm-as-judge/</guid><description>&lt;p>&lt;a href="https://tarrragon.github.io/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14 benchmarking-and-evaluation&lt;/a> 寫了 capability benchmark（MMLU、SWE-bench 等）跟 in-house benchmark 概念。但「自己工作流的真實案例該怎麼系統性 eval」這個操作層、4.14 點到沒展開。本章補上 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/llm-as-judge/" data-link-title="LLM-as-Judge" data-link-desc="用 LLM 評估另一個 LLM 的輸出品質、production eval 的主流方法、500-5000× 成本降但有 bias 要處理">LLM-as-Judge&lt;/a> — production AI app 的事實標準 eval 方法、比 human eval 便宜 500-5000×、跟人類有 80%+ agreement、但要處理 bias。&lt;/p>
&lt;p>Judge 在 eval 系統中的定位：&lt;a href="https://tarrragon.github.io/blog/llm/04-applications/eval-design-framework/" data-link-title="4.13 Eval 設計座標系：三軸、八象限、何時測什麼" data-link-desc="Eval 設計三軸（objective↔subjective / component↔end-to-end / quantitative↔qualitative）、八象限的對應 eval 工具、軸選錯的訊號、跟 benchmarking / LLM-as-judge / tracing 的關係">4.13 Eval 設計座標系&lt;/a> 把 eval 分三軸八象限、判斷哪個象限該用什麼工具——judge 的位置是 subjective 軸（沒 ground truth 的行為）、不是 objective 軸（有 ground truth 用 deterministic check 更便宜更準）。讀本章前先看 4.13 的軸誤選段、避開「全部 eval 都做成 judge」的常見反模式。&lt;/p>
&lt;h2 id="本章目標">本章目標&lt;/h2>
&lt;p>讀完本章後、你應該能：&lt;/p>
&lt;ol>
&lt;li>區分 LLM-as-Judge、standard benchmark、human eval 三條 eval 路徑。&lt;/li>
&lt;li>設計可重現的 judge rubric（input / output / rubric / reasoning 四段）。&lt;/li>
&lt;li>用 pairwise vs direct scoring、知道何時用哪種。&lt;/li>
&lt;li>緩解三大 bias（position / verbosity / self-preference）。&lt;/li>
&lt;li>把 production &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">trace&lt;/a> 餵回 judge、形成自動 eval 閉環。&lt;/li>
&lt;/ol>
&lt;h2 id="為什麼需要-llm-as-judge">為什麼需要 LLM-as-Judge&lt;/h2>
&lt;p>&lt;a href="https://tarrragon.github.io/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14&lt;/a> 推「in-house benchmark 是 final test」、但操作層是個 gap：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Eval 痛點&lt;/th>
 &lt;th>LLM-as-Judge 解法&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Standard benchmark 跟自己 use case 不符&lt;/td>
 &lt;td>Judge 用自己 case 跑、rubric 自定義&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Human eval 太貴 / 太慢&lt;/td>
 &lt;td>Judge 自動跑、$0.001-0.01 per item&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Production trace 量大、人工看不完&lt;/td>
 &lt;td>Judge 跑 100% production trace 都可行&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Rule-based eval 抓不到語意問題&lt;/td>
 &lt;td>Judge 能判斷「答案是否符合意圖、即使措辭不同」&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Iteration 需要快速 feedback&lt;/td>
 &lt;td>Judge 幾分鐘跑完 100 items、prompt 改完馬上重測&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>主要 use case（重複 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/llm-as-judge/" data-link-title="LLM-as-Judge" data-link-desc="用 LLM 評估另一個 LLM 的輸出品質、production eval 的主流方法、500-5000× 成本降但有 bias 要處理">LLM-as-Judge 卡片&lt;/a>）：in-house benchmark、production trace eval、A/B test、synthetic data quality。&lt;/p></description><content:encoded><![CDATA[<p><a href="/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14 benchmarking-and-evaluation</a> 寫了 capability benchmark（MMLU、SWE-bench 等）跟 in-house benchmark 概念。但「自己工作流的真實案例該怎麼系統性 eval」這個操作層、4.14 點到沒展開。本章補上 <a href="/blog/llm/knowledge-cards/llm-as-judge/" data-link-title="LLM-as-Judge" data-link-desc="用 LLM 評估另一個 LLM 的輸出品質、production eval 的主流方法、500-5000× 成本降但有 bias 要處理">LLM-as-Judge</a> — production AI app 的事實標準 eval 方法、比 human eval 便宜 500-5000×、跟人類有 80%+ agreement、但要處理 bias。</p>
<p>Judge 在 eval 系統中的定位：<a href="/blog/llm/04-applications/eval-design-framework/" data-link-title="4.13 Eval 設計座標系：三軸、八象限、何時測什麼" data-link-desc="Eval 設計三軸（objective↔subjective / component↔end-to-end / quantitative↔qualitative）、八象限的對應 eval 工具、軸選錯的訊號、跟 benchmarking / LLM-as-judge / tracing 的關係">4.13 Eval 設計座標系</a> 把 eval 分三軸八象限、判斷哪個象限該用什麼工具——judge 的位置是 subjective 軸（沒 ground truth 的行為）、不是 objective 軸（有 ground truth 用 deterministic check 更便宜更準）。讀本章前先看 4.13 的軸誤選段、避開「全部 eval 都做成 judge」的常見反模式。</p>
<h2 id="本章目標">本章目標</h2>
<p>讀完本章後、你應該能：</p>
<ol>
<li>區分 LLM-as-Judge、standard benchmark、human eval 三條 eval 路徑。</li>
<li>設計可重現的 judge rubric（input / output / rubric / reasoning 四段）。</li>
<li>用 pairwise vs direct scoring、知道何時用哪種。</li>
<li>緩解三大 bias（position / verbosity / self-preference）。</li>
<li>把 production <a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">trace</a> 餵回 judge、形成自動 eval 閉環。</li>
</ol>
<h2 id="為什麼需要-llm-as-judge">為什麼需要 LLM-as-Judge</h2>
<p><a href="/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14</a> 推「in-house benchmark 是 final test」、但操作層是個 gap：</p>
<table>
  <thead>
      <tr>
          <th>Eval 痛點</th>
          <th>LLM-as-Judge 解法</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Standard benchmark 跟自己 use case 不符</td>
          <td>Judge 用自己 case 跑、rubric 自定義</td>
      </tr>
      <tr>
          <td>Human eval 太貴 / 太慢</td>
          <td>Judge 自動跑、$0.001-0.01 per item</td>
      </tr>
      <tr>
          <td>Production trace 量大、人工看不完</td>
          <td>Judge 跑 100% production trace 都可行</td>
      </tr>
      <tr>
          <td>Rule-based eval 抓不到語意問題</td>
          <td>Judge 能判斷「答案是否符合意圖、即使措辭不同」</td>
      </tr>
      <tr>
          <td>Iteration 需要快速 feedback</td>
          <td>Judge 幾分鐘跑完 100 items、prompt 改完馬上重測</td>
      </tr>
  </tbody>
</table>
<p>主要 use case（重複 <a href="/blog/llm/knowledge-cards/llm-as-judge/" data-link-title="LLM-as-Judge" data-link-desc="用 LLM 評估另一個 LLM 的輸出品質、production eval 的主流方法、500-5000× 成本降但有 bias 要處理">LLM-as-Judge 卡片</a>）：in-house benchmark、production trace eval、A/B test、synthetic data quality。</p>
<h2 id="judge-prompt-結構">Judge prompt 結構</h2>
<p>可重現的 judge 必須四段式：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">[Section 1: Task description]
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">你是 LLM 輸出品質評估員。要評估 coding assistant 對使用者請求的回答品質。
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">[Section 2: Input + Output to evaluate]
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">User request: {input}
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">Assistant response: {output}
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">[Section 3: Rubric（評分標準）]
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">評分維度：
</span></span><span class="line"><span class="ln">10</span><span class="cl">1. Correctness（程式碼能否運作、邏輯是否正確）：1-5
</span></span><span class="line"><span class="ln">11</span><span class="cl">2. Style（是否符合 codebase convention）：1-5
</span></span><span class="line"><span class="ln">12</span><span class="cl">3. Completeness（是否完整解決 user request）：1-5
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl">評分規則：
</span></span><span class="line"><span class="ln">15</span><span class="cl">- 5：完美無瑕、可直接 merge
</span></span><span class="line"><span class="ln">16</span><span class="cl">- 4：小修可用、整體正確
</span></span><span class="line"><span class="ln">17</span><span class="cl">- 3：方向正確、需 substantial 修改
</span></span><span class="line"><span class="ln">18</span><span class="cl">- 2：部分對、主要邏輯有錯
</span></span><span class="line"><span class="ln">19</span><span class="cl">- 1：完全錯、誤導使用者
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl">明確不加分：
</span></span><span class="line"><span class="ln">22</span><span class="cl">- 冗長 / verbose（同樣正確的短答 = 長答）
</span></span><span class="line"><span class="ln">23</span><span class="cl">- 道歉 / 開場白
</span></span><span class="line"><span class="ln">24</span><span class="cl">- 「我希望這有幫助」這類禮貌話
</span></span><span class="line"><span class="ln">25</span><span class="cl">
</span></span><span class="line"><span class="ln">26</span><span class="cl">[Section 4: Output format]
</span></span><span class="line"><span class="ln">27</span><span class="cl">請依下列 JSON 輸出：
</span></span><span class="line"><span class="ln">28</span><span class="cl">{
</span></span><span class="line"><span class="ln">29</span><span class="cl">  &#34;correctness&#34;: &lt;1-5&gt;,
</span></span><span class="line"><span class="ln">30</span><span class="cl">  &#34;style&#34;: &lt;1-5&gt;,
</span></span><span class="line"><span class="ln">31</span><span class="cl">  &#34;completeness&#34;: &lt;1-5&gt;,
</span></span><span class="line"><span class="ln">32</span><span class="cl">  &#34;reasoning&#34;: &#34;&lt;簡短解釋&gt;&#34;,
</span></span><span class="line"><span class="ln">33</span><span class="cl">  &#34;overall&#34;: &lt;1-5&gt;
</span></span><span class="line"><span class="ln">34</span><span class="cl">}</span></span></code></pre></div><p>關鍵設計原則：</p>
<ol>
<li><strong>Rubric 明確、可重現</strong>：用 1-5 scale + 每分明確定義、避免 judge 自由發揮</li>
<li><strong>明確列「不加分項」</strong>：vag rubric 容易讓 judge 加分長答 / 道歉 / 客套（verbosity bias）</li>
<li><strong>要求 reasoning</strong>：強迫 judge 寫評分理由、提升 calibration、後續可 debug</li>
<li><strong>Structured output</strong>：用 JSON / <a href="/blog/llm/04-applications/application-protocols/" data-link-title="4.6 應用層協議：function calling / structured output / MCP" data-link-desc="三個常被混為一談的概念：模型能力、sampling 約束、server 協議，三者的層級差異與組合方式">structured output</a> 強制格式、後續可程式化處理</li>
</ol>
<h2 id="pairwise-vs-direct-scoring">Pairwise vs Direct scoring</h2>
<p>兩種主流評分方式：</p>
<h3 id="direct-scoring直接打分">Direct scoring（直接打分）</h3>
<p>給一個 (input, output)、judge 給絕對分數（1-5、1-10）。</p>
<p>優點：簡單、可看「絕對品質」隨時間改變
缺點：分數 calibration 不穩（不同 batch 跑、judge 可能 baseline drift）</p>
<h3 id="pairwise-comparison兩兩比較">Pairwise comparison（兩兩比較）</h3>
<p>給一個 input + 兩個 output（A、B）、judge 選哪個比較好。</p>
<p>優點：相對比較比絕對打分穩、適合 A/B testing
缺點：需要兩個 candidates、結果是「A &gt; B」不是「A 多好」</p>
<p>實務組合：</p>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>適合方式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Production quality monitoring</td>
          <td>Direct scoring（每個 trace 一個分數）</td>
      </tr>
      <tr>
          <td>Prompt / model A/B test</td>
          <td>Pairwise（A 跟 B 比）</td>
      </tr>
      <tr>
          <td>Fine-tune 前後比較</td>
          <td>Pairwise</td>
      </tr>
      <tr>
          <td>Regression detection</td>
          <td>Direct（跟 baseline 比較）</td>
      </tr>
      <tr>
          <td>Synthetic data filtering</td>
          <td>Direct（保留 ≥ 4 分）</td>
      </tr>
  </tbody>
</table>
<h2 id="三大-bias-跟緩解">三大 Bias 跟緩解</h2>
<h3 id="1-position-bias位置偏見">1. Position bias（位置偏見）</h3>
<p>Pairwise 比較時、judge 對「先出現」的 candidate 有偏好（通常偏 A）。</p>
<p><strong>緩解</strong>：</p>
<ul>
<li>換位置跑 2 次（A-B 跟 B-A）</li>
<li>只 count 兩次都偏 A 的為「prefer A」、不一致為「tie」</li>
<li>標準 LLM-as-Judge framework（如 MT-Bench）內建這做法</li>
</ul>
<h3 id="2-verbosity-bias冗長偏見">2. Verbosity bias（冗長偏見）</h3>
<p>Judge 傾向給「長答」高分、即使內容沒比「短答」更好。</p>
<p><strong>緩解</strong>：</p>
<ul>
<li>Rubric 明確寫「冗長不加分」「同樣正確的短答 = 長答」</li>
<li>長度 normalize：分數 = raw_score / log(length)</li>
<li>用 length-controlled benchmark（如 length-controlled AlpacaEval）</li>
</ul>
<h3 id="3-self-preference-bias自家偏好">3. Self-preference bias（自家偏好）</h3>
<p>Judge 偏好自家風格的答案（GPT 當 judge、偏好 GPT-style 輸出；Claude 當 judge、偏好 Claude-style）。</p>
<p><strong>緩解</strong>：</p>
<ul>
<li>用 3 個不同 family 的 judge model（如 Claude + GPT + Gemini）取多數</li>
<li>避免 judge 跟 test subject 同 model</li>
<li>用 reasoning model 當 judge（多家 reasoning model 共識更穩）</li>
</ul>
<h3 id="補充-biasformat-bias">補充 bias：Format bias</h3>
<p>Judge 對「有 markdown / 有 code block / 有結構」的答案偏好、即使內容沒比「純文字」更好。</p>
<p><strong>緩解</strong>：rubric 明確寫「格式不加分、看內容」。</p>
<h2 id="calibration校準">Calibration（校準）</h2>
<p>Judge 不該光信、要 calibrate：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">1. 蒐集 100 個 (input, output) pair
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">2. Human eval（你自己或可信 human）打 ground truth 分數
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">3. Judge 跑同樣 100 個
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">4. 算 agreement rate：
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">   - Pairwise：judge 跟 human 同意比例（target &gt; 75%）
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">   - Direct scoring：Spearman correlation（target &gt; 0.7）
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">5. 若 agreement 低：
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">   - 改 rubric（更明確）
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">   - 換 judge model（更強）
</span></span><span class="line"><span class="ln">10</span><span class="cl">   - 改 prompt（few-shot example）
</span></span><span class="line"><span class="ln">11</span><span class="cl">6. Calibrate 後的 judge 才能跑 production</span></span></code></pre></div><p>Calibration 是「judge 評什麼」跟「人類評什麼」對齊的步驟、跳過會讓 production eval 失準。</p>
<h2 id="跟-420-llm-tracing-的閉環">跟 <a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20 LLM tracing</a> 的閉環</h2>
<p>Production trace + LLM-as-Judge 形成自動 eval pipeline：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Production users
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">   ↓ 產生 trace
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">[LLM tracing 平台]（LangSmith / Phoenix / Langfuse / Braintrust）
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">   ↓ filter：user thumbs-down、error、long latency 等 trace
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">   ↓ sample 100 個 / day
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">[LLM-as-Judge batch run]
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">   ↓ rubric scoring
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">[Dashboard]
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">   - 哪類 query 品質下降
</span></span><span class="line"><span class="ln">10</span><span class="cl">   - 哪個 deployment version 品質差
</span></span><span class="line"><span class="ln">11</span><span class="cl">   - 哪個 user segment 體驗差
</span></span><span class="line"><span class="ln">12</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">13</span><span class="cl">觸發 alert / 改 prompt / 改 model / 回退
</span></span><span class="line"><span class="ln">14</span><span class="cl">   ↓ A/B test
</span></span><span class="line"><span class="ln">15</span><span class="cl">   ↓ Pairwise judge eval new vs old
</span></span><span class="line"><span class="ln">16</span><span class="cl">   ↓ Deploy 勝者</span></span></code></pre></div><p>這是 production LLM 應用 quality engineering 的標準閉環。</p>
<h2 id="judge-model-選型">Judge model 選型</h2>
<table>
  <thead>
      <tr>
          <th>Judge model 候選</th>
          <th>強項</th>
          <th>弱項</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Claude Sonnet / Opus</td>
          <td>reasoning 強、rubric 跟得緊</td>
          <td>Cost 中等</td>
      </tr>
      <tr>
          <td>GPT-5 / GPT-4o</td>
          <td>普及、tool-calling 強</td>
          <td>對自家 GPT 輸出有 self-preference</td>
      </tr>
      <tr>
          <td>Gemini Pro 2.5</td>
          <td>Long context 強、multi-modal</td>
          <td>rubric 跟得較鬆</td>
      </tr>
      <tr>
          <td>o1 / o3 / R1（reasoning model）</td>
          <td>推理能力強、判 nuanced case 穩</td>
          <td>Cost 高、latency 長</td>
      </tr>
      <tr>
          <td>本地 30B+ 模型（QwQ、DeepSeek-R1 distill）</td>
          <td>隱私強、cost 0</td>
          <td>能力上限低於雲端旗艦</td>
      </tr>
  </tbody>
</table>
<p>判讀：</p>
<ol>
<li><strong>大 stake / final QA</strong>：雲端旗艦 reasoning model</li>
<li><strong>大量 production trace eval</strong>：中等模型（GPT-4o / Sonnet）、cost / speed 平衡</li>
<li><strong>隱私敏感（user trace 不能送雲端）</strong>：本地 reasoning model（QwQ-32B / R1 distill）</li>
<li><strong>A/B test prompt 改進</strong>：用同個 judge 跑前後比對、保持 baseline</li>
</ol>
<h2 id="失敗模式">失敗模式</h2>
<ol>
<li><strong>Rubric 太 vague</strong>：judge 自由發揮、分數沒重複性</li>
</ol>
<p><strong>緩解</strong>：rubric 寫得像 unit test、每分有具體 criteria</p>
<ol start="2">
<li><strong>沒做 calibration</strong>：judge 跟 human agreement 沒驗、可能 systematically off</li>
</ol>
<p><strong>緩解</strong>：每次大改 rubric / 換 judge model 都重新 calibrate</p>
<ol start="3">
<li><strong>Sample 不代表 production</strong>：只 eval easy case、production 真實困難 case 沒覆蓋</li>
</ol>
<p><strong>緩解</strong>：用 stratified sampling（按 difficulty / user segment / feature 抽樣）</p>
<ol start="4">
<li><strong>Bias 沒緩解</strong>：position / verbosity / self-preference 直接 baked in</li>
</ol>
<p><strong>緩解</strong>：標準 framework（DeepEval / Inspect / Braintrust）內建 bias 緩解、用既有 framework 比 DIY 穩</p>
<ol start="5">
<li><strong>Judge cost 比預期高</strong>：production trace 全跑 judge、cost 爆</li>
</ol>
<p><strong>緩解</strong>：sample rate &lt; 10%、配合 <a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">LLM tracing</a> 的 sampling</p>
<ol start="6">
<li><strong>Over-reliance on judge</strong>：忘記 judge 也會錯、把 judge 當絕對真理</li>
</ol>
<p><strong>緩解</strong>：高 stake 任務仍需 spot human review、judge 是 80% 解、不是 100%</p>
<h2 id="主流-framework">主流 framework</h2>
<table>
  <thead>
      <tr>
          <th>Framework</th>
          <th>特色</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DeepEval</td>
          <td>OSS、Python、跟 pytest 整合</td>
      </tr>
      <tr>
          <td>Inspect（UK AI Safety）</td>
          <td>強 eval framework、reasoning model 友善</td>
      </tr>
      <tr>
          <td>Braintrust</td>
          <td>SaaS、eval + tracing 一體</td>
      </tr>
      <tr>
          <td>Langfuse evals</td>
          <td>OSS、跟 tracing 整合</td>
      </tr>
      <tr>
          <td>OpenAI evals</td>
          <td>OSS、Anthropic 也支援</td>
      </tr>
      <tr>
          <td>Patronus</td>
          <td>Production eval SaaS</td>
      </tr>
  </tbody>
</table>
<h2 id="何時不該用-llm-as-judge">何時不該用 LLM-as-Judge</h2>
<ol>
<li><strong>可機械驗證</strong>：unit test、exact match、output schema validation — 用 deterministic rule 比 judge 穩</li>
<li><strong>極小 dataset（&lt; 20 items）</strong>：直接 human eval、不必 judge</li>
<li><strong>判讀需要 domain expertise</strong>：醫療 / 法律 / 安全的 high-stake 判讀、judge 不該替代 expert</li>
<li><strong>Judge 能力 &lt; test subject</strong>：用 GPT-4o judge 評 o3 輸出、judge 看不懂 reasoning trace</li>
</ol>
<h2 id="何時過時--何時不過時">何時過時 / 何時不過時</h2>
<p><strong>不會過時的部分</strong>：</p>
<ul>
<li>LLM-as-Judge 作為 production eval 主流方法的地位</li>
<li>四段式 judge prompt 結構（task / input-output / rubric / format）</li>
<li>Pairwise vs direct scoring 的取捨</li>
<li>三大 bias 分類跟緩解方法</li>
<li>Production trace → judge → action 的閉環</li>
</ul>
<p><strong>會變的部分</strong>：</p>
<ul>
<li>主流 framework（DeepEval / Inspect / Braintrust 等）</li>
<li>各 judge model 的具體能力（每代強模型）</li>
<li>Bias 的具體量化（人類 agreement 數字會隨時間 / 任務變）</li>
<li>新興 bias 跟緩解方法</li>
</ul>
<h2 id="下一步">下一步</h2>
<p>下一步：模組四到此覆蓋從基礎（4.0 prompt 技術光譜 / 4.1-4.2 RAG / 4.3 tool / 4.4 agent / 4.5 HITL）、協議與編排（4.6 protocols / 4.7 workflow / 4.8 multi-agent）、production 細節（4.9-4.12 resource / artifact / long-context / embedding）、到 eval 跟 production observability 閉環（4.13 eval 框架 / 4.14 benchmarking / 4.17-4.21 harness / caching / memory / tracing / judge）的完整應用層地圖。Hands-on 端到端案例見 <a href="/blog/llm/04-applications/hands-on/" data-link-title="4.x Hands-on：端到端案例" data-link-desc="把模組四的所有原理串成具體 case study：從 task decomposition、workflow 設計、eval 設計到 iteration loop">hands-on 子分類</a>。可進入 <a href="/blog/llm/05-discrete-gpu/" data-link-title="模組五：Windows / Linux &#43; 獨立 GPU" data-link-desc="消費級 PC（Windows / Linux &#43; NVIDIA / AMD 獨立 GPU）跑本地 LLM 的硬體判讀、MoE CPU 卸載、KV cache 量化與 llama.cpp 調參">模組五</a> 看本地推論硬體、進入 <a href="/blog/llm/06-security/" data-link-title="模組六：本地 LLM 的安全與權限" data-link-desc="個人 dev 在自己機器上跑本地 LLM 的安全議題：模型供應鏈、推論伺服器綁定、tool use 副作用、prompt injection 在 IDE、跨雲端 / 本地資料邊界">模組六</a> 看安全議題（特別是 <a href="/blog/llm/06-security/owasp-llm-top10-mapping/" data-link-title="6.6 OWASP LLM Top 10 對照圖" data-link-desc="把模組六的本地 dev 視角安全章節對照到 OWASP LLM Top 10 2025、補出個人 dev 場景跟企業合規溝通的共同詞彙">6.6 OWASP LLM Top 10 對照</a>、把 production eval 的安全議題對應到企業合規詞彙）、或回 <a href="/blog/llm/04-applications/eval-design-framework/" data-link-title="4.13 Eval 設計座標系：三軸、八象限、何時測什麼" data-link-desc="Eval 設計三軸（objective↔subjective / component↔end-to-end / quantitative↔qualitative）、八象限的對應 eval 工具、軸選錯的訊號、跟 benchmarking / LLM-as-judge / tracing 的關係">4.13 Eval 設計座標系</a> 看 judge 在 meta eval 框架中的定位。</p>
]]></content:encoded></item></channel></rss>