<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Evals on Tarragon</title><link>https://tarrragon.github.io/blog/tags/evals/</link><description>Recent content in Evals on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Thu, 14 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/evals/index.xml" rel="self" type="application/rss+xml"/><item><title>Beyond LLM: Enhancing LLM Applications (Stanford CS230)</title><link>https://tarrragon.github.io/blog/llm/lectures/stanford-cs230-beyond-llm/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/lectures/stanford-cs230-beyond-llm/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>來源&lt;/strong>：Stanford CS230 Deep Learning、講題 &amp;ldquo;Beyond LLM: Enhancing Large Language Model Applications&amp;rdquo;。&lt;/p>
&lt;p>&lt;strong>整理原則&lt;/strong>：保留講者英文原文以避免翻譯失真、移除口語贅詞、用文章結構重新組織。標題與導讀用 zh-Hant。&lt;/p>&lt;/blockquote>
&lt;h2 id="講座定位">講座定位&lt;/h2>
&lt;p>We started with neurons, then layers, then deep networks, then how to structure projects in C3. This lecture goes one level beyond: what would it look like if you were building agentic AI systems at work, in a startup, in a company?&lt;/p>
&lt;p>The goal is not to build an end-to-end product in the next hour, but to give you the breadth of techniques that AI engineers have figured out — and are still exploring — so that after class you have the baggage to dive deeper and learn faster.&lt;/p>
&lt;p>Agenda:&lt;/p>
&lt;ol>
&lt;li>Challenges and opportunities for augmenting LLMs&lt;/li>
&lt;li>Prompt engineering&lt;/li>
&lt;li>Fine-tuning (and why to mostly avoid it)&lt;/li>
&lt;li>Retrieval-Augmented Generation (RAG)&lt;/li>
&lt;li>Agentic AI workflows&lt;/li>
&lt;li>Case study with evals&lt;/li>
&lt;li>Multi-agent workflows&lt;/li>
&lt;li>What&amp;rsquo;s next in AI&lt;/li>
&lt;/ol>
&lt;h2 id="1-why-augment-llms">1. Why augment LLMs?&lt;/h2>
&lt;p>Limitations that show up when you use a vanilla pre-trained model:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Lacks domain knowledge&lt;/strong> — e.g. a student project building an autonomous farming device with a camera that classifies sick crops. That data set isn&amp;rsquo;t out there; a pre-trained vision model lacks that knowledge.&lt;/li>
&lt;li>&lt;strong>Real-world distribution shift&lt;/strong> — the model was trained on high-quality data, but data in the wild is much messier.&lt;/li>
&lt;li>&lt;strong>Lacks current information&lt;/strong> — retraining from scratch every few months is impractical. Example: during Trump&amp;rsquo;s first presidency he tweeted &amp;ldquo;Covfefe.&amp;rdquo; The word didn&amp;rsquo;t exist; Twitter&amp;rsquo;s LLMs couldn&amp;rsquo;t recognize it, recommender systems went wild. New trends and slang (rizz, mid, etc.) appear constantly and you can&amp;rsquo;t keep retraining.&lt;/li>
&lt;li>&lt;strong>Trained for breadth, not depth&lt;/strong> — fine on a wide range of tasks, but may not be precise enough for narrow, well-defined enterprise applications with high precision / low latency requirements.&lt;/li>
&lt;li>&lt;strong>Carries unnecessary weight&lt;/strong> — a massive model where you only use 2% of capability is slow and expensive. Pruning, quantization, and modification are options.&lt;/li>
&lt;/ul>
&lt;h3 id="llms-are-hard-to-control">LLMs are hard to control&lt;/h3>
&lt;p>In 2016 Microsoft launched a Twitter bot that learned from users and quickly became a racist jerk. They removed it 16 hours after launch. Even better-funded teams struggle: there&amp;rsquo;s an ongoing debate (Elon Musk vs Sam Altman) on whose LLM is the &amp;ldquo;propaganda machine.&amp;rdquo; If you hang out on X you&amp;rsquo;ll see screenshots of LLMs saying controversial things. Even the best-funded labs don&amp;rsquo;t do a great job of controlling their LLMs.&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p><strong>來源</strong>：Stanford CS230 Deep Learning、講題 &ldquo;Beyond LLM: Enhancing Large Language Model Applications&rdquo;。</p>
<p><strong>整理原則</strong>：保留講者英文原文以避免翻譯失真、移除口語贅詞、用文章結構重新組織。標題與導讀用 zh-Hant。</p></blockquote>
<h2 id="講座定位">講座定位</h2>
<p>We started with neurons, then layers, then deep networks, then how to structure projects in C3. This lecture goes one level beyond: what would it look like if you were building agentic AI systems at work, in a startup, in a company?</p>
<p>The goal is not to build an end-to-end product in the next hour, but to give you the breadth of techniques that AI engineers have figured out — and are still exploring — so that after class you have the baggage to dive deeper and learn faster.</p>
<p>Agenda:</p>
<ol>
<li>Challenges and opportunities for augmenting LLMs</li>
<li>Prompt engineering</li>
<li>Fine-tuning (and why to mostly avoid it)</li>
<li>Retrieval-Augmented Generation (RAG)</li>
<li>Agentic AI workflows</li>
<li>Case study with evals</li>
<li>Multi-agent workflows</li>
<li>What&rsquo;s next in AI</li>
</ol>
<h2 id="1-why-augment-llms">1. Why augment LLMs?</h2>
<p>Limitations that show up when you use a vanilla pre-trained model:</p>
<ul>
<li><strong>Lacks domain knowledge</strong> — e.g. a student project building an autonomous farming device with a camera that classifies sick crops. That data set isn&rsquo;t out there; a pre-trained vision model lacks that knowledge.</li>
<li><strong>Real-world distribution shift</strong> — the model was trained on high-quality data, but data in the wild is much messier.</li>
<li><strong>Lacks current information</strong> — retraining from scratch every few months is impractical. Example: during Trump&rsquo;s first presidency he tweeted &ldquo;Covfefe.&rdquo; The word didn&rsquo;t exist; Twitter&rsquo;s LLMs couldn&rsquo;t recognize it, recommender systems went wild. New trends and slang (rizz, mid, etc.) appear constantly and you can&rsquo;t keep retraining.</li>
<li><strong>Trained for breadth, not depth</strong> — fine on a wide range of tasks, but may not be precise enough for narrow, well-defined enterprise applications with high precision / low latency requirements.</li>
<li><strong>Carries unnecessary weight</strong> — a massive model where you only use 2% of capability is slow and expensive. Pruning, quantization, and modification are options.</li>
</ul>
<h3 id="llms-are-hard-to-control">LLMs are hard to control</h3>
<p>In 2016 Microsoft launched a Twitter bot that learned from users and quickly became a racist jerk. They removed it 16 hours after launch. Even better-funded teams struggle: there&rsquo;s an ongoing debate (Elon Musk vs Sam Altman) on whose LLM is the &ldquo;propaganda machine.&rdquo; If you hang out on X you&rsquo;ll see screenshots of LLMs saying controversial things. Even the best-funded labs don&rsquo;t do a great job of controlling their LLMs.</p>
<h3 id="llms-may-underperform-on-your-task">LLMs may underperform on your task</h3>
<ul>
<li>Specific knowledge gaps (e.g. medical diagnosis)</li>
<li>Missing sources — research, education, legal all require sourcing</li>
<li>Inconsistencies in style / format (e.g. legal contracts where every word counts)</li>
<li>Task-specific understanding — example: a biotech company categorizing reviews as positive / neutral / negative. What counts as &ldquo;negative&rdquo; in that industry may differ from a generic LLM&rsquo;s notion. You need to align the LLM to your task.</li>
</ul>
<h3 id="limited-context-handling">Limited context handling</h3>
<p>A lot of enterprise applications need large context. Example: an LLM running on top of your entire drive that can answer &ldquo;what was our Q4 sales performance?&rdquo; in one shot. In practice the context window is limited (best models today max out around hundreds of thousands of tokens; 200K ≈ two books). For video or large data, you have to chunk and embed.</p>
<p>The <strong>attention mechanism</strong> doesn&rsquo;t attend well over very large contexts. The <strong>needle-in-a-haystack</strong> benchmark tests this: insert a single sentence (&ldquo;Arun and Max are having coffee at Blue Bottle&rdquo;) in the middle of a very long text like the Bible, then ask &ldquo;what were Arun and Max having?&rdquo; It&rsquo;s complex not because the question is hard but because the model must find a fact within a huge corpus.</p>
<h3 id="the-rag-debate">The RAG debate</h3>
<p>In theory, with infinite compute, RAG is useless — you could just read a massive corpus immediately and answer. But even then, latency matters; imagine the LLM reading your entire drive on every question. RAG also has other advantages: accuracy, sourcing.</p>
<p>Analogy to search: when you search, you still find sources. There&rsquo;s detailed traversal that ranks and finds specific links. Without that, you&rsquo;d be reading the entire web every query — not reasonable. So RAG-like approaches likely stay relevant.</p>
<h2 id="2-two-dimensions-of-optimization">2. Two dimensions of optimization</h2>
<p>Two axes when improving LLM-based products:</p>
<ol>
<li><strong>Foundation model axis</strong> — move from GPT-3.5 Turbo → GPT-4 → GPT-4o → GPT-5. Each step (in theory) improves base performance.</li>
<li><strong>Engineering axis</strong> — keep the same base model, but engineer how you leverage it: better prompts, <a href="/blog/llm/knowledge-cards/rag/" data-link-title="RAG" data-link-desc="Retrieval-Augmented Generation：動態外掛知識給 LLM、繞開模型參數記憶的靜態限制">RAG</a>, agentic workflow, multi-agent system.</li>
</ol>
<p>This lecture is about the vertical axis: which LLM are you using, and how do you maximize its performance?</p>
<h2 id="3-prompt-engineering">3. Prompt engineering</h2>
<h3 id="the-bcg--hbs--upenn--wharton-study">The BCG / HBS / UPenn / Wharton study</h3>
<p>Three groups of BCG consultants:</p>
<ol>
<li>No AI access</li>
<li>GPT-4 access</li>
<li>GPT-4 + training on how to prompt</li>
</ol>
<p>Two interesting findings:</p>
<p><strong>The jagged frontier</strong>: some tasks fall within the frontier where AI clearly helps; others fall outside, where AI actually makes performance worse. Many tasks fell within, many fell outside. Researchers also observed &ldquo;falling asleep at the wheel&rdquo; — relying on AI for a task beyond the frontier, and not reviewing outputs carefully.</p>
<p><strong>Centaurs vs cyborgs</strong>: two working modes.</p>
<ul>
<li><strong>Centaurs</strong> divide and delegate — give a big task to the AI, let it work, come back later. (Half human / half horse: clear delegation.)</li>
<li><strong>Cyborgs</strong> fully blend with AI — fast back-and-forth, augmented. Students often work like cyborgs; in the enterprise, when you automate a workflow, you&rsquo;re thinking like a centaur.</li>
</ul>
<p>The trained group did best. Prompt engineering is a skill everyone should have — not a job title to build a career on, but a powerful skill in your career.</p>
<h3 id="basic-prompt-design-principles">Basic prompt design principles</h3>
<p>A weak prompt:</p>
<blockquote>
<p>Summarize this document. {document}</p></blockquote>
<p>The model has no context on length, audience, focus. Better:</p>
<blockquote>
<p>Summarize this 10-page scientific paper on renewable energy in five bullet points, focusing on key findings and implications for policymakers.</p></blockquote>
<p>Common techniques to make it even better:</p>
<ul>
<li><strong>Give an example</strong> of a great summary</li>
<li><strong>Role prompting</strong>: &ldquo;Act as a renewable energy expert giving a conference at Davos&rdquo;</li>
<li><strong>Praise</strong>: &ldquo;You are the best in the world at this&rdquo;</li>
<li><strong>Reflection / self-critique</strong>: ask the model to critique its own output and revise</li>
<li><strong><a href="/blog/llm/knowledge-cards/chain-of-thought/" data-link-title="Chain-of-Thought（CoT）" data-link-desc="讓 LLM 先輸出推理步驟再給最終答案的 prompting / 訓練方式、reasoning model 的基礎機制">Chain of thought</a></strong>: break the task into explicit steps, &ldquo;think step by step, do not skip any step.&rdquo; Step 1 identify the three most important findings; Step 2 explain impact; Step 3 write the five-bullet summary.</li>
</ul>
<p>Andrew Ng recommends looking at other people&rsquo;s prompts. Repos like &ldquo;awesome <a href="/blog/llm/knowledge-cards/scaffold-vs-harness/" data-link-title="Scaffold vs Harness" data-link-desc="Coding agent 的兩個工程層次：scaffold 是建構時靜態結構、harness 是 runtime 的 tool dispatch &#43; context management &#43; safety">prompt template</a>&rdquo; on GitHub have many examples engineers have built. Many start with &ldquo;Act as a Linux terminal&rdquo;, &ldquo;Act as an English translator&rdquo;, &ldquo;Act as a position interviewer&rdquo;, etc.</p>
<h3 id="prompt-templates">Prompt templates</h3>
<p>The advantage of a template is you can put it in your code and scale across many user requests. Example from Workera: the HR system has &ldquo;Jane is a Product Manager Level 3, US, preferred language English.&rdquo; That metadata gets inserted into a prompt template that personalizes for Jane. Same template, different metadata for Joe (preferred language Spanish).</p>
<p>Foundation models likely use <a href="/blog/llm/knowledge-cards/system-prompt/" data-link-title="System Prompt" data-link-desc="LLM application 中由開發者預設、不直接顯示給使用者的指令層、定義模型的角色、行為規範、輸出格式">system prompts</a> you don&rsquo;t see — e.g. ChatGPT may inject &ldquo;Act like a helpful assistant&rdquo; plus user memories from a database before your prompt. That doesn&rsquo;t stop you from adding your own template on top.</p>
<h3 id="zero-shot-vs-few-shot-prompting">Zero-shot vs <a href="/blog/llm/knowledge-cards/few-shot-prompting/" data-link-title="Few-shot prompting" data-link-desc="在 prompt 內塞 input-output 範例對齊任務、不動模型權重的 in-context learning 技術">few-shot prompting</a></h3>
<p>Zero-shot:</p>
<blockquote>
<p>Classify the tone as positive, negative, or neutral.
&ldquo;The product is fine, but I was expecting more.&rdquo;</p></blockquote>
<p>Different humans would label this differently — partially positive, partially negative. Alignment to your task can come from few-shot:</p>
<blockquote>
<p>Here are examples of tone classifications:
&ldquo;These exceeded my expectations completely.&rdquo; → positive
&ldquo;It&rsquo;s OK, but I wish it had more features.&rdquo; → negative
&ldquo;The service was adequate. Neither good nor bad.&rdquo; → neutral
Now classify: &ldquo;The product is fine, but I was expecting more.&rdquo;</p></blockquote>
<p>The model now likely says negative, aligned to the second example.</p>
<p>Sophisticated AI startups keep their few-shot examples up to date — whenever a user says something interesting, a human labels it and it gets appended to the relevant prompt. Like building a dataset, but inserted directly in the prompt. Faster to iterate because you don&rsquo;t touch model weights.</p>
<blockquote>
<p><strong>Q</strong>: How long can the prompt be before the model loses itself?</p>
<p>There is research, but it dates fast. Practical example from Workera: a voice conversation eval breaks down after ~8 turns. Mitigation: chapter the conversation, summarize the first part, start over from a new prompt with the summary inserted.</p></blockquote>
<h3 id="chaining-complex-prompts">Chaining complex prompts</h3>
<p>The most popular technique. <strong>Not</strong> chain of thought.</p>
<p>Single prompt for a customer review response:</p>
<blockquote>
<p>Read this review and write a professional response that acknowledges concerns, explains the issue, offers a resolution. {review}</p></blockquote>
<p>You get one output. Hard to debug — everything is mixed together.</p>
<p>Chained version, three prompts:</p>
<ol>
<li>Extract the key issues from this review.</li>
<li>Using these issues, draft an outline.</li>
<li>Using the outline, write the full response.</li>
</ol>
<p>Advantages:</p>
<ul>
<li>Each prompt can be tested and optimized independently</li>
<li>You can identify which step is weakest (outline good but email rude? then prompt 3 is the bottleneck)</li>
<li>Easier to debug than one mega-prompt</li>
</ul>
<p>Tradeoff: latency. Chains add latency, so for certain applications you don&rsquo;t want long chains.</p>
<h3 id="testing-prompts">Testing prompts</h3>
<p>Start with manual error analysis — a baseline prompt, a refined prompt, a chained workflow; humans rate outputs. Manual is slow but builds intuition.</p>
<p>To scale, use platforms (e.g. <strong>Promptfoo</strong>) that let you:</p>
<ul>
<li>Run the same prompt across multiple LLMs side by side in a table</li>
<li>Define <strong>LLM judges</strong></li>
</ul>
<p>Flavors of <a href="/blog/llm/knowledge-cards/llm-as-judge/" data-link-title="LLM-as-Judge" data-link-desc="用 LLM 評估另一個 LLM 的輸出品質、production eval 的主流方法、500-5000× 成本降但有 bias 要處理">LLM judges</a>:</p>
<ul>
<li><strong>Pairwise comparison</strong>: &ldquo;Which summary is better?&rdquo;</li>
<li><strong>Single-answer grading</strong>: &ldquo;Grade this summary 1–5&rdquo;</li>
<li><strong>Reference-guided pairwise</strong> or <strong>rubric-based</strong>: e.g. &ldquo;A 5 is a summary below 100 chars, with three distinct key points, starting with an overview sentence; a 0 fails to summarize.&rdquo;</li>
</ul>
<p>You can stack techniques: few-shot the rubric with examples of 5/5, 4/5, 3/5, etc.</p>
<h2 id="4-fine-tuning-and-why-i-steer-away">4. Fine-tuning (and why I steer away)</h2>
<p>Reasons to avoid fine-tuning:</p>
<ul>
<li>Requires substantial labeled data</li>
<li>May overfit to specific data, losing general-purpose utility</li>
<li>Time- and cost-intensive — by the time you&rsquo;re done, the next base model is out and beating your fine-tuned version</li>
</ul>
<p>The advantage of prompt engineering is you can drop in the next best pre-trained model directly. Fine-tuning doesn&rsquo;t work like that.</p>
<p>When fine-tuning still makes sense:</p>
<ul>
<li>Task requires repeated high-precision outputs (legal, scientific)</li>
<li>The general-purpose LLM struggles with domain-specific language</li>
</ul>
<h3 id="the-slack-fine-tuning-cautionary-tale">The Slack fine-tuning cautionary tale</h3>
<p>Ross Lazerowitz (Sep 2023) fine-tuned a model on his company&rsquo;s Slack messages, hoping it would &ldquo;speak like us.&rdquo; Then he asked:</p>
<blockquote>
<p>Write a 500-word blog post on prompt engineering.</p></blockquote>
<p>The model: &ldquo;I shall work on that in the morning.&rdquo;</p>
<p>He pushes back: &ldquo;It&rsquo;s morning now.&rdquo;</p>
<p>Model: &ldquo;I&rsquo;m writing right now.&rdquo;</p>
<p>&ldquo;It&rsquo;s 6:30 AM here. Write it now.&rdquo;</p>
<p>&ldquo;OK, I shall write it now. I actually don&rsquo;t know what you would like me to say about prompt engineering. I can only describe the process&hellip;&rdquo;</p>
<p>It learned how people talk on Slack — not how they write blog posts. Fine-tuning went wrong because the training distribution wasn&rsquo;t the task distribution.</p>
<h2 id="5-retrieval-augmented-generation-rag">5. Retrieval-Augmented Generation (RAG)</h2>
<h3 id="why-standalone-llms-fall-short">Why standalone LLMs fall short</h3>
<ul>
<li>Small / hard-to-attend-to context windows</li>
<li>Knowledge gaps and training cutoff dates</li>
<li>Hallucinations — costly in medical, education</li>
<li>Lack of sources — research, education, legal love sources. Vanilla LLMs hallucinate fake research papers.</li>
</ul>
<h3 id="how-a-vanilla-rag-works">How a vanilla RAG works</h3>
<p>Question-answering in the medical field: &ldquo;What are the side effects of drug X?&rdquo;</p>
<ol>
<li><strong>Knowledge base</strong> of documents</li>
<li><strong>Embed</strong> documents into lower-dimensional vectors (trade-off: too small → lose info; too big → latency)</li>
<li>Store embeddings in a <strong>vector database</strong> with efficient retrieval and a distance metric</li>
<li><strong>Embed the user query</strong> with the same algorithm</li>
<li><strong>Retrieve</strong> the most relevant documents by distance</li>
<li>Pull those documents, paste into a <strong>prompt template</strong> like:</li>
</ol>
<blockquote>
<p>Answer the user query based on the list of documents. If the answer is not in the documents, say &ldquo;I don&rsquo;t know.&rdquo; Cite exact page, chapter, and line.</p></blockquote>
<p>You can extend the template to require links to the specific page.</p>
<h3 id="improving-rags">Improving RAGs</h3>
<blockquote>
<p><strong>Q</strong>: Do document embeddings retain location info within large documents?</p>
<p>Vanilla RAGs may not. Example: the giant white paper inside a medication box would not be served well by a vanilla RAG.</p></blockquote>
<p>Two popular improvements:</p>
<p><strong>Chunking</strong> — store both the full document embedding and chapter-level embeddings; retrieve both, sourcing becomes more precise.</p>
<p><strong>HyDE (Hypothetical Document Embeddings)</strong> — the user query usually doesn&rsquo;t look like the documents. Example: &ldquo;What are the side effects of drug X?&rdquo; vs a multi-page document. To bridge the gap:</p>
<ol>
<li>Take the user query</li>
<li>Use a prompt to generate a fake hallucinated document answering it (&ldquo;write a 5-page report answering this query&rdquo;)</li>
<li>Embed that fake document</li>
<li>Compare its embedding to the vector DB</li>
</ol>
<p>The fake document is closer in structure to real documents, so retrieval is more accurate.</p>
<p>This is just two of many RAG variants — research from 2020–2025 has many branches. (See the linked survey paper in the slides.)</p>
<h2 id="6-agentic-ai-workflows">6. Agentic AI workflows</h2>
<p>Andrew Ng coined &ldquo;agentic AI workflows&rdquo; because everyone uses &ldquo;agent&rdquo; to mean very different things — sometimes a single prompt, sometimes a complex multi-agent system. Calling everything an &ldquo;agent&rdquo; doesn&rsquo;t do it justice. Better term: <strong><a href="/blog/llm/knowledge-cards/agent/" data-link-title="LLM Agent" data-link-desc="把控制流交給 LLM 的應用模式：自主決策、跨多步呼叫工具、人類角色從主導變監督">agentic workflow</a></strong> — a multi-step process to complete a task, built from prompts, tools, additional resources, and API calls. This also avoids confusion with the RL definition of &ldquo;agent&rdquo; (interacts with environment, state transitions, reward, observation).</p>
<h3 id="one-shot-vs-agentic-example">One-shot vs agentic example</h3>
<p>User on a chatbot: &ldquo;What is your refund policy?&rdquo;</p>
<ul>
<li><strong>One-shot + RAG</strong>: &ldquo;Refunds are available within 30 days of purchase.&rdquo; [link to policy]</li>
<li><strong>Agentic</strong>:
<ol>
<li>Agent retrieves refund policy via RAG</li>
<li>Agent asks user for order number</li>
<li>Agent queries an API to check order details</li>
<li>Agent confirms: &ldquo;Your order qualifies. The amount will be processed in 3–5 business days.&rdquo;</li>
</ol>
</li>
</ul>
<p>Much more thoughtful than the vanilla one.</p>
<h3 id="specialized-agents-in-the-wild">Specialized agents in the wild</h3>
<p>In SF you&rsquo;ll see billboards: AI software engineer, AI skill mentor, AI SDR, AI lawyer, AI specialized cloud engineer. It would be a stretch to say everything works, but work is being done. (Personal opinion: putting a human face behind these is gimmicky and more scary than engaging. In a few years, very few products will use a human face — it&rsquo;s a marketing tactic.)</p>
<h3 id="paradigm-shift-traditional-software-vs-agentic-ai-software">Paradigm shift: traditional software vs agentic AI software</h3>
<table>
  <thead>
      <tr>
          <th>Dimension</th>
          <th>Traditional software</th>
          <th>Agentic AI software</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Data</td>
          <td>Structured: JSON, databases, forms</td>
          <td>Free-form text, images, video; dynamic interpretation</td>
      </tr>
      <tr>
          <td>Logic</td>
          <td>Deterministic</td>
          <td>Fuzzy</td>
      </tr>
      <tr>
          <td>Decomposition</td>
          <td>Monolith / microservices</td>
          <td>Think as a manager: delegate to roles (graphic designer → marketing manager → performance marketing → data scientist)</td>
      </tr>
      <tr>
          <td>Cost of experimentation</td>
          <td>High; you rarely throw away code</td>
          <td>Low; AI companies are more comfortable throwing away code</td>
      </tr>
  </tbody>
</table>
<p>Fuzzy engineering is truly hard. If you let users ask anything, the chance of breakage and attack is high. Companies have been bitten because a user did something authorized that broke the database.</p>
<p>Example from Workera:</p>
<ul>
<li><strong>Deterministic item types</strong>: multiple choice, multi-select, drag-and-drop, ordering, matching — one correct answer.</li>
<li><strong>Fuzzy item types</strong>: voice questions, voice + coding role-plays — the scoring algorithm can make mistakes, and mistakes are costly.</li>
</ul>
<p>Mitigation: a <strong><a href="/blog/llm/knowledge-cards/human-in-the-loop/" data-link-title="Human-in-the-loop（HITL）" data-link-desc="人類介入 LLM 工作流的設計：三種觸發時機（pre-act / mid-stream / post-hoc）、避免橡皮圖章化的四條件">human in the loop</a></strong> — e.g. the appeal feature at the end of an assessment that lets users challenge the agent, bringing a human in to fix and align it.</p>
<p>Advice for building a company: get as much done deterministically as possible. Then for the fuzzy parts (back-and-forth interaction), design guardrails up front.</p>
<h3 id="enterprise-workflows-the-mckinsey-credit-memo-example">Enterprise workflows: the McKinsey credit memo example</h3>
<p>A financial institution takes 1–4 weeks to produce a credit risk memo:</p>
<ol>
<li>Relationship manager gathers data from 15+ sources</li>
<li>RM and credit analyst collaboratively analyze</li>
<li>Credit analyst spends 20+ hours writing the memo</li>
<li>RM and analyst loop on feedback</li>
</ol>
<p>With Gen AI agents (McKinsey study), time drops 20–60%:</p>
<ol>
<li>RM works with Gen AI agent, provides materials</li>
<li>Agent decomposes into tasks for specialist sub-agents</li>
<li>Agents gather data, draft memo</li>
<li>RM and analyst review and give feedback</li>
</ol>
<p>The hardest part is changing people. In theory, this is great. In practice — 100,000-employee enterprises will take 10–20 years to rewire job descriptions, business workflows, incentives, and training to make this real at scale.</p>
<h3 id="core-components-of-an-agent">Core components of an agent</h3>
<p>Take a travel booking agent:</p>
<ul>
<li><strong>Prompts</strong> — the prompts we&rsquo;ve learned to optimize</li>
<li><strong>Context management / memory</strong>:
<ul>
<li><strong>Core / working memory</strong>: fast access. Things needed every interaction (e.g. user&rsquo;s name).</li>
<li><strong>Archival / long-term memory</strong>: slower. Things used occasionally (e.g. birthday).</li>
<li>Why split: imagine ChatGPT had to re-read all memories on every call. If memory lookup takes 3 seconds, every interaction takes 3 seconds. Working memory must be highly optimized.</li>
</ul>
</li>
<li><strong>Tools</strong>: flight search API, hotel API, car rental API, weather API, payment processing API. You typically pass API documentation to the LLM — they&rsquo;re good at reading JSON specs and learning the GET request format.</li>
<li><strong>Resources</strong> (Anthropic&rsquo;s term): data sitting somewhere (e.g. your CRM) that you let the agent read. Provide a lookup tool and access to the resource.</li>
</ul>
<h3 id="degrees-of-autonomy">Degrees of autonomy</h3>
<p>From least to most autonomous:</p>
<ul>
<li><strong>Least</strong>: hard-code the steps. &ldquo;First identify intent, then look up history, then call the flight API, &hellip;&rdquo;</li>
<li><strong>Semi</strong>: hard-code the tools only. &ldquo;You&rsquo;re a travel agent, help the user book travel. Here are your tools.&rdquo;</li>
<li><strong>Most</strong>: agent decides both steps and tools. Give it a code editor; it can ping any web API, perform calculations, generate code to display data.</li>
</ul>
<h3 id="apis-vs-mcp-model-context-protocol">APIs vs MCP (Model Context Protocol)</h3>
<p>With <strong>APIs</strong>, you teach the LLM to ping a specific API: give it documentation, define how to call it, what it returns. You do this one-off per API. Doesn&rsquo;t scale well.</p>
<p>With <strong><a href="/blog/llm/knowledge-cards/mcp/" data-link-title="MCP（Model Context Protocol）" data-link-desc="LLM application ↔ 外部 tool server 之間的標準化協議、複用 OpenAI 相容 API 的成功模式">MCP</a></strong> (Anthropic-coined), there&rsquo;s a system in the middle. Agents communicate with an MCP server:</p>
<blockquote>
<p>&ldquo;What do you need to give me flight info?&rdquo;
&ldquo;I need origin, destination, and what you&rsquo;re looking for.&rdquo;
&ldquo;Here are my requirements.&rdquo;
&ldquo;You forgot to tell me your budget.&rdquo;</p></blockquote>
<p>It&rsquo;s agent-to-agent communication. Companies publish their MCPs; your agent figures out how to get the data it needs.</p>
<blockquote>
<p><strong>Q</strong>: Isn&rsquo;t MCP just a shifted maintenance burden — APIs change, MCPs change?</p>
<p>Yes. But at least the agent can go back and forth and discover requirements. Ideally a startup has documentation, an LLM workflow reads docs and updates code accordingly.</p></blockquote>
<blockquote>
<p><strong>Q</strong>: Are there security concerns with MCP?</p>
<p>Likely, depending on the data exposed. Most MCPs have authentication, like APIs. The exact security surface depends on the implementation.</p></blockquote>
<blockquote>
<p><strong>Q</strong>: Is MCP about efficiency or accessing more data?</p>
<p>Efficiency. You still control what data is exposed. Compared to one-off API integration, MCP lets a coding agent communicate efficiently with many MCP servers and find what it needs.</p></blockquote>
<h3 id="step-by-step-workflow-example-travel-agent">Step-by-step workflow example: travel agent</h3>
<ol>
<li>User: &ldquo;Plan a trip to Paris Dec 15–20 with flights, hotels near the Eiffel Tower, and an itinerary.&rdquo;</li>
<li>Agent plans steps: find flights, search hotels, generate recommendations, validate preferences/budget, book.</li>
<li>Execute: use tools, combine results.</li>
<li>Proactive interaction: propose to user, validate, iterate.</li>
<li>Update memory: &ldquo;User only likes direct flights.&rdquo; &ldquo;User is fine with 3-star hotels.&rdquo;</li>
</ol>
<h2 id="7-case-study-building-a-customer-support-agent--evals">7. Case study: building a customer support agent + evals</h2>
<p>PM asks you to build a customer support agent. Example: &ldquo;I need to change my shipping address for order X — I moved.&rdquo;</p>
<h3 id="where-to-start">Where to start</h3>
<ul>
<li><strong>Research existing models / benchmarks</strong> for customer support</li>
<li><strong>Decompose the task</strong>: what would a human support agent do?</li>
<li><strong>Guess what&rsquo;s fuzzy vs deterministic</strong> in advance</li>
</ul>
<blockquote>
<p>Recommended start: sit with a customer support agent for a day or two. Watch their workflow. Ask where they struggle and how much time each step takes. That gives you the task decomposition.</p></blockquote>
<h3 id="decomposed-task">Decomposed task</h3>
<p>A human support agent typically:</p>
<ol>
<li>Extracts key info</li>
<li>Looks up the customer record in the database</li>
<li>Checks policy (allowed to update address?)</li>
<li>Drafts a response email</li>
<li>Sends the email</li>
</ol>
<h3 id="designing-the-agentic-workflow">Designing the agentic workflow</h3>
<p>For each step, pick the right primitive:</p>
<ul>
<li><strong>Step 1 extract info</strong>: vanilla LLM call — extract intent, order number, new address</li>
<li><strong>Step 2 lookup + update</strong>: tool — connect to database (custom tool or MCP)</li>
<li><strong>Step 3 check policy</strong>: RAG or rule lookup</li>
<li><strong>Step 4 draft email</strong>: LLM call, with the confirmation pasted in</li>
<li><strong>Step 5 send email</strong>: tool — post to email API</li>
</ul>
<h3 id="evals-how-do-you-know-it-works">Evals: how do you know it works?</h3>
<p>Assume you have <strong>LLM traces</strong> (a must in any AI startup — if a startup doesn&rsquo;t have traces, debugging is brutal). Several dimensions for evaluation:</p>
<p><strong>End-to-end vs component-based</strong>:</p>
<ul>
<li>End-to-end: user satisfaction rating at the end. If user rates 1, follow up: &ldquo;What was the issue?&rdquo; → &ldquo;Prices were too high&rdquo; → fix the relevant tool/prompt.</li>
<li>Component-based: error-analyze each tool / prompt independently. &ldquo;The tool keeps forgetting to update the email field.&rdquo; &ldquo;The email-send call uses wrong format.&rdquo;</li>
</ul>
<p><strong>Objective vs subjective</strong>:</p>
<ul>
<li>Objective: &ldquo;LLM extracted the wrong order ID.&rdquo; You can write Python to check alignment between user input and DB lookup. Catch automatically.</li>
<li>Subjective: &ldquo;Should we recommend a direct flight or cheaper indirect?&rdquo; Captured via:
<ul>
<li>Curated eval dataset — write 10 prompts where users say &ldquo;I prefer direct flights, I care about time.&rdquo; Define what a good output looks like.</li>
<li>LLM judges grading on a rubric.</li>
</ul>
</li>
</ul>
<p><strong>Quantitative vs qualitative</strong>:</p>
<ul>
<li>Quantitative: % successful address updates; latency per component (e.g. send-email takes 5s — too long).</li>
<li>Qualitative: error analysis on hallucinations, tone mismatch, user confusion. Typically white-glove.</li>
</ul>
<p>Example of subjective tone eval: error-analyze 20 user interactions, notice the LLM seems rude / overly short. Then build LLM judges with a politeness rubric. Then swap the underlying LLM (GPT-4 → Grok → Llama), run side by side, see which is most polite on average. Or fix the LLM and tweak the prompt (&ldquo;Act like a travel agent&rdquo; → &ldquo;Act like a helpful travel agent&rdquo;) to measure the word&rsquo;s influence.</p>
<h2 id="8-multi-agent-workflows">8. Multi-agent workflows</h2>
<p>Why multi-agent when a single workflow already has multiple steps?</p>
<ul>
<li><strong>Parallelism</strong> — independent things can run in parallel</li>
<li><strong>Reuse</strong> — a design agent built once can serve marketing, product, etc. Many stakeholders benefit from one optimized agent.</li>
</ul>
<h3 id="smart-home-example">Smart home example</h3>
<p>Brainstormed by the class:</p>
<ul>
<li><strong>Biometric / location agent</strong>: tracks where you are and how you&rsquo;re moving</li>
<li><strong>Climate agent</strong>: monitors and adjusts room temperature</li>
<li><strong>Energy efficiency agent</strong>: tracks usage, gives feedback, may control utilities</li>
<li><strong>Security agent</strong>: identifies who&rsquo;s entering, applies role-based permissions (parent vs kid)</li>
<li><strong>Weather / external API agent</strong>: integrates outdoor conditions to control temperature, blinds, etc.</li>
<li><strong>Fridge / grocery agent</strong>: knows what&rsquo;s inside via camera, knows preferences, has e-commerce API access for restocking</li>
<li><strong>Notification / alerts agent</strong>: system updates, energy savings</li>
<li><strong>Orchestrator agent</strong>: the user-facing entry point that delegates to specialists</li>
</ul>
<h3 id="interaction-patterns">Interaction patterns</h3>
<ul>
<li><strong>Flat / all-to-all</strong>: every agent can talk to every agent</li>
<li><strong>Hierarchical</strong>: orchestrator routes to specialists</li>
</ul>
<p>Smart home likely wants <strong>hierarchical</strong> for UX — users want one interface, not one app per agent. Some flat links may still help (climate + energy efficiency probably need to talk directly).</p>
<p>When you allow agents to speak to each other, it&rsquo;s basically an MCP-style protocol: treat the other agent like a tool. &ldquo;Here&rsquo;s how you interact, here&rsquo;s what it tells you, here&rsquo;s what it needs from you.&rdquo;</p>
<h3 id="advantages">Advantages</h3>
<ul>
<li>Easier to debug specialized agents than a monolithic system</li>
<li>Parallelization, time savings</li>
</ul>
<h2 id="9-whats-next-in-ai">9. What&rsquo;s next in AI</h2>
<h3 id="are-we-plateauing-ilya-sutskevers-question">Are we plateauing? (Ilya Sutskever&rsquo;s question)</h3>
<p>The community feeling around the latest GPT release was that the performance jump wasn&rsquo;t what people expected — though the unified hood (no model selector) made consumer UX better.</p>
<p>LLM <strong>scaling laws</strong> say more compute + energy → better performance, but that eventually plateaus. What takes us to the next step is probably <strong>architecture search</strong>. The human brain operates very differently — much more efficient, much faster, with far less data. Big labs are hiring thousands of engineers precisely to hunt the next architectural breakthrough. Whoever discovered Transformers had tremendous impact on AI&rsquo;s direction; the next analogous discovery could unlock a 10x reduction in compute and energy needs. (Foundation series analogy: individuals can disproportionately shape the future via their decisions.)</p>
<h3 id="multi-modality">Multi-modality</h3>
<p>LLMs started as text-only, added images. Models good at images are also better at text — being good at cat images makes you better at text about cats. Add audio and video, and the whole system improves. Pinnacle: robotics, where all modalities converge — the robot is better at avoiding a cat because it knows what a cat looks like, sounds like, smells like.</p>
<h3 id="methods-working-in-harmony">Methods working in harmony</h3>
<p>Humans probably use a mix of methods:</p>
<ul>
<li><strong>Meta-learning</strong> — survival instinct encoded in DNA (the baby&rsquo;s &ldquo;pre-training&rdquo;)</li>
<li><strong>Supervised</strong> — parents pointing and saying &ldquo;good / bad&rdquo;</li>
<li><strong>Reinforcement</strong> — falling and getting hurt</li>
<li><strong>Unsupervised</strong> — observing others</li>
</ul>
<p>Future AI systems likely combine the methods you saw in CS230, optimizing for speed, latency, cost, and energy.</p>
<h3 id="human-centric-vs-non-human-centric-research">Human-centric vs non-human-centric research</h3>
<p>The human body is limiting. Pure brain-modeled research may miss compute/energy optimizations. Still, the brain has lots to teach — e.g. one research direction asks: does the brain do backpropagation? Probably not — likely only forward propagation. Worth reading if you&rsquo;re curious about AI&rsquo;s direction.</p>
<h3 id="velocity">Velocity</h3>
<p>Things move so fast that we deliberately teach <strong>breadth</strong>, not depth — because today&rsquo;s specific RAG technique #17 will be irrelevant in two years. Get the breadth, develop the ability to sprint into depth when needed. The half-life of skills is low.</p>
<h2 id="後話">後話</h2>
<p>這篇是 Stanford CS230 公開課的整理、保留英文原文以避免翻譯失真。要看本 blog 對應的中文原理化內容、可以接：</p>
<ul>
<li><a href="/blog/llm/04-applications/" data-link-title="模組四：LLM 應用層原理" data-link-desc="Prompt 技術光譜、RAG、tool use、agent、應用層協議、人機協作、multi-agent、workflow 編排、eval 設計：跨工具不變的概念地圖">模組四：LLM 應用層原理</a> — RAG / tool use / agent / workflow patterns 的跨工具不變原理</li>
<li><a href="/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1 RAG 原理</a></li>
<li><a href="/blog/llm/04-applications/agent-architecture/" data-link-title="4.4 Agent 架構原理" data-link-desc="Agent loop 結構、失敗模式、什麼任務適合 vs 不適合、跟人類審查的協作模型">4.4 Agent 架構原理</a></li>
<li><a href="/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14 Benchmarking 與評估方法論</a></li>
<li><a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-Judge 評估方法</a></li>
</ul>
]]></content:encoded></item><item><title>Case Study：customer support agent 從 task decomposition 到 eval</title><link>https://tarrragon.github.io/blog/llm/04-applications/hands-on/customer-support-case-study/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/04-applications/hands-on/customer-support-case-study/</guid><description>&lt;p>本案例的責任是把模組四前面所有原理章節串成一個端到端的設計過程、示範&lt;strong>遇到實際 LLM 應用任務時、設計反射動作的順序&lt;/strong>。每段都標出引用哪章原理、讓讀者看到 principle 章節怎麼落到具體工作。&lt;/p>
&lt;p>用作走查的任務：PM 交派「做一個 customer support agent、能處理用戶查詢、必要時自動完成操作（如改地址）。」本案例聚焦「改地址」這個高頻 query type 走完整流程。&lt;/p>
&lt;h2 id="本案例的設計反射">本案例的設計反射&lt;/h2>
&lt;p>整個流程分七階段：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>觀察人類工作流&lt;/strong>：訪談、決定 task decomposition&lt;/li>
&lt;li>&lt;strong>典範定位&lt;/strong>：哪段該 deterministic、哪段該 fuzzy&lt;/li>
&lt;li>&lt;strong>工作流設計&lt;/strong>：每個 step 選對應的 LLM / tool / RAG / HITL 形態&lt;/li>
&lt;li>&lt;strong>協議跟自主度決定&lt;/strong>：是 single agent / multi-call / multi-agent&lt;/li>
&lt;li>&lt;strong>Trace instrumentation&lt;/strong>：哪些資訊要記&lt;/li>
&lt;li>&lt;strong>Eval 設計&lt;/strong>：先選座標、再選工具&lt;/li>
&lt;li>&lt;strong>Iteration loop&lt;/strong>：error analysis → 修哪一層 → 看 metric 收斂&lt;/li>
&lt;/ol>
&lt;p>初次設計 LLM 應用時最常省略階段 1、2、5、6、直接跳到階段 3 開始寫 prompt——這條路會走進「prompt 改了 20 版、無法判讀有沒有變好」的迭代無收斂。本案例強調的是設計反射動作的順序、不是寫 prompt 技巧。&lt;/p>
&lt;h2 id="階段-1觀察人類工作流">階段 1：觀察人類工作流&lt;/h2>
&lt;p>PM 給的任務描述是「處理用戶查詢」、但「查詢」涵蓋的範圍可能很大。第一個反射動作是&lt;strong>坐在客服旁邊觀察兩天&lt;/strong>、不是打開 IDE。&lt;/p>
&lt;p>實際做的事：&lt;/p>
&lt;ul>
&lt;li>統計收到的 query 類型分佈（退款 / 改地址 / 查詢訂單狀態 / 抱怨 / 開放問題各佔多少）。&lt;/li>
&lt;li>看每類 query 的 human resolution 流程（哪幾步、要查哪些系統、要遵守哪些 policy）。&lt;/li>
&lt;li>看哪幾類 query 是 high volume + low complexity（最值得自動化）、哪幾類是 low volume + high complexity（自動化 ROI 差）。&lt;/li>
&lt;li>記下 human 在哪些 step 卡住、哪些 step 反覆需要查同樣資料。&lt;/li>
&lt;/ul>
&lt;p>訪談結束、你得到一張 task decomposition map。本案例假設聚焦在「用戶請求改地址」這個高頻 query type：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">User: 「我搬家了、訂單編號 #12345、新地址是 ___」
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">1. 解析意圖 + 抽取訊息（訂單編號、新地址）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">2. 查訂單狀態（已出貨？未出貨？已送達？）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">3. 查 policy（這個訂單狀態 + user tier 能不能改地址？）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">4. 若可：執行改地址（呼叫物流 / 庫存 API）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">5. 若不可：解釋為什麼、給替代方案
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl">6. 草擬回覆 email、發出&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>引用原理：這個 decomposition 本身對應 &lt;a href="https://tarrragon.github.io/blog/llm/00-foundations/deterministic-vs-fuzzy-engineering/" data-link-title="0.8 Deterministic vs Fuzzy Engineering：軟體設計典範的位移" data-link-desc="傳統 deterministic 軟體跟 fuzzy LLM 軟體在資料、邏輯、分解、實驗成本四個維度的根本差異、以及哪段該 deterministic、哪段該 fuzzy 的決策框架">0.8 fuzzy engineering&lt;/a>（&lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/deterministic-vs-fuzzy/" data-link-title="Deterministic vs Fuzzy engineering" data-link-desc="LLM 軟體 vs 傳統軟體在資料 / 邏輯 / 行為一致性 / 實驗成本四維度的典範差異、決定哪段該包 guardrail">deterministic-vs-fuzzy&lt;/a> 卡）的「先分解任務、再判讀每段該 deterministic 還是 fuzzy」。&lt;/p>
&lt;h2 id="階段-2典範定位">階段 2：典範定位&lt;/h2>
&lt;p>對每個 step 做典範定位（deterministic / fuzzy）：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Step&lt;/th>
 &lt;th>典範&lt;/th>
 &lt;th>為什麼&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>1. 解析意圖 + 抽取訊息&lt;/td>
 &lt;td>Fuzzy&lt;/td>
 &lt;td>自由文字 input、需要 LLM 理解&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>2. 查訂單狀態&lt;/td>
 &lt;td>Deterministic&lt;/td>
 &lt;td>結構化 query（給 order_id、回 status）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>3. 查 policy&lt;/td>
 &lt;td>Deterministic&lt;/td>
 &lt;td>規則可窮舉、policy as code&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>4. 執行改地址&lt;/td>
 &lt;td>Deterministic&lt;/td>
 &lt;td>API call、有 schema 跟錯誤碼&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>5. 解釋 / 給替代方案&lt;/td>
 &lt;td>Fuzzy&lt;/td>
 &lt;td>要寫人話、要 tailored to 情境&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>6. 草擬 email + 發出&lt;/td>
 &lt;td>Fuzzy（草擬）+ Deterministic（發送）&lt;/td>
 &lt;td>寫 email 是 fuzzy、發 API call 是 deterministic&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>判讀的重點是&lt;strong>邊界各歸各位&lt;/strong>：規則跟政策走 code、人話跟意圖解析走 LLM。&lt;/p></description><content:encoded><![CDATA[<p>本案例的責任是把模組四前面所有原理章節串成一個端到端的設計過程、示範<strong>遇到實際 LLM 應用任務時、設計反射動作的順序</strong>。每段都標出引用哪章原理、讓讀者看到 principle 章節怎麼落到具體工作。</p>
<p>用作走查的任務：PM 交派「做一個 customer support agent、能處理用戶查詢、必要時自動完成操作（如改地址）。」本案例聚焦「改地址」這個高頻 query type 走完整流程。</p>
<h2 id="本案例的設計反射">本案例的設計反射</h2>
<p>整個流程分七階段：</p>
<ol>
<li><strong>觀察人類工作流</strong>：訪談、決定 task decomposition</li>
<li><strong>典範定位</strong>：哪段該 deterministic、哪段該 fuzzy</li>
<li><strong>工作流設計</strong>：每個 step 選對應的 LLM / tool / RAG / HITL 形態</li>
<li><strong>協議跟自主度決定</strong>：是 single agent / multi-call / multi-agent</li>
<li><strong>Trace instrumentation</strong>：哪些資訊要記</li>
<li><strong>Eval 設計</strong>：先選座標、再選工具</li>
<li><strong>Iteration loop</strong>：error analysis → 修哪一層 → 看 metric 收斂</li>
</ol>
<p>初次設計 LLM 應用時最常省略階段 1、2、5、6、直接跳到階段 3 開始寫 prompt——這條路會走進「prompt 改了 20 版、無法判讀有沒有變好」的迭代無收斂。本案例強調的是設計反射動作的順序、不是寫 prompt 技巧。</p>
<h2 id="階段-1觀察人類工作流">階段 1：觀察人類工作流</h2>
<p>PM 給的任務描述是「處理用戶查詢」、但「查詢」涵蓋的範圍可能很大。第一個反射動作是<strong>坐在客服旁邊觀察兩天</strong>、不是打開 IDE。</p>
<p>實際做的事：</p>
<ul>
<li>統計收到的 query 類型分佈（退款 / 改地址 / 查詢訂單狀態 / 抱怨 / 開放問題各佔多少）。</li>
<li>看每類 query 的 human resolution 流程（哪幾步、要查哪些系統、要遵守哪些 policy）。</li>
<li>看哪幾類 query 是 high volume + low complexity（最值得自動化）、哪幾類是 low volume + high complexity（自動化 ROI 差）。</li>
<li>記下 human 在哪些 step 卡住、哪些 step 反覆需要查同樣資料。</li>
</ul>
<p>訪談結束、你得到一張 task decomposition map。本案例假設聚焦在「用戶請求改地址」這個高頻 query type：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">User: 「我搬家了、訂單編號 #12345、新地址是 ___」
</span></span><span class="line"><span class="ln">2</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">3</span><span class="cl">1. 解析意圖 + 抽取訊息（訂單編號、新地址）
</span></span><span class="line"><span class="ln">4</span><span class="cl">2. 查訂單狀態（已出貨？未出貨？已送達？）
</span></span><span class="line"><span class="ln">5</span><span class="cl">3. 查 policy（這個訂單狀態 + user tier 能不能改地址？）
</span></span><span class="line"><span class="ln">6</span><span class="cl">4. 若可：執行改地址（呼叫物流 / 庫存 API）
</span></span><span class="line"><span class="ln">7</span><span class="cl">5. 若不可：解釋為什麼、給替代方案
</span></span><span class="line"><span class="ln">8</span><span class="cl">6. 草擬回覆 email、發出</span></span></code></pre></div><p>引用原理：這個 decomposition 本身對應 <a href="/blog/llm/00-foundations/deterministic-vs-fuzzy-engineering/" data-link-title="0.8 Deterministic vs Fuzzy Engineering：軟體設計典範的位移" data-link-desc="傳統 deterministic 軟體跟 fuzzy LLM 軟體在資料、邏輯、分解、實驗成本四個維度的根本差異、以及哪段該 deterministic、哪段該 fuzzy 的決策框架">0.8 fuzzy engineering</a>（<a href="/blog/llm/knowledge-cards/deterministic-vs-fuzzy/" data-link-title="Deterministic vs Fuzzy engineering" data-link-desc="LLM 軟體 vs 傳統軟體在資料 / 邏輯 / 行為一致性 / 實驗成本四維度的典範差異、決定哪段該包 guardrail">deterministic-vs-fuzzy</a> 卡）的「先分解任務、再判讀每段該 deterministic 還是 fuzzy」。</p>
<h2 id="階段-2典範定位">階段 2：典範定位</h2>
<p>對每個 step 做典範定位（deterministic / fuzzy）：</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>典範</th>
          <th>為什麼</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1. 解析意圖 + 抽取訊息</td>
          <td>Fuzzy</td>
          <td>自由文字 input、需要 LLM 理解</td>
      </tr>
      <tr>
          <td>2. 查訂單狀態</td>
          <td>Deterministic</td>
          <td>結構化 query（給 order_id、回 status）</td>
      </tr>
      <tr>
          <td>3. 查 policy</td>
          <td>Deterministic</td>
          <td>規則可窮舉、policy as code</td>
      </tr>
      <tr>
          <td>4. 執行改地址</td>
          <td>Deterministic</td>
          <td>API call、有 schema 跟錯誤碼</td>
      </tr>
      <tr>
          <td>5. 解釋 / 給替代方案</td>
          <td>Fuzzy</td>
          <td>要寫人話、要 tailored to 情境</td>
      </tr>
      <tr>
          <td>6. 草擬 email + 發出</td>
          <td>Fuzzy（草擬）+ Deterministic（發送）</td>
          <td>寫 email 是 fuzzy、發 API call 是 deterministic</td>
      </tr>
  </tbody>
</table>
<p>判讀的重點是<strong>邊界各歸各位</strong>：規則跟政策走 code、人話跟意圖解析走 LLM。</p>
<ul>
<li>Policy check 寫成 code（如「user tier + 訂單狀態 → 能否改地址」是 deterministic 規則）。對應反例：把規則塞進 prompt 讓 LLM 判斷、會偶爾跳過規則或誤判 tier。</li>
<li>「能不能做」這類 yes/no 走規則。對應反例：用 LLM 算判斷、debug 困難且非確定性。</li>
<li>「Helpful 的回覆」走 LLM 寫。對應反例：在 code 內 hard-code 模板、變成僵化的客服機器人腔。</li>
</ul>
<p>最容易混的邊界在 step 6：「草擬 email」是 fuzzy（要寫人話、tailor to 情境）、「發送 email」是 deterministic（呼叫 API、處理錯誤碼）。把這兩件事拆開、草擬可以 retry / 改 prompt 不影響發送邏輯、發送有結構化 error 不被 LLM hallucinate 蓋過。Step 4「執行改地址」也類似：tool call 本身 deterministic、但是否該 call 的判讀回到 step 3 的 policy check。</p>
<p>引用原理：<a href="/blog/llm/00-foundations/deterministic-vs-fuzzy-engineering/" data-link-title="0.8 Deterministic vs Fuzzy Engineering：軟體設計典範的位移" data-link-desc="傳統 deterministic 軟體跟 fuzzy LLM 軟體在資料、邏輯、分解、實驗成本四個維度的根本差異、以及哪段該 deterministic、哪段該 fuzzy 的決策框架">0.8 fuzzy engineering</a> 的「哪段該 deterministic / 哪段該 fuzzy」決策框架、特別是反模式「邊界用錯」段。</p>
<h2 id="階段-3工作流設計">階段 3：工作流設計</h2>
<p>對每個 step 選對應的工具：</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>設計選擇</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1. 解析意圖 + 抽取訊息</td>
          <td>Vanilla LLM call + structured output（output 強制 JSON schema：intent / order_id / new_address）</td>
      </tr>
      <tr>
          <td>2. 查訂單狀態</td>
          <td>Tool call → 內部 order API</td>
      </tr>
      <tr>
          <td>3. 查 policy</td>
          <td>Tool call → policy engine（純 deterministic、不過 LLM）</td>
      </tr>
      <tr>
          <td>4. 執行改地址</td>
          <td>Tool call → logistics API、寫操作前要 pre-act HITL（高風險 + 不可逆）</td>
      </tr>
      <tr>
          <td>5. 解釋 / 給替代方案</td>
          <td>LLM call + few-shot（從 case 庫 retrieve「類似情境怎麼解釋」、配 RAG）</td>
      </tr>
      <tr>
          <td>6. 草擬 email + 發出</td>
          <td>LLM call 寫 email + structured output 含 subject/body、發送透過 email API</td>
      </tr>
  </tbody>
</table>
<p>兩個容易選錯的 step 展開：</p>
<p><strong>Step 1 為何要 structured output、不是純 prompt 解析</strong>：抽取結果要餵 step 2-4 的 deterministic tool、order_id 抽錯就整個流程斷。純 prompt 描述「請輸出 JSON」是弱保證、structured output / constrained decoding 是強保證（見 <a href="/blog/llm/03-theoretical-foundations/constrained-decoding-internals/" data-link-title="3.10 Constrained decoding 內部：grammar mask 跟性能取捨" data-link-desc="Constrained decoding 的內部運作：token mask 計算、JSON schema / regex / CFG 三種 grammar、XGrammar pre-compile 機制、性能反而加速">3.10 constrained decoding 內部</a>）。Trade-off：強格式可能犧牲表達彈性、但這個 step 不需要彈性、要的是可靠。</p>
<p><strong>Step 5 為何配 RAG 而非純 few-shot</strong>：客服 case 涵蓋多種情境（訂單已出貨 / 已送達 / VIP / 一般 user / 不同國家 policy）、固定 few-shot 範例 cover 不全。RAG 從歷史 case 庫即時 retrieve 最相似的解釋範例、屬於 <a href="/blog/llm/04-applications/prompt-techniques-landscape/" data-link-title="4.0 Prompt 技術光譜：手法分類、取捨、組合模式" data-link-desc="Zero-shot / few-shot、chain-of-thought、role / template、reflection 等 prompt 技術的分類與取捨、何時 stack 何時不要 stack、跟 fine-tune / RAG / chaining 的邊界">4.0 prompt 技術光譜</a> context 軸的 retrieval-augmented prompting。</p>
<p>引用原理：</p>
<ul>
<li>Step 1 的 structured output → <a href="/blog/llm/04-applications/application-protocols/" data-link-title="4.6 應用層協議：function calling / structured output / MCP" data-link-desc="三個常被混為一談的概念：模型能力、sampling 約束、server 協議，三者的層級差異與組合方式">4.6 應用層協議</a></li>
<li>Step 2-4 的 tool 設計 → <a href="/blog/llm/04-applications/tool-use-principles/" data-link-title="4.3 Tool use 原理：LLM 跟外部世界互動" data-link-desc="Structured output 是 LLM 跨入工程系統的橋、function calling 取捨、為什麼本地小模型 tool use 表現崩潰">4.3 tool use</a></li>
<li>Step 4 的 pre-act HITL → <a href="/blog/llm/04-applications/human-ai-collaboration/" data-link-title="4.5 人機協作拓樸：何時人介入、怎麼介入" data-link-desc="Centaur vs Cyborg 工作模式、jagged frontier、HITL 三種觸發時機（pre-act / mid-stream / post-hoc）、確認流程的設計避免橡皮圖章化">4.5 人機協作拓樸</a> pre-act 段。對比講座 Workera appeal 是 post-hoc、本案例選 pre-act 是因為改地址不可逆 + 物流影響大、必須在執行前審</li>
<li>Step 5 的 RAG → <a href="/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1 RAG 原理</a> + <a href="/blog/llm/04-applications/prompt-techniques-landscape/" data-link-title="4.0 Prompt 技術光譜：手法分類、取捨、組合模式" data-link-desc="Zero-shot / few-shot、chain-of-thought、role / template、reflection 等 prompt 技術的分類與取捨、何時 stack 何時不要 stack、跟 fine-tune / RAG / chaining 的邊界">4.0 prompt 技術光譜</a> context 軸</li>
</ul>
<h2 id="階段-4協議跟自主度決定">階段 4：協議跟自主度決定</h2>
<p>這個工作流的控制流是線性的（1→2→3→4→5→6）、有條件分支（step 3 結果決定走 4 還是 5）、但每步順序固定。判讀：</p>
<p><strong>該用什麼結構</strong>：</p>
<ul>
<li><strong>不適用 Multi-agent</strong>：步驟順序固定、角色差異不大、orchestration overhead 純增。</li>
<li><strong>不適用 Single agent loop（model 自決下一步）</strong>：本案例假設 single-turn / 短多 turn、步驟順序明確、不需要 agent 自決。若 user 互動多輪 + turn 數不固定（如 user 中途補資訊、改主意、追問）、可考慮 agent loop。</li>
<li><strong>採用 Multi-call pipeline + router</strong>：寫成 deterministic pipeline、step 3 後有 router 分流。</li>
</ul>
<p>引用原理：</p>
<ul>
<li><a href="/blog/llm/04-applications/multi-agent-topology/" data-link-title="4.8 Multi-Agent 拓樸：flat / hierarchical / agent-as-tool" data-link-desc="從 multi-call workflow 走到 multi-agent system 的判讀、flat vs hierarchical 拓樸、agent-as-tool 的 MCP 視角、specialization 跟 orchestration overhead 的取捨">4.8 multi-agent 拓樸</a> 的「先 multi-call、不夠再 multi-agent」反射</li>
<li><a href="/blog/llm/04-applications/workflow-patterns/" data-link-title="4.7 Workflow 編排模式" data-link-desc="Pipeline / router / parallel / reflection：多 LLM call 組合的四種基本模式與退化條件">4.7 workflow patterns</a> 的 pipeline + router 模式</li>
<li><a href="/blog/llm/04-applications/agent-architecture/" data-link-title="4.4 Agent 架構原理" data-link-desc="Agent loop 結構、失敗模式、什麼任務適合 vs 不適合、跟人類審查的協作模型">4.4 agent 架構</a> 的「先 single-call、不夠再 agent」反射</li>
</ul>
<p><strong>自主度</strong>：</p>
<ul>
<li>Step 1（parse）、5（解釋）、6（草擬 email）：full auto。</li>
<li>Step 2、3（查訂單、查 policy）：full auto（read-only）。</li>
<li>Step 4（執行改地址）：pre-act HITL（高風險 + 不可逆）、有 diff show、user 可以 reject。</li>
<li>Step 6（發 email）：可選 pre-act HITL（看公司風格、保守版要審 email、激進版自動發）。</li>
</ul>
<h2 id="階段-5trace-instrumentation">階段 5：Trace Instrumentation</h2>
<p>工作流上線前、先設計要記哪些資訊。<strong>Eval 跟 debug 都靠 trace、沒 trace 後面什麼都做不了</strong>。</p>
<p>每個 step 要記：</p>
<table>
  <thead>
      <tr>
          <th>欄位</th>
          <th>為什麼</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Input（完整）</td>
          <td>Debug 時要重現</td>
      </tr>
      <tr>
          <td>Output（完整）</td>
          <td>比對預期、做 regression set</td>
      </tr>
      <tr>
          <td>Latency</td>
          <td>找 bottleneck</td>
      </tr>
      <tr>
          <td>Token cost</td>
          <td>算成本</td>
      </tr>
      <tr>
          <td>Step name + version</td>
          <td>追蹤是哪個版本的 prompt / tool</td>
      </tr>
      <tr>
          <td>Decision branch</td>
          <td>Step 3 的 router 走哪邊</td>
      </tr>
      <tr>
          <td>Error（若有）</td>
          <td>結構化 error、不是 string</td>
      </tr>
  </tbody>
</table>
<p>整段 trace 要綁同一個 conversation_id、可以後面 join 起來看完整流程。</p>
<p>引用原理：<a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20 LLM tracing</a>。</p>
<h2 id="階段-6eval-設計">階段 6：Eval 設計</h2>
<p>先選座標、再選工具。對本案例的每個 eval 需求、用 <a href="/blog/llm/04-applications/eval-design-framework/" data-link-title="4.13 Eval 設計座標系：三軸、八象限、何時測什麼" data-link-desc="Eval 設計三軸（objective↔subjective / component↔end-to-end / quantitative↔qualitative）、八象限的對應 eval 工具、軸選錯的訊號、跟 benchmarking / LLM-as-judge / tracing 的關係">4.13 三軸座標</a> 定位。下面列的 threshold 數字（95%、80%、≥4 等）是 illustrative、實際數字隨產品 baseline、user 容忍度、業務代價而定、不是通用標準。</p>
<h3 id="eval-1step-1-抽取準不準">Eval 1：Step 1 抽取準不準</h3>
<ul>
<li><strong>三軸</strong>：Objective（有 ground truth）+ Component（測單 step）+ Quantitative（accuracy）。</li>
<li><strong>工具</strong>：寫 100 個有標註的 query、跑 step 1、看 extraction accuracy（order_id 對 + new_address 對的比例）。</li>
<li><strong>Threshold</strong>：&lt; 95% 不上線。</li>
</ul>
<h3 id="eval-2step-2-4-tool-call-行為正確">Eval 2：Step 2-4 tool call 行為正確</h3>
<ul>
<li><strong>三軸</strong>：Objective + Component + Quantitative。</li>
<li><strong>工具</strong>：mock API、給 step 2-4 各 50 個 case、看 tool call 參數對不對、返回值處理對不對。</li>
<li><strong>Threshold</strong>：100%（這是 deterministic 行為、不該有錯）。</li>
</ul>
<h3 id="eval-3step-5-解釋品質">Eval 3：Step 5 解釋品質</h3>
<ul>
<li><strong>三軸</strong>：Subjective（沒有單一正解）+ Component + Quantitative。</li>
<li><strong>工具</strong>：LLM-as-judge with rubric（clarity / helpfulness / tone）、scale 1-5、aggregate average。</li>
<li><strong>Threshold</strong>：average ≥ 4、no 1-2 比例 &lt; 5%。</li>
</ul>
<h3 id="eval-4step-6-email-品質">Eval 4：Step 6 email 品質</h3>
<ul>
<li><strong>三軸</strong>：Subjective + Component + Quantitative + 加 Qualitative human review。</li>
<li><strong>工具</strong>：LLM judge 給分 + 每週抽 20 封 human review、看是否有 hallucinate 承諾、是否符合公司 tone。</li>
<li><strong>Threshold</strong>：judge 平均 ≥ 4、human review 沒有 critical issue。</li>
</ul>
<h3 id="eval-5e2e-success-rate">Eval 5：E2E success rate</h3>
<ul>
<li><strong>三軸</strong>：Objective + End-to-end + Quantitative。</li>
<li><strong>工具</strong>：跑 200 個 representative case、看「完整完成 + user 沒申訴」的比例。</li>
<li><strong>Threshold</strong>：≥ 85% baseline、降到 &lt; 80% alert。</li>
</ul>
<h3 id="eval-6user-滿意度">Eval 6：User 滿意度</h3>
<ul>
<li><strong>三軸</strong>：Subjective + End-to-end + Quantitative。</li>
<li><strong>工具</strong>：每次互動結束顯示 thumbs up/down + optional 留言、追蹤 weekly。</li>
<li><strong>Threshold</strong>：thumbs up rate &gt; 80%、appeal rate &lt; 5%。</li>
</ul>
<h3 id="eval-7failure-mode-pattern持續做">Eval 7：Failure mode pattern（持續做）</h3>
<ul>
<li><strong>三軸</strong>：Objective / Subjective + End-to-end + Qualitative。</li>
<li><strong>工具</strong>：每週讀 50 個 sampled traces + 100% 讀 failure / appeal traces、找 emerging pattern。</li>
<li><strong>產出</strong>：bug ticket、prompt 修改 hypothesis、policy 補強 hypothesis。</li>
</ul>
<p>引用原理：</p>
<ul>
<li>三軸座標 → <a href="/blog/llm/04-applications/eval-design-framework/" data-link-title="4.13 Eval 設計座標系：三軸、八象限、何時測什麼" data-link-desc="Eval 設計三軸（objective↔subjective / component↔end-to-end / quantitative↔qualitative）、八象限的對應 eval 工具、軸選錯的訊號、跟 benchmarking / LLM-as-judge / tracing 的關係">4.13 eval design framework</a></li>
<li>LLM judge rubric → <a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-Judge</a></li>
<li>Trace 接 eval → <a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20 LLM tracing</a></li>
</ul>
<h2 id="階段-7iteration-loop">階段 7：Iteration Loop</h2>
<p>上線後、不是「等出問題」、是<strong>持續 iteration</strong>。典型 iteration cycle：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Production trace + eval result
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">[Error analysis：找 emerging pattern]
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">   Hypothesis：哪一層有問題？
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">   ├── Prompt 層 → 改 prompt → A/B test → 看 eval 收斂
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">   ├── Tool 層   → 改 tool / schema → 跑 component eval → 收斂
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">   ├── RAG 層    → 改 chunking / query rewriting → 跑 [retrieval recall](/llm/knowledge-cards/retrieval-recall/) → 收斂
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">   ├── Policy 層 → 改 deterministic rule → 跑 step 3 component eval → 收斂
</span></span><span class="line"><span class="ln">10</span><span class="cl">   └── Model 層  → 換 model → 跑全 eval set → 收斂
</span></span><span class="line"><span class="ln">11</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">12</span><span class="cl">[改動進 production]
</span></span><span class="line"><span class="ln">13</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">14</span><span class="cl">[Frozen baseline 留著、新版本跟它比、漂移看得見]</span></span></code></pre></div><p>判讀「該改哪一層」的反射：</p>
<table>
  <thead>
      <tr>
          <th>失敗訊號</th>
          <th>該改的層</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Step 1 抽錯訊息</td>
          <td>Prompt / structured output schema</td>
      </tr>
      <tr>
          <td>Tool call 參數錯</td>
          <td>Prompt 內 tool description / few-shot</td>
      </tr>
      <tr>
          <td>Tool 跑掛</td>
          <td>Tool 實作（不是 LLM 問題）</td>
      </tr>
      <tr>
          <td>RAG retrieve 不到相關案例</td>
          <td>Chunking / embedding / query rewriting</td>
      </tr>
      <tr>
          <td>Policy judgment 錯</td>
          <td>Deterministic rule（不是 LLM 問題）</td>
      </tr>
      <tr>
          <td>Email tone 不對</td>
          <td>Prompt（role / few-shot）</td>
      </tr>
      <tr>
          <td>Email hallucinate 承諾</td>
          <td>Output validator（不只是 prompt）</td>
      </tr>
      <tr>
          <td>整體 latency 太高</td>
          <td>找 trace bottleneck、可能要 cache / 並行</td>
      </tr>
  </tbody>
</table>
<p>引用原理：</p>
<ul>
<li>Prompt 跟 model 層的失敗診斷 → <a href="/blog/llm/04-applications/prompt-techniques-landscape/" data-link-title="4.0 Prompt 技術光譜：手法分類、取捨、組合模式" data-link-desc="Zero-shot / few-shot、chain-of-thought、role / template、reflection 等 prompt 技術的分類與取捨、何時 stack 何時不要 stack、跟 fine-tune / RAG / chaining 的邊界">4.0 prompt 技術光譜</a> systematic vs random error</li>
<li>整體 fuzzy / deterministic 邊界判讀 → <a href="/blog/llm/00-foundations/deterministic-vs-fuzzy-engineering/" data-link-title="0.8 Deterministic vs Fuzzy Engineering：軟體設計典範的位移" data-link-desc="傳統 deterministic 軟體跟 fuzzy LLM 軟體在資料、邏輯、分解、實驗成本四個維度的根本差異、以及哪段該 deterministic、哪段該 fuzzy 的決策框架">0.8</a></li>
</ul>
<h2 id="五個容易遺漏的設計反射">五個容易遺漏的設計反射</h2>
<p>實務上常常省略這五個反射動作、走進無收斂迭代：</p>
<h3 id="反射一先觀察再開-ide">反射一：先觀察、再開 IDE</h3>
<p>階段 1 的價值是把 task decomposition 跟真實人類工作流對齊。沒這層對齊、寫出來的 prompt 跟 tool 拆法跟 reality 偏離、三天後重做。階段 1 的兩天比階段 3 的兩週值得。對應反例：「我先寫個 prompt 試試」、跳過觀察直接寫 code。</p>
<h3 id="反射二policy-寫成-codellm-只解析意圖">反射二：Policy 寫成 code、LLM 只解析意圖</h3>
<p>判斷類規則（user tier、訂單狀態、可否操作）走 deterministic code、LLM 只負責「user 想做什麼」這層意圖抽取。這條邊界讓 debug 容易、規則更新不用 prompt iteration。對應反例：「LLM、請判斷這個訂單能不能改地址、規則如下：&hellip;」——把判斷塞進 prompt、debug 困難、規則漂移無從追蹤。對應 <a href="/blog/llm/00-foundations/deterministic-vs-fuzzy-engineering/" data-link-title="0.8 Deterministic vs Fuzzy Engineering：軟體設計典範的位移" data-link-desc="傳統 deterministic 軟體跟 fuzzy LLM 軟體在資料、邏輯、分解、實驗成本四個維度的根本差異、以及哪段該 deterministic、哪段該 fuzzy 的決策框架">0.8</a> 的「邊界用錯」反模式。</p>
<h3 id="反射三trace-是-day-1-設計">反射三：Trace 是 day-1 設計</h3>
<p>從第一天就把 input / output / latency / token / step name / decision branch / error 進 trace、綁同一個 conversation_id。Eval 跟 debug 都靠 trace、沒 trace 後面什麼都做不了。對應反例：「先讓系統跑起來、之後再加 trace」——出 bug 時 debug 從零開始、production trace 不可回溯。</p>
<h3 id="反射四deterministic-行為用-deterministic-check">反射四：Deterministic 行為用 deterministic check</h3>
<p>有 ground truth 的行為（抽取對不對、API 參數對不對、JSON schema 合不合）用 Python 函數驗證、判斷成本低、精度高。LLM judge 留給沒 ground truth 的 subjective 行為。對應反例：用 LLM judge 測「step 1 抽取對不對」——cost 翻倍、精度反而不如 deterministic check。對應 <a href="/blog/llm/04-applications/eval-design-framework/" data-link-title="4.13 Eval 設計座標系：三軸、八象限、何時測什麼" data-link-desc="Eval 設計三軸（objective↔subjective / component↔end-to-end / quantitative↔qualitative）、八象限的對應 eval 工具、軸選錯的訊號、跟 benchmarking / LLM-as-judge / tracing 的關係">4.13</a> 軸誤選一。</p>
<h3 id="反射五保留-frozen-baseline">反射五：保留 frozen baseline</h3>
<p><a href="/blog/llm/knowledge-cards/frozen-baseline/" data-link-title="Frozen baseline" data-link-desc="Eval 系統中固定特定 prompt &#43; model 當長期對照、讓行為漂移可見的標準作法">Frozen baseline</a> 是把某個特定 prompt + 特定 model 跑 production 一段時間後 freeze 起來、每次新版本都跟它比、漂移看得見。對應反例：每次只跟「上一版」比、半年後累積漂移完全不可見、「整體變好了沒」無從回答。</p>
<h2 id="跟其他章節的對應總表">跟其他章節的對應總表</h2>
<p>本案例每階段引用的原理章節彙整：</p>
<table>
  <thead>
      <tr>
          <th>階段</th>
          <th>引用章節</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1. 觀察人類工作流</td>
          <td><a href="/blog/llm/00-foundations/deterministic-vs-fuzzy-engineering/" data-link-title="0.8 Deterministic vs Fuzzy Engineering：軟體設計典範的位移" data-link-desc="傳統 deterministic 軟體跟 fuzzy LLM 軟體在資料、邏輯、分解、實驗成本四個維度的根本差異、以及哪段該 deterministic、哪段該 fuzzy 的決策框架">0.8 fuzzy engineering</a></td>
      </tr>
      <tr>
          <td>2. 典範定位</td>
          <td><a href="/blog/llm/00-foundations/deterministic-vs-fuzzy-engineering/" data-link-title="0.8 Deterministic vs Fuzzy Engineering：軟體設計典範的位移" data-link-desc="傳統 deterministic 軟體跟 fuzzy LLM 軟體在資料、邏輯、分解、實驗成本四個維度的根本差異、以及哪段該 deterministic、哪段該 fuzzy 的決策框架">0.8 fuzzy engineering</a></td>
      </tr>
      <tr>
          <td>3. 工作流設計（prompt / tool / RAG / HITL）</td>
          <td><a href="/blog/llm/04-applications/prompt-techniques-landscape/" data-link-title="4.0 Prompt 技術光譜：手法分類、取捨、組合模式" data-link-desc="Zero-shot / few-shot、chain-of-thought、role / template、reflection 等 prompt 技術的分類與取捨、何時 stack 何時不要 stack、跟 fine-tune / RAG / chaining 的邊界">4.0</a>、<a href="/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1</a>、<a href="/blog/llm/04-applications/tool-use-principles/" data-link-title="4.3 Tool use 原理：LLM 跟外部世界互動" data-link-desc="Structured output 是 LLM 跨入工程系統的橋、function calling 取捨、為什麼本地小模型 tool use 表現崩潰">4.3</a>、<a href="/blog/llm/04-applications/human-ai-collaboration/" data-link-title="4.5 人機協作拓樸：何時人介入、怎麼介入" data-link-desc="Centaur vs Cyborg 工作模式、jagged frontier、HITL 三種觸發時機（pre-act / mid-stream / post-hoc）、確認流程的設計避免橡皮圖章化">4.5</a></td>
      </tr>
      <tr>
          <td>4. 結構決定（multi-call vs agent vs multi-agent）</td>
          <td><a href="/blog/llm/04-applications/agent-architecture/" data-link-title="4.4 Agent 架構原理" data-link-desc="Agent loop 結構、失敗模式、什麼任務適合 vs 不適合、跟人類審查的協作模型">4.4</a>、<a href="/blog/llm/04-applications/workflow-patterns/" data-link-title="4.7 Workflow 編排模式" data-link-desc="Pipeline / router / parallel / reflection：多 LLM call 組合的四種基本模式與退化條件">4.7</a>、<a href="/blog/llm/04-applications/multi-agent-topology/" data-link-title="4.8 Multi-Agent 拓樸：flat / hierarchical / agent-as-tool" data-link-desc="從 multi-call workflow 走到 multi-agent system 的判讀、flat vs hierarchical 拓樸、agent-as-tool 的 MCP 視角、specialization 跟 orchestration overhead 的取捨">4.8</a></td>
      </tr>
      <tr>
          <td>5. Trace instrumentation</td>
          <td><a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20 LLM tracing</a></td>
      </tr>
      <tr>
          <td>6. Eval 設計</td>
          <td><a href="/blog/llm/04-applications/eval-design-framework/" data-link-title="4.13 Eval 設計座標系：三軸、八象限、何時測什麼" data-link-desc="Eval 設計三軸（objective↔subjective / component↔end-to-end / quantitative↔qualitative）、八象限的對應 eval 工具、軸選錯的訊號、跟 benchmarking / LLM-as-judge / tracing 的關係">4.13 eval framework</a>、<a href="/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14</a>、<a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21</a></td>
      </tr>
      <tr>
          <td>7. Iteration loop</td>
          <td><a href="/blog/llm/04-applications/prompt-techniques-landscape/" data-link-title="4.0 Prompt 技術光譜：手法分類、取捨、組合模式" data-link-desc="Zero-shot / few-shot、chain-of-thought、role / template、reflection 等 prompt 技術的分類與取捨、何時 stack 何時不要 stack、跟 fine-tune / RAG / chaining 的邊界">4.0 prompt 光譜</a> systematic vs random error 段</td>
      </tr>
  </tbody>
</table>
<h2 id="下一步">下一步</h2>
<p>返回：<a href="/blog/llm/04-applications/" data-link-title="模組四：LLM 應用層原理" data-link-desc="Prompt 技術光譜、RAG、tool use、agent、應用層協議、人機協作、multi-agent、workflow 編排、eval 設計：跨工具不變的概念地圖">模組四首頁</a>、或回到 <a href="/blog/llm/04-applications/hands-on/" data-link-title="4.x Hands-on：端到端案例" data-link-desc="把模組四的所有原理串成具體 case study：從 task decomposition、workflow 設計、eval 設計到 iteration loop">hands-on 索引</a>。</p>
]]></content:encoded></item><item><title>4.13 Eval 設計座標系：三軸、八象限、何時測什麼</title><link>https://tarrragon.github.io/blog/llm/04-applications/eval-design-framework/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/04-applications/eval-design-framework/</guid><description>&lt;p>LLM 應用的「怎麼測」問題大家都在問、但答案常常是「跑某個 benchmark」「找個 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/llm-as-judge/" data-link-title="LLM-as-Judge" data-link-desc="用 LLM 評估另一個 LLM 的輸出品質、production eval 的主流方法、500-5000× 成本降但有 bias 要處理">LLM judge&lt;/a>」這類&lt;strong>工具層&lt;/strong>回答。實務上工具是末端、設計重點是&lt;strong>先選測什麼軸、再選工具&lt;/strong>。軸選錯了、再好的工具也測不出有用訊號——用 subjective 工具測 objective 行為（例如用 LLM judge 看金額計算對不對）、或用 end-to-end 工具測 component bug（例如看 user satisfaction 但其實是 retrieval pipeline 在漏 chunk）、都是常見的軸誤選。&lt;/p>
&lt;p>本章寫 eval 設計的座標系：三個 binary 軸、八個象限、每個象限對應什麼工具、軸選錯的訊號怎麼識別。這層 framing 是 meta、不是具體 eval 方法——具體方法在 &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14 benchmarking&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-Judge&lt;/a>。&lt;/p>
&lt;h2 id="本章目標">本章目標&lt;/h2>
&lt;p>讀完本章後你能：&lt;/p>
&lt;ol>
&lt;li>把任何 eval 需求放到三軸座標、定位象限。&lt;/li>
&lt;li>對每個象限選對應的 eval 工具。&lt;/li>
&lt;li>識別軸誤選的訊號、避免「工具對、軸錯」的常見坑。&lt;/li>
&lt;li>規劃 eval 路線：初期該做哪幾個象限、規模化後再補哪些。&lt;/li>
&lt;li>把 eval 設計跟 &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14 benchmarking&lt;/a> / &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20 tracing&lt;/a> / &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-Judge&lt;/a> 串成完整 pipeline。&lt;/li>
&lt;/ol>
&lt;h2 id="三軸">三軸&lt;/h2>
&lt;p>Eval 設計的三個正交軸：&lt;/p>
&lt;h3 id="軸-1objective--subjective">軸 1：Objective ↔ Subjective&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Objective&lt;/strong>：有明確 ground truth、檢驗可以寫成 deterministic check（金額對不對、SQL 跑得通不通、JSON schema 合不合法）。&lt;/li>
&lt;li>&lt;strong>Subjective&lt;/strong>：沒有單一正確答案、需要評分或比較（語氣好不好、解釋清楚不清楚、推薦的 trip 合不合用戶）。&lt;/li>
&lt;/ul>
&lt;p>判讀訊號：「能不能用 Python 函數判定對錯」、能 → objective、不能 → subjective。&lt;/p>
&lt;h3 id="軸-2component--end-to-end">軸 2：Component ↔ End-to-End&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Component&lt;/strong>：測單一元件、孤立評估（retrieval 拿對 chunk 沒、tool call 參數對沒、prompt 抽出正確 entity 沒）。&lt;/li>
&lt;li>&lt;strong>End-to-End&lt;/strong>：測完整流程、user 視角結果（user 問題有沒有被解決、訂單有沒有完成、conversation 滿意度）。&lt;/li>
&lt;/ul>
&lt;p>判讀訊號：「失敗時你想知道是哪一段壞掉」→ component；「你只在乎最終體驗」→ end-to-end。&lt;/p>
&lt;h3 id="軸-3quantitative--qualitative">軸 3：Quantitative ↔ Qualitative&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Quantitative&lt;/strong>：產出數字（accuracy / latency / cost / pass rate）、可以追蹤、可以比較、可以 alert。&lt;/li>
&lt;li>&lt;strong>Qualitative&lt;/strong>：產出觀察（error pattern、user 抱怨、reviewer 註記）、無法直接 aggregate、但能引導 hypothesis。&lt;/li>
&lt;/ul>
&lt;p>判讀訊號：「結果能算平均嗎」→ quantitative；「結果是讀完才知道」→ qualitative。&lt;/p></description><content:encoded><![CDATA[<p>LLM 應用的「怎麼測」問題大家都在問、但答案常常是「跑某個 benchmark」「找個 <a href="/blog/llm/knowledge-cards/llm-as-judge/" data-link-title="LLM-as-Judge" data-link-desc="用 LLM 評估另一個 LLM 的輸出品質、production eval 的主流方法、500-5000× 成本降但有 bias 要處理">LLM judge</a>」這類<strong>工具層</strong>回答。實務上工具是末端、設計重點是<strong>先選測什麼軸、再選工具</strong>。軸選錯了、再好的工具也測不出有用訊號——用 subjective 工具測 objective 行為（例如用 LLM judge 看金額計算對不對）、或用 end-to-end 工具測 component bug（例如看 user satisfaction 但其實是 retrieval pipeline 在漏 chunk）、都是常見的軸誤選。</p>
<p>本章寫 eval 設計的座標系：三個 binary 軸、八個象限、每個象限對應什麼工具、軸選錯的訊號怎麼識別。這層 framing 是 meta、不是具體 eval 方法——具體方法在 <a href="/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14 benchmarking</a> 跟 <a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-Judge</a>。</p>
<h2 id="本章目標">本章目標</h2>
<p>讀完本章後你能：</p>
<ol>
<li>把任何 eval 需求放到三軸座標、定位象限。</li>
<li>對每個象限選對應的 eval 工具。</li>
<li>識別軸誤選的訊號、避免「工具對、軸錯」的常見坑。</li>
<li>規劃 eval 路線：初期該做哪幾個象限、規模化後再補哪些。</li>
<li>把 eval 設計跟 <a href="/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14 benchmarking</a> / <a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20 tracing</a> / <a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-Judge</a> 串成完整 pipeline。</li>
</ol>
<h2 id="三軸">三軸</h2>
<p>Eval 設計的三個正交軸：</p>
<h3 id="軸-1objective--subjective">軸 1：Objective ↔ Subjective</h3>
<ul>
<li><strong>Objective</strong>：有明確 ground truth、檢驗可以寫成 deterministic check（金額對不對、SQL 跑得通不通、JSON schema 合不合法）。</li>
<li><strong>Subjective</strong>：沒有單一正確答案、需要評分或比較（語氣好不好、解釋清楚不清楚、推薦的 trip 合不合用戶）。</li>
</ul>
<p>判讀訊號：「能不能用 Python 函數判定對錯」、能 → objective、不能 → subjective。</p>
<h3 id="軸-2component--end-to-end">軸 2：Component ↔ End-to-End</h3>
<ul>
<li><strong>Component</strong>：測單一元件、孤立評估（retrieval 拿對 chunk 沒、tool call 參數對沒、prompt 抽出正確 entity 沒）。</li>
<li><strong>End-to-End</strong>：測完整流程、user 視角結果（user 問題有沒有被解決、訂單有沒有完成、conversation 滿意度）。</li>
</ul>
<p>判讀訊號：「失敗時你想知道是哪一段壞掉」→ component；「你只在乎最終體驗」→ end-to-end。</p>
<h3 id="軸-3quantitative--qualitative">軸 3：Quantitative ↔ Qualitative</h3>
<ul>
<li><strong>Quantitative</strong>：產出數字（accuracy / latency / cost / pass rate）、可以追蹤、可以比較、可以 alert。</li>
<li><strong>Qualitative</strong>：產出觀察（error pattern、user 抱怨、reviewer 註記）、無法直接 aggregate、但能引導 hypothesis。</li>
</ul>
<p>判讀訊號：「結果能算平均嗎」→ quantitative；「結果是讀完才知道」→ qualitative。</p>
<h3 id="三軸的正交性">三軸的正交性</h3>
<p>這三軸是正交的、不是同義詞：</p>
<ul>
<li>「Objective + component + quantitative」典型是 unit test（function 返回對不對）。</li>
<li>「Subjective + end-to-end + qualitative」典型是 user 訪談（user 整體滿意度）。</li>
<li>中間象限存在多種混合、各有對應工具。</li>
</ul>
<h2 id="八象限">八象限</h2>
<p>3 個 binary 軸 = 8 象限。每個象限的常見對應工具：</p>
<table>
  <thead>
      <tr>
          <th>象限</th>
          <th>典型問題</th>
          <th>對應工具</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Objective + Component + Quantitative</td>
          <td>這個函數 / tool / RAG 元件對嗎</td>
          <td>Unit test、deterministic check、<a href="/blog/llm/knowledge-cards/retrieval-recall/" data-link-title="Retrieval Recall" data-link-desc="衡量 RAG 檢索是否把應該命中的文件或 chunk 放進 top-k 結果，是 component-level eval 的核心指標">retrieval recall@k</a></td>
      </tr>
      <tr>
          <td>Objective + Component + Qualitative</td>
          <td>這個元件失敗 pattern 是什麼</td>
          <td>Error log 分析、trace inspection</td>
      </tr>
      <tr>
          <td>Objective + End-to-end + Quantitative</td>
          <td>整套系統的 success rate / latency</td>
          <td>E2E test、success metric、latency p95</td>
      </tr>
      <tr>
          <td>Objective + End-to-end + Qualitative</td>
          <td>整套系統的 catastrophic 失敗 case 是什麼</td>
          <td>Production incident review、抽樣 trace 讀</td>
      </tr>
      <tr>
          <td>Subjective + Component + Quantitative</td>
          <td>這個 step 的輸出評分</td>
          <td>LLM-as-judge pairwise / rubric、human rating</td>
      </tr>
      <tr>
          <td>Subjective + Component + Qualitative</td>
          <td>這個 step 的 output 哪裡讓人不舒服</td>
          <td>Human review、error analysis with comments</td>
      </tr>
      <tr>
          <td>Subjective + End-to-end + Quantitative</td>
          <td>User 整體 NPS / 滿意度評分</td>
          <td>CSAT、thumbs up/down、appeal rate</td>
      </tr>
      <tr>
          <td>Subjective + End-to-end + Qualitative</td>
          <td>User 想要的是什麼、現在哪裡沒滿足</td>
          <td>User 訪談、開放問卷、social listening</td>
      </tr>
  </tbody>
</table>
<p>不是「八個都要做」、是「先看你的問題在哪個象限、用對應工具」。</p>
<p>兩個最容易誤判的象限展開：</p>
<p><strong>Subjective + Component + Quantitative</strong>（這個 step 輸出評分）：對應工具列「LLM-as-judge pairwise / rubric、human rating」、但 <strong>pairwise 是首選、不是 rubric</strong>——pairwise 比較讓 judge 的偏差更可控（兩個答案放在一起比、誰好誰差比較好判）、rubric 容易受 verbosity / position bias 影響。Rubric 留給「需要絕對分數而非相對排序」的場景（如要追蹤絕對品質漂移）。詳見 <a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-Judge</a> 的 bias 緩解段。</p>
<p><strong>Objective + Component + Quantitative</strong>（元件對嗎）：這象限最容易做、cost 也最低——deterministic check 配 component test、CI 跑、production trace 隨抽即驗。Production AI 系統若這象限沒覆蓋、bug 永遠靠 user 抱怨才發現、debug 跟 incident review 成本高。對應反例：把這象限的測試交給 LLM judge（見軸誤選一）。</p>
<h2 id="軸誤選的訊號">軸誤選的訊號</h2>
<p>軸選錯時、工具會給出「看起來合理但其實沒用」的訊號。三個常見軸誤選：</p>
<h3 id="誤選一用-subjective-工具測-objective-行為">誤選一：用 subjective 工具測 objective 行為</h3>
<p>例：訂單金額計算對不對、找 LLM judge 來看「這個金額合理嗎」。</p>
<ul>
<li><strong>問題</strong>：金額計算有 ground truth、應該 deterministic check（<code>assert order.total == expected</code>）。LLM judge 對「合理」的判斷有偏差、會放過明顯錯誤、會挑剔正確但不直觀的答案。</li>
<li><strong>訊號</strong>：你發現自己在寫「judge prompt」描述「什麼樣的金額是合理的」、但其實該行為有客觀標準。</li>
<li><strong>修正</strong>：把 judge prompt 翻成 deterministic check。</li>
</ul>
<h3 id="誤選二用-end-to-end-工具測-component-bug">誤選二：用 end-to-end 工具測 component bug</h3>
<p>例：整套系統 success rate 從 90% 掉到 80%、追了一週、結果是 retrieval 漏 chunk。</p>
<ul>
<li><strong>問題</strong>：E2E metric 告訴你「有問題」、不告訴你「在哪」。Component eval 缺失時、debug 從 trace 倒推、耗時。</li>
<li><strong>訊號</strong>：incident 後 root cause analysis 經常超過一天、查到的東西其實 component eval 該秒抓。</li>
<li><strong>修正</strong>：對 critical component（retrieval、tool 調用、parse 階段）加 component eval、production 持續跑。</li>
</ul>
<h3 id="誤選三用-quantitative-工具找-qualitative-訊號">誤選三：用 quantitative 工具找 qualitative 訊號</h3>
<p>例：user 滿意度從 4.2 掉到 4.0、團隊看數字盯一週、不知道發生什麼。</p>
<ul>
<li><strong>問題</strong>：Quantitative metric 只告訴你「有變化」、不告訴你「為什麼」。Qualitative 訊號（user 抱怨內容、抽樣 conversation）才能浮現 hypothesis。</li>
<li><strong>訊號</strong>：團隊看 dashboard 看了很久、卻沒人去讀 actual user feedback。</li>
<li><strong>修正</strong>：quantitative trigger（指標漂移）、qualitative 跟進（讀樣本、找 pattern）。</li>
</ul>
<h2 id="eval-演化路徑">Eval 演化路徑</h2>
<p>不同階段的 LLM 應用、該優先補哪些象限不同。</p>
<h3 id="階段-0mvp沒任何-eval">階段 0：MVP（沒任何 eval）</h3>
<p>問題：「能不能 demo 一下就好」、行為對不對全靠手測。</p>
<ul>
<li><strong>第一個該補的</strong>：Objective + End-to-end + Quantitative。最少跑 10 個 representative case、能看「跑得起來率」就好。</li>
<li><strong>不該太早做</strong>：subjective eval、需要 judge / human rating 的東西。MVP 階段先讓系統穩定運行。</li>
</ul>
<h3 id="階段-1有-user-在用">階段 1：有 user 在用</h3>
<p>問題：production 偶爾有 bug、user 偶爾抱怨、不知道哪些是 systematic、哪些是 random。</p>
<ul>
<li><strong>第二個該補的</strong>：Objective + End-to-end + Qualitative。讀 incident、讀抽樣 trace、找 pattern。</li>
<li><strong>第三個該補的</strong>：Objective + Component + Quantitative。對 critical component（retrieval / tool call / parse）加 component-level eval、production 跑。</li>
<li><strong>不該做</strong>：完整 subjective rubric。先把 objective 失敗修了再說。</li>
</ul>
<h3 id="階段-2要持續優化品質">階段 2：要持續優化品質</h3>
<p>問題：objective 部分已經穩、user 抱怨主要在 subjective 層（語氣、helpful 程度、推薦合不合用）。</p>
<ul>
<li><strong>第四個該補的</strong>：Subjective + Component + Quantitative。用 LLM-as-judge 給每個 step 評分、做 A/B test 比較 prompt 變動。</li>
<li><strong>第五個該補的</strong>：Subjective + End-to-end + Quantitative。CSAT、thumbs up/down、appeal rate。</li>
<li><strong>要做的</strong>：Subjective eval 跟 qualitative review 必須配合進行——quantitative 給出方向、qualitative 給出修法 hypothesis。</li>
</ul>
<h3 id="階段-3規模化跨團隊">階段 3：規模化、跨團隊</h3>
<p>問題：多個產品 / 團隊用同一套 LLM infra、eval 要 cross-cutting。</p>
<ul>
<li><strong>要做的</strong>：標準化 eval pipeline、把象限 1-7 都 cover、qualitative review 進入 ritual（每週 incident review、每月抽樣 trace 讀）。</li>
<li><strong>重點不是「全部都有」、而是「每個象限的 owner 清楚」</strong>。</li>
</ul>
<h2 id="eval-跟-trace-的閉環">Eval 跟 Trace 的閉環</h2>
<p>Eval 不是孤立的——它跟 <a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20 LLM tracing</a> 形成閉環：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">[Production traffic]
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">       ↓
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">   [LLM trace]  ← 每次 call / agent step / tool 都記錄
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">       ↓
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">   ├── 即時 monitoring（latency / cost / error rate）
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">   ├── 抽樣進 eval set（人工標 + LLM judge）
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">   └── failed case 進 regression set（防止改 prompt 又壞同樣 case）
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">       ↓
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">   [Eval pipeline]
</span></span><span class="line"><span class="ln">10</span><span class="cl">       ↓
</span></span><span class="line"><span class="ln">11</span><span class="cl">   ├── Component eval（單元件 accuracy）
</span></span><span class="line"><span class="ln">12</span><span class="cl">   ├── E2E eval（整套 success rate）
</span></span><span class="line"><span class="ln">13</span><span class="cl">   └── Subjective eval（judge / human rating）
</span></span><span class="line"><span class="ln">14</span><span class="cl">       ↓
</span></span><span class="line"><span class="ln">15</span><span class="cl">   [Insights]
</span></span><span class="line"><span class="ln">16</span><span class="cl">       ↓
</span></span><span class="line"><span class="ln">17</span><span class="cl">   ├── Quantitative：metric 漂移 alert
</span></span><span class="line"><span class="ln">18</span><span class="cl">   └── Qualitative：error pattern → hypothesis → 修 prompt / tool / RAG
</span></span><span class="line"><span class="ln">19</span><span class="cl">       ↓
</span></span><span class="line"><span class="ln">20</span><span class="cl">   [改動進 production]
</span></span><span class="line"><span class="ln">21</span><span class="cl">       ↓
</span></span><span class="line"><span class="ln">22</span><span class="cl">   [回到 production traffic、看 metric 收斂]</span></span></code></pre></div><p>Production trace 不只是 debug 工具、是 eval set 的活泉。Trace + eval 閉環的設計細節見 <a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20</a>。</p>
<h2 id="跟其他-eval-章節的分工">跟其他 Eval 章節的分工</h2>
<table>
  <thead>
      <tr>
          <th>章節</th>
          <th>焦點</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/blog/llm/04-applications/eval-design-framework/" data-link-title="4.13 Eval 設計座標系：三軸、八象限、何時測什麼" data-link-desc="Eval 設計三軸（objective↔subjective / component↔end-to-end / quantitative↔qualitative）、八象限的對應 eval 工具、軸選錯的訊號、跟 benchmarking / LLM-as-judge / tracing 的關係">4.13 本章</a></td>
          <td><strong>Meta</strong>：先選軸、再選工具的設計座標系</td>
      </tr>
      <tr>
          <td><a href="/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14 Benchmarking</a></td>
          <td>具體 benchmark 跟自家 eval set 的方法論</td>
      </tr>
      <tr>
          <td><a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20 LLM tracing</a></td>
          <td>Trace 怎麼接 eval、production observability</td>
      </tr>
      <tr>
          <td><a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-Judge</a></td>
          <td>Subjective eval 的核心工具、rubric / pairwise / bias 緩解</td>
      </tr>
  </tbody>
</table>
<p>讀法建議：先讀本章建立座標系、再依當前痛點往對應章節展開。Subjective eval 痛點 → 4.21；自家 benchmark 設計 → 4.14；production observability → 4.20。</p>
<h2 id="有效-eval-系統的四個設計條件">有效 eval 系統的四個設計條件</h2>
<p>Eval 系統要持續產生有用訊號、必須滿足四個條件。每個條件對應一個常見退化模式、可同時當 checklist 用。</p>
<h3 id="條件一judge-只用在-subjective-軸">條件一：Judge 只用在 subjective 軸</h3>
<p>LLM-as-judge 留給沒 ground truth 的 subjective 行為（語氣、helpful 程度、解釋清楚）、objective 行為（金額、JSON schema、API 參數）用 deterministic check。Judge 的 cost 比 deterministic check 高 1-2 個數量級、精度反而不如、明顯不划算。</p>
<p>對應反例：「全部 eval 都做成 LLM judge」——judge 被誤用在 objective 行為、cost 翻倍、精度反降。</p>
<h3 id="條件二每個-metric-有-ownerthresholdaction">條件二：每個 metric 有 owner、threshold、action</h3>
<p>每個 production metric 都要明確：誰負責看（owner）、什麼數字觸發 alert（threshold）、alert 後做什麼（action）。沒這三項的 metric 是 noise。</p>
<p>對應反例：dashboard 上 50 個 metric 圖、沒人定期看、bug 還是靠 user 抱怨才知道。</p>
<h3 id="條件三eval-set-跟-production-traffic-同步">條件三：Eval set 跟 production traffic 同步</h3>
<p>Production trace 持續抽樣補進 eval set、每季 review eval set 跟 traffic 分佈是否一致。</p>
<p>對應反例：eval set 是兩年前定的、production traffic 已經漂得很遠、eval 通過不代表 user 滿意。</p>
<h3 id="條件四保留-frozen-baseline">條件四：保留 frozen baseline</h3>
<p><a href="/blog/llm/knowledge-cards/frozen-baseline/" data-link-title="Frozen baseline" data-link-desc="Eval 系統中固定特定 prompt &#43; model 當長期對照、讓行為漂移可見的標準作法">Frozen baseline</a> 是把某個特定 prompt + 特定 model 跑 production 一段時間後 freeze 起來、每次新版本跟它比、定期 refresh 並標明時點。漂移看得見才能管理。</p>
<p>對應反例：每次 A/B 都跟「最新版本」比、長期累積漂移完全不可見、「整體變好了沒」無從回答。</p>
<h2 id="何時過時--何時不過時">何時過時 / 何時不過時</h2>
<p><strong>不會過時的部分</strong>：</p>
<ul>
<li>三軸座標（objective / component / quantitative 三個 binary 軸）。</li>
<li>八象限對應工具的結構分類。</li>
<li>三類軸誤選的識別訊號跟修正。</li>
<li>Eval 演化路徑（MVP → user → 優化 → 規模化）。</li>
<li>Eval / trace 閉環的設計。</li>
<li>有效 eval 系統的四個設計條件。</li>
</ul>
<p><strong>會變的部分</strong>：</p>
<ul>
<li>具體 eval framework（OpenAI Evals、Promptfoo、Braintrust、Langfuse 等會持續演化）。</li>
<li>LLM-as-judge 的具體 prompt 模板跟 bias 緩解技巧。</li>
<li>各 benchmark 的權威性（半年一換）。</li>
</ul>
<h2 id="下一章">下一章</h2>
<p>下一章：<a href="/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14 Benchmarking 與評估方法論</a>、把座標系落到具體 benchmark 設計。Subjective eval 的工具見 <a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-Judge</a>、production trace 怎麼接 eval 見 <a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20 LLM tracing</a>、跟 fuzzy engineering 典範的關係見 <a href="/blog/llm/00-foundations/deterministic-vs-fuzzy-engineering/" data-link-title="0.8 Deterministic vs Fuzzy Engineering：軟體設計典範的位移" data-link-desc="傳統 deterministic 軟體跟 fuzzy LLM 軟體在資料、邏輯、分解、實驗成本四個維度的根本差異、以及哪段該 deterministic、哪段該 fuzzy 的決策框架">0.8</a>（fuzzy 行為的測試本質就是 distribution metric）。</p>
]]></content:encoded></item></channel></rss>