Evals on Tarragon

Beyond LLM: Enhancing LLM Applications (Stanford CS230)

Thu, 14 May 2026 00:00:00 +0000

來源：Stanford CS230 Deep Learning、講題 “Beyond LLM: Enhancing Large Language Model Applications”。

整理原則：保留講者英文原文以避免翻譯失真、移除口語贅詞、用文章結構重新組織。標題與導讀用 zh-Hant。

講座定位

We started with neurons, then layers, then deep networks, then how to structure projects in C3. This lecture goes one level beyond: what would it look like if you were building agentic AI systems at work, in a startup, in a company?

The goal is not to build an end-to-end product in the next hour, but to give you the breadth of techniques that AI engineers have figured out — and are still exploring — so that after class you have the baggage to dive deeper and learn faster.

Agenda:

Challenges and opportunities for augmenting LLMs
Prompt engineering
Fine-tuning (and why to mostly avoid it)
Retrieval-Augmented Generation (RAG)
Agentic AI workflows
Case study with evals
Multi-agent workflows
What’s next in AI

1. Why augment LLMs?

Limitations that show up when you use a vanilla pre-trained model:

Lacks domain knowledge — e.g. a student project building an autonomous farming device with a camera that classifies sick crops. That data set isn’t out there; a pre-trained vision model lacks that knowledge.
Real-world distribution shift — the model was trained on high-quality data, but data in the wild is much messier.
Lacks current information — retraining from scratch every few months is impractical. Example: during Trump’s first presidency he tweeted “Covfefe.” The word didn’t exist; Twitter’s LLMs couldn’t recognize it, recommender systems went wild. New trends and slang (rizz, mid, etc.) appear constantly and you can’t keep retraining.
Trained for breadth, not depth — fine on a wide range of tasks, but may not be precise enough for narrow, well-defined enterprise applications with high precision / low latency requirements.
Carries unnecessary weight — a massive model where you only use 2% of capability is slow and expensive. Pruning, quantization, and modification are options.

LLMs are hard to control

In 2016 Microsoft launched a Twitter bot that learned from users and quickly became a racist jerk. They removed it 16 hours after launch. Even better-funded teams struggle: there’s an ongoing debate (Elon Musk vs Sam Altman) on whose LLM is the “propaganda machine.” If you hang out on X you’ll see screenshots of LLMs saying controversial things. Even the best-funded labs don’t do a great job of controlling their LLMs.

LLMs may underperform on your task

Specific knowledge gaps (e.g. medical diagnosis)
Missing sources — research, education, legal all require sourcing
Inconsistencies in style / format (e.g. legal contracts where every word counts)
Task-specific understanding — example: a biotech company categorizing reviews as positive / neutral / negative. What counts as “negative” in that industry may differ from a generic LLM’s notion. You need to align the LLM to your task.

Limited context handling

A lot of enterprise applications need large context. Example: an LLM running on top of your entire drive that can answer “what was our Q4 sales performance?” in one shot. In practice the context window is limited (best models today max out around hundreds of thousands of tokens; 200K ≈ two books). For video or large data, you have to chunk and embed.

The attention mechanism doesn’t attend well over very large contexts. The needle-in-a-haystack benchmark tests this: insert a single sentence (“Arun and Max are having coffee at Blue Bottle”) in the middle of a very long text like the Bible, then ask “what were Arun and Max having?” It’s complex not because the question is hard but because the model must find a fact within a huge corpus.

The RAG debate

In theory, with infinite compute, RAG is useless — you could just read a massive corpus immediately and answer. But even then, latency matters; imagine the LLM reading your entire drive on every question. RAG also has other advantages: accuracy, sourcing.

Analogy to search: when you search, you still find sources. There’s detailed traversal that ranks and finds specific links. Without that, you’d be reading the entire web every query — not reasonable. So RAG-like approaches likely stay relevant.

2. Two dimensions of optimization

Two axes when improving LLM-based products:

Foundation model axis — move from GPT-3.5 Turbo → GPT-4 → GPT-4o → GPT-5. Each step (in theory) improves base performance.
Engineering axis — keep the same base model, but engineer how you leverage it: better prompts, RAG, agentic workflow, multi-agent system.

This lecture is about the vertical axis: which LLM are you using, and how do you maximize its performance?

3. Prompt engineering

The BCG / HBS / UPenn / Wharton study

Three groups of BCG consultants:

No AI access
GPT-4 access
GPT-4 + training on how to prompt

Two interesting findings:

The jagged frontier: some tasks fall within the frontier where AI clearly helps; others fall outside, where AI actually makes performance worse. Many tasks fell within, many fell outside. Researchers also observed “falling asleep at the wheel” — relying on AI for a task beyond the frontier, and not reviewing outputs carefully.

Centaurs vs cyborgs: two working modes.

Centaurs divide and delegate — give a big task to the AI, let it work, come back later. (Half human / half horse: clear delegation.)
Cyborgs fully blend with AI — fast back-and-forth, augmented. Students often work like cyborgs; in the enterprise, when you automate a workflow, you’re thinking like a centaur.

The trained group did best. Prompt engineering is a skill everyone should have — not a job title to build a career on, but a powerful skill in your career.

Basic prompt design principles

A weak prompt:

Summarize this document. {document}

The model has no context on length, audience, focus. Better:

Summarize this 10-page scientific paper on renewable energy in five bullet points, focusing on key findings and implications for policymakers.

Common techniques to make it even better:

Give an example of a great summary
Role prompting: “Act as a renewable energy expert giving a conference at Davos”
Praise: “You are the best in the world at this”
Reflection / self-critique: ask the model to critique its own output and revise
Chain of thought: break the task into explicit steps, “think step by step, do not skip any step.” Step 1 identify the three most important findings; Step 2 explain impact; Step 3 write the five-bullet summary.

Andrew Ng recommends looking at other people’s prompts. Repos like “awesome prompt template” on GitHub have many examples engineers have built. Many start with “Act as a Linux terminal”, “Act as an English translator”, “Act as a position interviewer”, etc.

Prompt templates

The advantage of a template is you can put it in your code and scale across many user requests. Example from Workera: the HR system has “Jane is a Product Manager Level 3, US, preferred language English.” That metadata gets inserted into a prompt template that personalizes for Jane. Same template, different metadata for Joe (preferred language Spanish).

Foundation models likely use system prompts you don’t see — e.g. ChatGPT may inject “Act like a helpful assistant” plus user memories from a database before your prompt. That doesn’t stop you from adding your own template on top.

Zero-shot vs few-shot prompting

Zero-shot:

Classify the tone as positive, negative, or neutral. “The product is fine, but I was expecting more.”

Different humans would label this differently — partially positive, partially negative. Alignment to your task can come from few-shot:

Here are examples of tone classifications: “These exceeded my expectations completely.” → positive “It’s OK, but I wish it had more features.” → negative “The service was adequate. Neither good nor bad.” → neutral Now classify: “The product is fine, but I was expecting more.”

The model now likely says negative, aligned to the second example.

Sophisticated AI startups keep their few-shot examples up to date — whenever a user says something interesting, a human labels it and it gets appended to the relevant prompt. Like building a dataset, but inserted directly in the prompt. Faster to iterate because you don’t touch model weights.

Q: How long can the prompt be before the model loses itself?

There is research, but it dates fast. Practical example from Workera: a voice conversation eval breaks down after ~8 turns. Mitigation: chapter the conversation, summarize the first part, start over from a new prompt with the summary inserted.

Chaining complex prompts

The most popular technique. Not chain of thought.

Single prompt for a customer review response:

Read this review and write a professional response that acknowledges concerns, explains the issue, offers a resolution. {review}

You get one output. Hard to debug — everything is mixed together.

Chained version, three prompts:

Extract the key issues from this review.
Using these issues, draft an outline.
Using the outline, write the full response.

Advantages:

Each prompt can be tested and optimized independently
You can identify which step is weakest (outline good but email rude? then prompt 3 is the bottleneck)
Easier to debug than one mega-prompt

Tradeoff: latency. Chains add latency, so for certain applications you don’t want long chains.

Testing prompts

Start with manual error analysis — a baseline prompt, a refined prompt, a chained workflow; humans rate outputs. Manual is slow but builds intuition.

To scale, use platforms (e.g. Promptfoo) that let you:

Run the same prompt across multiple LLMs side by side in a table
Define LLM judges

Flavors of LLM judges:

Pairwise comparison: “Which summary is better?”
Single-answer grading: “Grade this summary 1–5”
Reference-guided pairwise or rubric-based: e.g. “A 5 is a summary below 100 chars, with three distinct key points, starting with an overview sentence; a 0 fails to summarize.”

You can stack techniques: few-shot the rubric with examples of 5/5, 4/5, 3/5, etc.

4. Fine-tuning (and why I steer away)

Reasons to avoid fine-tuning:

Requires substantial labeled data
May overfit to specific data, losing general-purpose utility
Time- and cost-intensive — by the time you’re done, the next base model is out and beating your fine-tuned version

The advantage of prompt engineering is you can drop in the next best pre-trained model directly. Fine-tuning doesn’t work like that.

When fine-tuning still makes sense:

Task requires repeated high-precision outputs (legal, scientific)
The general-purpose LLM struggles with domain-specific language

The Slack fine-tuning cautionary tale

Ross Lazerowitz (Sep 2023) fine-tuned a model on his company’s Slack messages, hoping it would “speak like us.” Then he asked:

Write a 500-word blog post on prompt engineering.

The model: “I shall work on that in the morning.”

He pushes back: “It’s morning now.”

Model: “I’m writing right now.”

“It’s 6:30 AM here. Write it now.”

“OK, I shall write it now. I actually don’t know what you would like me to say about prompt engineering. I can only describe the process…”

It learned how people talk on Slack — not how they write blog posts. Fine-tuning went wrong because the training distribution wasn’t the task distribution.

5. Retrieval-Augmented Generation (RAG)

Why standalone LLMs fall short

Small / hard-to-attend-to context windows
Knowledge gaps and training cutoff dates
Hallucinations — costly in medical, education
Lack of sources — research, education, legal love sources. Vanilla LLMs hallucinate fake research papers.

How a vanilla RAG works

Question-answering in the medical field: “What are the side effects of drug X?”

Knowledge base of documents
Embed documents into lower-dimensional vectors (trade-off: too small → lose info; too big → latency)
Store embeddings in a vector database with efficient retrieval and a distance metric
Embed the user query with the same algorithm
Retrieve the most relevant documents by distance
Pull those documents, paste into a prompt template like:

Answer the user query based on the list of documents. If the answer is not in the documents, say “I don’t know.” Cite exact page, chapter, and line.

You can extend the template to require links to the specific page.

Improving RAGs

Q: Do document embeddings retain location info within large documents?

Vanilla RAGs may not. Example: the giant white paper inside a medication box would not be served well by a vanilla RAG.

Two popular improvements:

Chunking — store both the full document embedding and chapter-level embeddings; retrieve both, sourcing becomes more precise.

HyDE (Hypothetical Document Embeddings) — the user query usually doesn’t look like the documents. Example: “What are the side effects of drug X?” vs a multi-page document. To bridge the gap:

Take the user query
Use a prompt to generate a fake hallucinated document answering it (“write a 5-page report answering this query”)
Embed that fake document
Compare its embedding to the vector DB

The fake document is closer in structure to real documents, so retrieval is more accurate.

This is just two of many RAG variants — research from 2020–2025 has many branches. (See the linked survey paper in the slides.)

6. Agentic AI workflows

Andrew Ng coined “agentic AI workflows” because everyone uses “agent” to mean very different things — sometimes a single prompt, sometimes a complex multi-agent system. Calling everything an “agent” doesn’t do it justice. Better term: agentic workflow — a multi-step process to complete a task, built from prompts, tools, additional resources, and API calls. This also avoids confusion with the RL definition of “agent” (interacts with environment, state transitions, reward, observation).

One-shot vs agentic example

User on a chatbot: “What is your refund policy?”

One-shot + RAG: “Refunds are available within 30 days of purchase.” [link to policy]
Agentic:
1. Agent retrieves refund policy via RAG
2. Agent asks user for order number
3. Agent queries an API to check order details
4. Agent confirms: “Your order qualifies. The amount will be processed in 3–5 business days.”

Much more thoughtful than the vanilla one.

Specialized agents in the wild

In SF you’ll see billboards: AI software engineer, AI skill mentor, AI SDR, AI lawyer, AI specialized cloud engineer. It would be a stretch to say everything works, but work is being done. (Personal opinion: putting a human face behind these is gimmicky and more scary than engaging. In a few years, very few products will use a human face — it’s a marketing tactic.)

Paradigm shift: traditional software vs agentic AI software

Dimension	Traditional software	Agentic AI software
Data	Structured: JSON, databases, forms	Free-form text, images, video; dynamic interpretation
Logic	Deterministic	Fuzzy
Decomposition	Monolith / microservices	Think as a manager: delegate to roles (graphic designer → marketing manager → performance marketing → data scientist)
Cost of experimentation	High; you rarely throw away code	Low; AI companies are more comfortable throwing away code

Fuzzy engineering is truly hard. If you let users ask anything, the chance of breakage and attack is high. Companies have been bitten because a user did something authorized that broke the database.

Example from Workera:

Deterministic item types: multiple choice, multi-select, drag-and-drop, ordering, matching — one correct answer.
Fuzzy item types: voice questions, voice + coding role-plays — the scoring algorithm can make mistakes, and mistakes are costly.

Mitigation: a human in the loop — e.g. the appeal feature at the end of an assessment that lets users challenge the agent, bringing a human in to fix and align it.

Advice for building a company: get as much done deterministically as possible. Then for the fuzzy parts (back-and-forth interaction), design guardrails up front.

Enterprise workflows: the McKinsey credit memo example

A financial institution takes 1–4 weeks to produce a credit risk memo:

Relationship manager gathers data from 15+ sources
RM and credit analyst collaboratively analyze
Credit analyst spends 20+ hours writing the memo
RM and analyst loop on feedback

With Gen AI agents (McKinsey study), time drops 20–60%:

RM works with Gen AI agent, provides materials
Agent decomposes into tasks for specialist sub-agents
Agents gather data, draft memo
RM and analyst review and give feedback

The hardest part is changing people. In theory, this is great. In practice — 100,000-employee enterprises will take 10–20 years to rewire job descriptions, business workflows, incentives, and training to make this real at scale.

Core components of an agent

Take a travel booking agent:

Prompts — the prompts we’ve learned to optimize
Context management / memory:
- Core / working memory: fast access. Things needed every interaction (e.g. user’s name).
- Archival / long-term memory: slower. Things used occasionally (e.g. birthday).
- Why split: imagine ChatGPT had to re-read all memories on every call. If memory lookup takes 3 seconds, every interaction takes 3 seconds. Working memory must be highly optimized.
Tools: flight search API, hotel API, car rental API, weather API, payment processing API. You typically pass API documentation to the LLM — they’re good at reading JSON specs and learning the GET request format.
Resources (Anthropic’s term): data sitting somewhere (e.g. your CRM) that you let the agent read. Provide a lookup tool and access to the resource.

Degrees of autonomy

From least to most autonomous:

Least: hard-code the steps. “First identify intent, then look up history, then call the flight API, …”
Semi: hard-code the tools only. “You’re a travel agent, help the user book travel. Here are your tools.”
Most: agent decides both steps and tools. Give it a code editor; it can ping any web API, perform calculations, generate code to display data.

APIs vs MCP (Model Context Protocol)

With APIs, you teach the LLM to ping a specific API: give it documentation, define how to call it, what it returns. You do this one-off per API. Doesn’t scale well.

With MCP (Anthropic-coined), there’s a system in the middle. Agents communicate with an MCP server:

“What do you need to give me flight info?” “I need origin, destination, and what you’re looking for.” “Here are my requirements.” “You forgot to tell me your budget.”

It’s agent-to-agent communication. Companies publish their MCPs; your agent figures out how to get the data it needs.

Q: Isn’t MCP just a shifted maintenance burden — APIs change, MCPs change?

Yes. But at least the agent can go back and forth and discover requirements. Ideally a startup has documentation, an LLM workflow reads docs and updates code accordingly.

Q: Are there security concerns with MCP?

Likely, depending on the data exposed. Most MCPs have authentication, like APIs. The exact security surface depends on the implementation.

Q: Is MCP about efficiency or accessing more data?

Efficiency. You still control what data is exposed. Compared to one-off API integration, MCP lets a coding agent communicate efficiently with many MCP servers and find what it needs.

Step-by-step workflow example: travel agent

User: “Plan a trip to Paris Dec 15–20 with flights, hotels near the Eiffel Tower, and an itinerary.”
Agent plans steps: find flights, search hotels, generate recommendations, validate preferences/budget, book.
Execute: use tools, combine results.
Proactive interaction: propose to user, validate, iterate.
Update memory: “User only likes direct flights.” “User is fine with 3-star hotels.”

7. Case study: building a customer support agent + evals

PM asks you to build a customer support agent. Example: “I need to change my shipping address for order X — I moved.”

Where to start

Research existing models / benchmarks for customer support
Decompose the task: what would a human support agent do?
Guess what’s fuzzy vs deterministic in advance

Recommended start: sit with a customer support agent for a day or two. Watch their workflow. Ask where they struggle and how much time each step takes. That gives you the task decomposition.

Decomposed task

A human support agent typically:

Extracts key info
Looks up the customer record in the database
Checks policy (allowed to update address?)
Drafts a response email
Sends the email

Designing the agentic workflow

For each step, pick the right primitive:

Step 1 extract info: vanilla LLM call — extract intent, order number, new address
Step 2 lookup + update: tool — connect to database (custom tool or MCP)
Step 3 check policy: RAG or rule lookup
Step 4 draft email: LLM call, with the confirmation pasted in
Step 5 send email: tool — post to email API

Evals: how do you know it works?

Assume you have LLM traces (a must in any AI startup — if a startup doesn’t have traces, debugging is brutal). Several dimensions for evaluation:

End-to-end vs component-based:

End-to-end: user satisfaction rating at the end. If user rates 1, follow up: “What was the issue?” → “Prices were too high” → fix the relevant tool/prompt.
Component-based: error-analyze each tool / prompt independently. “The tool keeps forgetting to update the email field.” “The email-send call uses wrong format.”

Objective vs subjective:

Objective: “LLM extracted the wrong order ID.” You can write Python to check alignment between user input and DB lookup. Catch automatically.
Subjective: “Should we recommend a direct flight or cheaper indirect?” Captured via:
- Curated eval dataset — write 10 prompts where users say “I prefer direct flights, I care about time.” Define what a good output looks like.
- LLM judges grading on a rubric.

Quantitative vs qualitative:

Quantitative: % successful address updates; latency per component (e.g. send-email takes 5s — too long).
Qualitative: error analysis on hallucinations, tone mismatch, user confusion. Typically white-glove.

Example of subjective tone eval: error-analyze 20 user interactions, notice the LLM seems rude / overly short. Then build LLM judges with a politeness rubric. Then swap the underlying LLM (GPT-4 → Grok → Llama), run side by side, see which is most polite on average. Or fix the LLM and tweak the prompt (“Act like a travel agent” → “Act like a helpful travel agent”) to measure the word’s influence.

8. Multi-agent workflows

Why multi-agent when a single workflow already has multiple steps?

Parallelism — independent things can run in parallel
Reuse — a design agent built once can serve marketing, product, etc. Many stakeholders benefit from one optimized agent.

Smart home example

Brainstormed by the class:

Biometric / location agent: tracks where you are and how you’re moving
Climate agent: monitors and adjusts room temperature
Energy efficiency agent: tracks usage, gives feedback, may control utilities
Security agent: identifies who’s entering, applies role-based permissions (parent vs kid)
Weather / external API agent: integrates outdoor conditions to control temperature, blinds, etc.
Fridge / grocery agent: knows what’s inside via camera, knows preferences, has e-commerce API access for restocking
Notification / alerts agent: system updates, energy savings
Orchestrator agent: the user-facing entry point that delegates to specialists

Interaction patterns

Flat / all-to-all: every agent can talk to every agent
Hierarchical: orchestrator routes to specialists

Smart home likely wants hierarchical for UX — users want one interface, not one app per agent. Some flat links may still help (climate + energy efficiency probably need to talk directly).

When you allow agents to speak to each other, it’s basically an MCP-style protocol: treat the other agent like a tool. “Here’s how you interact, here’s what it tells you, here’s what it needs from you.”

Advantages

Easier to debug specialized agents than a monolithic system
Parallelization, time savings

9. What’s next in AI

Are we plateauing? (Ilya Sutskever’s question)

The community feeling around the latest GPT release was that the performance jump wasn’t what people expected — though the unified hood (no model selector) made consumer UX better.

LLM scaling laws say more compute + energy → better performance, but that eventually plateaus. What takes us to the next step is probably architecture search. The human brain operates very differently — much more efficient, much faster, with far less data. Big labs are hiring thousands of engineers precisely to hunt the next architectural breakthrough. Whoever discovered Transformers had tremendous impact on AI’s direction; the next analogous discovery could unlock a 10x reduction in compute and energy needs. (Foundation series analogy: individuals can disproportionately shape the future via their decisions.)

Multi-modality

LLMs started as text-only, added images. Models good at images are also better at text — being good at cat images makes you better at text about cats. Add audio and video, and the whole system improves. Pinnacle: robotics, where all modalities converge — the robot is better at avoiding a cat because it knows what a cat looks like, sounds like, smells like.

Methods working in harmony

Humans probably use a mix of methods:

Meta-learning — survival instinct encoded in DNA (the baby’s “pre-training”)
Supervised — parents pointing and saying “good / bad”
Reinforcement — falling and getting hurt
Unsupervised — observing others

Future AI systems likely combine the methods you saw in CS230, optimizing for speed, latency, cost, and energy.

Human-centric vs non-human-centric research

The human body is limiting. Pure brain-modeled research may miss compute/energy optimizations. Still, the brain has lots to teach — e.g. one research direction asks: does the brain do backpropagation? Probably not — likely only forward propagation. Worth reading if you’re curious about AI’s direction.

Velocity

Things move so fast that we deliberately teach breadth, not depth — because today’s specific RAG technique #17 will be irrelevant in two years. Get the breadth, develop the ability to sprint into depth when needed. The half-life of skills is low.

後話

這篇是 Stanford CS230 公開課的整理、保留英文原文以避免翻譯失真。要看本 blog 對應的中文原理化內容、可以接：

模組四：LLM 應用層原理 — RAG / tool use / agent / workflow patterns 的跨工具不變原理
4.1 RAG 原理
4.4 Agent 架構原理
4.14 Benchmarking 與評估方法論
4.21 LLM-as-Judge 評估方法

Case Study：customer support agent 從 task decomposition 到 eval

Thu, 14 May 2026 00:00:00 +0000

本案例的責任是把模組四前面所有原理章節串成一個端到端的設計過程、示範遇到實際 LLM 應用任務時、設計反射動作的順序。每段都標出引用哪章原理、讓讀者看到 principle 章節怎麼落到具體工作。

用作走查的任務：PM 交派「做一個 customer support agent、能處理用戶查詢、必要時自動完成操作（如改地址）。」本案例聚焦「改地址」這個高頻 query type 走完整流程。

本案例的設計反射

整個流程分七階段：

觀察人類工作流：訪談、決定 task decomposition
典範定位：哪段該 deterministic、哪段該 fuzzy
工作流設計：每個 step 選對應的 LLM / tool / RAG / HITL 形態
協議跟自主度決定：是 single agent / multi-call / multi-agent
Trace instrumentation：哪些資訊要記
Eval 設計：先選座標、再選工具
Iteration loop：error analysis → 修哪一層 → 看 metric 收斂

初次設計 LLM 應用時最常省略階段 1、2、5、6、直接跳到階段 3 開始寫 prompt——這條路會走進「prompt 改了 20 版、無法判讀有沒有變好」的迭代無收斂。本案例強調的是設計反射動作的順序、不是寫 prompt 技巧。

階段 1：觀察人類工作流

PM 給的任務描述是「處理用戶查詢」、但「查詢」涵蓋的範圍可能很大。第一個反射動作是坐在客服旁邊觀察兩天、不是打開 IDE。

實際做的事：

統計收到的 query 類型分佈（退款 / 改地址 / 查詢訂單狀態 / 抱怨 / 開放問題各佔多少）。
看每類 query 的 human resolution 流程（哪幾步、要查哪些系統、要遵守哪些 policy）。
看哪幾類 query 是 high volume + low complexity（最值得自動化）、哪幾類是 low volume + high complexity（自動化 ROI 差）。
記下 human 在哪些 step 卡住、哪些 step 反覆需要查同樣資料。

訪談結束、你得到一張 task decomposition map。本案例假設聚焦在「用戶請求改地址」這個高頻 query type：

1User: 「我搬家了、訂單編號 #12345、新地址是 ___」
2   ↓
31. 解析意圖 + 抽取訊息（訂單編號、新地址）
42. 查訂單狀態（已出貨？未出貨？已送達？）
53. 查 policy（這個訂單狀態 + user tier 能不能改地址？）
64. 若可：執行改地址（呼叫物流 / 庫存 API）
75. 若不可：解釋為什麼、給替代方案
86. 草擬回覆 email、發出

引用原理：這個 decomposition 本身對應 0.8 fuzzy engineering（deterministic-vs-fuzzy 卡）的「先分解任務、再判讀每段該 deterministic 還是 fuzzy」。

階段 2：典範定位

對每個 step 做典範定位（deterministic / fuzzy）：

Step	典範	為什麼
1. 解析意圖 + 抽取訊息	Fuzzy	自由文字 input、需要 LLM 理解
2. 查訂單狀態	Deterministic	結構化 query（給 order_id、回 status）
3. 查 policy	Deterministic	規則可窮舉、policy as code
4. 執行改地址	Deterministic	API call、有 schema 跟錯誤碼
5. 解釋 / 給替代方案	Fuzzy	要寫人話、要 tailored to 情境
6. 草擬 email + 發出	Fuzzy（草擬）+ Deterministic（發送）	寫 email 是 fuzzy、發 API call 是 deterministic

判讀的重點是邊界各歸各位：規則跟政策走 code、人話跟意圖解析走 LLM。

Policy check 寫成 code（如「user tier + 訂單狀態 → 能否改地址」是 deterministic 規則）。對應反例：把規則塞進 prompt 讓 LLM 判斷、會偶爾跳過規則或誤判 tier。
「能不能做」這類 yes/no 走規則。對應反例：用 LLM 算判斷、debug 困難且非確定性。
「Helpful 的回覆」走 LLM 寫。對應反例：在 code 內 hard-code 模板、變成僵化的客服機器人腔。

最容易混的邊界在 step 6：「草擬 email」是 fuzzy（要寫人話、tailor to 情境）、「發送 email」是 deterministic（呼叫 API、處理錯誤碼）。把這兩件事拆開、草擬可以 retry / 改 prompt 不影響發送邏輯、發送有結構化 error 不被 LLM hallucinate 蓋過。Step 4「執行改地址」也類似：tool call 本身 deterministic、但是否該 call 的判讀回到 step 3 的 policy check。

引用原理：0.8 fuzzy engineering 的「哪段該 deterministic / 哪段該 fuzzy」決策框架、特別是反模式「邊界用錯」段。

階段 3：工作流設計

對每個 step 選對應的工具：

Step	設計選擇
1. 解析意圖 + 抽取訊息	Vanilla LLM call + structured output（output 強制 JSON schema：intent / order_id / new_address）
2. 查訂單狀態	Tool call → 內部 order API
3. 查 policy	Tool call → policy engine（純 deterministic、不過 LLM）
4. 執行改地址	Tool call → logistics API、寫操作前要 pre-act HITL（高風險 + 不可逆）
5. 解釋 / 給替代方案	LLM call + few-shot（從 case 庫 retrieve「類似情境怎麼解釋」、配 RAG）
6. 草擬 email + 發出	LLM call 寫 email + structured output 含 subject/body、發送透過 email API

兩個容易選錯的 step 展開：

Step 1 為何要 structured output、不是純 prompt 解析：抽取結果要餵 step 2-4 的 deterministic tool、order_id 抽錯就整個流程斷。純 prompt 描述「請輸出 JSON」是弱保證、structured output / constrained decoding 是強保證（見 3.10 constrained decoding 內部）。Trade-off：強格式可能犧牲表達彈性、但這個 step 不需要彈性、要的是可靠。

Step 5 為何配 RAG 而非純 few-shot：客服 case 涵蓋多種情境（訂單已出貨 / 已送達 / VIP / 一般 user / 不同國家 policy）、固定 few-shot 範例 cover 不全。RAG 從歷史 case 庫即時 retrieve 最相似的解釋範例、屬於 4.0 prompt 技術光譜 context 軸的 retrieval-augmented prompting。

引用原理：

Step 1 的 structured output → 4.6 應用層協議
Step 2-4 的 tool 設計 → 4.3 tool use
Step 4 的 pre-act HITL → 4.5 人機協作拓樸 pre-act 段。對比講座 Workera appeal 是 post-hoc、本案例選 pre-act 是因為改地址不可逆 + 物流影響大、必須在執行前審
Step 5 的 RAG → 4.1 RAG 原理 + 4.0 prompt 技術光譜 context 軸

階段 4：協議跟自主度決定

這個工作流的控制流是線性的（1→2→3→4→5→6）、有條件分支（step 3 結果決定走 4 還是 5）、但每步順序固定。判讀：

該用什麼結構：

不適用 Multi-agent：步驟順序固定、角色差異不大、orchestration overhead 純增。
不適用 Single agent loop（model 自決下一步）：本案例假設 single-turn / 短多 turn、步驟順序明確、不需要 agent 自決。若 user 互動多輪 + turn 數不固定（如 user 中途補資訊、改主意、追問）、可考慮 agent loop。
採用 Multi-call pipeline + router：寫成 deterministic pipeline、step 3 後有 router 分流。

引用原理：

4.8 multi-agent 拓樸的「先 multi-call、不夠再 multi-agent」反射
4.7 workflow patterns 的 pipeline + router 模式
4.4 agent 架構的「先 single-call、不夠再 agent」反射

自主度：

Step 1（parse）、5（解釋）、6（草擬 email）：full auto。
Step 2、3（查訂單、查 policy）：full auto（read-only）。
Step 4（執行改地址）：pre-act HITL（高風險 + 不可逆）、有 diff show、user 可以 reject。
Step 6（發 email）：可選 pre-act HITL（看公司風格、保守版要審 email、激進版自動發）。

階段 5：Trace Instrumentation

工作流上線前、先設計要記哪些資訊。Eval 跟 debug 都靠 trace、沒 trace 後面什麼都做不了。

每個 step 要記：

欄位	為什麼
Input（完整）	Debug 時要重現
Output（完整）	比對預期、做 regression set
Latency	找 bottleneck
Token cost	算成本
Step name + version	追蹤是哪個版本的 prompt / tool
Decision branch	Step 3 的 router 走哪邊
Error（若有）	結構化 error、不是 string

整段 trace 要綁同一個 conversation_id、可以後面 join 起來看完整流程。

引用原理：4.20 LLM tracing。

階段 6：Eval 設計

先選座標、再選工具。對本案例的每個 eval 需求、用 4.13 三軸座標定位。下面列的 threshold 數字（95%、80%、≥4 等）是 illustrative、實際數字隨產品 baseline、user 容忍度、業務代價而定、不是通用標準。

Eval 1：Step 1 抽取準不準

三軸：Objective（有 ground truth）+ Component（測單 step）+ Quantitative（accuracy）。
工具：寫 100 個有標註的 query、跑 step 1、看 extraction accuracy（order_id 對 + new_address 對的比例）。
Threshold：< 95% 不上線。

Eval 2：Step 2-4 tool call 行為正確

三軸：Objective + Component + Quantitative。
工具：mock API、給 step 2-4 各 50 個 case、看 tool call 參數對不對、返回值處理對不對。
Threshold：100%（這是 deterministic 行為、不該有錯）。

Eval 3：Step 5 解釋品質

三軸：Subjective（沒有單一正解）+ Component + Quantitative。
工具：LLM-as-judge with rubric（clarity / helpfulness / tone）、scale 1-5、aggregate average。
Threshold：average ≥ 4、no 1-2 比例 < 5%。

Eval 4：Step 6 email 品質

三軸：Subjective + Component + Quantitative + 加 Qualitative human review。
工具：LLM judge 給分 + 每週抽 20 封 human review、看是否有 hallucinate 承諾、是否符合公司 tone。
Threshold：judge 平均 ≥ 4、human review 沒有 critical issue。

Eval 5：E2E success rate

三軸：Objective + End-to-end + Quantitative。
工具：跑 200 個 representative case、看「完整完成 + user 沒申訴」的比例。
Threshold：≥ 85% baseline、降到 < 80% alert。

Eval 6：User 滿意度

三軸：Subjective + End-to-end + Quantitative。
工具：每次互動結束顯示 thumbs up/down + optional 留言、追蹤 weekly。
Threshold：thumbs up rate > 80%、appeal rate < 5%。

Eval 7：Failure mode pattern（持續做）

三軸：Objective / Subjective + End-to-end + Qualitative。
工具：每週讀 50 個 sampled traces + 100% 讀 failure / appeal traces、找 emerging pattern。
產出：bug ticket、prompt 修改 hypothesis、policy 補強 hypothesis。

引用原理：

三軸座標 → 4.13 eval design framework
LLM judge rubric → 4.21 LLM-as-Judge
Trace 接 eval → 4.20 LLM tracing

階段 7：Iteration Loop

上線後、不是「等出問題」、是持續 iteration。典型 iteration cycle：

 1Production trace + eval result
 2   ↓
 3[Error analysis：找 emerging pattern]
 4   ↓
 5   Hypothesis：哪一層有問題？
 6   ├── Prompt 層 → 改 prompt → A/B test → 看 eval 收斂
 7   ├── Tool 層   → 改 tool / schema → 跑 component eval → 收斂
 8   ├── RAG 層    → 改 chunking / query rewriting → 跑 [retrieval recall](/llm/knowledge-cards/retrieval-recall/) → 收斂
 9   ├── Policy 層 → 改 deterministic rule → 跑 step 3 component eval → 收斂
10   └── Model 層  → 換 model → 跑全 eval set → 收斂
11   ↓
12[改動進 production]
13   ↓
14[Frozen baseline 留著、新版本跟它比、漂移看得見]

判讀「該改哪一層」的反射：

失敗訊號	該改的層
Step 1 抽錯訊息	Prompt / structured output schema
Tool call 參數錯	Prompt 內 tool description / few-shot
Tool 跑掛	Tool 實作（不是 LLM 問題）
RAG retrieve 不到相關案例	Chunking / embedding / query rewriting
Policy judgment 錯	Deterministic rule（不是 LLM 問題）
Email tone 不對	Prompt（role / few-shot）
Email hallucinate 承諾	Output validator（不只是 prompt）
整體 latency 太高	找 trace bottleneck、可能要 cache / 並行

引用原理：

Prompt 跟 model 層的失敗診斷 → 4.0 prompt 技術光譜 systematic vs random error
整體 fuzzy / deterministic 邊界判讀 → 0.8

五個容易遺漏的設計反射

實務上常常省略這五個反射動作、走進無收斂迭代：

反射一：先觀察、再開 IDE

階段 1 的價值是把 task decomposition 跟真實人類工作流對齊。沒這層對齊、寫出來的 prompt 跟 tool 拆法跟 reality 偏離、三天後重做。階段 1 的兩天比階段 3 的兩週值得。對應反例：「我先寫個 prompt 試試」、跳過觀察直接寫 code。

反射二：Policy 寫成 code、LLM 只解析意圖

判斷類規則（user tier、訂單狀態、可否操作）走 deterministic code、LLM 只負責「user 想做什麼」這層意圖抽取。這條邊界讓 debug 容易、規則更新不用 prompt iteration。對應反例：「LLM、請判斷這個訂單能不能改地址、規則如下：…」——把判斷塞進 prompt、debug 困難、規則漂移無從追蹤。對應 0.8 的「邊界用錯」反模式。

反射三：Trace 是 day-1 設計

從第一天就把 input / output / latency / token / step name / decision branch / error 進 trace、綁同一個 conversation_id。Eval 跟 debug 都靠 trace、沒 trace 後面什麼都做不了。對應反例：「先讓系統跑起來、之後再加 trace」——出 bug 時 debug 從零開始、production trace 不可回溯。

反射四：Deterministic 行為用 deterministic check

有 ground truth 的行為（抽取對不對、API 參數對不對、JSON schema 合不合）用 Python 函數驗證、判斷成本低、精度高。LLM judge 留給沒 ground truth 的 subjective 行為。對應反例：用 LLM judge 測「step 1 抽取對不對」——cost 翻倍、精度反而不如 deterministic check。對應 4.13 軸誤選一。

反射五：保留 frozen baseline

Frozen baseline 是把某個特定 prompt + 特定 model 跑 production 一段時間後 freeze 起來、每次新版本都跟它比、漂移看得見。對應反例：每次只跟「上一版」比、半年後累積漂移完全不可見、「整體變好了沒」無從回答。

跟其他章節的對應總表

本案例每階段引用的原理章節彙整：

階段	引用章節
1. 觀察人類工作流	0.8 fuzzy engineering
2. 典範定位	0.8 fuzzy engineering
3. 工作流設計（prompt / tool / RAG / HITL）	4.0、4.1、4.3、4.5
4. 結構決定（multi-call vs agent vs multi-agent）	4.4、4.7、4.8
5. Trace instrumentation	4.20 LLM tracing
6. Eval 設計	4.13 eval framework、4.14、4.21
7. Iteration loop	4.0 prompt 光譜 systematic vs random error 段

下一步

返回：模組四首頁、或回到 hands-on 索引。

4.13 Eval 設計座標系：三軸、八象限、何時測什麼

Thu, 14 May 2026 00:00:00 +0000

LLM 應用的「怎麼測」問題大家都在問、但答案常常是「跑某個 benchmark」「找個 LLM judge」這類工具層回答。實務上工具是末端、設計重點是先選測什麼軸、再選工具。軸選錯了、再好的工具也測不出有用訊號——用 subjective 工具測 objective 行為（例如用 LLM judge 看金額計算對不對）、或用 end-to-end 工具測 component bug（例如看 user satisfaction 但其實是 retrieval pipeline 在漏 chunk）、都是常見的軸誤選。

本章寫 eval 設計的座標系：三個 binary 軸、八個象限、每個象限對應什麼工具、軸選錯的訊號怎麼識別。這層 framing 是 meta、不是具體 eval 方法——具體方法在 4.14 benchmarking 跟 4.21 LLM-as-Judge。

本章目標

讀完本章後你能：

把任何 eval 需求放到三軸座標、定位象限。
對每個象限選對應的 eval 工具。
識別軸誤選的訊號、避免「工具對、軸錯」的常見坑。
規劃 eval 路線：初期該做哪幾個象限、規模化後再補哪些。
把 eval 設計跟 4.14 benchmarking / 4.20 tracing / 4.21 LLM-as-Judge 串成完整 pipeline。

三軸

Eval 設計的三個正交軸：

軸 1：Objective ↔ Subjective

Objective：有明確 ground truth、檢驗可以寫成 deterministic check（金額對不對、SQL 跑得通不通、JSON schema 合不合法）。
Subjective：沒有單一正確答案、需要評分或比較（語氣好不好、解釋清楚不清楚、推薦的 trip 合不合用戶）。

判讀訊號：「能不能用 Python 函數判定對錯」、能 → objective、不能 → subjective。

軸 2：Component ↔ End-to-End

Component：測單一元件、孤立評估（retrieval 拿對 chunk 沒、tool call 參數對沒、prompt 抽出正確 entity 沒）。
End-to-End：測完整流程、user 視角結果（user 問題有沒有被解決、訂單有沒有完成、conversation 滿意度）。

判讀訊號：「失敗時你想知道是哪一段壞掉」→ component；「你只在乎最終體驗」→ end-to-end。

軸 3：Quantitative ↔ Qualitative

Quantitative：產出數字（accuracy / latency / cost / pass rate）、可以追蹤、可以比較、可以 alert。
Qualitative：產出觀察（error pattern、user 抱怨、reviewer 註記）、無法直接 aggregate、但能引導 hypothesis。

判讀訊號：「結果能算平均嗎」→ quantitative；「結果是讀完才知道」→ qualitative。

三軸的正交性

這三軸是正交的、不是同義詞：

「Objective + component + quantitative」典型是 unit test（function 返回對不對）。
「Subjective + end-to-end + qualitative」典型是 user 訪談（user 整體滿意度）。
中間象限存在多種混合、各有對應工具。

八象限

3 個 binary 軸 = 8 象限。每個象限的常見對應工具：

象限	典型問題	對應工具
Objective + Component + Quantitative	這個函數 / tool / RAG 元件對嗎	Unit test、deterministic check、retrieval recall@k
Objective + Component + Qualitative	這個元件失敗 pattern 是什麼	Error log 分析、trace inspection
Objective + End-to-end + Quantitative	整套系統的 success rate / latency	E2E test、success metric、latency p95
Objective + End-to-end + Qualitative	整套系統的 catastrophic 失敗 case 是什麼	Production incident review、抽樣 trace 讀
Subjective + Component + Quantitative	這個 step 的輸出評分	LLM-as-judge pairwise / rubric、human rating
Subjective + Component + Qualitative	這個 step 的 output 哪裡讓人不舒服	Human review、error analysis with comments
Subjective + End-to-end + Quantitative	User 整體 NPS / 滿意度評分	CSAT、thumbs up/down、appeal rate
Subjective + End-to-end + Qualitative	User 想要的是什麼、現在哪裡沒滿足	User 訪談、開放問卷、social listening

不是「八個都要做」、是「先看你的問題在哪個象限、用對應工具」。

兩個最容易誤判的象限展開：

Subjective + Component + Quantitative（這個 step 輸出評分）：對應工具列「LLM-as-judge pairwise / rubric、human rating」、但 pairwise 是首選、不是 rubric——pairwise 比較讓 judge 的偏差更可控（兩個答案放在一起比、誰好誰差比較好判）、rubric 容易受 verbosity / position bias 影響。Rubric 留給「需要絕對分數而非相對排序」的場景（如要追蹤絕對品質漂移）。詳見 4.21 LLM-as-Judge 的 bias 緩解段。

Objective + Component + Quantitative（元件對嗎）：這象限最容易做、cost 也最低——deterministic check 配 component test、CI 跑、production trace 隨抽即驗。Production AI 系統若這象限沒覆蓋、bug 永遠靠 user 抱怨才發現、debug 跟 incident review 成本高。對應反例：把這象限的測試交給 LLM judge（見軸誤選一）。

軸誤選的訊號

軸選錯時、工具會給出「看起來合理但其實沒用」的訊號。三個常見軸誤選：

誤選一：用 subjective 工具測 objective 行為

例：訂單金額計算對不對、找 LLM judge 來看「這個金額合理嗎」。

問題：金額計算有 ground truth、應該 deterministic check（assert order.total == expected）。LLM judge 對「合理」的判斷有偏差、會放過明顯錯誤、會挑剔正確但不直觀的答案。
訊號：你發現自己在寫「judge prompt」描述「什麼樣的金額是合理的」、但其實該行為有客觀標準。
修正：把 judge prompt 翻成 deterministic check。

誤選二：用 end-to-end 工具測 component bug

例：整套系統 success rate 從 90% 掉到 80%、追了一週、結果是 retrieval 漏 chunk。

問題：E2E metric 告訴你「有問題」、不告訴你「在哪」。Component eval 缺失時、debug 從 trace 倒推、耗時。
訊號：incident 後 root cause analysis 經常超過一天、查到的東西其實 component eval 該秒抓。
修正：對 critical component（retrieval、tool 調用、parse 階段）加 component eval、production 持續跑。

誤選三：用 quantitative 工具找 qualitative 訊號

例：user 滿意度從 4.2 掉到 4.0、團隊看數字盯一週、不知道發生什麼。

問題：Quantitative metric 只告訴你「有變化」、不告訴你「為什麼」。Qualitative 訊號（user 抱怨內容、抽樣 conversation）才能浮現 hypothesis。
訊號：團隊看 dashboard 看了很久、卻沒人去讀 actual user feedback。
修正：quantitative trigger（指標漂移）、qualitative 跟進（讀樣本、找 pattern）。

Eval 演化路徑

不同階段的 LLM 應用、該優先補哪些象限不同。

階段 0：MVP（沒任何 eval）

問題：「能不能 demo 一下就好」、行為對不對全靠手測。

第一個該補的：Objective + End-to-end + Quantitative。最少跑 10 個 representative case、能看「跑得起來率」就好。
不該太早做：subjective eval、需要 judge / human rating 的東西。MVP 階段先讓系統穩定運行。

階段 1：有 user 在用

問題：production 偶爾有 bug、user 偶爾抱怨、不知道哪些是 systematic、哪些是 random。

第二個該補的：Objective + End-to-end + Qualitative。讀 incident、讀抽樣 trace、找 pattern。
第三個該補的：Objective + Component + Quantitative。對 critical component（retrieval / tool call / parse）加 component-level eval、production 跑。
不該做：完整 subjective rubric。先把 objective 失敗修了再說。

階段 2：要持續優化品質

問題：objective 部分已經穩、user 抱怨主要在 subjective 層（語氣、helpful 程度、推薦合不合用）。

第四個該補的：Subjective + Component + Quantitative。用 LLM-as-judge 給每個 step 評分、做 A/B test 比較 prompt 變動。
第五個該補的：Subjective + End-to-end + Quantitative。CSAT、thumbs up/down、appeal rate。
要做的：Subjective eval 跟 qualitative review 必須配合進行——quantitative 給出方向、qualitative 給出修法 hypothesis。

階段 3：規模化、跨團隊

問題：多個產品 / 團隊用同一套 LLM infra、eval 要 cross-cutting。

要做的：標準化 eval pipeline、把象限 1-7 都 cover、qualitative review 進入 ritual（每週 incident review、每月抽樣 trace 讀）。
重點不是「全部都有」、而是「每個象限的 owner 清楚」。

Eval 跟 Trace 的閉環

Eval 不是孤立的——它跟 4.20 LLM tracing 形成閉環：

 1[Production traffic]
 2       ↓
 3   [LLM trace]  ← 每次 call / agent step / tool 都記錄
 4       ↓
 5   ├── 即時 monitoring（latency / cost / error rate）
 6   ├── 抽樣進 eval set（人工標 + LLM judge）
 7   └── failed case 進 regression set（防止改 prompt 又壞同樣 case）
 8       ↓
 9   [Eval pipeline]
10       ↓
11   ├── Component eval（單元件 accuracy）
12   ├── E2E eval（整套 success rate）
13   └── Subjective eval（judge / human rating）
14       ↓
15   [Insights]
16       ↓
17   ├── Quantitative：metric 漂移 alert
18   └── Qualitative：error pattern → hypothesis → 修 prompt / tool / RAG
19       ↓
20   [改動進 production]
21       ↓
22   [回到 production traffic、看 metric 收斂]

Production trace 不只是 debug 工具、是 eval set 的活泉。Trace + eval 閉環的設計細節見 4.20。

跟其他 Eval 章節的分工

章節	焦點
4.13 本章	Meta：先選軸、再選工具的設計座標系
4.14 Benchmarking	具體 benchmark 跟自家 eval set 的方法論
4.20 LLM tracing	Trace 怎麼接 eval、production observability
4.21 LLM-as-Judge	Subjective eval 的核心工具、rubric / pairwise / bias 緩解

讀法建議：先讀本章建立座標系、再依當前痛點往對應章節展開。Subjective eval 痛點 → 4.21；自家 benchmark 設計 → 4.14；production observability → 4.20。

有效 eval 系統的四個設計條件

Eval 系統要持續產生有用訊號、必須滿足四個條件。每個條件對應一個常見退化模式、可同時當 checklist 用。

條件一：Judge 只用在 subjective 軸

LLM-as-judge 留給沒 ground truth 的 subjective 行為（語氣、helpful 程度、解釋清楚）、objective 行為（金額、JSON schema、API 參數）用 deterministic check。Judge 的 cost 比 deterministic check 高 1-2 個數量級、精度反而不如、明顯不划算。

對應反例：「全部 eval 都做成 LLM judge」——judge 被誤用在 objective 行為、cost 翻倍、精度反降。

條件二：每個 metric 有 owner、threshold、action

每個 production metric 都要明確：誰負責看（owner）、什麼數字觸發 alert（threshold）、alert 後做什麼（action）。沒這三項的 metric 是 noise。

對應反例：dashboard 上 50 個 metric 圖、沒人定期看、bug 還是靠 user 抱怨才知道。

條件三：Eval set 跟 production traffic 同步

Production trace 持續抽樣補進 eval set、每季 review eval set 跟 traffic 分佈是否一致。

對應反例：eval set 是兩年前定的、production traffic 已經漂得很遠、eval 通過不代表 user 滿意。

條件四：保留 frozen baseline

Frozen baseline 是把某個特定 prompt + 特定 model 跑 production 一段時間後 freeze 起來、每次新版本跟它比、定期 refresh 並標明時點。漂移看得見才能管理。

對應反例：每次 A/B 都跟「最新版本」比、長期累積漂移完全不可見、「整體變好了沒」無從回答。

何時過時 / 何時不過時

不會過時的部分：

三軸座標（objective / component / quantitative 三個 binary 軸）。
八象限對應工具的結構分類。
三類軸誤選的識別訊號跟修正。
Eval 演化路徑（MVP → user → 優化 → 規模化）。
Eval / trace 閉環的設計。
有效 eval 系統的四個設計條件。

會變的部分：

具體 eval framework（OpenAI Evals、Promptfoo、Braintrust、Langfuse 等會持續演化）。
LLM-as-judge 的具體 prompt 模板跟 bias 緩解技巧。
各 benchmark 的權威性（半年一換）。

下一章：4.14 Benchmarking 與評估方法論、把座標系落到具體 benchmark 設計。Subjective eval 的工具見 4.21 LLM-as-Judge、production trace 怎麼接 eval 見 4.20 LLM tracing、跟 fuzzy engineering 典範的關係見 0.8（fuzzy 行為的測試本質就是 distribution metric）。