Agent on Tarragon

Agent-as-Tool

Thu, 14 May 2026 00:00:00 +0000

Agent-as-tool 的核心概念是「把一個 agent 封裝成另一個 agent 可呼叫的工具」。被封裝的 agent 有自己的 prompt、工具、上下文與完成條件；呼叫方只看到一個較高階的 tool interface。

概念位置

它是 multi-agent system 的一種拓樸，也可透過 MCP 暴露成 tool server。它跟 subagent 的差異是：subagent 常是同一 runtime 內的任務分派，agent-as-tool 強調對外介面與重用邊界。

可觀察訊號與例子

主 agent 呼叫 run_security_review()，背後其實是一個安全 reviewer agent 讀檔、查規則、輸出 findings。主 agent 不需要知道內部步驟，只需要 consume 結果。

設計責任

Agent-as-tool 要把輸入、輸出、權限、副作用與 timeout 定清楚。否則呼叫方會把它當 deterministic tool，但內部其實是 fuzzy agent，失敗模式會被隱藏。

Beyond LLM: Enhancing LLM Applications (Stanford CS230)

Thu, 14 May 2026 00:00:00 +0000

來源：Stanford CS230 Deep Learning、講題 “Beyond LLM: Enhancing Large Language Model Applications”。

整理原則：保留講者英文原文以避免翻譯失真、移除口語贅詞、用文章結構重新組織。標題與導讀用 zh-Hant。

講座定位

We started with neurons, then layers, then deep networks, then how to structure projects in C3. This lecture goes one level beyond: what would it look like if you were building agentic AI systems at work, in a startup, in a company?

The goal is not to build an end-to-end product in the next hour, but to give you the breadth of techniques that AI engineers have figured out — and are still exploring — so that after class you have the baggage to dive deeper and learn faster.

Agenda:

Challenges and opportunities for augmenting LLMs
Prompt engineering
Fine-tuning (and why to mostly avoid it)
Retrieval-Augmented Generation (RAG)
Agentic AI workflows
Case study with evals
Multi-agent workflows
What’s next in AI

1. Why augment LLMs?

Limitations that show up when you use a vanilla pre-trained model:

Lacks domain knowledge — e.g. a student project building an autonomous farming device with a camera that classifies sick crops. That data set isn’t out there; a pre-trained vision model lacks that knowledge.
Real-world distribution shift — the model was trained on high-quality data, but data in the wild is much messier.
Lacks current information — retraining from scratch every few months is impractical. Example: during Trump’s first presidency he tweeted “Covfefe.” The word didn’t exist; Twitter’s LLMs couldn’t recognize it, recommender systems went wild. New trends and slang (rizz, mid, etc.) appear constantly and you can’t keep retraining.
Trained for breadth, not depth — fine on a wide range of tasks, but may not be precise enough for narrow, well-defined enterprise applications with high precision / low latency requirements.
Carries unnecessary weight — a massive model where you only use 2% of capability is slow and expensive. Pruning, quantization, and modification are options.

LLMs are hard to control

In 2016 Microsoft launched a Twitter bot that learned from users and quickly became a racist jerk. They removed it 16 hours after launch. Even better-funded teams struggle: there’s an ongoing debate (Elon Musk vs Sam Altman) on whose LLM is the “propaganda machine.” If you hang out on X you’ll see screenshots of LLMs saying controversial things. Even the best-funded labs don’t do a great job of controlling their LLMs.

LLMs may underperform on your task

Specific knowledge gaps (e.g. medical diagnosis)
Missing sources — research, education, legal all require sourcing
Inconsistencies in style / format (e.g. legal contracts where every word counts)
Task-specific understanding — example: a biotech company categorizing reviews as positive / neutral / negative. What counts as “negative” in that industry may differ from a generic LLM’s notion. You need to align the LLM to your task.

Limited context handling

A lot of enterprise applications need large context. Example: an LLM running on top of your entire drive that can answer “what was our Q4 sales performance?” in one shot. In practice the context window is limited (best models today max out around hundreds of thousands of tokens; 200K ≈ two books). For video or large data, you have to chunk and embed.

The attention mechanism doesn’t attend well over very large contexts. The needle-in-a-haystack benchmark tests this: insert a single sentence (“Arun and Max are having coffee at Blue Bottle”) in the middle of a very long text like the Bible, then ask “what were Arun and Max having?” It’s complex not because the question is hard but because the model must find a fact within a huge corpus.

The RAG debate

In theory, with infinite compute, RAG is useless — you could just read a massive corpus immediately and answer. But even then, latency matters; imagine the LLM reading your entire drive on every question. RAG also has other advantages: accuracy, sourcing.

Analogy to search: when you search, you still find sources. There’s detailed traversal that ranks and finds specific links. Without that, you’d be reading the entire web every query — not reasonable. So RAG-like approaches likely stay relevant.

2. Two dimensions of optimization

Two axes when improving LLM-based products:

Foundation model axis — move from GPT-3.5 Turbo → GPT-4 → GPT-4o → GPT-5. Each step (in theory) improves base performance.
Engineering axis — keep the same base model, but engineer how you leverage it: better prompts, RAG, agentic workflow, multi-agent system.

This lecture is about the vertical axis: which LLM are you using, and how do you maximize its performance?

3. Prompt engineering

The BCG / HBS / UPenn / Wharton study

Three groups of BCG consultants:

No AI access
GPT-4 access
GPT-4 + training on how to prompt

Two interesting findings:

The jagged frontier: some tasks fall within the frontier where AI clearly helps; others fall outside, where AI actually makes performance worse. Many tasks fell within, many fell outside. Researchers also observed “falling asleep at the wheel” — relying on AI for a task beyond the frontier, and not reviewing outputs carefully.

Centaurs vs cyborgs: two working modes.

Centaurs divide and delegate — give a big task to the AI, let it work, come back later. (Half human / half horse: clear delegation.)
Cyborgs fully blend with AI — fast back-and-forth, augmented. Students often work like cyborgs; in the enterprise, when you automate a workflow, you’re thinking like a centaur.

The trained group did best. Prompt engineering is a skill everyone should have — not a job title to build a career on, but a powerful skill in your career.

Basic prompt design principles

A weak prompt:

Summarize this document. {document}

The model has no context on length, audience, focus. Better:

Summarize this 10-page scientific paper on renewable energy in five bullet points, focusing on key findings and implications for policymakers.

Common techniques to make it even better:

Give an example of a great summary
Role prompting: “Act as a renewable energy expert giving a conference at Davos”
Praise: “You are the best in the world at this”
Reflection / self-critique: ask the model to critique its own output and revise
Chain of thought: break the task into explicit steps, “think step by step, do not skip any step.” Step 1 identify the three most important findings; Step 2 explain impact; Step 3 write the five-bullet summary.

Andrew Ng recommends looking at other people’s prompts. Repos like “awesome prompt template” on GitHub have many examples engineers have built. Many start with “Act as a Linux terminal”, “Act as an English translator”, “Act as a position interviewer”, etc.

Prompt templates

The advantage of a template is you can put it in your code and scale across many user requests. Example from Workera: the HR system has “Jane is a Product Manager Level 3, US, preferred language English.” That metadata gets inserted into a prompt template that personalizes for Jane. Same template, different metadata for Joe (preferred language Spanish).

Foundation models likely use system prompts you don’t see — e.g. ChatGPT may inject “Act like a helpful assistant” plus user memories from a database before your prompt. That doesn’t stop you from adding your own template on top.

Zero-shot vs few-shot prompting

Zero-shot:

Classify the tone as positive, negative, or neutral. “The product is fine, but I was expecting more.”

Different humans would label this differently — partially positive, partially negative. Alignment to your task can come from few-shot:

Here are examples of tone classifications: “These exceeded my expectations completely.” → positive “It’s OK, but I wish it had more features.” → negative “The service was adequate. Neither good nor bad.” → neutral Now classify: “The product is fine, but I was expecting more.”

The model now likely says negative, aligned to the second example.

Sophisticated AI startups keep their few-shot examples up to date — whenever a user says something interesting, a human labels it and it gets appended to the relevant prompt. Like building a dataset, but inserted directly in the prompt. Faster to iterate because you don’t touch model weights.

Q: How long can the prompt be before the model loses itself?

There is research, but it dates fast. Practical example from Workera: a voice conversation eval breaks down after ~8 turns. Mitigation: chapter the conversation, summarize the first part, start over from a new prompt with the summary inserted.

Chaining complex prompts

The most popular technique. Not chain of thought.

Single prompt for a customer review response:

Read this review and write a professional response that acknowledges concerns, explains the issue, offers a resolution. {review}

You get one output. Hard to debug — everything is mixed together.

Chained version, three prompts:

Extract the key issues from this review.
Using these issues, draft an outline.
Using the outline, write the full response.

Advantages:

Each prompt can be tested and optimized independently
You can identify which step is weakest (outline good but email rude? then prompt 3 is the bottleneck)
Easier to debug than one mega-prompt

Tradeoff: latency. Chains add latency, so for certain applications you don’t want long chains.

Testing prompts

Start with manual error analysis — a baseline prompt, a refined prompt, a chained workflow; humans rate outputs. Manual is slow but builds intuition.

To scale, use platforms (e.g. Promptfoo) that let you:

Run the same prompt across multiple LLMs side by side in a table
Define LLM judges

Flavors of LLM judges:

Pairwise comparison: “Which summary is better?”
Single-answer grading: “Grade this summary 1–5”
Reference-guided pairwise or rubric-based: e.g. “A 5 is a summary below 100 chars, with three distinct key points, starting with an overview sentence; a 0 fails to summarize.”

You can stack techniques: few-shot the rubric with examples of 5/5, 4/5, 3/5, etc.

4. Fine-tuning (and why I steer away)

Reasons to avoid fine-tuning:

Requires substantial labeled data
May overfit to specific data, losing general-purpose utility
Time- and cost-intensive — by the time you’re done, the next base model is out and beating your fine-tuned version

The advantage of prompt engineering is you can drop in the next best pre-trained model directly. Fine-tuning doesn’t work like that.

When fine-tuning still makes sense:

Task requires repeated high-precision outputs (legal, scientific)
The general-purpose LLM struggles with domain-specific language

The Slack fine-tuning cautionary tale

Ross Lazerowitz (Sep 2023) fine-tuned a model on his company’s Slack messages, hoping it would “speak like us.” Then he asked:

Write a 500-word blog post on prompt engineering.

The model: “I shall work on that in the morning.”

He pushes back: “It’s morning now.”

Model: “I’m writing right now.”

“It’s 6:30 AM here. Write it now.”

“OK, I shall write it now. I actually don’t know what you would like me to say about prompt engineering. I can only describe the process…”

It learned how people talk on Slack — not how they write blog posts. Fine-tuning went wrong because the training distribution wasn’t the task distribution.

5. Retrieval-Augmented Generation (RAG)

Why standalone LLMs fall short

Small / hard-to-attend-to context windows
Knowledge gaps and training cutoff dates
Hallucinations — costly in medical, education
Lack of sources — research, education, legal love sources. Vanilla LLMs hallucinate fake research papers.

How a vanilla RAG works

Question-answering in the medical field: “What are the side effects of drug X?”

Knowledge base of documents
Embed documents into lower-dimensional vectors (trade-off: too small → lose info; too big → latency)
Store embeddings in a vector database with efficient retrieval and a distance metric
Embed the user query with the same algorithm
Retrieve the most relevant documents by distance
Pull those documents, paste into a prompt template like:

Answer the user query based on the list of documents. If the answer is not in the documents, say “I don’t know.” Cite exact page, chapter, and line.

You can extend the template to require links to the specific page.

Improving RAGs

Q: Do document embeddings retain location info within large documents?

Vanilla RAGs may not. Example: the giant white paper inside a medication box would not be served well by a vanilla RAG.

Two popular improvements:

Chunking — store both the full document embedding and chapter-level embeddings; retrieve both, sourcing becomes more precise.

HyDE (Hypothetical Document Embeddings) — the user query usually doesn’t look like the documents. Example: “What are the side effects of drug X?” vs a multi-page document. To bridge the gap:

Take the user query
Use a prompt to generate a fake hallucinated document answering it (“write a 5-page report answering this query”)
Embed that fake document
Compare its embedding to the vector DB

The fake document is closer in structure to real documents, so retrieval is more accurate.

This is just two of many RAG variants — research from 2020–2025 has many branches. (See the linked survey paper in the slides.)

6. Agentic AI workflows

Andrew Ng coined “agentic AI workflows” because everyone uses “agent” to mean very different things — sometimes a single prompt, sometimes a complex multi-agent system. Calling everything an “agent” doesn’t do it justice. Better term: agentic workflow — a multi-step process to complete a task, built from prompts, tools, additional resources, and API calls. This also avoids confusion with the RL definition of “agent” (interacts with environment, state transitions, reward, observation).

One-shot vs agentic example

User on a chatbot: “What is your refund policy?”

One-shot + RAG: “Refunds are available within 30 days of purchase.” [link to policy]
Agentic:
1. Agent retrieves refund policy via RAG
2. Agent asks user for order number
3. Agent queries an API to check order details
4. Agent confirms: “Your order qualifies. The amount will be processed in 3–5 business days.”

Much more thoughtful than the vanilla one.

Specialized agents in the wild

In SF you’ll see billboards: AI software engineer, AI skill mentor, AI SDR, AI lawyer, AI specialized cloud engineer. It would be a stretch to say everything works, but work is being done. (Personal opinion: putting a human face behind these is gimmicky and more scary than engaging. In a few years, very few products will use a human face — it’s a marketing tactic.)

Paradigm shift: traditional software vs agentic AI software

Dimension	Traditional software	Agentic AI software
Data	Structured: JSON, databases, forms	Free-form text, images, video; dynamic interpretation
Logic	Deterministic	Fuzzy
Decomposition	Monolith / microservices	Think as a manager: delegate to roles (graphic designer → marketing manager → performance marketing → data scientist)
Cost of experimentation	High; you rarely throw away code	Low; AI companies are more comfortable throwing away code

Fuzzy engineering is truly hard. If you let users ask anything, the chance of breakage and attack is high. Companies have been bitten because a user did something authorized that broke the database.

Example from Workera:

Deterministic item types: multiple choice, multi-select, drag-and-drop, ordering, matching — one correct answer.
Fuzzy item types: voice questions, voice + coding role-plays — the scoring algorithm can make mistakes, and mistakes are costly.

Mitigation: a human in the loop — e.g. the appeal feature at the end of an assessment that lets users challenge the agent, bringing a human in to fix and align it.

Advice for building a company: get as much done deterministically as possible. Then for the fuzzy parts (back-and-forth interaction), design guardrails up front.

Enterprise workflows: the McKinsey credit memo example

A financial institution takes 1–4 weeks to produce a credit risk memo:

Relationship manager gathers data from 15+ sources
RM and credit analyst collaboratively analyze
Credit analyst spends 20+ hours writing the memo
RM and analyst loop on feedback

With Gen AI agents (McKinsey study), time drops 20–60%:

RM works with Gen AI agent, provides materials
Agent decomposes into tasks for specialist sub-agents
Agents gather data, draft memo
RM and analyst review and give feedback

The hardest part is changing people. In theory, this is great. In practice — 100,000-employee enterprises will take 10–20 years to rewire job descriptions, business workflows, incentives, and training to make this real at scale.

Core components of an agent

Take a travel booking agent:

Prompts — the prompts we’ve learned to optimize
Context management / memory:
- Core / working memory: fast access. Things needed every interaction (e.g. user’s name).
- Archival / long-term memory: slower. Things used occasionally (e.g. birthday).
- Why split: imagine ChatGPT had to re-read all memories on every call. If memory lookup takes 3 seconds, every interaction takes 3 seconds. Working memory must be highly optimized.
Tools: flight search API, hotel API, car rental API, weather API, payment processing API. You typically pass API documentation to the LLM — they’re good at reading JSON specs and learning the GET request format.
Resources (Anthropic’s term): data sitting somewhere (e.g. your CRM) that you let the agent read. Provide a lookup tool and access to the resource.

Degrees of autonomy

From least to most autonomous:

Least: hard-code the steps. “First identify intent, then look up history, then call the flight API, …”
Semi: hard-code the tools only. “You’re a travel agent, help the user book travel. Here are your tools.”
Most: agent decides both steps and tools. Give it a code editor; it can ping any web API, perform calculations, generate code to display data.

APIs vs MCP (Model Context Protocol)

With APIs, you teach the LLM to ping a specific API: give it documentation, define how to call it, what it returns. You do this one-off per API. Doesn’t scale well.

With MCP (Anthropic-coined), there’s a system in the middle. Agents communicate with an MCP server:

“What do you need to give me flight info?” “I need origin, destination, and what you’re looking for.” “Here are my requirements.” “You forgot to tell me your budget.”

It’s agent-to-agent communication. Companies publish their MCPs; your agent figures out how to get the data it needs.

Q: Isn’t MCP just a shifted maintenance burden — APIs change, MCPs change?

Yes. But at least the agent can go back and forth and discover requirements. Ideally a startup has documentation, an LLM workflow reads docs and updates code accordingly.

Q: Are there security concerns with MCP?

Likely, depending on the data exposed. Most MCPs have authentication, like APIs. The exact security surface depends on the implementation.

Q: Is MCP about efficiency or accessing more data?

Efficiency. You still control what data is exposed. Compared to one-off API integration, MCP lets a coding agent communicate efficiently with many MCP servers and find what it needs.

Step-by-step workflow example: travel agent

User: “Plan a trip to Paris Dec 15–20 with flights, hotels near the Eiffel Tower, and an itinerary.”
Agent plans steps: find flights, search hotels, generate recommendations, validate preferences/budget, book.
Execute: use tools, combine results.
Proactive interaction: propose to user, validate, iterate.
Update memory: “User only likes direct flights.” “User is fine with 3-star hotels.”

7. Case study: building a customer support agent + evals

PM asks you to build a customer support agent. Example: “I need to change my shipping address for order X — I moved.”

Where to start

Research existing models / benchmarks for customer support
Decompose the task: what would a human support agent do?
Guess what’s fuzzy vs deterministic in advance

Recommended start: sit with a customer support agent for a day or two. Watch their workflow. Ask where they struggle and how much time each step takes. That gives you the task decomposition.

Decomposed task

A human support agent typically:

Extracts key info
Looks up the customer record in the database
Checks policy (allowed to update address?)
Drafts a response email
Sends the email

Designing the agentic workflow

For each step, pick the right primitive:

Step 1 extract info: vanilla LLM call — extract intent, order number, new address
Step 2 lookup + update: tool — connect to database (custom tool or MCP)
Step 3 check policy: RAG or rule lookup
Step 4 draft email: LLM call, with the confirmation pasted in
Step 5 send email: tool — post to email API

Evals: how do you know it works?

Assume you have LLM traces (a must in any AI startup — if a startup doesn’t have traces, debugging is brutal). Several dimensions for evaluation:

End-to-end vs component-based:

End-to-end: user satisfaction rating at the end. If user rates 1, follow up: “What was the issue?” → “Prices were too high” → fix the relevant tool/prompt.
Component-based: error-analyze each tool / prompt independently. “The tool keeps forgetting to update the email field.” “The email-send call uses wrong format.”

Objective vs subjective:

Objective: “LLM extracted the wrong order ID.” You can write Python to check alignment between user input and DB lookup. Catch automatically.
Subjective: “Should we recommend a direct flight or cheaper indirect?” Captured via:
- Curated eval dataset — write 10 prompts where users say “I prefer direct flights, I care about time.” Define what a good output looks like.
- LLM judges grading on a rubric.

Quantitative vs qualitative:

Quantitative: % successful address updates; latency per component (e.g. send-email takes 5s — too long).
Qualitative: error analysis on hallucinations, tone mismatch, user confusion. Typically white-glove.

Example of subjective tone eval: error-analyze 20 user interactions, notice the LLM seems rude / overly short. Then build LLM judges with a politeness rubric. Then swap the underlying LLM (GPT-4 → Grok → Llama), run side by side, see which is most polite on average. Or fix the LLM and tweak the prompt (“Act like a travel agent” → “Act like a helpful travel agent”) to measure the word’s influence.

8. Multi-agent workflows

Why multi-agent when a single workflow already has multiple steps?

Parallelism — independent things can run in parallel
Reuse — a design agent built once can serve marketing, product, etc. Many stakeholders benefit from one optimized agent.

Smart home example

Brainstormed by the class:

Biometric / location agent: tracks where you are and how you’re moving
Climate agent: monitors and adjusts room temperature
Energy efficiency agent: tracks usage, gives feedback, may control utilities
Security agent: identifies who’s entering, applies role-based permissions (parent vs kid)
Weather / external API agent: integrates outdoor conditions to control temperature, blinds, etc.
Fridge / grocery agent: knows what’s inside via camera, knows preferences, has e-commerce API access for restocking
Notification / alerts agent: system updates, energy savings
Orchestrator agent: the user-facing entry point that delegates to specialists

Interaction patterns

Flat / all-to-all: every agent can talk to every agent
Hierarchical: orchestrator routes to specialists

Smart home likely wants hierarchical for UX — users want one interface, not one app per agent. Some flat links may still help (climate + energy efficiency probably need to talk directly).

When you allow agents to speak to each other, it’s basically an MCP-style protocol: treat the other agent like a tool. “Here’s how you interact, here’s what it tells you, here’s what it needs from you.”

Advantages

Easier to debug specialized agents than a monolithic system
Parallelization, time savings

9. What’s next in AI

Are we plateauing? (Ilya Sutskever’s question)

The community feeling around the latest GPT release was that the performance jump wasn’t what people expected — though the unified hood (no model selector) made consumer UX better.

LLM scaling laws say more compute + energy → better performance, but that eventually plateaus. What takes us to the next step is probably architecture search. The human brain operates very differently — much more efficient, much faster, with far less data. Big labs are hiring thousands of engineers precisely to hunt the next architectural breakthrough. Whoever discovered Transformers had tremendous impact on AI’s direction; the next analogous discovery could unlock a 10x reduction in compute and energy needs. (Foundation series analogy: individuals can disproportionately shape the future via their decisions.)

Multi-modality

LLMs started as text-only, added images. Models good at images are also better at text — being good at cat images makes you better at text about cats. Add audio and video, and the whole system improves. Pinnacle: robotics, where all modalities converge — the robot is better at avoiding a cat because it knows what a cat looks like, sounds like, smells like.

Methods working in harmony

Humans probably use a mix of methods:

Meta-learning — survival instinct encoded in DNA (the baby’s “pre-training”)
Supervised — parents pointing and saying “good / bad”
Reinforcement — falling and getting hurt
Unsupervised — observing others

Future AI systems likely combine the methods you saw in CS230, optimizing for speed, latency, cost, and energy.

Human-centric vs non-human-centric research

The human body is limiting. Pure brain-modeled research may miss compute/energy optimizations. Still, the brain has lots to teach — e.g. one research direction asks: does the brain do backpropagation? Probably not — likely only forward propagation. Worth reading if you’re curious about AI’s direction.

Velocity

Things move so fast that we deliberately teach breadth, not depth — because today’s specific RAG technique #17 will be irrelevant in two years. Get the breadth, develop the ability to sprint into depth when needed. The half-life of skills is low.

後話

這篇是 Stanford CS230 公開課的整理、保留英文原文以避免翻譯失真。要看本 blog 對應的中文原理化內容、可以接：

模組四：LLM 應用層原理 — RAG / tool use / agent / workflow patterns 的跨工具不變原理
4.1 RAG 原理
4.4 Agent 架構原理
4.14 Benchmarking 與評估方法論
4.21 LLM-as-Judge 評估方法

Case Study：customer support agent 從 task decomposition 到 eval

Thu, 14 May 2026 00:00:00 +0000

本案例的責任是把模組四前面所有原理章節串成一個端到端的設計過程、示範遇到實際 LLM 應用任務時、設計反射動作的順序。每段都標出引用哪章原理、讓讀者看到 principle 章節怎麼落到具體工作。

用作走查的任務：PM 交派「做一個 customer support agent、能處理用戶查詢、必要時自動完成操作（如改地址）。」本案例聚焦「改地址」這個高頻 query type 走完整流程。

本案例的設計反射

整個流程分七階段：

觀察人類工作流：訪談、決定 task decomposition
典範定位：哪段該 deterministic、哪段該 fuzzy
工作流設計：每個 step 選對應的 LLM / tool / RAG / HITL 形態
協議跟自主度決定：是 single agent / multi-call / multi-agent
Trace instrumentation：哪些資訊要記
Eval 設計：先選座標、再選工具
Iteration loop：error analysis → 修哪一層 → 看 metric 收斂

初次設計 LLM 應用時最常省略階段 1、2、5、6、直接跳到階段 3 開始寫 prompt——這條路會走進「prompt 改了 20 版、無法判讀有沒有變好」的迭代無收斂。本案例強調的是設計反射動作的順序、不是寫 prompt 技巧。

階段 1：觀察人類工作流

PM 給的任務描述是「處理用戶查詢」、但「查詢」涵蓋的範圍可能很大。第一個反射動作是坐在客服旁邊觀察兩天、不是打開 IDE。

實際做的事：

統計收到的 query 類型分佈（退款 / 改地址 / 查詢訂單狀態 / 抱怨 / 開放問題各佔多少）。
看每類 query 的 human resolution 流程（哪幾步、要查哪些系統、要遵守哪些 policy）。
看哪幾類 query 是 high volume + low complexity（最值得自動化）、哪幾類是 low volume + high complexity（自動化 ROI 差）。
記下 human 在哪些 step 卡住、哪些 step 反覆需要查同樣資料。

訪談結束、你得到一張 task decomposition map。本案例假設聚焦在「用戶請求改地址」這個高頻 query type：

1User: 「我搬家了、訂單編號 #12345、新地址是 ___」
2   ↓
31. 解析意圖 + 抽取訊息（訂單編號、新地址）
42. 查訂單狀態（已出貨？未出貨？已送達？）
53. 查 policy（這個訂單狀態 + user tier 能不能改地址？）
64. 若可：執行改地址（呼叫物流 / 庫存 API）
75. 若不可：解釋為什麼、給替代方案
86. 草擬回覆 email、發出

引用原理：這個 decomposition 本身對應 0.8 fuzzy engineering（deterministic-vs-fuzzy 卡）的「先分解任務、再判讀每段該 deterministic 還是 fuzzy」。

階段 2：典範定位

對每個 step 做典範定位（deterministic / fuzzy）：

Step	典範	為什麼
1. 解析意圖 + 抽取訊息	Fuzzy	自由文字 input、需要 LLM 理解
2. 查訂單狀態	Deterministic	結構化 query（給 order_id、回 status）
3. 查 policy	Deterministic	規則可窮舉、policy as code
4. 執行改地址	Deterministic	API call、有 schema 跟錯誤碼
5. 解釋 / 給替代方案	Fuzzy	要寫人話、要 tailored to 情境
6. 草擬 email + 發出	Fuzzy（草擬）+ Deterministic（發送）	寫 email 是 fuzzy、發 API call 是 deterministic

判讀的重點是邊界各歸各位：規則跟政策走 code、人話跟意圖解析走 LLM。

Policy check 寫成 code（如「user tier + 訂單狀態 → 能否改地址」是 deterministic 規則）。對應反例：把規則塞進 prompt 讓 LLM 判斷、會偶爾跳過規則或誤判 tier。
「能不能做」這類 yes/no 走規則。對應反例：用 LLM 算判斷、debug 困難且非確定性。
「Helpful 的回覆」走 LLM 寫。對應反例：在 code 內 hard-code 模板、變成僵化的客服機器人腔。

最容易混的邊界在 step 6：「草擬 email」是 fuzzy（要寫人話、tailor to 情境）、「發送 email」是 deterministic（呼叫 API、處理錯誤碼）。把這兩件事拆開、草擬可以 retry / 改 prompt 不影響發送邏輯、發送有結構化 error 不被 LLM hallucinate 蓋過。Step 4「執行改地址」也類似：tool call 本身 deterministic、但是否該 call 的判讀回到 step 3 的 policy check。

引用原理：0.8 fuzzy engineering 的「哪段該 deterministic / 哪段該 fuzzy」決策框架、特別是反模式「邊界用錯」段。

階段 3：工作流設計

對每個 step 選對應的工具：

Step	設計選擇
1. 解析意圖 + 抽取訊息	Vanilla LLM call + structured output（output 強制 JSON schema：intent / order_id / new_address）
2. 查訂單狀態	Tool call → 內部 order API
3. 查 policy	Tool call → policy engine（純 deterministic、不過 LLM）
4. 執行改地址	Tool call → logistics API、寫操作前要 pre-act HITL（高風險 + 不可逆）
5. 解釋 / 給替代方案	LLM call + few-shot（從 case 庫 retrieve「類似情境怎麼解釋」、配 RAG）
6. 草擬 email + 發出	LLM call 寫 email + structured output 含 subject/body、發送透過 email API

兩個容易選錯的 step 展開：

Step 1 為何要 structured output、不是純 prompt 解析：抽取結果要餵 step 2-4 的 deterministic tool、order_id 抽錯就整個流程斷。純 prompt 描述「請輸出 JSON」是弱保證、structured output / constrained decoding 是強保證（見 3.10 constrained decoding 內部）。Trade-off：強格式可能犧牲表達彈性、但這個 step 不需要彈性、要的是可靠。

Step 5 為何配 RAG 而非純 few-shot：客服 case 涵蓋多種情境（訂單已出貨 / 已送達 / VIP / 一般 user / 不同國家 policy）、固定 few-shot 範例 cover 不全。RAG 從歷史 case 庫即時 retrieve 最相似的解釋範例、屬於 4.0 prompt 技術光譜 context 軸的 retrieval-augmented prompting。

引用原理：

Step 1 的 structured output → 4.6 應用層協議
Step 2-4 的 tool 設計 → 4.3 tool use
Step 4 的 pre-act HITL → 4.5 人機協作拓樸 pre-act 段。對比講座 Workera appeal 是 post-hoc、本案例選 pre-act 是因為改地址不可逆 + 物流影響大、必須在執行前審
Step 5 的 RAG → 4.1 RAG 原理 + 4.0 prompt 技術光譜 context 軸

階段 4：協議跟自主度決定

這個工作流的控制流是線性的（1→2→3→4→5→6）、有條件分支（step 3 結果決定走 4 還是 5）、但每步順序固定。判讀：

該用什麼結構：

不適用 Multi-agent：步驟順序固定、角色差異不大、orchestration overhead 純增。
不適用 Single agent loop（model 自決下一步）：本案例假設 single-turn / 短多 turn、步驟順序明確、不需要 agent 自決。若 user 互動多輪 + turn 數不固定（如 user 中途補資訊、改主意、追問）、可考慮 agent loop。
採用 Multi-call pipeline + router：寫成 deterministic pipeline、step 3 後有 router 分流。

引用原理：

4.8 multi-agent 拓樸的「先 multi-call、不夠再 multi-agent」反射
4.7 workflow patterns 的 pipeline + router 模式
4.4 agent 架構的「先 single-call、不夠再 agent」反射

自主度：

Step 1（parse）、5（解釋）、6（草擬 email）：full auto。
Step 2、3（查訂單、查 policy）：full auto（read-only）。
Step 4（執行改地址）：pre-act HITL（高風險 + 不可逆）、有 diff show、user 可以 reject。
Step 6（發 email）：可選 pre-act HITL（看公司風格、保守版要審 email、激進版自動發）。

階段 5：Trace Instrumentation

工作流上線前、先設計要記哪些資訊。Eval 跟 debug 都靠 trace、沒 trace 後面什麼都做不了。

每個 step 要記：

欄位	為什麼
Input（完整）	Debug 時要重現
Output（完整）	比對預期、做 regression set
Latency	找 bottleneck
Token cost	算成本
Step name + version	追蹤是哪個版本的 prompt / tool
Decision branch	Step 3 的 router 走哪邊
Error（若有）	結構化 error、不是 string

整段 trace 要綁同一個 conversation_id、可以後面 join 起來看完整流程。

引用原理：4.20 LLM tracing。

階段 6：Eval 設計

先選座標、再選工具。對本案例的每個 eval 需求、用 4.13 三軸座標定位。下面列的 threshold 數字（95%、80%、≥4 等）是 illustrative、實際數字隨產品 baseline、user 容忍度、業務代價而定、不是通用標準。

Eval 1：Step 1 抽取準不準

三軸：Objective（有 ground truth）+ Component（測單 step）+ Quantitative（accuracy）。
工具：寫 100 個有標註的 query、跑 step 1、看 extraction accuracy（order_id 對 + new_address 對的比例）。
Threshold：< 95% 不上線。

Eval 2：Step 2-4 tool call 行為正確

三軸：Objective + Component + Quantitative。
工具：mock API、給 step 2-4 各 50 個 case、看 tool call 參數對不對、返回值處理對不對。
Threshold：100%（這是 deterministic 行為、不該有錯）。

Eval 3：Step 5 解釋品質

三軸：Subjective（沒有單一正解）+ Component + Quantitative。
工具：LLM-as-judge with rubric（clarity / helpfulness / tone）、scale 1-5、aggregate average。
Threshold：average ≥ 4、no 1-2 比例 < 5%。

Eval 4：Step 6 email 品質

三軸：Subjective + Component + Quantitative + 加 Qualitative human review。
工具：LLM judge 給分 + 每週抽 20 封 human review、看是否有 hallucinate 承諾、是否符合公司 tone。
Threshold：judge 平均 ≥ 4、human review 沒有 critical issue。

Eval 5：E2E success rate

三軸：Objective + End-to-end + Quantitative。
工具：跑 200 個 representative case、看「完整完成 + user 沒申訴」的比例。
Threshold：≥ 85% baseline、降到 < 80% alert。

Eval 6：User 滿意度

三軸：Subjective + End-to-end + Quantitative。
工具：每次互動結束顯示 thumbs up/down + optional 留言、追蹤 weekly。
Threshold：thumbs up rate > 80%、appeal rate < 5%。

Eval 7：Failure mode pattern（持續做）

三軸：Objective / Subjective + End-to-end + Qualitative。
工具：每週讀 50 個 sampled traces + 100% 讀 failure / appeal traces、找 emerging pattern。
產出：bug ticket、prompt 修改 hypothesis、policy 補強 hypothesis。

引用原理：

三軸座標 → 4.13 eval design framework
LLM judge rubric → 4.21 LLM-as-Judge
Trace 接 eval → 4.20 LLM tracing

階段 7：Iteration Loop

上線後、不是「等出問題」、是持續 iteration。典型 iteration cycle：

 1Production trace + eval result
 2   ↓
 3[Error analysis：找 emerging pattern]
 4   ↓
 5   Hypothesis：哪一層有問題？
 6   ├── Prompt 層 → 改 prompt → A/B test → 看 eval 收斂
 7   ├── Tool 層   → 改 tool / schema → 跑 component eval → 收斂
 8   ├── RAG 層    → 改 chunking / query rewriting → 跑 [retrieval recall](/llm/knowledge-cards/retrieval-recall/) → 收斂
 9   ├── Policy 層 → 改 deterministic rule → 跑 step 3 component eval → 收斂
10   └── Model 層  → 換 model → 跑全 eval set → 收斂
11   ↓
12[改動進 production]
13   ↓
14[Frozen baseline 留著、新版本跟它比、漂移看得見]

判讀「該改哪一層」的反射：

失敗訊號	該改的層
Step 1 抽錯訊息	Prompt / structured output schema
Tool call 參數錯	Prompt 內 tool description / few-shot
Tool 跑掛	Tool 實作（不是 LLM 問題）
RAG retrieve 不到相關案例	Chunking / embedding / query rewriting
Policy judgment 錯	Deterministic rule（不是 LLM 問題）
Email tone 不對	Prompt（role / few-shot）
Email hallucinate 承諾	Output validator（不只是 prompt）
整體 latency 太高	找 trace bottleneck、可能要 cache / 並行

引用原理：

Prompt 跟 model 層的失敗診斷 → 4.0 prompt 技術光譜 systematic vs random error
整體 fuzzy / deterministic 邊界判讀 → 0.8

五個容易遺漏的設計反射

實務上常常省略這五個反射動作、走進無收斂迭代：

反射一：先觀察、再開 IDE

階段 1 的價值是把 task decomposition 跟真實人類工作流對齊。沒這層對齊、寫出來的 prompt 跟 tool 拆法跟 reality 偏離、三天後重做。階段 1 的兩天比階段 3 的兩週值得。對應反例：「我先寫個 prompt 試試」、跳過觀察直接寫 code。

反射二：Policy 寫成 code、LLM 只解析意圖

判斷類規則（user tier、訂單狀態、可否操作）走 deterministic code、LLM 只負責「user 想做什麼」這層意圖抽取。這條邊界讓 debug 容易、規則更新不用 prompt iteration。對應反例：「LLM、請判斷這個訂單能不能改地址、規則如下：…」——把判斷塞進 prompt、debug 困難、規則漂移無從追蹤。對應 0.8 的「邊界用錯」反模式。

反射三：Trace 是 day-1 設計

從第一天就把 input / output / latency / token / step name / decision branch / error 進 trace、綁同一個 conversation_id。Eval 跟 debug 都靠 trace、沒 trace 後面什麼都做不了。對應反例：「先讓系統跑起來、之後再加 trace」——出 bug 時 debug 從零開始、production trace 不可回溯。

反射四：Deterministic 行為用 deterministic check

有 ground truth 的行為（抽取對不對、API 參數對不對、JSON schema 合不合）用 Python 函數驗證、判斷成本低、精度高。LLM judge 留給沒 ground truth 的 subjective 行為。對應反例：用 LLM judge 測「step 1 抽取對不對」——cost 翻倍、精度反而不如 deterministic check。對應 4.13 軸誤選一。

反射五：保留 frozen baseline

Frozen baseline 是把某個特定 prompt + 特定 model 跑 production 一段時間後 freeze 起來、每次新版本都跟它比、漂移看得見。對應反例：每次只跟「上一版」比、半年後累積漂移完全不可見、「整體變好了沒」無從回答。

跟其他章節的對應總表

本案例每階段引用的原理章節彙整：

階段	引用章節
1. 觀察人類工作流	0.8 fuzzy engineering
2. 典範定位	0.8 fuzzy engineering
3. 工作流設計（prompt / tool / RAG / HITL）	4.0、4.1、4.3、4.5
4. 結構決定（multi-call vs agent vs multi-agent）	4.4、4.7、4.8
5. Trace instrumentation	4.20 LLM tracing
6. Eval 設計	4.13 eval framework、4.14、4.21
7. Iteration loop	4.0 prompt 光譜 systematic vs random error 段

下一步

返回：模組四首頁、或回到 hands-on 索引。

Context Drift

Thu, 14 May 2026 00:00:00 +0000

Context drift（上下文漂移）的核心概念是「agent loop 長任務中累積 context 逐步偏離原始目標」。每一步局部看起來合理，但中間結果、工具輸出與模型自我敘述會逐漸取代原始任務，讓整體方向跑偏。

概念位置

Context drift 是 agent loop 的長程失敗模式，跟 goal drift 不同：goal drift 是把子目標當終點，context drift 是上下文重心逐步偏移。

可觀察訊號與例子

任務原本是修 bug，十步後變成重構模組，再十步後變成重寫整個檔案，就是 context drift。常見訊號是 agent 開始引用近期工具輸出當主目標，卻不再回看最初 acceptance criteria。

設計責任

緩解方式是定期重錨原始目標、保留 checklist、設 checkpoint、讓外部 evaluator 比對目前行動與原始任務距離。漂移持續發生時，縮短 loop、改用 single-call pipeline，或提高 human review 頻率。

Goal Drift

Thu, 14 May 2026 00:00:00 +0000

Goal drift（目標漂移）的核心概念是「agent loop 把子目標誤當成整體目標」。它常讓模型完成局部步驟後宣告任務完成，實際上還漏掉測試、驗證、提交、回報或其他原始要求。

概念位置

Goal drift 是 agent loop 的 termination 失敗。它跟 context drift 的差異是：context drift 是上下文逐步偏移，goal drift 是完成條件被錯誤替換。

可觀察訊號與例子

原任務是「實作、測試、commit」，agent 實作完就回答「已完成」，這是 goal drift。另一個訊號是 agent 每步都在完成一個合理子任務，但沒有維護整體 checklist。

設計責任

緩解方式是把完成條件外部化：test pass、檔案存在、PR 開啟、commit hash 產生、人工批准。不要只靠模型自評完成；高風險任務要用 checklist 與 deterministic gate。

Multi-agent system

Thu, 14 May 2026 00:00:00 +0000

Multi-agent system 的核心概念是「多個 LLM agent 協作完成任務」。跟 multi-call workflow 的差異不在 agent 數量多寡、在控制流跟責任邊界——multi-call 是主程式編排每 step、multi-agent 是 agent 自決下一步並可呼叫其他 agent。屬於 agent 概念的進一步擴展。

概念位置

跟 multi-call 對照：

維度	Multi-call workflow	Multi-agent system
控制流	主程式編排	Agent 自決
角色	Step 是函數、無「身份」	每個 agent 有 role / 工具集
Context	主程式傳 context	Agent 自帶 memory
重用	Step 是函數、容易 import	Agent 跨系統重用透過協議
失敗歸屬	Step 失敗、主程式接	Agent 失敗可能 cascading

三種主流拓樸：

拓樸	結構	適用
Flat	All-to-all、無 orchestrator	2-4 個 agent、動態協商
Hierarchical	Orchestrator + specialists	多專業 agent、單一對外介面
Agent-as-tool	Agent 互通像 tool call（如 MCP）	跨組織重用、標準協議

設計責任

讀 agent framework / paper 看到「multi-agent」「orchestrator」「agent-as-tool」就是這層設計。實作判讀：

「先 multi-call、不夠再 multi-agent」：multi-agent 是「特定問題的解法」、不是「更高級的設計」。判讀訊號：role 顯著差異 / 跨產品重用 / 真正平行 / 動態協作 / 團隊熟悉度——四條件全滿足才走 multi-agent。
Specialization gain vs orchestration overhead：拆細帶來單一責任、獨立優化、重用、平行；代價是 context 重複傳遞、latency 累積、debug 困難、責任歸屬模糊。
特有失敗模式：循環依賴、責任歸屬模糊、context 重複傳遞、orchestrator 單點瓶頸、agent 互相 hallucinate。每類有對應 guardrail（call stack 監測、trace 全紀錄、shared context、deterministic dispatch rule、schema validation）。
跟 MCP 的關係：MCP 的 tool primitive 視角下、agent-as-tool 可包成 MCP server 暴露、跨組織重用走這條路。

完整 multi-agent 拓樸設計見 4.8 Multi-Agent 拓樸。

Tool Result Misread

Thu, 14 May 2026 00:00:00 +0000

Tool result misread（工具結果誤判）的核心概念是「agent 把工具輸出的錯誤或不完整狀態解讀成成功」。LLM 只看文字與結構化回傳，若工具結果設計不清楚，模型容易忽略 error、warning、空集合或 partial failure。

概念位置

它是 tool use 與 agent loop 交界的失敗模式。模型可能選對工具、也成功呼叫工具，但在 observe 階段錯讀結果。

可觀察訊號與例子

git push 失敗，agent 卻開始寫 PR description；查詢回空集合，agent 卻假設資料存在；測試命令非零退出，agent 只讀到最後幾行 log 就當成功。這些都是工具結果誤判。

設計責任

工具回傳要結構化表示 status、exit code、error type、stdout/stderr 與可重試性。Agent loop 要在 error signal 出現時強制 re-read 或 retry，必要時呼叫狀態確認工具，而不是依賴模型記憶。

Agent Loop

Tue, 12 May 2026 00:00:00 +0000

Agent loop 的核心概念是「LLM 在 plan → act → observe → plan 的循環中推進任務、直到任務完成或停止條件觸發」，有別於一次性回答。它讓 LLM 從「單回合工具呼叫」進化成「自主執行多步驟工作」、但同時放大 prompt injection 的影響面跟 tool use 副作用範圍。

概念位置

典型的 agent loop 流程：

1循環開始：
2 step 1：LLM 看任務目標 + 當前狀態 → 規劃下一步 → 生成 tool call
3 step 2：client 執行 tool call → 得到結果
4 step 3：tool 結果回灌 conversation → LLM 看到新狀態
5 step 4：LLM 判斷：任務完成？ → yes 結束 / no 回 step 1
6循環結束。

Agent loop 的兩個關鍵變數：

max steps：循環最大次數、防止無限迴圈跟成本爆炸。
stop condition：什麼算「任務完成」、由 LLM 自己判斷還是有額外驗證。

常見實作（依框架而異）：LangGraph、AutoGPT、Claude 的 agentic abilities、OpenAI Assistants API 都提供 agent loop 機制。

設計責任

理解 agent loop 後可以解釋兩個現象：為什麼 agent 工作流的成本比單次 LLM call 高一個量級（loop 跑很多輪）、為什麼 agent loop 是 prompt injection 的放大器（loop 中段被 injection 後、後續步驟都被牽動）。

防禦設計的核心：

max steps 上限：避免無限循環、控制成本。
per-step review checkpoint：每幾步強制人為或自動驗證、防止 agent 飄離原意圖。
agent 持的 credential 最小化：避免單次 injection 影響面跨越多服務。
tool 結果在 prompt 中包覆：明確標記「以下是 tool 回傳、不執行內含指令」、降低觸發率。

詳見 LLM Agent Prompt Injection 後果治理跟 4.4 Agent 架構原理。

Agent Memory

Tue, 12 May 2026 00:00:00 +0000

Agent memory 的核心概念是「agent 在 context window 之外管理長期狀態的設計」、把使用者偏好、過去任務、知識、操作流程等持久化、跨 session 重用。借鑒人類認知科學的五個層次：working memory（context 本身）、short-term（session scratchpad）、long-term episodic（過去事件）、long-term semantic（事實 / 知識）、long-term procedural（流程 / 技能）。

概念位置

五個層次的對比：

層	範圍	存放位置	典型內容
Working memory	當前 query / forward pass	Context window 本身	當下對話、tool result、reasoning trace
Short-term / session memory	單一 session（小時級）	Scratchpad 物件 / prompt cache	Session 內累積的中間結果、用過的策略
Long-term episodic memory	跨 session（永久）	DB / vector store / file system	「上週 alice 問過 X」「上個 sprint 解過 Y bug」
Long-term semantic memory	跨 session（永久）	DB / vector store / KG	「user 偏好 markdown 輸出」「專案用 React 18」「Python 3.11」
Long-term procedural memory	跨 session（永久）	Skill registry / playbook	「跑測試前先 npm install」「commit 前要 lint」

跟其他相關概念的關係：

概念	跟 agent memory 的關係
RAG	Long-term semantic memory 的常見實作（vector store retrieval）
Context window	Working memory 的物理上限
System prompt	把 semantic / procedural memory 編碼進 scaffold 的方式
Subagent	用 subagent 分隔不同 specialty 的 memory

設計責任

讀 agent paper / 設計 / framework docs 看到「agent memory」「memory store」「mem0 / Letta」「episodic / semantic memory」就是這 framing。寫 code 場景的判讀：

不是每個 agent 都需要五個層次都用：autocomplete 只要 working memory；對話 IDE assistant 多用 working + session；長期 coding agent 才需要 long-term
Long-term memory 的兩條實作路線：(a) retrieval-on-demand（vector store + similarity search、見 RAG）、(b) injection-on-startup（把關鍵 memory 編進 system prompt、適合小量穩定的 procedural）
失敗模式：memory drift（舊 memory 過時但仍被 retrieve）、PII 寫入（user 不知情下被存）、context 污染（不相關 memory 被 inject 進 working）、跟 hallucination 互相 boost
跟 4.19 agent memory 章節的關係：本卡是分類定義、章節是工程實務（寫入時機、retrieval 設計、失敗模式緩解）

Tool Use

Tue, 12 May 2026 00:00:00 +0000

Tool use 的核心概念是「LLM 不只生成文字、還能透過結構化呼叫外部工具來執行讀檔、查資料庫、發 API request、跑程式等動作」。它擴展 LLM 從「對話模型」變成「能影響真實世界的 agent」。實作上常見透過 function calling 或 MCP 協定。

概念位置

Tool use 的典型流程：

11. 開發者定義 tools（每個 tool 含 name、description、parameters schema）
22. LLM 收到 user message 跟 tools 清單
33. LLM 決定要呼叫哪個 tool、生成結構化 tool call（JSON）
44. LLM client（不是模型本身）執行 tool call、得到結果
55. tool 結果回灌進 conversation、模型基於結果繼續生成或再呼叫

關鍵特性：

模型本身不執行 tool：模型只生成 tool call JSON、實際執行由 client 或 MCP server 完成。
權限由 OS / user / sandbox 決定：模型再「同意」執行 rm -rf /、實際能不能跑取決於跑 tool 的 process 權限。
副作用範圍跟 tool 設計強相關：tool 寫得越通用（如 run_shell）、攻擊面越大；tool 寫得越窄（如 read_workspace_file）、攻擊面越小。

Tool use 跟 function calling、MCP 的關係：

層次	角色
Tool use（概念）	廣義概念、LLM 能呼叫工具
Function calling	OpenAI 提出的 API 規範、用 JSON schema 定義 function
MCP	Anthropic 推動的開放協議、定義 LLM client 跟 tool server 之間的通訊格式

設計責任

理解 tool use 後可以解釋三個現象：為什麼 LLM 「能跑 shell」其實是 client 跑、不是模型跑（職責切分）、為什麼 tool spec 設計直接影響攻擊面（spec 越鬆、injection 後果越大）、為什麼 agent loop 比單次 tool call 危險（多步 tool use 中 injection 累積）。

設計 tool 跟 MCP server 時、權限白名單 + 副作用可逆性 + confirm 機制是基本配置；production 場景見 LLM Agent Prompt Injection 後果治理跟 6.2 tool use 與 MCP server 的權限模型。

6.2 tool use 與 MCP server 的權限模型

Tue, 12 May 2026 00:00:00 +0000

Tool use 跟 MCP server 是本地 LLM 對主機資源最大的副作用面。本章把「這個 tool 能做什麼」「MCP server 跑了會碰到什麼檔案」「能不能 rollback」整理成可操作的權限判讀。原理層的副作用範圍 spectrum、可逆性分級見 4.3 Tool use 原理、agent 跟人類審查的協作模型見 4.4；hands-on 驗證「LLM 自己沒 FS / shell 權限、wrapper 才有」見 Ollama 改檔案的權限邊界。隔離技術見 sandbox 卡、權限白名單見 backend allowlist 跟 least-privilege 卡。本章 framing 是個人 dev 視角；production agent 場景下 tool use 引發的 prompt injection 後果見 backend/07 LLM agent prompt injection。

讀完本章後、你應該能對自己用的 tool / MCP server 回答：能讀寫哪些路徑、能跑哪些 shell command、能連哪些網路位址、副作用有沒有 dry-run / preview、出錯時怎麼回退。

本章目標

認識 tool use 跟 MCP server 在三層架構中的位置。
區分「讀取類 tool」跟「副作用類 tool」的權限判讀差異。
知道個人 dev 場景下、第三方 MCP server 的信任邊界跟驗證流程。
用「沙箱 / 白名單 / 副作用可逆性」三個維度評估具體 tool / MCP 的風險。
認識常見的 tool use 副作用洩漏路徑跟對應的最低防護。

tool use 跟 MCP server 在哪一層

tool use 跟 MCP server 同時跨三層架構的兩層、但跟模型本身的權限模型分離：

 1介面層（VS Code / Continue.dev / CLI）
 2  ↓
 3推論伺服器（Ollama / llama-server / LM Studio）
 4  ↓
 5模型（GGUF 權重）
 6
 7旁邊另一條：
 8  ↓
 9MCP server（獨立 process、自己的權限）
10  └── 對檔案 / shell / 網路的具體 API

關鍵特性：

模型本身不執行 tool：模型只生成 tool call JSON、實際執行由「LLM client」（如 Continue.dev、Claude Desktop）跟 MCP server 完成。
MCP server 是獨立程式：可以是 Node / Python script、可以呼叫任何系統 API、權限上限是「跑該 server 的 user 的權限」。
權限不是模型給的、是 OS / user 給的：模型再怎麼「同意」執行 rm -rf /、實際上能不能跑取決於 OS 的權限模型跟 MCP server 自己的 sandbox。

事實查核註：Model Context Protocol（MCP）是 Anthropic 在 2024 年底發布的開放協議、各家 LLM client 跟 MCP server 實作的成熟度、權限粒度依版本演進。本章描述以 2026 年 5 月主流實作為基準、引用前以 MCP 官方規格跟各 client / server 的 README 為準。

「讀取類」跟「副作用類」tool 的權限差異

tool 可以粗分成兩類、權限判讀完全不同：

類別	例子	主要風險	個人 dev 場景的接受程度
讀取類	read file、grep、search code、查 git log	把私密內容讀進 prompt、prompt 被洩漏出去	較高、但要注意 prompt 傳到哪個 LLM
副作用類	write file、run shell、git commit、發 HTTP request、操作資料庫	不可逆改變、損毀檔案、發送請求、洩漏到外部	較低、需要 preview / confirm / sandbox

讀取類的判讀重點是「讀到的內容會被傳到哪」：

讀到的 code 變 prompt 的一部分、prompt 送到本地模型→沒外洩
同樣 prompt 送到雲端 LLM→傳到雲端、跟雲端 LLM 的資料政策走（見 6.4 跨雲端 / 本地資料邊界）
讀取會被 log→log 累積、需要管理

副作用類的判讀重點是「可逆性」：

write file 蓋掉原內容→可能無法回復（沒備份的話）
run shell rm / git push→不可逆或需要 force pull 才能還原
發 HTTP request、轉帳、call API→送出去就回不來
操作 production 資料庫→可能影響其他人

三個維度評估具體 tool / MCP 的風險

對任何 tool / MCP server、可以用三個維度做初步評估：

 1┌────────────────────────────────────────────────────┐
 2│ 維度一：沙箱                                       │
 3│   能做什麼 = 跑該 server 的 user 能做什麼          │
 4│   有沒有 chroot / Docker / namespace 隔離？        │
 5│                                                    │
 6│ 維度二：白名單                                     │
 7│   能讀寫的路徑、能跑的指令、能連的網址有沒有限定？  │
 8│   還是 "all paths" / "any shell" / "any URL"？     │
 9│                                                    │
10│ 維度三：副作用可逆性                               │
11│   出錯能不能 rollback？                            │
12│   有沒有 dry-run / preview / confirm？             │
13└────────────────────────────────────────────────────┘

對應的判讀範例：

Tool / MCP	沙箱	白名單	副作用可逆性	個人 dev 評估
`read_file`（讀任意路徑）	無、user 權限	無、可讀 user 所有檔案	N/A（讀取無副作用）	注意 prompt 走向
`read_file` 限定 workspace	無	有、只讀 workspace	N/A	較安全
`run_shell`（任意指令）	無	無	視指令、`rm` / `git push` 不可逆	高風險
`apply_patch`（套 diff 到 file）	無	限定 workspace	git stash 可逆、未 stash 不可逆	中風險、值得用 git track
`fetch_url`（任意 URL）	無	無	一般 GET 可逆、POST 不可逆	看具體請求
`mcp-server-postgres`（直連 DB）	無	視 DB user 權限	改 row 通常可逆、DROP TABLE 不可逆	DB user 權限要設好

實務上、社群常見的 MCP server 多半屬於「白名單較弱」「副作用直接套用」的設計、需要使用者自己加防護。

第三方 MCP server 的供應鏈信任

MCP server 是可執行程式碼、信任邊界比 GGUF 模型權重高一個層級。常見的 MCP server 來源：

官方 reference server（如 Anthropic 維護的 @modelcontextprotocol/server-*）：相對較高信任、有官方 maintain。
知名專案的 MCP server（如 GitHub、Notion、Slack 等公司自己出的）：跟該公司的軟體分發信任度一致。
社群 MCP server：個人或小團隊維護、信任度視 maintainer 與 download 量、看 code 是基本動作。

裝任何 MCP server 前的最低判讀：

看 source repo：是不是知名作者、stars 數、最後 commit 時間、issues 是否活躍。
看實際做什麼：MCP server 的 README 通常列出提供的 tools、跑起來會碰到的權限。
跑在最小權限環境：能用 Docker / chroot / nice -n 19 之類就用、不要直接用 root / admin。
不要用 curl | sh 安裝：用 npm install / pip install / go install 等有 package manager 介入的方式、留下 install log。

事實查核註：MCP server registry、套件管理工具的供應鏈安全機制依版本演進、Anthropic 跟其他主要 client 廠商可能引入官方 marketplace 或簽章機制、建議引用前以當前 MCP 官方狀態為準。

個人 dev 場景的最低防護建議

對「我想用 tool use 但又怕 LLM 把檔案搞壞」的工作流、最低防護建議：

codebase 用 git track：所有寫入操作前確認 working tree clean、出問題能 git checkout 還原。git stash 是更輕的選擇。
重要檔案 backup：dotfile、SSH key、雲端 API key 等不在 git track 範圍的、用 Time Machine / rsync / cloud sync 之類做日常 backup。
跑 LLM agent 時用獨立 user / 容器：對「想試 agent 但怕」的場景、開個專用 macOS user 或 Docker container、user 沒 sudo、檔案存取限定 workspace。
MCP server 的 config 加白名單：能設 allowed paths / allowed commands / allowed URLs 的 server 都先設、預設拒絕、按需開放。
看不懂的 tool call 不要 confirm：Continue.dev / Claude Desktop 等 client 通常會 prompt 使用者確認 tool 執行、看不懂的 JSON 先別按。

tool use 副作用洩漏的常見路徑

個人 dev 場景常見的 tool use 副作用洩漏路徑：

LLM 誤把 secret 寫進 commit：tool use 帶 git commit、LLM 從 .env 讀到 API key 又寫進 commit message。對應防護：MCP server 加 .env 黑名單、commit hook 掃 secret。
LLM 套用 broken patch 蓋掉檔案：apply_patch 失敗 / 部分套用、留下無法 compile 的狀態。對應防護：套 patch 前 git stash 或 git add -p 先存 working tree。
LLM 從 issue / PR 內容引發指令：讀進 issue 的 prompt 內容包含 prompt injection、誘導跑非預期指令。對應防護：tool 跑前明確讓使用者確認（見 6.3 prompt injection）。
LLM 觸發 production 操作：MCP server 連到 production DB、LLM 跑 DROP TABLE。對應防護：production credential 絕對不放在 tool use 可達的環境。

給讀者的 tool / MCP 評估清單

每次裝新 MCP server / 啟用新 tool 之前、跑一次評估：

1[ ] 來源是知名作者 / 官方專案 / 我能 audit 的開源 repo
2[ ] README 列出的 tool 列表、跟我的使用情境匹配
3[ ] 該 server 跑在最小權限環境（user / sandbox / container）
4[ ] 副作用類 tool 有 confirm / preview 機制
5[ ] workspace 內容受 git track、能 rollback
6[ ] 不放 production credential / SSH key 在該 server 可達的環境
7[ ] 啟用後跑簡單測試、確認 tool call 行為符合預期

下一章：6.3 IDE 場景的 prompt injection、處理 tool use 副作用最常見的觸發來源。

模組四：LLM 應用層原理

Thu, 14 May 2026 00:00:00 +0000

狀態：大綱階段、部分章節待完成內容。

本模組整理 LLM 應用層的核心原理：模型裝起來、能對話之後、要怎麼跟外部世界互動、怎麼組成可用的工作流、怎麼測它跑得對不對。模組零到模組三建立的是「模型本身」的心智模型；本模組建立的是「模型作為系統元件」的心智模型。

寫這個模組的核心約束是「只寫不會過時的部分」。LangChain、LlamaIndex、aider、Cline 等工具半年一個世代、寫具體 API 半年後就過時；但「retrieval 在做什麼」「為什麼 LLM 需要 tool use」「agent loop 為什麼會失敗」「eval 軸怎麼選」這些原理跨工具世代都成立。本模組刻意避開具體實作教學、把焦點放在跨世代的設計取捨。

章節列表

章節	主題	關鍵收穫
4.0	Prompt 技術光譜	三軸（context / 推理 / 格式）+ 四維 trade-off + stack 判讀 + 跟 fine-tune/RAG/chaining 的邊界
4.1	RAG 原理：retrieval + augmentation 模式	為什麼要外掛知識、語意相似 vs 字面相似、chunking 取捨、失敗的根本原因
4.2	RAG 檢索增強：query rewriting / HyDE / multi-step / packing	四層增強分類、何時 stack 何時不要、adaptive retrieval
4.3	Tool use 原理：LLM 跟外部世界互動	structured output 是橋、function calling 取捨、為什麼小模型 tool use 崩
4.4	Agent 架構原理	Agent loop 結構、失敗模式、什麼任務適合 vs 不適合、人類審查模型
4.5	人機協作拓樸：何時人介入、怎麼介入	Centaur vs Cyborg、jagged frontier、HITL 三時機（pre-act / mid-stream / post-hoc）、避免橡皮圖章化
4.6	應用層協議：function calling / structured output / MCP	三者層級差異、為什麼出現 MCP、組合工作流
4.7	Workflow 編排模式	Pipeline / router / parallel / reflection 四種基本模式、退化條件
4.8	Multi-Agent 拓樸	Flat / hierarchical / agent-as-tool、specialization gain vs orchestration overhead、特有失敗模式
4.9	Production 部署的資源評估原理	6 個 dimension：concurrency / latency / cost / storage / observability / reliability
4.10	衍生產物管理原理：什麼進 git、什麼不該	Source / derived / external 三分類、`.gitignore` 設計模式、prompt + eval 版本管理、production deployment 對接
4.11	Long context engineering	claimed vs effective context、lost-in-the-middle、跟 RAG 的取捨
4.12	Embedding model 內部	contrastive learning、選型、MTEB、in-domain fine-tune
4.13	Eval 設計座標系：三軸、八象限	Objective / component / quantitative 三軸 × 工具選擇、軸誤選的訊號、eval 演化路徑
4.14	Benchmarking 與評估方法論	capability vs performance、in-house benchmark、`llama-bench`
4.15	Vision in coding workflow	VLM 在 coding 場景的 use cases、本地 VLM 選型、IDE 整合現狀
4.16	靜態 / serverless RAG deployment	沒 backend 的 RAG 四方案、API key 暴露、CORS、abuse、SaaS 供應鏈、跟模組六 routing
4.17	Coding agent harness	Scaffold vs harness 分層、context budget 25% 規則、subagent 設計、跟 Claude Code / Cursor / Aider 的 mapping
4.18	Prompt caching 工程實務	Cache breakpoint 設計、coding agent / RAG 場景 pattern、anti-pattern、cost / latency 槓桿
4.19	Agent memory 分層架構	Working / session / episodic / semantic / procedural 四層、寫入時機、retrieval 設計、失敗模式
4.20	LLM tracing 與 observability	OTel GenAI semconv、cost / latency / failure debug、trace → eval 閉環
4.21	LLM-as-Judge 評估方法	Rubric 設計、pairwise vs direct、三大 bias 緩解、calibration、跟 production trace 的閉環
4.22	RAG storage 工程	四層可替換結構、storage 演化階梯、升級判讀訊號、index 生命週期、dependency 約束
Hands-on	端到端案例：把所有原理串成具體 case study	Customer support agent 從 task decomposition 到 eval 全流程

為什麼這個順序

本模組章節順序的設計脈絡：

先 4.0 Prompt 技術光譜：within-call 增強是後續所有設計的基底、先建立「prompt 層能做什麼、邊界在哪」的座標。
接 4.1 RAG 原理 + 4.2 RAG 檢索增強：應用層最常見的模式、把「LLM + 外部知識」這個基本組合走過一遍、概念對映到每個讀者都用過的 @codebase 等實務經驗。
再 4.3 Tool use：RAG 是「LLM 讀外部資料」、Tool use 是「LLM 對外部世界做事」、兩條延伸方向自然接續。
再 4.4 Agent 架構 + 4.5 人機協作：把 Tool use 從「單次呼叫」延伸到「自主多步」、自然進入 agent；agent 自主後立刻面對人類介入時機問題。
再 4.6 應用層協議：前面章節涉及 function calling、structured output、MCP 等術語、本章把這三個概念放回正確的層級、避免混為一談。
再 4.7 Workflow + 4.8 Multi-agent：上層整合、把多 LLM call 跟多 agent 組合的設計模式整理成跨 framework 不變的概念地圖。
4.9 起進入 production / 細節：部署資源、衍生產物管理、long context、embedding 內部、eval / benchmarking、tracing、judge——每個都是 production 場景遇到的具體議題。
最後 hands-on：把上述所有原理串成具體案例、看「實際做的時候、原理怎麼落」。

每章可以單獨讀、但若你是第一次接觸 LLM 應用層、照順序讀最不容易迷路。

跟其他模組的分工

模組	角度
模組零	操作層心智模型：模型放哪、怎麼選工具
模組一	工具層：具體裝 Ollama / Continue.dev
模組二	數學工具：線性代數、機率、最佳化
模組三	理論機制：模型內部運作
模組四	應用層原理：模型作為系統元件、跟外部世界互動的設計取捨

適合的讀者

你的背景	適合程度
寫過 Ollama + Continue.dev、想懂「然後呢」	直接適合、從 4.0 依序讀
已經試過 LangChain / aider / Cline、想看原理	直接適合、本模組補足「為什麼這樣設計」的視角
想做 LLM 應用開發	重點讀 4.0、4.1–4.3、4.4–4.5、4.7–4.8、4.13
只想用本地 LLM 寫 code、不做應用	跳過本模組無妨、模組零 + 模組一已足夠

不在本模組內的主題

具體 framework 教學：LangChain、LlamaIndex 等的 API 用法、隨版本變、交給官方文件。
具體 prompt 寫法：跨模型跨任務不可遷移、本模組 4.0 寫的是 prompt 技術 landscape 的結構、不是具體寫法。
具體 agent 工具配置：aider、Cline 等的安裝設定、隨工具版本變、見 1.6 延伸方向的入口資訊。
訓練 / fine-tuning：屬於改變模型本身、見 3.4 訓練流程。

4.4 Agent 架構原理

Mon, 11 May 2026 00:00:00 +0000

Agent 跟「對話 LLM」的根本差異在於控制流的所有權。對話 LLM 是「人類問、模型答」、每輪都由人類決定下一步；agent 是「LLM 自己決定下一步、自己呼叫工具、自己評估結果」、控制流交給模型。

這個轉變看似只是「加個 loop」、實際上帶來新的設計問題：失敗模式從「答錯」變成「跑偏」、終止條件變成設計重點、人類審查角色從「事後讀」變成「決定何時介入」。本章把 agent 的這些核心問題拆開、寫成跨 framework 都成立的原理。aider、Cline、LangGraph、各家 Agent SDK 等具體工具不在本章焦點——這些半年一個版本、原理層級更穩。

本章目標

讀完本章後你能：

區分「LLM agent」跟「對話 LLM」的本質差異。
畫出 agent loop 的核心結構、看到新 agent 工具能對應到這個骨架。
看到 agent 失敗時、能診斷是哪一類失敗（context drift / 目標漂移 / tool 誤判）。
判斷一個任務該用 agent 還是 single-call。

Agent 跟「對話 LLM」的差異

維度	對話 LLM	Agent
控制流	人類驅動、每輪 turn 獨立	LLM 自己驅動、跨多步
上下文	每次 prompt 由人類組裝	自己累積跨步驟 context
工具呼叫	單次 / 偶爾	多次連續、串接結果
終止	使用者結束對話	模型自己判斷「完成」
失敗模式	答錯（人類能立刻 catch）	跑偏、進入錯路、long horizon 累積誤差
人類角色	主導者	監督者 / 審查者

這個轉變對 LLM 提出新的能力要求：

規劃能力（把目標拆成可執行的子步驟）。
自我評估能力（判斷子步驟做對了沒）。
工具選擇能力（多個工具中挑對的）。
上下文管理能力（哪些 context 該帶下去、哪些可以丟）。

這幾項能力是雲端旗艦模型的明顯強項、也是本地小模型的明顯弱項。理解這個能力差距、能解釋為什麼「本地寫 code 用 Continue.dev 還行、本地跑 agent 經常失敗」、不是工具問題、是模型能力 baseline 問題——背後牽涉 function calling 訓練深度、long context prefill 痛點、規劃能力差距。

Agent Loop 的核心結構

所有 agent framework 不管實作怎麼包裝、骨架都是同一個 loop：

11. 感知（Perceive）：讀當前 context、環境狀態、上一步結果
2   ↓
32. 推理（Reason）：思考下一步該做什麼、選工具、決定參數
4   ↓
53. 行動（Act）：呼叫工具、修改環境
6   ↓
74. 觀察（Observe）：解讀工具回應、更新 context
8   ↓
95. 判斷終止：done 還是回 1

這個 loop 跟控制系統的 sense-plan-act 同骨架、本質是「在環境中執行目標導向行為」。Agent framework 的差異主要在每一步的具體實作：

感知怎麼編成 prompt？要保留多少歷史？怎麼壓縮 long context？
推理用什麼模型？用 chain-of-thought 還是直接決定？要不要再拆成 plan + act？
行動支援什麼 tool？怎麼防止破壞性操作？
觀察怎麼把工具回應翻成 context？大 output 怎麼摘要？
終止怎麼判斷？模型自己說、外部 critic 判斷、step 上限、cost 上限？

理解這個骨架的價值是：看到新 agent framework 時、按這 5 步問就能拆解它的設計取捨；agent 跑出問題時、定位是哪一步壞掉、不是「整個 agent 壞了」。

為什麼 Agent 容易失敗

Agent 跑長時間任務時、失敗率比 single-call 高很多、根因多半落在這三類：

Context drift（上下文漂移）

每輪累積的 context 偏離原始目標、後期 LLM 「忘記」要做什麼。典型表現：開始任務是「修這個 bug」、跑了 10 步後變成「重構這個 module」、再 10 步後變成「rewrite 整個 file」。每一步看起來都合理、累積起來偏離原目標。

根因：

模型對 long context 後段的 attention 偏弱（middle-loss 現象、attention 在序列中段表現最弱、見 3.2 attention 機制）。
子步驟產出的中間結果會被當成「新目標」、模型沿著中間結果繼續推、原始目標被擠掉。
沒有定期重新引用原始目標的機制。

緩解：每隔 N 步把原始目標重新塞回 context、或用外部 critic 比對「現在這步跟原目標的距離」。緩解失敗的下一步：N 步重塞仍漂移、改換較大 model（context 處理能力跟模型大小強相關）；換 model 仍漂移、escalate human 或退回 single-call 拆解任務。

Goal drift（目標漂移）

模型把子目標當主目標、執行完子目標就停下來、原始任務沒完成。例：原任務「實作 + 測試 + commit」、模型實作完就回「我寫完了」、忘了還要測 + commit。

根因：

訓練資料中「完成單一任務」的範例多、「完成複雜 multi-step 任務」的範例相對少。
子任務做完的「完成感」訊號比「整個任務還沒完」訊號強。

緩解：終止條件用外部驗證（test 跑通、PR 開、commit 進）、不靠模型自己說「完成了」。緩解失敗的下一步：外部驗證仍漏步、加 explicit checklist 在 system prompt、每步要求模型回報 checklist 完成狀態。

Tool result misread（工具結果誤判）

Tool 回 error 或意外結果、模型 hallucinate「成功了」繼續推進、累積錯誤越來越深。例：git push 失敗、模型沒讀 error message、下一步開始寫 PR description、最終提交一個沒推上去的 branch。

根因：

模型對「無聲失敗」（tool 回的格式正常但內容是 error）解讀差。
部分 framework 對 tool error 處理弱、模型看不到完整 error message。

緩解：tool 設計時 error 用結構化、模型容易識別；agent loop 加 explicit error handling step、看到 error signal 強制 retry 或 escalate。緩解失敗的下一步：retry 仍失敗、強制呼叫 tool 重新讀狀態（如 git status / git log）確認、避免依賴模型對 tool 結果的記憶。

什麼任務適合 Agent vs Single-call

Agent 適用面有邊界、判讀 framework：

適合 agent：

目標可分解成明確子步驟。
子步驟有客觀驗證訊號（test 跑通、file 寫入、API 200）。
單一 call 上下文不足、需要跨多次 tool 互動。
失敗可以 recover（agent 跑錯一步可以糾正）。

不適合 agent、改用 single-call：

目標模糊探索性（沒有客觀驗證）。
緊湊推理任務（拆步驟反而失去全局視角）。
簡單可預測的任務（agent overhead 大於收益）。
失敗代價極高（agent 跑錯一步很難 recover）。

例子對照：

任務	該用	為什麼
修一個 bug、跑 test 確認	Agent	子步驟清楚、test 是客觀驗證
寫一個 function 的 docstring	Single-call	簡單、不需 multi-step
設計新 module 架構	Single-call + 人類	探索性、人類審查比 agent loop 有用
重構整個 codebase	Agent（謹慎）	子步驟多但失敗代價高、需強人類監督
寫詩 / brainstorming	Single-call	創意任務、沒有客觀驗證、agent loop 沒意義
Migrate database schema	Agent + 強審查	子步驟清楚但失敗代價極高、每步要人類確認

「先 single-call 試、不夠再 agent」是合理的預設姿勢。Agent 是「特定問題的解法」、客觀驗證訊號 + 可承擔失敗 + 多步必要、三者俱備時用；用錯地方反而增加 cost 跟失敗率。

灰色帶反例：判讀容易誤判的情境

實務上常見的「該用但失敗了」「不該用但成功了」灰色帶、列幾個典型情境跟判讀路徑：

目標可分解但子步驟驗證不夠客觀：如「優化這段 code 的可讀性」、可以分成「重構函式 / 加註解 / rename 變數」、但「好不好」沒客觀驗證。Agent 跑完可能改成「自己覺得好」的版本、跟使用者期待差很多。判讀：改用 single-call + 人類審查、或加明確的 lint / formatter 當客觀驗證。
失敗代價不對稱：如 production database migration、子步驟清楚（dump → migrate → verify）、但中間失敗可能毀資料。判讀：用 agent 但強制每步要 human-in-the-loop confirm、或拆成 agent 生 migration script + 人類執行兩階段。
子步驟之間有強依賴：如「研究某 topic → 寫摘要 → 翻譯」、agent 容易在中間步驟漂掉、累積誤差傳到最後。判讀：強依賴 chain 走 single-call sequential pipeline、不走 agent loop。
任務在訓練分佈邊緣：如 niche domain（特定 framework、罕見語言）的 multi-step 任務、模型對該 domain 沒看過 multi-step 範例、容易在 step transition 漏 context。判讀：先 small-scale 驗證 agent 在這個 domain 表現、再決定要不要 scale up。

Termination 條件：怎麼讓 Agent 知道停下來

Agent 的失敗模式很多落在 termination：該停沒停（無限 loop）、不該停就停（漏做子步驟）。Termination 策略選擇是 agent 設計的核心。

主流 termination 機制：

明確 done signal：tool 回 special token、模型輸出特定 phrase。最直接、但靠模型自律、不夠 robust。
Step 上限：跑 N 步強制停。防止無限 loop、但 N 設不對會中途砍掉。
Cost 上限：累計 token / dollar 超過 cap 強制停。實務防錢被燒掉。
目標達成評估：另一個 LLM 或 deterministic check 判斷「任務完成了沒」。最 robust 但 cost 高。
外部訊號：test 跑通、檔案被寫入、人類介入。客觀、用在有明確完成判準的任務。
人類介入：把 termination 決定交給人類。最保守、適合不可逆任務。

實務上多重 termination 並用：step 上限當 safety net、cost 上限當預算守門、外部訊號當主要判準、人類介入當最終 fallback。

判讀 termination 設計的訊號：

沒有 step / cost cap → 失控風險高。
完全靠模型自己說「完成」→ 漂移風險高。
沒有客觀驗證 → 「成功」訊號可能是 hallucination。

Agent 跟人類審查的協作模型

Agent 的自主程度跟人類審查粒度是 spectrum、不是 binary：

模型	人類介入時機	適合任務
Full auto	跑完之後審結果	可逆任務、低風險（read-only、本地實驗）
Checkpoint	每隔 N 步審一次	中等風險、長時間任務
Step-by-step approval	每個 tool call 前審	不可逆任務、高風險（production change）
Plan first, then auto	審 plan、approve 後自動跑	可預測子步驟、人類確認方向後可放手
Human-in-the-loop（HITL、agent 過程中插入人類審查節點）	Agent 不確定時主動問人類	模糊邊界、需要 domain 判斷

選擇依據主要是「副作用範圍」（見 4.3 工具的副作用範圍設計）：等級 1-2 工具可以 full auto、等級 3 適合 checkpoint、等級 4-5 強制 step-by-step。不同自主度對應的 HITL 時機選擇（pre-act / mid-stream / post-hoc）跟確認流程設計（避免橡皮圖章化）見 4.5 人機協作拓樸。

設計 agent 時、先設想最差情況：「agent 跑偏到底會發生什麼」、再決定該用哪一級協作模型。完全自動跑 production migration 通常是 over-trust、step-by-step 跑 search 通常是 under-trust。個人 dev 把這個協作模型從本機 wrapper 演化到團隊 / production 服務時的 routing 判讀見 6.5 跨進 production 的 routing 中樞。

本地 LLM 跑 Agent 的特殊挑戰

本地 LLM 跑 agent 現階段（2026/5）失敗率明顯高於雲端、根因不只一條：

Tool use 訓練不足（見 4.3）：小模型 tool use 本來就崩、agent 需要多次穩定 tool use、失敗率複合放大。
Long context prefill 痛點（見 0.1 為什麼 LLM 生字慢）：Agent 每步都重新 prefill 累積 context、TTFT 越跑越長。
規劃能力弱：雲端旗艦在 multi-step planning 上的優勢是公認的；本地 model SFT 規模有限、規劃能力跟雲端有明顯差距。
失敗 recovery 弱：模型發現走錯路時、本地模型較容易繼續錯下去、雲端模型較會自我修正。

實務啟示：本地 agent 在 2026/5 屬於「值得試、但不一定留下」的階段。對寫 code 場景的多數使用者、agent loop 的複雜任務交給雲端旗艦更划算；本地保留給 single-call 跟簡單 tool use 場景。在以下條件成立前、雲端仍占優、可作為 tripwire 重新評估：

30B+ 本地模型 SWE-bench tool-use 子集達雲端旗艦的 80% 以上、且推論成本可接受
本地推論伺服器（Ollama / LM Studio / oMLX）穩定支援 function calling spec、跨框架行為一致
Apple Silicon Mac 記憶體預算夠跑「主 model + drafter + KV cache」整套 agent loop 不 swap

任一條件達標時、本地 agent 的成本效益就可能翻轉、值得重新評估。

何時過時 / 何時不過時

不會過時的部分：

Agent vs 對話 LLM 的控制流差異 framing。
Agent loop 五步骨架（感知 / 推理 / 行動 / 觀察 / 終止）。
三類失敗模式（context drift / 目標漂移 / tool 誤判）的分類。
「適合 agent vs single-call」的判讀框架。
Termination 策略的 trade-off。
人類審查協作 spectrum。

會變的部分：

具體 agent framework（aider / Cline / LangGraph / OpenAI Assistants 等會持續演化）。
模型 agent 能力（本地模型會逐步追上雲端、平衡點會移動）。
Tool ecosystem 跟 MCP server 普及度（見 4.6 應用層協議）。
各家 agent 的最佳 prompt / system prompt（屬於 prompt engineering、本指南不展開）。

看到新 agent framework 時、回到本章的 5 步骨架、3 類失敗模式、5 種人類審查協作模型——這些 dimension 不變、看新工具能很快理解它的定位跟限制。

下一章：4.5 人機協作拓樸、把上文的人類審查 spectrum 落到「人類什麼時候介入、怎麼介入」的三時機設計。應用層協議（function calling / structured output / MCP）的層級差異見 4.6。Agent 對本機資源副作用的個人 dev 權限判讀見 6.2、個人工作流跨進 production 服務時的 routing 中樞見 6.5。

4.8 Multi-Agent 拓樸：flat / hierarchical / agent-as-tool

Thu, 14 May 2026 00:00:00 +0000

4.7 workflow patterns 寫的是「多次 LLM call 怎麼組合」、四個基本模式（pipeline / router / parallel / reflection）解的是 single-thread 多 call 問題。當問題進一步複雜——需要平行的多個專業化角色、需要跨產品的 agent 重用、需要 agent 之間互相呼叫——就進入 multi-agent system 的領域。

本章寫的是 multi-agent 系統的拓樸結構：何時值得從多 call 走到多 agent、flat 跟 hierarchical 兩種拓樸的差異、agent-as-tool 的 MCP 視角、specialization 跟 orchestration overhead 的核心 trade-off。具體 framework（CrewAI、AutoGen、LangGraph 多 agent 等）半年一個世代、本章不寫具體 API。

本章目標

讀完本章後你能：

判斷一個系統該停在 multi-call workflow 還是進入 multi-agent。
區分 flat / hierarchical / agent-as-tool 三種拓樸、各自的適用場景。
估算 specialization gain vs orchestration overhead 的 trade-off。
識別 multi-agent 特有的失敗模式（循環依賴、責任歸屬模糊、context 重複傳遞）。
把 agent-as-tool 對應回 MCP / function calling 的協議設計。

從 Multi-Call 走到 Multi-Agent 的判讀

Multi-agent 跟 multi-call 不是「agent 數量多寡」的差別、是控制流跟責任邊界的差別。

維度	Multi-call workflow	Multi-agent system
控制流	主程式編排、每 call 是 step	Agent 自己決定下一步、可能呼叫其他 agent
角色	Step 跟 step 之間沒有「身份」、就是函數	每個 agent 有 role / 專業 / 工具集
Context	主程式傳 context、step 不擁有 context	Agent 自帶 memory、有「自己知道的事」
重用	Step 是函數、容易 import 重用	Agent 是黑盒、跨系統重用要透過協議
失敗歸屬	Step 失敗、主程式接	Agent 失敗、可能 cascading 影響別的 agent

判讀「該走 multi-agent」的四條件（任一不滿足、就留在 multi-call）：

角色差異顯著：不同 step 要不同 prompt / model / tool / memory。任一條件同質就退回 multi-call、硬拆成多 agent 只是換個名字、orchestration overhead 純增。
跨產品重用：同一個 agent 要被多團隊 / 多場景使用。單一 user / 單一場景的話、寫成函數比 agent 簡單。
真正平行 / 動態協作：多個 agent 各做自己的事最後合併、或哪些 agent 參與是 query-dependent。控制流可寫死、step 順序固定時、multi-call pipeline 已足夠。
團隊熟悉度足：multi-agent 失敗模式比 multi-call 多、debug 比較難。團隊還在學階段、debug 容易性 > 靈活性、先 stick to multi-call。

「先 multi-call、不夠再 multi-agent」是合理預設姿勢。Multi-agent 是「特定問題的解法」、不是「更高級的設計」。對應 4.4 agent 架構的「先 single-call、不夠再 agent」反射、層級往上類似。

三種拓樸

Multi-agent 的拓樸結構決定 agent 之間怎麼通訊、誰決定誰做什麼。三種主流拓樸各有適用場景。

Flat 拓樸：all-to-all

所有 agent 同層級、可以互相呼叫、沒有固定 orchestrator。

1       Agent A ─────── Agent B
2         │  ╲          ╱  │
3         │   ╲        ╱   │
4         │    ╲      ╱    │
5       Agent C ─────── Agent D

適用：agent 之間平等、任務需要動態協商（agent A 想知道 X、問 B 跟 D、再決定）。
典型場景：研究型多 agent debate、模擬多個利害關係人協商。
失敗模式：
- N² 通訊複雜度：agent 多了之後、通訊路徑潛在 N²、實務常較稀疏但難預測、cost / latency 上限不可控。
- 無權威仲裁：兩個 agent 意見衝突、沒有第三方決定、容易死鎖。
- 責任歸屬模糊：最終結果是誰決定的不清楚、debug 困難。
規模限制：實務上 flat 拓樸超過 5–6 個 agent 就難維護、不推薦大規模。

Hierarchical 拓樸：orchestrator + specialists

一個 orchestrator agent 對外、底下若干 specialist agent、orchestrator 決定 dispatch 給誰、合併結果回 user。

 1              User
 2                │
 3          ┌─────────────┐
 4          │ Orchestrator │
 5          └──┬──┬──┬──┬─┘
 6             │  │  │  │
 7        ┌────┘  │  │  └────┐
 8   Specialist  │  │   Specialist
 9       A    Spec  Spec      D
10             B    C

適用：對 user 要單一介面、底下 agent 專業化、orchestrator 知道每個 specialist 的 capability。
典型場景：智慧家庭中央控制（user 對 orchestrator 說話、orchestrator 派給 climate / security / energy agent）、複雜客服系統（orchestrator 派給 product / refund / billing 不同 specialist）。
失敗模式：
- Orchestrator 變單點瓶頸：所有請求過 orchestrator、它的 prompt / model 限制整個系統能力。
- Specialist 之間訊息傳遞要過 orchestrator：增加 latency、容易丟細節。
- Orchestrator 不知道何時該派誰：需要動態描述 specialist capability、複雜 query 容易 dispatch 錯。
變體：multi-level hierarchy（orchestrator 下面還有 sub-orchestrator），實務上 2 層夠用、3 層以上 overhead 大於 specialization gain。

Agent-as-Tool：agent 互通就是 tool call

把每個 agent 包成「另一個 agent 的 tool」、agent A 呼叫 agent B 跟呼叫 weather API 在介面上一樣——都是 tool call。

1Agent A
2  ├── tool: weather_api
3  ├── tool: database_query
4  └── tool: agent_B  ←── 內部其實是另一個 agent loop
5                            └── 它也有自己的 tools
6                                ├── tool: code_executor
7                                └── tool: agent_C

適用：agent 之間有清楚的「誰呼叫誰」、不是平等協商；想透過標準協議（function calling / MCP）讓 agent 跨系統重用。
典型場景：MCP 的 tool primitive 視角下、agent-as-tool 可以包成 MCP server 暴露、client agent 把它當 tool 用。跨組織 agent 互通常走這個模式。注意 MCP 還有 resources / prompts 另外兩類 primitive、不是所有 MCP server 都是 agent-as-tool。
跟 hierarchical 的關係：agent-as-tool 是 hierarchical 的一個實作策略——orchestrator 把 specialist agent 當 tool。差異在於：hierarchical 可能是同進程內的緊耦合、agent-as-tool 走標準協議、跨進程 / 跨組織 / 可替換。
失敗模式：
- 協議的 schema 太薄：agent 跟 agent 之間的 input/output 用 string 傳、丟結構資訊、下游難解析。
- Cascading failure：下游 agent 失敗、上游 agent 不知道為什麼失敗、誤判繼續。
- 重複 context 傳遞：每次呼叫都要重新 brief 一次下游 agent、token cost 爆。緩解：下游 agent 自帶 session memory（見 4.19 agent memory architecture）。

三種拓樸的選擇

場景特性	推薦拓樸
2–4 個 agent、需要動態協商	Flat
多個專業 agent、單一對外介面	Hierarchical
跨組織 / 跨進程 / 標準化重用	Agent-as-tool
大規模（10+ agents）、固定協作模式	Hierarchical 多層
想簡單開始	Hierarchical 兩層

教材建議的組合：對外是 hierarchical（單一 orchestrator）、orchestrator 內部跟 specialist 通訊走 agent-as-tool 協議（如 MCP tool primitive）、specialist 之間用 flat 模式平等溝通。實務上組合方式因團隊跟產品差異很大、這只是一個合理起點。

Specialization Gain vs Orchestration Overhead

Multi-agent 的核心 trade-off 是專業化收益跟協調成本的拉鋸。

Specialization gain：把 agent 拆細的好處

單一責任：每個 agent prompt 短、focus 清楚、debugging 容易。
獨立優化：每個 agent 可以用不同 model（具體 routing 思路屬於 4.7 workflow patterns router 模式）、不同 prompt、獨立 eval。
重用：同一個 specialist 跨多個系統用、攤平訓練 / 設計成本。
平行：獨立 agent 可平行跑、latency 降。

Orchestration overhead：拆細的成本

Context 傳遞成本：每個 agent 要被 brief、context 重複傳、token 累積。
Latency 累積：每跳一個 agent 加一個 LLM call 的 latency、跨 agent chain 跟 reflection / multi-step retrieval 一樣會累積。
失敗模式多：每個 agent 自己會 drift、agent 之間也會誤判、debug 比 single agent 難。
責任歸屬：bug 出現時、定位是哪個 agent 跑偏要看完整 trace。

何時 specialization 划算

條件	Specialization 划算？
Agent 之間 role 差異顯著	划算
Agent 之間 role 同質	不划算
重用機會多（多產品 / 多場景）	划算
單一場景 / 單一團隊	不划算
每個 sub-task 各自有客觀 eval	划算
Sub-task 無法獨立評估	不划算（debugging 困難）
Latency 容忍度高（後台 batch）	划算
即時 chatbot	不划算（orchestration latency 殺死 UX）

兩個容易低估的條件展開：

「sub-task 無法獨立評估」為何讓 debugging 困難：當 specialist agent 出問題、若沒有 component-level eval、要從 final output 倒推到「哪個 agent 跑偏」要看完整 trace + 人工讀。Single agent 失敗只需查一個 agent 的 trace、multi-agent 失敗要查 N 個、且 cascading failure 讓 root cause 模糊。要配 sub-task 客觀 eval（如 retrieval recall、抽取 accuracy）才能秒抓問題層、不然 specialization 換來的是更貴的 debug。
「orchestration latency 殺死 UX」的量級：每跳一個 agent 加一個 LLM call（雲端旗艦 ~1-3s）。Hierarchical 三層、user query 到回應走 3+ 次 LLM、累積 3-10s。即時 chatbot 的 latency budget 通常 < 3s、multi-agent 容易超標。Workaround：specialist 換小 model、或某些 step 改 deterministic、或退回 single agent + multi-step prompt。

「先粗、再細」的演化路徑

實務多採演化路徑、不是一開始就設計多 agent：

Single agent 開始：把整個任務塞一個 agent、看跑得起來嗎。
發現某子任務 systematic 失敗：那個子任務拆出來、變成 specialist agent。
更多子任務需要拆：演化成 hierarchical。
要跨產品重用：把某個 specialist 包成 agent-as-tool（透過 MCP）。

這條路徑的好處是每一步都有具體痛點驅動拆分、不是「為了 multi-agent 而 multi-agent」。

Multi-Agent 特有的失敗模式

除了單 agent 共通的失敗（context drift / goal drift / tool misread、見 4.4）、multi-agent 系統有自己特有的失敗模式：

循環依賴

循環依賴是 agent 呼叫圖在執行期才形成 cycle、靜態 declaration 抓不出來、結果無限執行。例：Agent A 呼叫 B、B 呼叫 C、C 又呼叫 A、形成 cycle。

緩解：

Call stack 監測、深度超過 N 強制中止。
Agent 設計時明確 declare 它會呼叫哪些下游 agent、靜態 check 不出 cycle。
Cycle 的合法用例（如 negotiation）要明確設停止條件。

責任歸屬模糊

責任歸屬模糊是 multi-agent 的 cascading 結構讓 final output 的「哪個 agent 出錯」可能跨多個 agent 累積、debug 時不知道從哪查。

緩解：

強制 trace 全部 agent call（見 4.20 LLM tracing）。
每個 agent 明確 declare 它對 final output 的貢獻範圍。
Error 用結構化、明確標出 raised by 哪個 agent。

Context 重複傳遞

Context 重複傳遞是 agent-as-tool 介面下、上游每次呼叫下游都要重新 brief 一遍、缺乏跨 call 的狀態保留、累積成 token cost 跟 latency 雙重浪費。

緩解：

Specialist agent 自帶 session memory、不用每次 brief（見 4.19 agent memory architecture）。
共享 context（global state、reference passing）取代複製。
Agent-as-tool 協議設計時、輸入 schema 包含「已 brief 過、跳過 intro」flag。

Orchestrator 成為單點認知瓶頸

Orchestrator 是 hierarchical 拓樸的核心、要理解所有 specialist 跟分派邏輯、它的 prompt / capability 限制整個系統上限。換 specialist 容易（介面標準）、換 orchestrator 牽動所有 routing 邏輯（耦合深）。

緩解：

Orchestrator 的 dispatch 邏輯外部化（不寫在 prompt 內、寫在 deterministic routing rule）。
Specialist 自己 declare capability（用 OpenAPI / MCP schema）、orchestrator 動態讀、不寫死。

Agent 之間互相 hallucinate

Agent 之間互相 hallucinate 是 agent 介面信任假設失效——上游 agent 給的 input 被視為「可信」、下游沒驗證就執行、hallucinated 內容沿著 agent chain 層層放大。

緩解：

Agent 之間互通也要走 schema validation（見 0.8 fuzzy engineering guardrail 段）。
Critical path 加 deterministic check、不只靠 LLM 自評。

跟 MCP / Function Calling 的協議對應

4.6 應用層協議寫 function calling / structured output / MCP 的層級差異。Multi-agent 拓樸的 agent-as-tool 模式直接對應 MCP：

1Agent-as-tool 在 MCP 視角下的展開：
2
3Client Agent
4  ├── MCP client
5  │     ↓ stdio / SSE / HTTP
6  │   MCP server #1 ← 包了一個 specialist agent
7  │   MCP server #2 ← 包了另一個 specialist agent
8  │   MCP server #3 ← 包了一個外部 service
9  └── 對 client agent 來說、三者介面一致、都是 tool

這個 framing 的價值：目前 agent 跨組織重用的主要工程問題是 agent-as-tool 協議普及度——MCP 是當前的主流選項。當業界對協議 schema 達成共識（無論是 MCP 還是後續演化的標準）、agent-as-tool 拓樸的工程成本會大幅下降。

判讀訊號：自家 agent 想暴露給其他團隊用、預設選 MCP server 包裝、不要設計 proprietary protocol。

何時過時 / 何時不過時

不會過時的部分：

Multi-call vs multi-agent 的判讀框架（控制流 / 角色 / context / 重用 / 失敗歸屬五維度）。
Flat / hierarchical / agent-as-tool 三種拓樸的結構分類。
Specialization gain vs orchestration overhead 的 trade-off。
「先粗、再細」的演化路徑反射。
Multi-agent 特有的五類失敗模式跟緩解。
Agent-as-tool 對應 MCP 的 framing。

會變的部分：

具體 multi-agent framework（CrewAI / AutoGen / LangGraph multi-agent 等會持續演化）。
MCP server 生態的成熟度（普及度會大幅影響 agent-as-tool 的工程成本）。
各家 framework 對 multi-agent 失敗模式的 handling 工具（debugging / tracing tooling）。

下一章：4.9 Production 部署資源評估、把多 LLM call / 多 agent 系統的 cost / latency / capacity 落到具體 production 評估。Multi-agent 跟 multi-call 的對比基礎見 4.7 workflow patterns、agent 自身的失敗模式見 4.4 agent 架構、MCP 協議層討論見 4.6 應用層協議。

Datadog 成本治理與 Agent 配置

Mon, 22 Jun 2026 00:00:00 +0000

本文是 Datadog 的 vendor deep article，深化 overview 的成本跟 Agent 段。初次接觸 Datadog 的讀者建議先讀 Datadog 服務頁。

定位

Datadog 是全託管觀測平台，涵蓋 metrics、logs、traces、profiling、RUM、synthetic monitoring。託管方案的核心取捨是「零運維但成本跟用量成正比」— 用得越多付得越多，而且計價維度多（host、custom metric、log ingestion、span、indexed span），成本治理需要理解每個維度的計價模型。

計價模型概覽

Datadog 的主要計價維度：

維度	計價方式	常見失控來源
Infrastructure host	每 host/月	Auto-scaling 造成 host 數量波動
Custom metrics	每 unique time series/月	Label 爆炸（同 cardinality 問題）
Log ingestion	每 GB ingested/月	Debug log level 忘記關
Log indexed retention	每 million events × 天/月	預設 retention 太長
APM host + indexed span	每 host/月 + 每 million span	Sampling 沒設、全收
Profiling	每 host/月（APM 加購）	整體成本疊加

多數 Datadog 成本失控的根因是 custom metrics 跟 log ingestion — 兩者跟 cardinality 跟 log volume 直接相關，成長可以很快。

Custom Metrics 成本控制

什麼算 custom metric

Datadog 把每個 unique 的 metric name + tag 組合算一個 time series。http_requests_total{service=checkout, method=GET, status=200} 跟 http_requests_total{service=checkout, method=POST, status=500} 是兩個 time series。

Tag 的笛卡爾積決定 series 數量。5 個 service × 4 個 method × 5 個 status = 100 個 series。加一個 region tag（3 個值）就變 300 個。加一個 endpoint tag（50 個 normalized path）就變 15,000 個。

控制策略

Tag 白名單：跟 Prometheus 的 label 白名單邏輯相同。只保留有查詢價值的 tag — service、method、status_class（2xx/4xx/5xx）。移除 user_id、request_id、完整 URL。

Metrics without Limits：Datadog 的功能 — 在 ingestion 之後、query 之前過濾 tag。所有 tag 都收但只 index / 計費特定 tag。適合「收全量但只查部分維度」的場景。

DogStatsD 聚合：Datadog Agent 的 DogStatsD 端在 Agent 層做 pre-aggregation，把客戶端的 per-request metric 聚合成 per-interval 的摘要。減少送到 Datadog 的 data point 數量。DogStatsD 聚合在 Agent 端執行，跟 TSDB 層的 recording rule 是不同位置的 pre-aggregation 機制。

Usage attribution：Datadog 的 Usage Attribution 功能把 custom metric 成本拆到 service / team tag，讓團隊看到自己的 metric 成本。對應 4.15 cost attribution。

判讀指標

Datadog UI 的 Metric Summary 頁面顯示每個 metric name 的 tag cardinality。定期（每月）檢查 top 20 高 cardinality metric，確認是否有意外的 tag 爆炸。

Log Ingestion 成本控制

Index 策略

Datadog log 的計費分兩層：ingestion（進來就計費）跟 indexing（索引後按保留天數計費）。可以 ingest 所有 log 但只 index 部分 — 非 indexed 的 log 可以在 15 分鐘的 live tail 窗口查看，之後就看不到了（除非歸檔到 S3/GCS 做 rehydrate）。

可操作的分層：

Error / warning log：index，retention 30 天
Info log（關鍵路徑）：index，retention 7 天
Debug log：不 index、只 ingest（live tail 用）；或直接不送
Access log（高量）：不 index、歸檔到 S3、需要時 rehydrate

Exclusion filter

Datadog 的 index exclusion filter 讓特定 pattern 的 log 進入 ingestion pipeline 但跳過 index。例：health check 的 access log（path:/health）每秒數百筆但沒有 debug 價值，設 exclusion filter 讓它不佔 index quota。

Log pipeline 跟 Datadog log 的對應

4.11 telemetry pipeline 的 collector 端可以在 log 送到 Datadog 之前做 filtering — 低價值 log 直接 drop、不進 Datadog ingestion（連 ingestion 費用都省）。這比 Datadog 的 exclusion filter 更節省成本（exclusion filter 仍然計 ingestion 費用）。

Agent 部署配置

Agent 部署模式

模式	部署位置	適用場景
Host agent	每台 VM 一個 agent	傳統 VM 部署
DaemonSet agent	K8s 每個 node 一個 agent	K8s 標準部署
Sidecar agent	每個 pod 一個 agent	需要嚴格隔離時
Cluster agent	K8s cluster 一個	收集 cluster-level metric

多數 K8s 部署用 DaemonSet + Cluster Agent 組合。DaemonSet agent 收集 node-level 跟 pod-level 的 metric / log / trace；Cluster Agent 收集 cluster-level 的 metadata 跟 event。

Agent 健康判讀

Agent 本身需要被監控 — Agent 故障時 Datadog 看到的是「資料消失」而非「Agent 掛了」。

判讀指標（Agent 自帶）：

datadog.agent.running：Agent process 是否存活
datadog.agent.check_run：各 integration check 是否正常
datadog.dogstatsd.packets.dropped：DogStatsD buffer 滿時丟棄的封包數

Agent 掛掉時 dashboard 會出現 gap（資料斷層）。如果所有 host 同時斷層、問題在 Datadog backend；如果特定 host 斷層、問題在該 host 的 Agent。

常見 Agent 故障

CPU / memory over-consumption：Agent 開太多 integration check 或 DogStatsD 收太多 custom metric。修復：減少 check 數量、調整 DogStatsD 的 aggregation interval、或升級 Agent 版本（新版通常更節省資源）。

Log collection 延遲：Agent 的 log tail 落後，log 到達 Datadog 的延遲增加。原因通常是 log rotation 設定跟 Agent 的 tail 設定不一致，或 log 量突然爆增超過 Agent 的處理能力。

Network connectivity：Agent 到 Datadog intake endpoint 的網路問題。Agent 會 buffer 資料並重試，但 buffer 滿（預設 100MB）後會 drop。在網路不穩的環境（edge location、受限網路），需要加大 buffer 或設定 proxy。

跟 OTel 的整合

Datadog 支援 OpenTelemetry — 可以用 OTel SDK instrumentation + OTel Collector，把資料送到 Datadog backend。這種模式讓 instrumentation 跟 vendor 解耦，但犧牲部分 Datadog-native 功能（例如 Watchdog anomaly detection 需要 Datadog Agent 的 metadata）。

整合模式的選擇跟 4.C7 Datadog OTel migration practice 的案例分析對應 — 雙軌期的成本跟語意對齊是主要挑戰。

下一步路由

Datadog 服務頁：overview 跟日常操作
4.7 cardinality：cardinality 治理的完整策略
4.15 cost attribution：成本歸因的組織治理
4.C7 Datadog OTel migration：Datadog 跟 OTel 的整合案例
OpenTelemetry：vendor-neutral instrumentation

4.19 Agent memory 分層架構

Tue, 12 May 2026 00:00:00 +0000

LLM 本身無狀態 — 每次 forward pass 從零開始、唯一輸入是 context window。但「agent」概念上有跨 session 狀態：使用者偏好、過去任務、累積知識、操作流程。Agent memory 是 harness 層的設計、把這些狀態持久化、按需 inject 到 working context。本章把 memory 分成五個層次、各層的寫入時機、retrieval 設計、失敗模式拆成可操作的工程實務。

本章目標

讀完本章後、你應該能：

區分 agent memory 的五個層次（working / short-term / long-term episodic / semantic / procedural）。
對自己 agent 場景判斷要哪幾層 memory、不要哪幾層。
設計 long-term memory 的「何時寫」「何時讀」邏輯。
認識 memory 的常見失敗模式（drift / PII / 污染）跟對應緩解。

五個層次的責任劃分

 1[Working memory]：當前 forward pass 的 context window
 2   - 規模：模型 context（4K-1M token）
 3   - 範圍：當下這次推論的全部輸入
 4   - 例：當下 user query + recent tool result + reasoning trace
 5
 6       ↑ 從這層讀 / 寫到這層
 7
 8[Short-term / session memory]：單一 session 的 scratchpad
 9   - 規模：一輪對話到一天
10   - 範圍：跨多個 turn、但 session 結束就丟
11   - 例：本 session 算過的中間結果、tried strategies
12
13       ↑ session 結束時可選擇 persist 到 long-term
14
15[Long-term episodic memory]：跨 session 的「事件」
16   - 規模：永久（直到主動刪除）
17   - 範圍：跨所有 session、按時間順序
18   - 例：「上週解過這個 race condition」「alice 上個月問過 X」
19
20[Long-term semantic memory]：跨 session 的「事實 / 知識」
21   - 規模：永久
22   - 範圍：跨所有 session、按主題索引
23   - 例：「user 偏好 markdown 輸出」「專案用 React 18」「team 不用 Tailwind」
24
25[Long-term procedural memory]：跨 session 的「流程 / 技能」
26   - 規模：永久
27   - 範圍：可重複使用的 known-good 程序
28   - 例：「跑測試前先 npm install」「commit 前要 lint」「deploy 前要 dry-run」

跟人類認知科學的對應：working ≈ 短期工作記憶、episodic ≈ 「我昨天去哪裡了」、semantic ≈ 「巴黎是法國首都」、procedural ≈ 「騎腳踏車的肌肉記憶」。

不是每個 agent 都要五個層次都用

選擇看用例：

用例	Working	Session	Episodic	Semantic	Procedural
Autocomplete（單行補完）	需要	不需要	不需要	不需要	不需要
Single-turn Q&A	需要	不需要	不需要	不需要	不需要
Chat IDE assistant（短對話）	需要	需要	不需要	不需要	不需要
Chat IDE assistant（長期使用）	需要	需要	可選	需要	可選
長期 coding agent（持續同 codebase）	需要	需要	需要	需要	需要
Multi-session research agent	需要	需要	需要	需要	需要

實務啟示：從「最少 memory」開始、有具體 trigger 才加。memory 不是越多越好、每加一層都增加複雜度跟失敗面。

Long-term memory 的寫入時機

何時寫是設計核心、影響 memory 的品質跟成本。三種主流模式：

1. 每 turn 寫（Auto-write）

每個對話 turn 結束都寫一條 memory。實作簡單但 memory 變垃圾場 — 太多瑣碎內容、retrieval 時混淆 signal。

適合：實驗階段、想看 memory 怎麼累積 不適合：production、長期使用

2. 任務結束寫（Task-end write）

每個明確「任務」（如「修完 bug」「寫完 feature」）結束時、寫一條 episodic / semantic memory 摘要。

實作：

1任務開始 → working memory 進入「task mode」
2   ↓ 多 turn 累積 session scratchpad
3任務結束（user 說「好了」/ test 通過 / commit done）
4   ↓ trigger memory write
5LLM call：「請從本 session 提取值得記得的 episodic / semantic / procedural memory」
6   ↓ 結構化輸出
7寫進 long-term store

適合：production agent、明確任務邊界 不適合：開放式對話、無明確任務終點

3. 主動觸發寫（Reflection / consolidation）

定期（每 N turn / 每天）跑「memory consolidation」step、LLM 自己決定該寫什麼。借鑒人類睡眠時 memory consolidation 的研究。

適合：長 running agent、有明確 idle 時間 不適合：低 cost 場景（consolidation 額外 LLM call 是常駐成本）

混用：production 多用「task-end write」為主 + 偶爾 reflection 做 consolidation。

Long-term memory 的 retrieval

何時讀也是設計核心。三種主流模式：

1. Inject-on-startup

把 long-term memory 在 session / agent 啟動時一次塞進 system prompt。

1System prompt:
2  "你是 coding assistant、user alice。
3   semantic memory: {markdown 偏好、React 18、Python 3.11、...}
4   procedural memory: {npm install before test、lint before commit、...}"

適合：memory 量小（< 1K token）、相對穩定 不適合：memory 多、變動快、retrieval 不準

2. Retrieval-on-demand

每次 user query 來、用 embedding similarity 從 vector store retrieve 相關 memory、塞進 context。

1User query → embed → cosine similarity vs memory vectors → top-K → inject

適合：memory 量大、跨主題、需要動態 不適合：高頻 / 低 latency 要求（retrieval overhead）

3. Hybrid（混合）

Procedural / semantic（穩定）→ inject-on-startup；episodic（動態）→ retrieval-on-demand。

1Session 啟動：
2  inject procedural + semantic（小、穩定）
3
4每 user query：
5  retrieve top-K episodic（動態）+ inject

實務 production 多採 hybrid。

跟 RAG 的邊界

Agent memory 跟 RAG 容易混淆、實際上是不同概念：

維度	RAG	Long-term agent memory
主要內容	外部知識庫（docs、wiki、codebase）	Agent 跟特定 user 的互動歷史
Per-user？	通常通用	Per-user / per-session
寫入時機	Build time / ingestion pipeline	Runtime（agent 自己決定何時寫）
變動頻率	較慢（doc 更新）	快（每 session 都可能變）
是否含「事件」	否（純知識）	Episodic memory 是事件

但兩者實作層常共享：vector store / embedding model / retrieval logic 可重用。設計上：

如果讀者問「跟『過去聊過的事』有關」→ memory
如果讀者問「跟『某個固定知識』有關」→ RAG
同一個 query 兩者都要 → hybrid retrieval、結果合併

失敗模式

1. Memory drift（記憶過時）

舊 memory 寫的內容不再正確、但仍被 retrieve、agent 用過時資訊。

例：兩個月前寫 memory「user 偏好 React class component」、user 已換 hooks、agent 仍寫 class component。

緩解：

Memory 加 timestamp、retrieval 時加 time decay weighting
定期 consolidation：LLM 跑一遍判斷哪些 memory 過時
Procedural / semantic memory 跑「validation step」：當前對話是否仍 align、不 align 就 mark stale

2. PII 寫入

User 不知情下、agent 把 PII（email、phone、社群 ID）寫進 long-term memory、跨 session retrieve 出來、可能洩漏。

緩解：

Memory write 前過 PII detection（regex 或專門模型）
Memory store 加 encryption-at-rest
User 可看 / 編輯 / 刪除自己 memory（GDPR / 隱私法規要求）
跟 6.4 跨雲端資料邊界結合判讀

3. Context 污染

不相關 memory 被 retrieve 進 working memory、模型把 irrelevant 內容當 signal、輸出飄。

例：user 問 React 問題、retrieve 出兩個月前的 Vue 經驗、模型混淆。

緩解：

Retrieval 加 similarity threshold（< 0.7 不 inject）
Memory 加 metadata（topic / project / language）、retrieval 加 filter
Inject 後加 explicit framing：「以下是過去相關 memory、僅供參考、若跟當前問題不符請忽略」

4. Memory 跟 hallucination 互相 boost

Hallucination 寫進 memory、變成「事實」、後續 retrieve 強化 hallucination、agent 越來越相信錯誤內容。

緩解：

Memory write 前要求 LLM 標「不確定」flag、retrieval 時 deprioritize
定期 ground truth validation（如連結 memory 到實際檔案、檔案變了 memory 失效）
Critical memory 要 user 確認才寫入

5. 跨 user memory 污染

Production 多 user 場景、memory store 沒做 user isolation、A user 的 memory 流到 B user。

緩解：

Memory store schema 強制 user_id 索引
Retrieval query 必加 user_id filter
跟 6.5 routing-to-production 的多租戶 isolation 結合

主流實作

工具 / framework	特色
Mem0	開源、五層 memory framework、retrieval-on-demand
Letta（前 MemGPT）	LLM-managed memory hierarchy、自動 page in/out
LangGraph memory	LangChain 系、跟 graph workflow 整合
Zep	雲端 memory service、含 PII detection
Self-implemented（DIY）	多數 production 自寫、用 vector store + metadata

判讀：用既有 framework vs 自己寫、取決於 memory 邏輯複雜度。簡單 case（per-user semantic preferences）用 DIY 即可；多層 memory + consolidation + GDPR 合規要 framework / SaaS。

跟 Coding agent 的整合

Coding agent 場景的 memory 案例：

Memory 類型	內容例子
Semantic	「專案用 TypeScript strict mode」「team 不用 anonymous default export」
Procedural	「跑測試 = `npm test`」「commit 前 `npm run lint`」
Episodic	「上週解過 race condition 在 user_session.ts」「alice 的 retry 邏輯偏好」

跟 4.17 coding agent harness 的關係：

Procedural memory 編進 scaffold 的 system prompt 或 skill registry
Semantic memory 可 inject-on-startup 或 retrieval-on-demand
Episodic memory 用 retrieval-on-demand、跟 RAG 共享 infrastructure

何時過時 / 何時不過時

不會過時的部分：

五層 memory 分類（working / session / episodic / semantic / procedural）
「不是每個 agent 都要五層都用」的選擇框架
寫入時機的三種模式（auto / task-end / reflection）
Retrieval 的三種模式（inject / retrieval / hybrid）
五個失敗模式分類

會變的部分：

具體 framework（Mem0 / Letta / LangGraph）的 API
LLM-managed memory 的具體實作（如 MemGPT 風格的 paging）
Memory consolidation 的最佳實踐
整合 LLM 跟 vector store / DB 的最佳方式

下一章：4.20 LLM tracing 與 observability、看 production debug 跟 cost 監控的工具層。

LLM 寫 code 工程實務指南：從心智模型到應用架構

Tue, 12 May 2026 00:00:00 +0000

本指南的核心目標是把「LLM 在寫 code 工作流的完整工程地圖」拆成可決策、可實作、可期望管理的工程問題。範圍覆蓋四條讀者旅程：(1) 在自己機器跑本地 LLM 寫 code 的最短可行路徑（Mac 或 PC）、(2) 想懂 LLM 內部運作機制（數學 + 理論基礎）、(3) 想做 LLM 應用開發（RAG / agent / tool use / VLM / benchmarking / 靜態 deployment）、(4) 關心 LLM 工作流的安全議題（本地 dev 視角 + 靜態網站視角）。網路上的 LLM 文章常把推論框架、加速技巧、應用模式、安全議題混為一談；本指南先把這些名詞放回正確的層級、再回答各層的具體取捨。

本指南預設讀者已經會用過雲端 LLM（ChatGPT、Claude）、熟悉終端機操作、想以工程視角理解 LLM。寫 code 場景是主要使用例、但模組二 / 三 / 四 / 六多數章節跨場景通用：想懂 reasoning model / RAG / embedding model 內部、即使不裝本地 LLM 也能讀。硬體前提分兩條路線：Apple Silicon Mac（M1 ~ M4、統一記憶體）走模組一；Windows / Linux + 獨立 GPU（NVIDIA / AMD、獨立 VRAM + 系統 RAM）走模組五。文章不販賣 LLM 焦慮、也不誇大本地能取代雲端的程度；它的責任是給每條讀者旅程的最短可行路徑、並標出每個階段的取捨。

模組零（心智模型）是所有讀者旅程的共同前置。模組一跟模組五是「裝本地 LLM」的兩條硬體路線、依平台選一條；想懂底層走模組二跟模組三（跟硬體無關、含 reasoning model / speculative decoding 等推論細節）；想看 LLM 作為系統元件走模組四（12 章涵蓋 RAG、tool use、agent、應用層協議、workflow、production resource、long context、embedding model、benchmarking、vision、靜態 deployment）；本地工作流跑穩想看安全議題走模組六（個人 dev 視角的供應鏈、伺服器綁定、tool use 權限、prompt injection、跨雲端邊界、production routing）。

教材邊界

類型	放在本指南	不放在本指南
心智模型	本地 vs 雲端的差異、為何 LLM 生字慢、三層架構（介面 / 伺服器 / 模型）、OpenAI 相容 API	雲端 GPU 租用、AGI 預測
術語澄清	MLX、MTP、oMLX、speculative decoding、量化、KV cache、TTFT、MoE CPU 卸載	post-training fine-tuning 細節
Mac 硬體現實	記憶體預算與模型大小、量化選擇、首字延遲、風扇與功耗	雲端 GPU 租用、資料中心訓練
PC 硬體現實	VRAM + RAM 分層預算、MoE 專家層 CPU 卸載、KV cache 量化、PCIe 頻寬限制	多卡 NVLink、資料中心級分散式推論
本地推論伺服器	Ollama、LM Studio、llama.cpp（Mac + PC 通用）	vLLM、TGI、Triton 等資料中心級 inference server
編輯器整合	Continue.dev + VS Code、Cursor 對應關係	JetBrains 全套整合、Vim / Emacs 進階 plugin
模型挑選	coding 場景的模型優先順序、量化等級對體感影響	benchmark 跑分方法論的完整推導
期望管理	本地 LLM 的擅長領域與分工、混用雲端的時機	LLM 通用能力評估、AGI 預測
數學基礎	線性代數、機率與資訊論、最佳化、數值精度在 LLM 中的角色	完整數學證明、測度論等屬於數學系範圍的主題
理論基礎	神經網路、embedding、attention、Transformer、訓練流程、sampling、tokenization、跨語言原理	多模態擴展、最新研究細節交給 Stanford CS25
應用層原理	RAG、Tool use、Agent 架構、應用層協議、Workflow 編排、Production resource、Artifact 管理	具體 framework 教學（LangChain / LlamaIndex）、prompt engineering
進階理論	Reasoning models（o1 / R1 / QwQ 風格）、Speculative decoding 內部（drafter / MTP / EAGLE）	完整 paper 推導、最新研究 frontier
進階應用	Long context engineering、Embedding model 內部、Benchmarking、Vision in coding、靜態 / serverless RAG deployment	完整 LangChain / LlamaIndex 教學
Fine-tuning	原理（LoRA / QLoRA / catastrophic forgetting）+ 本機 hands-on	完整資料工程、large-scale distributed fine-tune
隱私 / 安全	隱私資料流、本地 dev 安全模組（供應鏈 / 伺服器綁定 / tool use / prompt injection / 跨雲端邊界 / production routing）、靜態網站 RAG 資安、排錯方法論	企業合規逐條檢核、SOC 2 / HIPAA 流程
進一步學習	數學公開課推薦、LLM 理論公開課推薦	（交給推薦的課程跟書籍）

學習路線

本指南分成七個模組加一組前置卡片（111 張）。讀者依目的選讀、不需要從頭到尾全讀：

想用 Apple Silicon Mac 裝本地 LLM 寫 code：讀模組零 + 模組一（最短路徑）
想用 Windows / Linux + 獨立 GPU 裝：讀模組零 + 模組五
想懂 LLM 內部原理：模組二（數學） + 模組三（理論、含 reasoning models / speculative decoding）— 跟硬體無關
想做 LLM 應用開發（含 RAG / agent / VLM / 靜態 deployment）：模組四（12 章、跨工具世代不變的原理）— 跟硬體無關
想懂本地工作流的安全議題：模組一 / 五跑穩後接模組六（個人 dev 視角）
想選 RAG 的 storage 方案（pickle / vector DB / hosted SaaS）：直接看 4.22 RAG storage 工程
想在靜態網站加 RAG / 智能搜尋：直接看 4.16 靜態 / serverless RAG deployment
想在本機 fine-tune 模型：模組三 3.4 訓練流程原理 → 本機 QLoRA hands-on
想跟最新進展接軌：讀完模組後進推薦的公開課程跟 paper（模組二 2.4 + 模組三 3.10）

前置知識卡片

用原子化卡片整理 token、自回歸、KV cache、量化、speculative decoding、MTP、MLX、推論伺服器、OpenAI 相容 API、memory bandwidth、統一記憶體、TTFT、prefill、context window、Transformer、Diffusion 等核心概念。章節文章專注情境推導、術語背景交由卡片維持一致。

模組零：基礎知識與心智模型

整理本地 vs 雲端 LLM 的差異、自回歸架構與記憶體頻寬瓶頸、介面 / 伺服器 / 模型三層心智模型、OpenAI 相容 API 為何重要、MLX / MTP / oMLX 三個容易搞混的術語、Apple Silicon Mac 記憶體與模型大小的對應關係、判讀本地 LLM 資訊的五個框架。

模組一：本地 LLM 服務的安裝與應用

整理 Ollama、LM Studio、llama.cpp 三個主流推論伺服器的現況差異與安裝路徑、用 Continue.dev 把本地 LLM 接到 VS Code 的完整步驟、寫 code 場景下模型選型的優先順序、本地模型的期望管理、想進一步玩 coding agent、Web UI、產圖時的延伸方向。

模組二：LLM 的數學基礎

整理 LLM 推論背後的數學工具：線性代數（向量、矩陣、空間）、機率與資訊論（softmax、cross-entropy、KL、perplexity）、微積分與最佳化（gradient、SGD / Adam）、數值精度（fp32 / bf16 / Q4 / Q8 的取捨）。每章末尾接到公開課推薦。

模組三：LLM 的理論基礎

整理 LLM 內部運作機制、共 11 章：神經網路基礎、embedding 空間、attention 機制、Transformer 架構、訓練流程（pre-train → SFT → RLHF / DPO）、sampling 策略、tokenization 算法、跨語言場景原理、Reasoning models（o1 / R1 / QwQ 等 test-time compute paradigm）、Speculative decoding 內部（drafter / MTP / EAGLE）。每章末尾接到公開課推薦（Karpathy、Stanford CS224N / CS25 / CS336、DeepLearning.AI）。

模組四：LLM 應用層原理

整理 LLM 作為系統元件的設計原理、共 12 章：RAG、tool use、agent 架構、應用層協議、workflow 編排模式、Production resource planning、衍生產物管理、Long context engineering、Embedding model 內部、Benchmarking 方法論、Vision in coding workflow（本地 VLM 接 IDE）、靜態 / serverless RAG deployment（沒 backend 場景）。本模組刻意只寫跨工具世代不變的原理、避開 LangChain / LlamaIndex 等具體 framework 教學。

模組五：Windows / Linux + 獨立 GPU

整理消費級 PC（Windows / Linux + NVIDIA / AMD 獨立 GPU）跑本地 LLM 的硬體判讀模型與工程選項：VRAM + RAM 分層預算、MoE 模型的 CPU 卸載策略（--n-cpu-moe）、KV cache 量化（K=Q8 / V=Q4）跟 context 長度的權衡、llama.cpp 在 PC 上的調參空間。本模組跟模組一是平行的硬體路線、共用模組零的心智模型跟卡片。

模組六：本地 LLM 的安全與權限

整理個人 dev 在自己機器上跑本地 LLM 的安全議題：模型供應鏈與信任邊界、推論伺服器的綁定與暴露範圍、tool use 與 MCP server 的權限模型、IDE 場景的 prompt injection、跨雲端 / 本地的資料邊界、跨進 production 的 routing 中樞。framing 是個人 dev 視角、不是 enterprise 資安管理；production / 多租戶 LLM 服務的特殊資安議題見 Backend 模組七資安與資料保護的 LLM 相關章節。

模組之間怎麼配合

模組	角度	跟其他模組的關係
模組零	操作層心智模型	是模組一跟模組五的共同前置
模組一	工具層、Mac 實際安裝	用模組零的詞彙、跟模組三的理論互補
模組二	數學工具	提供模組三需要的數學詞彙、跟硬體平台無關
模組三	理論機制	用模組二的工具拼出完整 LLM、跟硬體平台無關
模組四	應用層原理	用前面模組建的詞彙、看 LLM 作為系統元件
模組五	工具層、PC 獨立 GPU	跟模組一平行、用模組零的詞彙、處理 VRAM 場景
模組六	安全層、個人 dev 視角	在模組一 / 五的工作流上加安全判讀、cross-link backend/07 通用資安卡片

模組二跟模組三可並讀。閱讀模組三遇到陌生數學詞時跳回模組二補完、再回模組三繼續。模組四在前面模組之上、但讀者熟悉 LLM 應用詞彙也可直接從這裡讀起。模組一跟模組五依硬體選一條主路線、共用模組零的心智模型與 knowledge-cards。模組六在模組一 / 五跑穩後接、處理「跑起來後該注意什麼」。

適合的讀者

背景	適合程度	建議起點
用過 ChatGPT / Claude、沒碰過本地模型	直接適合	模組零從頭讀
裝過 Ollama 但被網路上的術語混淆	直接適合	MLX / MTP / oMLX 區分 + 判讀框架
想知道 24GB / 32GB Mac 該選哪個模型	直接適合	硬體記憶體預算 + 模型選型
想用本地 LLM 完全取代 Claude / GPT-5	部分適合	期望管理先看完再決定
想懂 LLM 內部運作機制	直接適合	模組三理論基礎從頭讀（含 reasoning models / speculative decoding）
想懂背後的數學	直接適合	模組二數學基礎從頭讀
想懂 o1 / DeepSeek-R1 等 reasoning model 怎麼運作	直接適合	3.8 Reasoning models 從頭讀
想做 LLM 應用開發（RAG / agent / tool use）	直接適合	模組四從 4.0 RAG 依序讀
想在自家 Hugo / Astro 等靜態網站加 RAG	直接適合	4.16 靜態 / serverless RAG deployment（含資安取捨）
想用 VLM 看截圖 / 設計稿輔助寫 code	直接適合	4.15 Vision in coding workflow
想評估 LLM benchmark 數字、做 in-house eval	直接適合	4.14 Benchmarking 方法論
想在本機 fine-tune 模型懂自家 codebase 慣例	直接適合	3.4 訓練流程原理 + QLoRA hands-on
想做 large-scale fine-tune / 從頭訓練	部分適合	讀完模組三後進入推薦的公開課程跟 Stanford CS336
用 Windows / Linux + NVIDIA / AMD 獨立 GPU 跑本地 LLM	直接適合	模組零建心智模型 + 模組五處理 VRAM 預算、MoE 卸載、KV cache 量化
想知道本地 LLM 跑起來後的安全議題	直接適合	模組六個人 dev 視角的安全與權限
想把 LLM 部署成 production 服務、處理服務化資安	部分適合	個人視角見模組六；production 場景見 Backend 模組七資安的 LLM 相關章節
想在資料中心級 GPU（H100 / H200 / B200）部署	部分適合	心智模型跟 knowledge-cards 通用；vLLM / TGI / Triton 等資料中心 inference server 另尋專門教材
想跑 Stable Diffusion / Midjourney 等產圖	跟主題不同	產圖是 Diffusion 架構、見 Diffusion 卡片、另尋 ComfyUI / Draw Things 教材

用語約定

本指南使用的關鍵術語在第一次出現時都附原文。為避免歧義，下列詞彙在本指南內固定指涉：

本地 LLM：跑在使用者自己機器（Mac 或 PC）上的大型語言模型推論、prompt 留在本機。
推論伺服器（inference server）：負責載入模型權重、處理 prompt、產生 token 的常駐程式、例如 Ollama、LM Studio 內建 server、llama.cpp server。
介面層：使用者實際打字互動的工具、例如 VS Code + Continue.dev、CLI、Web UI。介面層透過 API 跟推論伺服器溝通。
模型（model）：權重檔本身、例如 gemma4:31b、qwen3-coder:30b。模型可以在不同推論伺服器之間共用、前提是格式相容。
量化（quantization）：把模型權重從高精度（如 bf16）壓成低精度（如 Q4）以減少記憶體佔用、代價是少許品質下降。

不在本指南內的主題

本指南不討論：

Speech / audio LLM：跟核心文字 LLM 是不同方向、本指南不涵蓋。Vision（VLM）原本不放、但因 coding 工作流的 vision use case 進入主流、補上 4.15 Vision in coding workflow；video LLM 仍不放。
資料中心訓練的工程細節：data parallelism、ZeRO、tensor parallelism 等屬於專門課程的範圍。
向量資料庫的 vendor 比較（Pinecone vs Weaviate vs Chroma 等）：vendor 格局半年一變、不適合寫入教材。RAG 的 storage 工程原理（升級判讀、index 生命週期、dependency 約束）見 4.22 RAG storage 工程。
Kubernetes / 資料中心級分散式推論：跟個人機器本地 LLM 方向不同、需另尋專門教材。
多卡 NVLink、tensor parallelism：消費級 PC 場景通常單卡、本指南不涵蓋多卡分散式推論。

若讀完本指南後想往這些方向走：

想做 RAG 應用：先把 Ollama + Continue.dev 跑穩、再讀模組四 4.1 RAG 原理建立設計取捨判讀、或模組三 3.8 推薦的 DeepLearning.AI short courses。
想跑 coding agent：先讀 4.4 Agent 架構原理建立判讀、再看 1.6 延伸方向了解 aider、Cline 等工具的定位差異。
想跑產圖模型：Diffusion 跟 Transformer 是不同架構、請另尋 ComfyUI / Draw Things / Diffusers 教材。
想自己訓練 / fine-tune：讀完模組三、進入 Karpathy zero-to-hero、Stanford CS336、Hugging Face NLP Course 等推薦資源。

文件版本：v0.7.0 最後更新：2026-05-12 系列狀態：七個模組 + 125 張知識卡片。模組零（9 章）/ 一（10 章 + hands-on、含 QLoRA + judge harness）/ 二（5 章）/ 三（12 章、含 reasoning / speculative / constrained decoding）/ 四（17 章、含 long context / embedding / benchmarking / VLM / 靜態 deployment / coding agent harness / prompt caching / agent memory / tracing / LLM-as-judge）/ 五（7 章）/ 六（7 章、含 OWASP 對照）。

Background Agent 平行研究：main context 節省的量化效應

Mon, 18 May 2026 00:00:00 +0000

跨多個獨立子任務的研究（如多個 vendor 案例採集、多個主題 web research、多個檔案的 fact-check）、用 background agent 平行做、比串行單一 agent 或主 context 直接做都更省 token。

這份紀錄整理 backend/03-message-queue 模組 6 vendor case 庫採集的實作經驗、量化 main context 節省效應、給未來類似任務作為設定參考。

採集任務的特徵

backend/03 模組需要為 6 個 vendor（Kafka / RabbitMQ / NATS / Redis Streams / SQS / Pub/Sub）採集 5-10 個公開 case。任務特徵：

各 vendor 獨立、無相互依賴
每個 vendor 需要 WebSearch 找候選 + WebFetch 驗證 URL + 抽 finding、多步驟
每個 agent 任務時長 4-7 分鐘（含 WebFetch 多次往返）
採集回報是清單形式、易於主 context 整合

Background agent 平行的執行方式

每個 agent 用 subagent_type: general-purpose、run_in_background: true、prompt 含：

採集目標（5-10 案例）
硬閘門（WebFetch 驗證）
排除清單（已有案例 / vendor 自家 marketing）
對齊大綱（該 vendor 的進階主題列表）
回傳格式（清單、含 source / observation / finding / 對應章節）

主 context 一個 message spawn 6 個 agent、然後等通知。

量化結果

維度	串行單 agent	Background 平行 6 agent	主 context 直接做
總時間	~40 分鐘（6 vendor × 7 分鐘）	~7 分鐘（最慢 agent）	~60 分鐘（含探索盲區）
主 context token	高（每次 WebFetch 都進 context）	低（只收 summary）	最高（整個流程在 context）
Agent context token	跟串行同	每 agent 獨立、不影響主	N/A
失敗風險	任一 agent 失敗影響全部	失敗 agent 獨立、其他繼續	主 context 失敗整體中斷

主 context 節省效應 ~80%：每個 agent 報告約 2KB summary、6 個總 12KB；若主 context 直接做、每次 WebFetch 取回的 markdown 約 10-30KB、累積後容易 > 100KB。

適用場景判斷

Background agent 平行適用：

多個獨立子任務（不互相依賴 input / output）
每個子任務需要多步驟 tool use（WebFetch / WebSearch / Bash / Glob）
子任務回報是結構化清單 / summary、不是 raw transcript
主 context 需要節省 token 做後續工作（如寫檔、整理 index）

不適用：

線性依賴（任務 B 需要任務 A 結果）
短任務（單一 WebFetch、串行直接做更快、平行 overhead 不划算）
需要主 context 即時介入決策的任務

跟其他 agent 用法的對比

backend 模組過去用過的其他 agent 用法：

用法	階段	目的
Stage 0 平行採集	寫作前	研究、補案例庫
Stage 3 平行 review	寫作後	審查、抓 issue
即時 Explore agent	寫作中	找 file / symbol 位置

三種都用 background、都節省主 context、但目的跟回報格式不同。Stage 0 採集回報是「清單 + 捨棄候選」、Stage 3 review 回報是「issue list + severity」、Explore 回報是「file path + match」。

設定參考

spawn 平行 agent 的 anti-pattern：

不寫硬閘門：「找 5-10 case」沒明示 WebFetch 驗證 → agent 編造 URL
不列排除清單：「找 Kafka 案例」沒列既有案例 → agent 重複採集
要求 raw transcript 回報：「把找到的內容貼給我」→ 主 context 爆炸
單一巨大 agent：「找所有 6 個 vendor」串行做 → 失去平行優勢
平行過頭：spawn 20+ agent 但實際只有 6 個獨立任務 → 不必要的協調成本

跟 case-first 流程的關係

這個方法已寫入 .claude/skills/case-first-module-workflow/references/stage-0-case-collection.md、成為 case-first 流程的 stage 0 採集標準執行範式。但實際適用範圍超出 case 採集、適用所有「多獨立子任務 + 多步驟 tool use」場景。

下一步該追蹤的議題

平行 agent 數量上限：6 個跑 OK、20+ 是否會撞到 rate limit 或協調成本？實作上限是多少？
Agent context 跑滿後的恢復策略：若某個 agent context 跑滿、其他 agent 繼續但該 agent 失敗、要不要 retry？怎麼接續？
跨 agent 共享 cache：6 個 agent 都 WebSearch 同一個 vendor 主頁、有沒有 cache 共享機制可省 token？目前每 agent 獨立、可能重複 fetch

LLM Agent Prompt Injection 後果治理

Tue, 12 May 2026 00:00:00 +0000

本章的責任是把 prompt injection 在 production agent 場景下能造成的具體後果、跟 7.10 事件案例到控制工作流的 incident 流程接起來。核心概念見 tool use 跟 agent loop 卡；影響範圍評估見 backend blast-radius 卡。個人 dev IDE 場景的 prompt injection 入口判讀見 llm/6.3 IDE 場景的 prompt injection；本章聚焦 production agent 場景下、injection 觸發 tool / API call 後造成的服務級後果。

本章寫作邊界

本章聚焦 production agent 場景下 prompt injection 的後果治理：tool spec 設計約束、agent loop 限制、review checkpoint、可逆性保證。注入發生機制（IDE 場景、codebase / 依賴 / Web）已在 llm/6.3 涵蓋、本章不重複。

本章 threat scope

In-scope：production agent 場景下 prompt injection 觸發 tool 副作用、跨服務 lateral movement、惡意 API call、誤觸發 production 操作、agent loop 中的 injection 累積。

Out-of-scope（路由到他章）：

個人 dev IDE prompt injection 入口 → llm/6.3 prompt-injection-in-ide
一般 incident workflow → 7.10 incident-case-to-control-workflow
偵測訊號 → llm-as-service-detection-coverage
身份授權邊界 → 7.2 identity-access-boundary
tool use 個人 dev 場景 → llm/6.2 tool-use-permission-model

從本章到實作

Mechanism：問題節點表 → knowledge-card / 工程模式。
Delivery：交接路由 → IR 流程 08-incident-response、平台治理 05-deployment-platform。

production agent 場景的 prompt injection 後果光譜

場景複雜度	典型 tool 配置	injection 後果
單一 tool	read_file 或 fetch_url	資料洩漏（讀到敏感檔案 / 觸發內網請求）
兩三個 tool	+ write_file / send_email	+ 不可逆副作用（檔案修改、外送郵件）
多 tool agent	+ DB query / external API / shell	+ 跨服務 lateral movement、production 資料污染
autonomous agent	+ 長 agent loop + 自我計畫	+ injection 在 loop 內累積、行為偏離原意圖、難以 rollback

production 場景下、後果嚴重度跟 tool 配置複雜度近似正比。「能讓 LLM 做的事越多、injection 能造成的傷害越大」是核心 framing。

分析模型

production agent 場景下 prompt injection 治理的分析依四個層次：

tool spec 層：每個 tool 的能力邊界、白名單、副作用可逆性。
agent loop 層：loop 步數限制、checkpoint 設計、人為 review 介入點。
identity 層：agent 持有的 credential 範圍、scope 最小化。
observability 層：tool call 序列的可追溯性、異常模式偵測。

判讀流程

判讀流程的責任是把「能執行 tool 的 LLM agent」轉成「injection 後仍可控的 LLM agent」。

先盤點 agent 能執行的所有 tool、每個 tool 的副作用範圍。
再確認 tool spec 是否設了白名單、副作用是否可逆。
接著確認 agent loop 的步數限制跟 review checkpoint。
最後交接到偵測流程跟 IR 流程、確認異常能被識別跟回退。

問題節點（案例觸發式）

問題節點	判讀訊號	風險後果	前置控制面
tool spec 沒白名單	tool 接受任意路徑 / 任意 URL / 任意指令	injection 觸發 tool 觸及敏感資源	contract
副作用 tool 沒 dry-run / confirm	寫入 / 外送 / DB 操作直接生效、無人為 checkpoint	不可逆操作被 injection 觸發、production 影響	release-gate
agent loop 無步數限制	LLM 可無限自我規劃下一步	injection 在 loop 中累積、行為飄移	circuit-breaker
agent 持高權限 credential	同一 credential 涵蓋讀寫 production / 跨服務	單次 injection 影響多服務	identity-access-boundary
tool 結果回流到下一個 prompt 沒標記	tool 回傳的內容直接 concat 到 prompt	tool 回傳的內容若含 injection、會被當下一輪指令	contract
跨 agent / sub-agent chain 沒邊界	parent agent 直接調用 sub-agent、共用 context	injection 在 chain 中傳播、影響面難收斂	dependency-isolation

常見風險邊界

風險邊界的責任是界定何時 production agent 已進入高壓狀態。

agent 能執行的 tool 集合擴張、單次 injection 影響面跨越 tenant 或服務邊界時、代表 tool spec 層 isolation 失效。
agent loop 步數沒上限、且自我規劃結果直接執行時、代表 loop 層控制不足。
同一 agent credential 跨多個 production 服務 / 多個 environment 時、代表 identity scope 過寬。
tool call 序列無 audit trail、無法事後追蹤 injection 從哪個 tool 結果引入時、代表 observability 不足。

production 場景的特殊判讀

production agent 場景下 prompt injection 治理的特殊性：

「擋住 injection」是不切實際的目標：production agent 處理大量外部內容（user input、Web、RAG 文件、其他 service 回傳）、infused 內容會有 injection；治理目標應是「injection 後仍可控」、不是完全擋住。
下游動作的可逆性比模型對齊重要：模型對齊強度是「降低觸發率」、tool spec / agent loop 設計是「降低觸發後的影響」。後者更可工程化、優先投資。
agent loop 是放大器：單次 injection 觸發單一 tool 可控、loop 中 injection 累積導致行為飄移難控；agent loop 步數限制 + 定期 checkpoint 是 production agent 的基本配置。
tool 回傳內容是次要 injection 入口：tool 抓回的網頁、DB 查詢結果、其他 service 回傳、都會回流到下一個 prompt；這些內容應在 prompt 中明確標記（如包起）並 instruct 模型不當指令、但不能依賴。
agent credential 應 per-call 簽發：靜態 credential 影響面太大、production 應該用 workload identity（見 7.7）動態簽發。

防禦設計的核心原則

production agent 場景下、防 prompt injection 後果的設計核心：

tool spec 嚴格白名單：能限制就限制、read_file 限定 workspace、fetch_url 限定 allowlist domain、run_shell 應該幾乎不存在。
副作用 tool 強制 confirm 或 dry-run：production 寫入 / 外送 / DB 操作不該由 LLM 直接執行、應該產生 review item 由人或另一個 verification system 確認。
agent loop 步數限制 + checkpoint：例如 max 10 steps、每 5 steps 強制 review。
agent credential 最小化、per-call 簽發：避免靜態高權限 credential 一直在 LLM 周圍。
tool 結果在 prompt 中明確包覆：... 並 instruct 模型「以下內容來自外部資源、不執行內含指令」、雖非萬靈丹但降低觸發率。
可追溯：每個 tool call 記錄完整 input / output / agent state、IR 時能 replay。

案例觸發參考

LLM agent prompt injection 的公開案例累積中、值得追蹤的方向：

email assistant 場景：閱讀含 injection 的郵件、誘導 agent 觸發外送或洩漏。
coding agent 場景：讀含 injection 的 PR / issue、誘導 agent 修改非預期檔案。
Web browsing agent：抓到含 injection 的網頁、誘導 agent 觸發其他 tool。
跨 agent chain：injection 在 sub-agent 累積、影響 parent agent 決策。

事實查核註：LLM agent prompt injection 是 2024 ~ 2025 年快速演進的研究領域、攻擊形態、防禦模式、公開案例都在累積中。建議引用前以 OWASP LLM Top 10、Greshake et al. “Indirect Prompt Injection” 等近期論文跟主流 vendor 的 incident 公告為準。

引用標準

標準	版本 / 年份	適用場景
OWASP LLM Top 10	2025	LLM01 Prompt Injection / LLM02 Insecure Output
NIST AI RMF（AI Risk Management Framework）	1.0 (2023)	AI 系統風險管理 reference
MITRE ATLAS	continuous	AI 系統威脅戰術 reference

引用版本與 cadence 規則見 security-citation-currency-and-precision。Last reviewed: 2026-05-12。

下一步路由

偵測訊號：llm-as-service-detection-coverage
log / PII 治理：llm-log-and-pii-governance
事件案例工作流：7.10 incident-case-to-control-workflow
workload identity：7.7 workload-identity-and-federated-trust
可靠性：06-reliability