Prompt-Engineering on Tarragon

Beyond LLM: Enhancing LLM Applications (Stanford CS230)

Thu, 14 May 2026 00:00:00 +0000

來源：Stanford CS230 Deep Learning、講題 “Beyond LLM: Enhancing Large Language Model Applications”。

整理原則：保留講者英文原文以避免翻譯失真、移除口語贅詞、用文章結構重新組織。標題與導讀用 zh-Hant。

講座定位

We started with neurons, then layers, then deep networks, then how to structure projects in C3. This lecture goes one level beyond: what would it look like if you were building agentic AI systems at work, in a startup, in a company?

The goal is not to build an end-to-end product in the next hour, but to give you the breadth of techniques that AI engineers have figured out — and are still exploring — so that after class you have the baggage to dive deeper and learn faster.

Agenda:

Challenges and opportunities for augmenting LLMs
Prompt engineering
Fine-tuning (and why to mostly avoid it)
Retrieval-Augmented Generation (RAG)
Agentic AI workflows
Case study with evals
Multi-agent workflows
What’s next in AI

1. Why augment LLMs?

Limitations that show up when you use a vanilla pre-trained model:

Lacks domain knowledge — e.g. a student project building an autonomous farming device with a camera that classifies sick crops. That data set isn’t out there; a pre-trained vision model lacks that knowledge.
Real-world distribution shift — the model was trained on high-quality data, but data in the wild is much messier.
Lacks current information — retraining from scratch every few months is impractical. Example: during Trump’s first presidency he tweeted “Covfefe.” The word didn’t exist; Twitter’s LLMs couldn’t recognize it, recommender systems went wild. New trends and slang (rizz, mid, etc.) appear constantly and you can’t keep retraining.
Trained for breadth, not depth — fine on a wide range of tasks, but may not be precise enough for narrow, well-defined enterprise applications with high precision / low latency requirements.
Carries unnecessary weight — a massive model where you only use 2% of capability is slow and expensive. Pruning, quantization, and modification are options.

LLMs are hard to control

In 2016 Microsoft launched a Twitter bot that learned from users and quickly became a racist jerk. They removed it 16 hours after launch. Even better-funded teams struggle: there’s an ongoing debate (Elon Musk vs Sam Altman) on whose LLM is the “propaganda machine.” If you hang out on X you’ll see screenshots of LLMs saying controversial things. Even the best-funded labs don’t do a great job of controlling their LLMs.

LLMs may underperform on your task

Specific knowledge gaps (e.g. medical diagnosis)
Missing sources — research, education, legal all require sourcing
Inconsistencies in style / format (e.g. legal contracts where every word counts)
Task-specific understanding — example: a biotech company categorizing reviews as positive / neutral / negative. What counts as “negative” in that industry may differ from a generic LLM’s notion. You need to align the LLM to your task.

Limited context handling

A lot of enterprise applications need large context. Example: an LLM running on top of your entire drive that can answer “what was our Q4 sales performance?” in one shot. In practice the context window is limited (best models today max out around hundreds of thousands of tokens; 200K ≈ two books). For video or large data, you have to chunk and embed.

The attention mechanism doesn’t attend well over very large contexts. The needle-in-a-haystack benchmark tests this: insert a single sentence (“Arun and Max are having coffee at Blue Bottle”) in the middle of a very long text like the Bible, then ask “what were Arun and Max having?” It’s complex not because the question is hard but because the model must find a fact within a huge corpus.

The RAG debate

In theory, with infinite compute, RAG is useless — you could just read a massive corpus immediately and answer. But even then, latency matters; imagine the LLM reading your entire drive on every question. RAG also has other advantages: accuracy, sourcing.

Analogy to search: when you search, you still find sources. There’s detailed traversal that ranks and finds specific links. Without that, you’d be reading the entire web every query — not reasonable. So RAG-like approaches likely stay relevant.

2. Two dimensions of optimization

Two axes when improving LLM-based products:

Foundation model axis — move from GPT-3.5 Turbo → GPT-4 → GPT-4o → GPT-5. Each step (in theory) improves base performance.
Engineering axis — keep the same base model, but engineer how you leverage it: better prompts, RAG, agentic workflow, multi-agent system.

This lecture is about the vertical axis: which LLM are you using, and how do you maximize its performance?

3. Prompt engineering

The BCG / HBS / UPenn / Wharton study

Three groups of BCG consultants:

No AI access
GPT-4 access
GPT-4 + training on how to prompt

Two interesting findings:

The jagged frontier: some tasks fall within the frontier where AI clearly helps; others fall outside, where AI actually makes performance worse. Many tasks fell within, many fell outside. Researchers also observed “falling asleep at the wheel” — relying on AI for a task beyond the frontier, and not reviewing outputs carefully.

Centaurs vs cyborgs: two working modes.

Centaurs divide and delegate — give a big task to the AI, let it work, come back later. (Half human / half horse: clear delegation.)
Cyborgs fully blend with AI — fast back-and-forth, augmented. Students often work like cyborgs; in the enterprise, when you automate a workflow, you’re thinking like a centaur.

The trained group did best. Prompt engineering is a skill everyone should have — not a job title to build a career on, but a powerful skill in your career.

Basic prompt design principles

A weak prompt:

Summarize this document. {document}

The model has no context on length, audience, focus. Better:

Summarize this 10-page scientific paper on renewable energy in five bullet points, focusing on key findings and implications for policymakers.

Common techniques to make it even better:

Give an example of a great summary
Role prompting: “Act as a renewable energy expert giving a conference at Davos”
Praise: “You are the best in the world at this”
Reflection / self-critique: ask the model to critique its own output and revise
Chain of thought: break the task into explicit steps, “think step by step, do not skip any step.” Step 1 identify the three most important findings; Step 2 explain impact; Step 3 write the five-bullet summary.

Andrew Ng recommends looking at other people’s prompts. Repos like “awesome prompt template” on GitHub have many examples engineers have built. Many start with “Act as a Linux terminal”, “Act as an English translator”, “Act as a position interviewer”, etc.

Prompt templates

The advantage of a template is you can put it in your code and scale across many user requests. Example from Workera: the HR system has “Jane is a Product Manager Level 3, US, preferred language English.” That metadata gets inserted into a prompt template that personalizes for Jane. Same template, different metadata for Joe (preferred language Spanish).

Foundation models likely use system prompts you don’t see — e.g. ChatGPT may inject “Act like a helpful assistant” plus user memories from a database before your prompt. That doesn’t stop you from adding your own template on top.

Zero-shot vs few-shot prompting

Zero-shot:

Classify the tone as positive, negative, or neutral. “The product is fine, but I was expecting more.”

Different humans would label this differently — partially positive, partially negative. Alignment to your task can come from few-shot:

Here are examples of tone classifications: “These exceeded my expectations completely.” → positive “It’s OK, but I wish it had more features.” → negative “The service was adequate. Neither good nor bad.” → neutral Now classify: “The product is fine, but I was expecting more.”

The model now likely says negative, aligned to the second example.

Sophisticated AI startups keep their few-shot examples up to date — whenever a user says something interesting, a human labels it and it gets appended to the relevant prompt. Like building a dataset, but inserted directly in the prompt. Faster to iterate because you don’t touch model weights.

Q: How long can the prompt be before the model loses itself?

There is research, but it dates fast. Practical example from Workera: a voice conversation eval breaks down after ~8 turns. Mitigation: chapter the conversation, summarize the first part, start over from a new prompt with the summary inserted.

Chaining complex prompts

The most popular technique. Not chain of thought.

Single prompt for a customer review response:

Read this review and write a professional response that acknowledges concerns, explains the issue, offers a resolution. {review}

You get one output. Hard to debug — everything is mixed together.

Chained version, three prompts:

Extract the key issues from this review.
Using these issues, draft an outline.
Using the outline, write the full response.

Advantages:

Each prompt can be tested and optimized independently
You can identify which step is weakest (outline good but email rude? then prompt 3 is the bottleneck)
Easier to debug than one mega-prompt

Tradeoff: latency. Chains add latency, so for certain applications you don’t want long chains.

Testing prompts

Start with manual error analysis — a baseline prompt, a refined prompt, a chained workflow; humans rate outputs. Manual is slow but builds intuition.

To scale, use platforms (e.g. Promptfoo) that let you:

Run the same prompt across multiple LLMs side by side in a table
Define LLM judges

Flavors of LLM judges:

Pairwise comparison: “Which summary is better?”
Single-answer grading: “Grade this summary 1–5”
Reference-guided pairwise or rubric-based: e.g. “A 5 is a summary below 100 chars, with three distinct key points, starting with an overview sentence; a 0 fails to summarize.”

You can stack techniques: few-shot the rubric with examples of 5/5, 4/5, 3/5, etc.

4. Fine-tuning (and why I steer away)

Reasons to avoid fine-tuning:

Requires substantial labeled data
May overfit to specific data, losing general-purpose utility
Time- and cost-intensive — by the time you’re done, the next base model is out and beating your fine-tuned version

The advantage of prompt engineering is you can drop in the next best pre-trained model directly. Fine-tuning doesn’t work like that.

When fine-tuning still makes sense:

Task requires repeated high-precision outputs (legal, scientific)
The general-purpose LLM struggles with domain-specific language

The Slack fine-tuning cautionary tale

Ross Lazerowitz (Sep 2023) fine-tuned a model on his company’s Slack messages, hoping it would “speak like us.” Then he asked:

Write a 500-word blog post on prompt engineering.

The model: “I shall work on that in the morning.”

He pushes back: “It’s morning now.”

Model: “I’m writing right now.”

“It’s 6:30 AM here. Write it now.”

“OK, I shall write it now. I actually don’t know what you would like me to say about prompt engineering. I can only describe the process…”

It learned how people talk on Slack — not how they write blog posts. Fine-tuning went wrong because the training distribution wasn’t the task distribution.

5. Retrieval-Augmented Generation (RAG)

Why standalone LLMs fall short

Small / hard-to-attend-to context windows
Knowledge gaps and training cutoff dates
Hallucinations — costly in medical, education
Lack of sources — research, education, legal love sources. Vanilla LLMs hallucinate fake research papers.

How a vanilla RAG works

Question-answering in the medical field: “What are the side effects of drug X?”

Knowledge base of documents
Embed documents into lower-dimensional vectors (trade-off: too small → lose info; too big → latency)
Store embeddings in a vector database with efficient retrieval and a distance metric
Embed the user query with the same algorithm
Retrieve the most relevant documents by distance
Pull those documents, paste into a prompt template like:

Answer the user query based on the list of documents. If the answer is not in the documents, say “I don’t know.” Cite exact page, chapter, and line.

You can extend the template to require links to the specific page.

Improving RAGs

Q: Do document embeddings retain location info within large documents?

Vanilla RAGs may not. Example: the giant white paper inside a medication box would not be served well by a vanilla RAG.

Two popular improvements:

Chunking — store both the full document embedding and chapter-level embeddings; retrieve both, sourcing becomes more precise.

HyDE (Hypothetical Document Embeddings) — the user query usually doesn’t look like the documents. Example: “What are the side effects of drug X?” vs a multi-page document. To bridge the gap:

Take the user query
Use a prompt to generate a fake hallucinated document answering it (“write a 5-page report answering this query”)
Embed that fake document
Compare its embedding to the vector DB

The fake document is closer in structure to real documents, so retrieval is more accurate.

This is just two of many RAG variants — research from 2020–2025 has many branches. (See the linked survey paper in the slides.)

6. Agentic AI workflows

Andrew Ng coined “agentic AI workflows” because everyone uses “agent” to mean very different things — sometimes a single prompt, sometimes a complex multi-agent system. Calling everything an “agent” doesn’t do it justice. Better term: agentic workflow — a multi-step process to complete a task, built from prompts, tools, additional resources, and API calls. This also avoids confusion with the RL definition of “agent” (interacts with environment, state transitions, reward, observation).

One-shot vs agentic example

User on a chatbot: “What is your refund policy?”

One-shot + RAG: “Refunds are available within 30 days of purchase.” [link to policy]
Agentic:
1. Agent retrieves refund policy via RAG
2. Agent asks user for order number
3. Agent queries an API to check order details
4. Agent confirms: “Your order qualifies. The amount will be processed in 3–5 business days.”

Much more thoughtful than the vanilla one.

Specialized agents in the wild

In SF you’ll see billboards: AI software engineer, AI skill mentor, AI SDR, AI lawyer, AI specialized cloud engineer. It would be a stretch to say everything works, but work is being done. (Personal opinion: putting a human face behind these is gimmicky and more scary than engaging. In a few years, very few products will use a human face — it’s a marketing tactic.)

Paradigm shift: traditional software vs agentic AI software

Dimension	Traditional software	Agentic AI software
Data	Structured: JSON, databases, forms	Free-form text, images, video; dynamic interpretation
Logic	Deterministic	Fuzzy
Decomposition	Monolith / microservices	Think as a manager: delegate to roles (graphic designer → marketing manager → performance marketing → data scientist)
Cost of experimentation	High; you rarely throw away code	Low; AI companies are more comfortable throwing away code

Fuzzy engineering is truly hard. If you let users ask anything, the chance of breakage and attack is high. Companies have been bitten because a user did something authorized that broke the database.

Example from Workera:

Deterministic item types: multiple choice, multi-select, drag-and-drop, ordering, matching — one correct answer.
Fuzzy item types: voice questions, voice + coding role-plays — the scoring algorithm can make mistakes, and mistakes are costly.

Mitigation: a human in the loop — e.g. the appeal feature at the end of an assessment that lets users challenge the agent, bringing a human in to fix and align it.

Advice for building a company: get as much done deterministically as possible. Then for the fuzzy parts (back-and-forth interaction), design guardrails up front.

Enterprise workflows: the McKinsey credit memo example

A financial institution takes 1–4 weeks to produce a credit risk memo:

Relationship manager gathers data from 15+ sources
RM and credit analyst collaboratively analyze
Credit analyst spends 20+ hours writing the memo
RM and analyst loop on feedback

With Gen AI agents (McKinsey study), time drops 20–60%:

RM works with Gen AI agent, provides materials
Agent decomposes into tasks for specialist sub-agents
Agents gather data, draft memo
RM and analyst review and give feedback

The hardest part is changing people. In theory, this is great. In practice — 100,000-employee enterprises will take 10–20 years to rewire job descriptions, business workflows, incentives, and training to make this real at scale.

Core components of an agent

Take a travel booking agent:

Prompts — the prompts we’ve learned to optimize
Context management / memory:
- Core / working memory: fast access. Things needed every interaction (e.g. user’s name).
- Archival / long-term memory: slower. Things used occasionally (e.g. birthday).
- Why split: imagine ChatGPT had to re-read all memories on every call. If memory lookup takes 3 seconds, every interaction takes 3 seconds. Working memory must be highly optimized.
Tools: flight search API, hotel API, car rental API, weather API, payment processing API. You typically pass API documentation to the LLM — they’re good at reading JSON specs and learning the GET request format.
Resources (Anthropic’s term): data sitting somewhere (e.g. your CRM) that you let the agent read. Provide a lookup tool and access to the resource.

Degrees of autonomy

From least to most autonomous:

Least: hard-code the steps. “First identify intent, then look up history, then call the flight API, …”
Semi: hard-code the tools only. “You’re a travel agent, help the user book travel. Here are your tools.”
Most: agent decides both steps and tools. Give it a code editor; it can ping any web API, perform calculations, generate code to display data.

APIs vs MCP (Model Context Protocol)

With APIs, you teach the LLM to ping a specific API: give it documentation, define how to call it, what it returns. You do this one-off per API. Doesn’t scale well.

With MCP (Anthropic-coined), there’s a system in the middle. Agents communicate with an MCP server:

“What do you need to give me flight info?” “I need origin, destination, and what you’re looking for.” “Here are my requirements.” “You forgot to tell me your budget.”

It’s agent-to-agent communication. Companies publish their MCPs; your agent figures out how to get the data it needs.

Q: Isn’t MCP just a shifted maintenance burden — APIs change, MCPs change?

Yes. But at least the agent can go back and forth and discover requirements. Ideally a startup has documentation, an LLM workflow reads docs and updates code accordingly.

Q: Are there security concerns with MCP?

Likely, depending on the data exposed. Most MCPs have authentication, like APIs. The exact security surface depends on the implementation.

Q: Is MCP about efficiency or accessing more data?

Efficiency. You still control what data is exposed. Compared to one-off API integration, MCP lets a coding agent communicate efficiently with many MCP servers and find what it needs.

Step-by-step workflow example: travel agent

User: “Plan a trip to Paris Dec 15–20 with flights, hotels near the Eiffel Tower, and an itinerary.”
Agent plans steps: find flights, search hotels, generate recommendations, validate preferences/budget, book.
Execute: use tools, combine results.
Proactive interaction: propose to user, validate, iterate.
Update memory: “User only likes direct flights.” “User is fine with 3-star hotels.”

7. Case study: building a customer support agent + evals

PM asks you to build a customer support agent. Example: “I need to change my shipping address for order X — I moved.”

Where to start

Research existing models / benchmarks for customer support
Decompose the task: what would a human support agent do?
Guess what’s fuzzy vs deterministic in advance

Recommended start: sit with a customer support agent for a day or two. Watch their workflow. Ask where they struggle and how much time each step takes. That gives you the task decomposition.

Decomposed task

A human support agent typically:

Extracts key info
Looks up the customer record in the database
Checks policy (allowed to update address?)
Drafts a response email
Sends the email

Designing the agentic workflow

For each step, pick the right primitive:

Step 1 extract info: vanilla LLM call — extract intent, order number, new address
Step 2 lookup + update: tool — connect to database (custom tool or MCP)
Step 3 check policy: RAG or rule lookup
Step 4 draft email: LLM call, with the confirmation pasted in
Step 5 send email: tool — post to email API

Evals: how do you know it works?

Assume you have LLM traces (a must in any AI startup — if a startup doesn’t have traces, debugging is brutal). Several dimensions for evaluation:

End-to-end vs component-based:

End-to-end: user satisfaction rating at the end. If user rates 1, follow up: “What was the issue?” → “Prices were too high” → fix the relevant tool/prompt.
Component-based: error-analyze each tool / prompt independently. “The tool keeps forgetting to update the email field.” “The email-send call uses wrong format.”

Objective vs subjective:

Objective: “LLM extracted the wrong order ID.” You can write Python to check alignment between user input and DB lookup. Catch automatically.
Subjective: “Should we recommend a direct flight or cheaper indirect?” Captured via:
- Curated eval dataset — write 10 prompts where users say “I prefer direct flights, I care about time.” Define what a good output looks like.
- LLM judges grading on a rubric.

Quantitative vs qualitative:

Quantitative: % successful address updates; latency per component (e.g. send-email takes 5s — too long).
Qualitative: error analysis on hallucinations, tone mismatch, user confusion. Typically white-glove.

Example of subjective tone eval: error-analyze 20 user interactions, notice the LLM seems rude / overly short. Then build LLM judges with a politeness rubric. Then swap the underlying LLM (GPT-4 → Grok → Llama), run side by side, see which is most polite on average. Or fix the LLM and tweak the prompt (“Act like a travel agent” → “Act like a helpful travel agent”) to measure the word’s influence.

8. Multi-agent workflows

Why multi-agent when a single workflow already has multiple steps?

Parallelism — independent things can run in parallel
Reuse — a design agent built once can serve marketing, product, etc. Many stakeholders benefit from one optimized agent.

Smart home example

Brainstormed by the class:

Biometric / location agent: tracks where you are and how you’re moving
Climate agent: monitors and adjusts room temperature
Energy efficiency agent: tracks usage, gives feedback, may control utilities
Security agent: identifies who’s entering, applies role-based permissions (parent vs kid)
Weather / external API agent: integrates outdoor conditions to control temperature, blinds, etc.
Fridge / grocery agent: knows what’s inside via camera, knows preferences, has e-commerce API access for restocking
Notification / alerts agent: system updates, energy savings
Orchestrator agent: the user-facing entry point that delegates to specialists

Interaction patterns

Flat / all-to-all: every agent can talk to every agent
Hierarchical: orchestrator routes to specialists

Smart home likely wants hierarchical for UX — users want one interface, not one app per agent. Some flat links may still help (climate + energy efficiency probably need to talk directly).

When you allow agents to speak to each other, it’s basically an MCP-style protocol: treat the other agent like a tool. “Here’s how you interact, here’s what it tells you, here’s what it needs from you.”

Advantages

Easier to debug specialized agents than a monolithic system
Parallelization, time savings

9. What’s next in AI

Are we plateauing? (Ilya Sutskever’s question)

The community feeling around the latest GPT release was that the performance jump wasn’t what people expected — though the unified hood (no model selector) made consumer UX better.

LLM scaling laws say more compute + energy → better performance, but that eventually plateaus. What takes us to the next step is probably architecture search. The human brain operates very differently — much more efficient, much faster, with far less data. Big labs are hiring thousands of engineers precisely to hunt the next architectural breakthrough. Whoever discovered Transformers had tremendous impact on AI’s direction; the next analogous discovery could unlock a 10x reduction in compute and energy needs. (Foundation series analogy: individuals can disproportionately shape the future via their decisions.)

Multi-modality

LLMs started as text-only, added images. Models good at images are also better at text — being good at cat images makes you better at text about cats. Add audio and video, and the whole system improves. Pinnacle: robotics, where all modalities converge — the robot is better at avoiding a cat because it knows what a cat looks like, sounds like, smells like.

Methods working in harmony

Humans probably use a mix of methods:

Meta-learning — survival instinct encoded in DNA (the baby’s “pre-training”)
Supervised — parents pointing and saying “good / bad”
Reinforcement — falling and getting hurt
Unsupervised — observing others

Future AI systems likely combine the methods you saw in CS230, optimizing for speed, latency, cost, and energy.

Human-centric vs non-human-centric research

The human body is limiting. Pure brain-modeled research may miss compute/energy optimizations. Still, the brain has lots to teach — e.g. one research direction asks: does the brain do backpropagation? Probably not — likely only forward propagation. Worth reading if you’re curious about AI’s direction.

Velocity

Things move so fast that we deliberately teach breadth, not depth — because today’s specific RAG technique #17 will be irrelevant in two years. Get the breadth, develop the ability to sprint into depth when needed. The half-life of skills is low.

後話

這篇是 Stanford CS230 公開課的整理、保留英文原文以避免翻譯失真。要看本 blog 對應的中文原理化內容、可以接：

模組四：LLM 應用層原理 — RAG / tool use / agent / workflow patterns 的跨工具不變原理
4.1 RAG 原理
4.4 Agent 架構原理
4.14 Benchmarking 與評估方法論
4.21 LLM-as-Judge 評估方法

System Prompt

Tue, 12 May 2026 00:00:00 +0000

System prompt 的核心概念是「LLM application 中、由開發者預設、放在每次 conversation 最前面、不直接顯示給使用者的指令層」。常見用途包括設定模型角色（如「你是 senior Python engineer」）、規範輸出格式（如「always return JSON」）、加入 safety guideline。Chat-based LLM API（OpenAI、Anthropic 等）通常有專門的 role: "system" message type。

概念位置

LLM API call 的訊息結構：

1messages = [
2 {role: "system", content: "你是專業 code reviewer..."}, ← system prompt
3 {role: "user", content: "請 review 這段 code: ..."},
4 {role: "assistant", content: "..."}, ← 模型回答
5 {role: "user", content: "..."}, ← 後續對話
6 ...
7]

System prompt 在 application 設計中的角色：

用途	例子
角色定義	“你是 senior Python engineer、專長 async / typing”
輸出格式約束	“always return JSON with keys: title, body, tags”
行為規範	“若不確定、明確說『我不知道』、不要編造”
工具使用指引	“When user asks about weather, call get_weather tool”
安全約束	“Do not generate executable shell commands”
上下文注入	“Current date: 2026-05-12; User language: zh-TW”

事實查核註：不同 LLM vendor 對 system prompt 的處理機制不同（如部分模型把 system 跟 user 視為相同優先級、部分模型有特殊訓練讓 system 較高優先）、具體行為以該模型的官方文件為準。

設計責任

理解 system prompt 後可以解釋兩個現象：為什麼同一個模型在不同 LLM 應用中的「個性」差很多（system prompt 不同）、為什麼 prompt injection 的主要目標是繞過 system prompt 的約束（攻擊者想讓模型不照原本指令走）。

實務上、設計 LLM application 時、system prompt 是行為約束的第一層、但不是唯一防線（容易被 injection 繞過）；critical 行為應該在 application 層（如 tool call 的權限白名單、輸出驗證）加第二層防護。詳見 6.3 IDE 場景的 prompt injection。

4.0 Prompt 技術光譜：手法分類、取捨、組合模式

Thu, 14 May 2026 00:00:00 +0000

Prompt 技術不缺教學文章——但多數教學是「教你怎麼寫」、半年後模型換代、寫法跟著過時。本章不教「怎麼寫」、寫的是這個技術 landscape 的結構：有哪些手法、每個解什麼問題、它們的 trade-off 在哪、什麼時候該組合、什麼時候不該。這些結構性問題跨模型世代不變。

讀完本章後、看到任何新 prompt 技術都能放回正確座標、判斷「這是哪一軸的優化、跟我現在的問題對上嗎、能不能跟既有技術疊」——而不是每出一個新技術都從零學一次。

本章目標

讀完本章後你能：

把任何 prompt 技術放進三軸座標（context 提供 / 推理引導 / 角色與格式）。
對單一技術評估四維 trade-off（accuracy、latency、cost、debuggability）。
判斷何時 stack 技術、何時 stack 會互相抵消。
區分 prompt 層解法 vs fine-tune / RAG / chaining 解法的邊界。
看到「prompt 改了沒效」時、診斷是 systematic error 還是 random error。

本章鎖定的是結構層、不是寫法層

Prompt 知識可以分兩層：易變層是具體寫法（特定模型偏好哪種句型、特定任務最佳 step 切法）、不變層是「有哪些技術可選、各解什麼問題、能不能組合」的結構。本章只寫不變層。

易變層為什麼留給 case-by-case：

跨模型差異：對 GPT-4 有效的寫法、對 Claude 可能反效果。模型 SFT 分佈不同、對 prompt 結構的偏好不同。
跨任務差異：對 summarization 有效的格式、對 classification 沒幫助。每個任務的最佳 prompt 形狀要實驗。

不變層的價值是：看到任何新 prompt 技術都能放回正確座標、判斷它解什麼問題、跟既有技術疊能不能。具體寫法（act as XYZ 怎麼設計、step 怎麼分）屬於客製工作、不在本章。

三軸分類

把 prompt 技術放到三軸座標、看到任何新手法都能定位：

軸	解決什麼問題	代表技術
Context 提供	模型「缺資料 / 缺對齊範例」	zero-shot、few-shot、retrieval-augmented
推理引導	模型「直接答錯、需要 think」	chain-of-thought、decomposition、reflection
角色與格式	模型「不知該以什麼姿態回應」	role prompting、persona、output template

每個技術可能跨軸（如 few-shot CoT 同時 context + 推理）、但歸到主軸有助判讀「這技術在解哪一類問題」。看到新技術時、先問「它放哪一軸」、再看它跟既有技術的關係。

Context 軸：模型缺什麼資料

Zero-shot

直接給任務、不給範例。

適用：模型對任務分佈熟、輸出格式可預測。例：「將下列文字翻譯成英文」。
失效：任務邊界模糊、模型沒「對齊到你的標準」。例：「分類這個 review 是正向 / 中性 / 負向」——「中性」的邊界在不同產業差很多。

Few-shot

Few-shot prompting 在 prompt 內塞幾個 input-output 範例、模型透過範例對齊任務。

適用：任務有「我的標準跟模型預設不同」、但能舉幾個代表性例子。常見場景：分類、抽取、格式轉換、tone alignment。
核心收益：把「對齊任務」這件事從 fine-tune 移到 prompt——iteration 從幾天縮到幾分鐘、不動模型權重。
失效：範例選不好（cherry-picked、cover 不到 edge case）、範例太多撐爆 context、任務本質需要外部知識（這時該用 RAG 不是 few-shot）。

Few-shot 跟 fine-tune 是「對齊」這件事的兩個 endpoint。Trade-off：

維度	Few-shot in prompt	Fine-tune
Iteration	分鐘級、改 prompt 即可	天級、要 retrain
範例容量	受 context window 限制（10–50）	可以幾千幾萬、整個 dataset 都行
Cost	每次 inference 多付 token	一次性訓練 cost、之後 inference 不變
模型遷移	跨模型即時換、prompt 直接搬	綁特定 base model、換模型要 retrain
知識更新	改 prompt 即可	要 retrain

實務啟示：先 few-shot、等到範例真的多到撐爆 context 又每天都用、再考慮 fine-tune。本指南對 fine-tune 的整體看法見 3.4 訓練流程。

Retrieval-augmented prompting

跟 few-shot 像、但範例不是寫死、是從一個範例庫即時 retrieve。技術上落在 4.1 RAG 原理、但概念上是 few-shot 的延伸——把固定範例變成動態範例。

適用：範例庫大、每次任務最相關的範例不同。
跟 RAG 知識檢索的差異：RAG 取「事實 / 文件」、retrieval-augmented prompting 取「相似任務的解答範例」。兩個目的不同、但 infra 共用。

推理軸：模型該不該「think」

Chain-of-Thought（CoT）

Chain-of-Thought 要求模型「show your work」、把推理步驟寫出來、再給最終答案。

適用：multi-step reasoning（數學、邏輯、複雜判斷）、模型直接答錯但 step-by-step 後對。
失效在 reasoning model 出現後：reasoning model 本身就在生成內部推理 trace、再外加 explicit CoT prompt 邊際收益遞減、部分模型可能反而干擾內部推理路徑。判讀訊號：模型卡片寫「reasoning model」、就不要再加 “think step by step”。
失效在低能力模型：模型本身推理能力不足、CoT 變成「把錯誤推理寫得更詳細」、不會把答案變對。CoT 是「把潛在能力擠出來」、不是「給模型新能力」。

Task decomposition

把大任務拆成幾個明確子任務、prompt 內 enumerate 出來。

跟 CoT 的差異：CoT 是「過程要 explicit」、decomposition 是「子任務要 explicit」。CoT 在 single call 內展開、decomposition 可以單 call 也可以多 call。
適用：任務有明顯的 phase（如「先抽要點、再寫 outline、再展開段落」）、不分階段就會走錯。
跟 chaining 的邊界：decomposition 寫在 single prompt 裡是 prompt 技術；拆成多 call 是 4.7 workflow patterns 的 pipeline 模式。判讀：每階段 output 要不要被審查 / 被 inject 不同 context → 要 → 走 chaining；不需要 → 留在 single prompt 內 decomposition。

Reflection / self-critique

Reflection 要求模型先輸出一版、再 critique 自己、再修改。

適用：模型有能力辨識「自己寫的不夠好」、critique 跟 generator 不會共用同樣 blind spot。
失效：critique 跟 generator 是同個模型、訓練分佈中的盲點不會因為「再想一次」消失。判讀訊號：critique 每次都給很像的建議、或修完還是同一類錯——這是 systematic error、加 reflection 沒收益。
完整失敗模式分析見 4.7 workflow patterns reflection 段。

角色與格式軸：模型該以什麼姿態回應

Role prompting

“Act as X” 系列——指定模型扮演的角色或專業領域。

適用：通用模型在多種風格之間漂、加 role 把它鎖到特定分佈。例：「act as a senior backend engineer reviewing this PR」鎖技術深度。
失效：role 跟任務無關（“act as a wizard” 做財務分析）、或 role 設定跟使用者實際需求衝突。Role 是調 tone / 深度 / 視角的工具、不會給模型新能力。
常見過度迷信：“you are the best in the world at this” 這類自誇式 prompt 跨模型效果不穩定、難以可靠重現。不值得當核心策略。

Output template

指定 output 格式（JSON schema、Markdown 結構、特定欄位）。

適用：output 要餵下游 deterministic 系統（API、DB、UI）、格式錯就整個流程斷。
執行層次：純 prompt 指定（弱）→ few-shot 範例（中）→ structured output / constrained decoding（強、見 3.10 constrained decoding 內部）。三者疊用最穩。
失效：模板太緊、模型為了符合格式犧牲內容品質。Trade-off：嚴格 schema 換來下游穩定、但 prompt 的 expression 空間變小。

Persona / system prompt

跨 turn 持續性的角色與行為設定、放在 system prompt。

跟 role prompting 的差異：role prompting 是 single call 的暫時角色、persona 是跨 turn 的長期人設。多數 chatbot 應用都在後台塞 persona。
失效：persona 跟 user request 衝突時、模型在「跟 persona 一致」跟「滿足 user」之間擺盪、行為不穩。

四維 Trade-off

每個 prompt 技術都可以用這四維評估：

維度	意義	典型代價
Accuracy	任務完成品質	—
Latency	從 request 到 final response 的時間	Token 累積拉長生成時間
Cost	每次 inference 的 token 成本	Token 累積放大成本
Debuggability	失敗時能不能定位是哪一步出問題	Single 大 prompt 失敗難排查

四維是 trade-off、不是「都拉到最高」。Few-shot 提高 accuracy 但加 cost 跟 latency；CoT 提高 accuracy 但顯著拉長 latency；reflection 進一步提高 accuracy 但 cost / latency 翻倍以上。

Latency 的展開：標準 LLM 生成的 latency 由 TTFT（首 token 時間）+ output token 數 × per-token latency 決定。Few-shot 加 input token、影響 TTFT 但不影響 per-token；CoT / reflection 加 output token、顯著拉長總生成時間。Reasoning model 例外——它的 thinking token 也算 output、顯著拉長 TTFT 跟總時間、加 explicit CoT 在 reasoning model 上是重複收費。

Debuggability 的展開：single 大 prompt 跑出錯時、要排查是 task 拆解錯、role 不對、few-shot 範例誤導、還是格式描述不清——所有問題混在一個 call 裡。Chaining / decomposition 把流程拆成多個獨立 step、每 step 有自己的 input / output trace、可以 isolate 故障點。Trade-off：chaining 加 latency / cost、但 debug 時間遠少。

設計時先問「我的 binding constraint 是哪個」：

即時 chatbot → latency / cost 優先、accuracy 次要、避開 reflection
後台 batch（每晚跑、明早看）→ accuracy 優先、latency 不重要、reflection 可用
高代價任務（醫療、法律、財務）→ accuracy + debuggability 優先、cost 不在乎

組合：Stack 的兩個條件

Stack 有效的必要條件是兩技術解不同軸的問題、且底層假設一致。兩條件都滿足才有疊加收益、任一失效就會抵消甚至反效果。

有效的 stack 組合

Few-shot + role：few-shot 解「任務對齊」、role 解「回應姿態」、兩軸不衝突。
Few-shot + output template：few-shot 教任務、template 鎖格式、互補。
CoT + decomposition：decomposition 拆 phase、CoT 展開每 phase 的推理、層級互補。

失效的 stack 組合（同軸或假設衝突）

CoT + reasoning model：reasoning model 內部已在做 chain-of-thought、外加 explicit CoT 邊際收益遞減、部分模型可能反而干擾內部推理路徑。判讀：模型卡片寫 reasoning、就不要再加 “think step by step”。
Reflection + 低能力模型：reflection 需要 critique 能力、低能力模型 critique 不出有用建議、徒增 cost。
多重 role 衝突：“act as a creative writer AND a strict editor”——指令互相牴觸、模型隨機選一邊。
Few-shot 太多 + long context 任務：few-shot 撐爆 context、留給實際任務的空間不足、accuracy 反降。

判讀 stack 是否有效的反射動作：問「兩個技術解的是不同問題嗎、它們有沒有共用底層假設」。

跟相鄰技術的邊界

Prompt 技術不是萬能、有些問題該換層解：

問題	Prompt 層能解到哪	該換的層
模型不知道某個事實	few-shot 塞少量、不夠	RAG（4.1）
模型完全不會某個任務	few-shot 撐不住、頻繁失敗	Fine-tune（3.4）
任務要多步、每步要不同 context	single prompt 塞不下、邏輯混	Chaining / workflow（4.7）
任務要外部資料 / API	prompt 描述不出、需要實際呼叫	Tool use（4.3）
任務要 LLM 自主推進	prompt 無法表達「持續決定下一步」	Agent（4.4）

判讀訊號：prompt 改了五版、accuracy 還是不到 baseline、就該往這個表的右欄移、不是再改 prompt 第六版。

失敗診斷：Prompt 改了沒效時

Prompt 修改沒效、定位是 systematic 還是 random error：

Random error：同 prompt 跑 N 次、output 不穩定、有時對有時錯。可以靠 reflection / 多採樣 / temperature 降低收斂——這條路 prompt 層有解。
Systematic error：同 prompt 跑 N 次、output 一致地錯（或一致地朝某個方向偏）。reflection 沒用、prompt 改寫也救不回——這是模型能力 / 訓練分佈問題、要往 RAG / fine-tune / 換模型走、不是再改 prompt。

判讀步驟：

同 prompt 跑 5–10 次、看 output 分佈
若分佈寬：random error、prompt 層可解
若分佈窄但錯：systematic error、不要再 iterate prompt、換層

這個判讀直接呼應模組零 fuzzy engineering 的「同 input → 分佈」假設——不看分佈、debug 就是瞎猜。

何時過時 / 何時不過時

不會過時的部分：

三軸分類（context / 推理 / 格式）。
四維 trade-off（accuracy / latency / cost / debuggability）。
Stack 有效 vs 抵消的判讀原則（不同軸 vs 同軸 / 底層假設）。
Prompt 層 vs 換層的邊界判讀。
Systematic vs random error 的診斷流程。

會變的部分：

對特定模型有效的具體寫法（每個模型偏好的 prompt structure）。
角色 prompting 的有效程度（隨 model alignment 訓練成熟、role hack 的效果逐年降低）。
CoT 的必要性（reasoning model 普及後、explicit CoT 的場景縮小）。
Output format 強制手段（從 prompt-only 走向 structured output API、再走向 constrained decoding）。

下一章：4.1 RAG 原理、把「prompt 層塞不下知識」這個邊界往外推、進入 LLM 跟外部資料互動的領域。Prompt 跟 fine-tune 的對齊取捨見 3.4、跟 chaining 的邊界完整討論見 4.7、跟 fuzzy engineering 典範的關係見 0.8。