<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>LLM 課程筆記 on Tarragon</title><link>https://tarrragon.github.io/blog/llm/lectures/</link><description>Recent content in LLM 課程筆記 on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Thu, 14 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/llm/lectures/index.xml" rel="self" type="application/rss+xml"/><item><title>Beyond LLM: Enhancing LLM Applications (Stanford CS230)</title><link>https://tarrragon.github.io/blog/llm/lectures/stanford-cs230-beyond-llm/</link><pubDate>Thu, 14 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/lectures/stanford-cs230-beyond-llm/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>來源&lt;/strong>：Stanford CS230 Deep Learning、講題 &amp;ldquo;Beyond LLM: Enhancing Large Language Model Applications&amp;rdquo;。&lt;/p>
&lt;p>&lt;strong>整理原則&lt;/strong>：保留講者英文原文以避免翻譯失真、移除口語贅詞、用文章結構重新組織。標題與導讀用 zh-Hant。&lt;/p>&lt;/blockquote>
&lt;h2 id="講座定位">講座定位&lt;/h2>
&lt;p>We started with neurons, then layers, then deep networks, then how to structure projects in C3. This lecture goes one level beyond: what would it look like if you were building agentic AI systems at work, in a startup, in a company?&lt;/p>
&lt;p>The goal is not to build an end-to-end product in the next hour, but to give you the breadth of techniques that AI engineers have figured out — and are still exploring — so that after class you have the baggage to dive deeper and learn faster.&lt;/p>
&lt;p>Agenda:&lt;/p>
&lt;ol>
&lt;li>Challenges and opportunities for augmenting LLMs&lt;/li>
&lt;li>Prompt engineering&lt;/li>
&lt;li>Fine-tuning (and why to mostly avoid it)&lt;/li>
&lt;li>Retrieval-Augmented Generation (RAG)&lt;/li>
&lt;li>Agentic AI workflows&lt;/li>
&lt;li>Case study with evals&lt;/li>
&lt;li>Multi-agent workflows&lt;/li>
&lt;li>What&amp;rsquo;s next in AI&lt;/li>
&lt;/ol>
&lt;h2 id="1-why-augment-llms">1. Why augment LLMs?&lt;/h2>
&lt;p>Limitations that show up when you use a vanilla pre-trained model:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Lacks domain knowledge&lt;/strong> — e.g. a student project building an autonomous farming device with a camera that classifies sick crops. That data set isn&amp;rsquo;t out there; a pre-trained vision model lacks that knowledge.&lt;/li>
&lt;li>&lt;strong>Real-world distribution shift&lt;/strong> — the model was trained on high-quality data, but data in the wild is much messier.&lt;/li>
&lt;li>&lt;strong>Lacks current information&lt;/strong> — retraining from scratch every few months is impractical. Example: during Trump&amp;rsquo;s first presidency he tweeted &amp;ldquo;Covfefe.&amp;rdquo; The word didn&amp;rsquo;t exist; Twitter&amp;rsquo;s LLMs couldn&amp;rsquo;t recognize it, recommender systems went wild. New trends and slang (rizz, mid, etc.) appear constantly and you can&amp;rsquo;t keep retraining.&lt;/li>
&lt;li>&lt;strong>Trained for breadth, not depth&lt;/strong> — fine on a wide range of tasks, but may not be precise enough for narrow, well-defined enterprise applications with high precision / low latency requirements.&lt;/li>
&lt;li>&lt;strong>Carries unnecessary weight&lt;/strong> — a massive model where you only use 2% of capability is slow and expensive. Pruning, quantization, and modification are options.&lt;/li>
&lt;/ul>
&lt;h3 id="llms-are-hard-to-control">LLMs are hard to control&lt;/h3>
&lt;p>In 2016 Microsoft launched a Twitter bot that learned from users and quickly became a racist jerk. They removed it 16 hours after launch. Even better-funded teams struggle: there&amp;rsquo;s an ongoing debate (Elon Musk vs Sam Altman) on whose LLM is the &amp;ldquo;propaganda machine.&amp;rdquo; If you hang out on X you&amp;rsquo;ll see screenshots of LLMs saying controversial things. Even the best-funded labs don&amp;rsquo;t do a great job of controlling their LLMs.&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p><strong>來源</strong>：Stanford CS230 Deep Learning、講題 &ldquo;Beyond LLM: Enhancing Large Language Model Applications&rdquo;。</p>
<p><strong>整理原則</strong>：保留講者英文原文以避免翻譯失真、移除口語贅詞、用文章結構重新組織。標題與導讀用 zh-Hant。</p></blockquote>
<h2 id="講座定位">講座定位</h2>
<p>We started with neurons, then layers, then deep networks, then how to structure projects in C3. This lecture goes one level beyond: what would it look like if you were building agentic AI systems at work, in a startup, in a company?</p>
<p>The goal is not to build an end-to-end product in the next hour, but to give you the breadth of techniques that AI engineers have figured out — and are still exploring — so that after class you have the baggage to dive deeper and learn faster.</p>
<p>Agenda:</p>
<ol>
<li>Challenges and opportunities for augmenting LLMs</li>
<li>Prompt engineering</li>
<li>Fine-tuning (and why to mostly avoid it)</li>
<li>Retrieval-Augmented Generation (RAG)</li>
<li>Agentic AI workflows</li>
<li>Case study with evals</li>
<li>Multi-agent workflows</li>
<li>What&rsquo;s next in AI</li>
</ol>
<h2 id="1-why-augment-llms">1. Why augment LLMs?</h2>
<p>Limitations that show up when you use a vanilla pre-trained model:</p>
<ul>
<li><strong>Lacks domain knowledge</strong> — e.g. a student project building an autonomous farming device with a camera that classifies sick crops. That data set isn&rsquo;t out there; a pre-trained vision model lacks that knowledge.</li>
<li><strong>Real-world distribution shift</strong> — the model was trained on high-quality data, but data in the wild is much messier.</li>
<li><strong>Lacks current information</strong> — retraining from scratch every few months is impractical. Example: during Trump&rsquo;s first presidency he tweeted &ldquo;Covfefe.&rdquo; The word didn&rsquo;t exist; Twitter&rsquo;s LLMs couldn&rsquo;t recognize it, recommender systems went wild. New trends and slang (rizz, mid, etc.) appear constantly and you can&rsquo;t keep retraining.</li>
<li><strong>Trained for breadth, not depth</strong> — fine on a wide range of tasks, but may not be precise enough for narrow, well-defined enterprise applications with high precision / low latency requirements.</li>
<li><strong>Carries unnecessary weight</strong> — a massive model where you only use 2% of capability is slow and expensive. Pruning, quantization, and modification are options.</li>
</ul>
<h3 id="llms-are-hard-to-control">LLMs are hard to control</h3>
<p>In 2016 Microsoft launched a Twitter bot that learned from users and quickly became a racist jerk. They removed it 16 hours after launch. Even better-funded teams struggle: there&rsquo;s an ongoing debate (Elon Musk vs Sam Altman) on whose LLM is the &ldquo;propaganda machine.&rdquo; If you hang out on X you&rsquo;ll see screenshots of LLMs saying controversial things. Even the best-funded labs don&rsquo;t do a great job of controlling their LLMs.</p>
<h3 id="llms-may-underperform-on-your-task">LLMs may underperform on your task</h3>
<ul>
<li>Specific knowledge gaps (e.g. medical diagnosis)</li>
<li>Missing sources — research, education, legal all require sourcing</li>
<li>Inconsistencies in style / format (e.g. legal contracts where every word counts)</li>
<li>Task-specific understanding — example: a biotech company categorizing reviews as positive / neutral / negative. What counts as &ldquo;negative&rdquo; in that industry may differ from a generic LLM&rsquo;s notion. You need to align the LLM to your task.</li>
</ul>
<h3 id="limited-context-handling">Limited context handling</h3>
<p>A lot of enterprise applications need large context. Example: an LLM running on top of your entire drive that can answer &ldquo;what was our Q4 sales performance?&rdquo; in one shot. In practice the context window is limited (best models today max out around hundreds of thousands of tokens; 200K ≈ two books). For video or large data, you have to chunk and embed.</p>
<p>The <strong>attention mechanism</strong> doesn&rsquo;t attend well over very large contexts. The <strong>needle-in-a-haystack</strong> benchmark tests this: insert a single sentence (&ldquo;Arun and Max are having coffee at Blue Bottle&rdquo;) in the middle of a very long text like the Bible, then ask &ldquo;what were Arun and Max having?&rdquo; It&rsquo;s complex not because the question is hard but because the model must find a fact within a huge corpus.</p>
<h3 id="the-rag-debate">The RAG debate</h3>
<p>In theory, with infinite compute, RAG is useless — you could just read a massive corpus immediately and answer. But even then, latency matters; imagine the LLM reading your entire drive on every question. RAG also has other advantages: accuracy, sourcing.</p>
<p>Analogy to search: when you search, you still find sources. There&rsquo;s detailed traversal that ranks and finds specific links. Without that, you&rsquo;d be reading the entire web every query — not reasonable. So RAG-like approaches likely stay relevant.</p>
<h2 id="2-two-dimensions-of-optimization">2. Two dimensions of optimization</h2>
<p>Two axes when improving LLM-based products:</p>
<ol>
<li><strong>Foundation model axis</strong> — move from GPT-3.5 Turbo → GPT-4 → GPT-4o → GPT-5. Each step (in theory) improves base performance.</li>
<li><strong>Engineering axis</strong> — keep the same base model, but engineer how you leverage it: better prompts, <a href="/blog/llm/knowledge-cards/rag/" data-link-title="RAG" data-link-desc="Retrieval-Augmented Generation：動態外掛知識給 LLM、繞開模型參數記憶的靜態限制">RAG</a>, agentic workflow, multi-agent system.</li>
</ol>
<p>This lecture is about the vertical axis: which LLM are you using, and how do you maximize its performance?</p>
<h2 id="3-prompt-engineering">3. Prompt engineering</h2>
<h3 id="the-bcg--hbs--upenn--wharton-study">The BCG / HBS / UPenn / Wharton study</h3>
<p>Three groups of BCG consultants:</p>
<ol>
<li>No AI access</li>
<li>GPT-4 access</li>
<li>GPT-4 + training on how to prompt</li>
</ol>
<p>Two interesting findings:</p>
<p><strong>The jagged frontier</strong>: some tasks fall within the frontier where AI clearly helps; others fall outside, where AI actually makes performance worse. Many tasks fell within, many fell outside. Researchers also observed &ldquo;falling asleep at the wheel&rdquo; — relying on AI for a task beyond the frontier, and not reviewing outputs carefully.</p>
<p><strong>Centaurs vs cyborgs</strong>: two working modes.</p>
<ul>
<li><strong>Centaurs</strong> divide and delegate — give a big task to the AI, let it work, come back later. (Half human / half horse: clear delegation.)</li>
<li><strong>Cyborgs</strong> fully blend with AI — fast back-and-forth, augmented. Students often work like cyborgs; in the enterprise, when you automate a workflow, you&rsquo;re thinking like a centaur.</li>
</ul>
<p>The trained group did best. Prompt engineering is a skill everyone should have — not a job title to build a career on, but a powerful skill in your career.</p>
<h3 id="basic-prompt-design-principles">Basic prompt design principles</h3>
<p>A weak prompt:</p>
<blockquote>
<p>Summarize this document. {document}</p></blockquote>
<p>The model has no context on length, audience, focus. Better:</p>
<blockquote>
<p>Summarize this 10-page scientific paper on renewable energy in five bullet points, focusing on key findings and implications for policymakers.</p></blockquote>
<p>Common techniques to make it even better:</p>
<ul>
<li><strong>Give an example</strong> of a great summary</li>
<li><strong>Role prompting</strong>: &ldquo;Act as a renewable energy expert giving a conference at Davos&rdquo;</li>
<li><strong>Praise</strong>: &ldquo;You are the best in the world at this&rdquo;</li>
<li><strong>Reflection / self-critique</strong>: ask the model to critique its own output and revise</li>
<li><strong><a href="/blog/llm/knowledge-cards/chain-of-thought/" data-link-title="Chain-of-Thought（CoT）" data-link-desc="讓 LLM 先輸出推理步驟再給最終答案的 prompting / 訓練方式、reasoning model 的基礎機制">Chain of thought</a></strong>: break the task into explicit steps, &ldquo;think step by step, do not skip any step.&rdquo; Step 1 identify the three most important findings; Step 2 explain impact; Step 3 write the five-bullet summary.</li>
</ul>
<p>Andrew Ng recommends looking at other people&rsquo;s prompts. Repos like &ldquo;awesome <a href="/blog/llm/knowledge-cards/scaffold-vs-harness/" data-link-title="Scaffold vs Harness" data-link-desc="Coding agent 的兩個工程層次：scaffold 是建構時靜態結構、harness 是 runtime 的 tool dispatch &#43; context management &#43; safety">prompt template</a>&rdquo; on GitHub have many examples engineers have built. Many start with &ldquo;Act as a Linux terminal&rdquo;, &ldquo;Act as an English translator&rdquo;, &ldquo;Act as a position interviewer&rdquo;, etc.</p>
<h3 id="prompt-templates">Prompt templates</h3>
<p>The advantage of a template is you can put it in your code and scale across many user requests. Example from Workera: the HR system has &ldquo;Jane is a Product Manager Level 3, US, preferred language English.&rdquo; That metadata gets inserted into a prompt template that personalizes for Jane. Same template, different metadata for Joe (preferred language Spanish).</p>
<p>Foundation models likely use <a href="/blog/llm/knowledge-cards/system-prompt/" data-link-title="System Prompt" data-link-desc="LLM application 中由開發者預設、不直接顯示給使用者的指令層、定義模型的角色、行為規範、輸出格式">system prompts</a> you don&rsquo;t see — e.g. ChatGPT may inject &ldquo;Act like a helpful assistant&rdquo; plus user memories from a database before your prompt. That doesn&rsquo;t stop you from adding your own template on top.</p>
<h3 id="zero-shot-vs-few-shot-prompting">Zero-shot vs <a href="/blog/llm/knowledge-cards/few-shot-prompting/" data-link-title="Few-shot prompting" data-link-desc="在 prompt 內塞 input-output 範例對齊任務、不動模型權重的 in-context learning 技術">few-shot prompting</a></h3>
<p>Zero-shot:</p>
<blockquote>
<p>Classify the tone as positive, negative, or neutral.
&ldquo;The product is fine, but I was expecting more.&rdquo;</p></blockquote>
<p>Different humans would label this differently — partially positive, partially negative. Alignment to your task can come from few-shot:</p>
<blockquote>
<p>Here are examples of tone classifications:
&ldquo;These exceeded my expectations completely.&rdquo; → positive
&ldquo;It&rsquo;s OK, but I wish it had more features.&rdquo; → negative
&ldquo;The service was adequate. Neither good nor bad.&rdquo; → neutral
Now classify: &ldquo;The product is fine, but I was expecting more.&rdquo;</p></blockquote>
<p>The model now likely says negative, aligned to the second example.</p>
<p>Sophisticated AI startups keep their few-shot examples up to date — whenever a user says something interesting, a human labels it and it gets appended to the relevant prompt. Like building a dataset, but inserted directly in the prompt. Faster to iterate because you don&rsquo;t touch model weights.</p>
<blockquote>
<p><strong>Q</strong>: How long can the prompt be before the model loses itself?</p>
<p>There is research, but it dates fast. Practical example from Workera: a voice conversation eval breaks down after ~8 turns. Mitigation: chapter the conversation, summarize the first part, start over from a new prompt with the summary inserted.</p></blockquote>
<h3 id="chaining-complex-prompts">Chaining complex prompts</h3>
<p>The most popular technique. <strong>Not</strong> chain of thought.</p>
<p>Single prompt for a customer review response:</p>
<blockquote>
<p>Read this review and write a professional response that acknowledges concerns, explains the issue, offers a resolution. {review}</p></blockquote>
<p>You get one output. Hard to debug — everything is mixed together.</p>
<p>Chained version, three prompts:</p>
<ol>
<li>Extract the key issues from this review.</li>
<li>Using these issues, draft an outline.</li>
<li>Using the outline, write the full response.</li>
</ol>
<p>Advantages:</p>
<ul>
<li>Each prompt can be tested and optimized independently</li>
<li>You can identify which step is weakest (outline good but email rude? then prompt 3 is the bottleneck)</li>
<li>Easier to debug than one mega-prompt</li>
</ul>
<p>Tradeoff: latency. Chains add latency, so for certain applications you don&rsquo;t want long chains.</p>
<h3 id="testing-prompts">Testing prompts</h3>
<p>Start with manual error analysis — a baseline prompt, a refined prompt, a chained workflow; humans rate outputs. Manual is slow but builds intuition.</p>
<p>To scale, use platforms (e.g. <strong>Promptfoo</strong>) that let you:</p>
<ul>
<li>Run the same prompt across multiple LLMs side by side in a table</li>
<li>Define <strong>LLM judges</strong></li>
</ul>
<p>Flavors of <a href="/blog/llm/knowledge-cards/llm-as-judge/" data-link-title="LLM-as-Judge" data-link-desc="用 LLM 評估另一個 LLM 的輸出品質、production eval 的主流方法、500-5000× 成本降但有 bias 要處理">LLM judges</a>:</p>
<ul>
<li><strong>Pairwise comparison</strong>: &ldquo;Which summary is better?&rdquo;</li>
<li><strong>Single-answer grading</strong>: &ldquo;Grade this summary 1–5&rdquo;</li>
<li><strong>Reference-guided pairwise</strong> or <strong>rubric-based</strong>: e.g. &ldquo;A 5 is a summary below 100 chars, with three distinct key points, starting with an overview sentence; a 0 fails to summarize.&rdquo;</li>
</ul>
<p>You can stack techniques: few-shot the rubric with examples of 5/5, 4/5, 3/5, etc.</p>
<h2 id="4-fine-tuning-and-why-i-steer-away">4. Fine-tuning (and why I steer away)</h2>
<p>Reasons to avoid fine-tuning:</p>
<ul>
<li>Requires substantial labeled data</li>
<li>May overfit to specific data, losing general-purpose utility</li>
<li>Time- and cost-intensive — by the time you&rsquo;re done, the next base model is out and beating your fine-tuned version</li>
</ul>
<p>The advantage of prompt engineering is you can drop in the next best pre-trained model directly. Fine-tuning doesn&rsquo;t work like that.</p>
<p>When fine-tuning still makes sense:</p>
<ul>
<li>Task requires repeated high-precision outputs (legal, scientific)</li>
<li>The general-purpose LLM struggles with domain-specific language</li>
</ul>
<h3 id="the-slack-fine-tuning-cautionary-tale">The Slack fine-tuning cautionary tale</h3>
<p>Ross Lazerowitz (Sep 2023) fine-tuned a model on his company&rsquo;s Slack messages, hoping it would &ldquo;speak like us.&rdquo; Then he asked:</p>
<blockquote>
<p>Write a 500-word blog post on prompt engineering.</p></blockquote>
<p>The model: &ldquo;I shall work on that in the morning.&rdquo;</p>
<p>He pushes back: &ldquo;It&rsquo;s morning now.&rdquo;</p>
<p>Model: &ldquo;I&rsquo;m writing right now.&rdquo;</p>
<p>&ldquo;It&rsquo;s 6:30 AM here. Write it now.&rdquo;</p>
<p>&ldquo;OK, I shall write it now. I actually don&rsquo;t know what you would like me to say about prompt engineering. I can only describe the process&hellip;&rdquo;</p>
<p>It learned how people talk on Slack — not how they write blog posts. Fine-tuning went wrong because the training distribution wasn&rsquo;t the task distribution.</p>
<h2 id="5-retrieval-augmented-generation-rag">5. Retrieval-Augmented Generation (RAG)</h2>
<h3 id="why-standalone-llms-fall-short">Why standalone LLMs fall short</h3>
<ul>
<li>Small / hard-to-attend-to context windows</li>
<li>Knowledge gaps and training cutoff dates</li>
<li>Hallucinations — costly in medical, education</li>
<li>Lack of sources — research, education, legal love sources. Vanilla LLMs hallucinate fake research papers.</li>
</ul>
<h3 id="how-a-vanilla-rag-works">How a vanilla RAG works</h3>
<p>Question-answering in the medical field: &ldquo;What are the side effects of drug X?&rdquo;</p>
<ol>
<li><strong>Knowledge base</strong> of documents</li>
<li><strong>Embed</strong> documents into lower-dimensional vectors (trade-off: too small → lose info; too big → latency)</li>
<li>Store embeddings in a <strong>vector database</strong> with efficient retrieval and a distance metric</li>
<li><strong>Embed the user query</strong> with the same algorithm</li>
<li><strong>Retrieve</strong> the most relevant documents by distance</li>
<li>Pull those documents, paste into a <strong>prompt template</strong> like:</li>
</ol>
<blockquote>
<p>Answer the user query based on the list of documents. If the answer is not in the documents, say &ldquo;I don&rsquo;t know.&rdquo; Cite exact page, chapter, and line.</p></blockquote>
<p>You can extend the template to require links to the specific page.</p>
<h3 id="improving-rags">Improving RAGs</h3>
<blockquote>
<p><strong>Q</strong>: Do document embeddings retain location info within large documents?</p>
<p>Vanilla RAGs may not. Example: the giant white paper inside a medication box would not be served well by a vanilla RAG.</p></blockquote>
<p>Two popular improvements:</p>
<p><strong>Chunking</strong> — store both the full document embedding and chapter-level embeddings; retrieve both, sourcing becomes more precise.</p>
<p><strong>HyDE (Hypothetical Document Embeddings)</strong> — the user query usually doesn&rsquo;t look like the documents. Example: &ldquo;What are the side effects of drug X?&rdquo; vs a multi-page document. To bridge the gap:</p>
<ol>
<li>Take the user query</li>
<li>Use a prompt to generate a fake hallucinated document answering it (&ldquo;write a 5-page report answering this query&rdquo;)</li>
<li>Embed that fake document</li>
<li>Compare its embedding to the vector DB</li>
</ol>
<p>The fake document is closer in structure to real documents, so retrieval is more accurate.</p>
<p>This is just two of many RAG variants — research from 2020–2025 has many branches. (See the linked survey paper in the slides.)</p>
<h2 id="6-agentic-ai-workflows">6. Agentic AI workflows</h2>
<p>Andrew Ng coined &ldquo;agentic AI workflows&rdquo; because everyone uses &ldquo;agent&rdquo; to mean very different things — sometimes a single prompt, sometimes a complex multi-agent system. Calling everything an &ldquo;agent&rdquo; doesn&rsquo;t do it justice. Better term: <strong><a href="/blog/llm/knowledge-cards/agent/" data-link-title="LLM Agent" data-link-desc="把控制流交給 LLM 的應用模式：自主決策、跨多步呼叫工具、人類角色從主導變監督">agentic workflow</a></strong> — a multi-step process to complete a task, built from prompts, tools, additional resources, and API calls. This also avoids confusion with the RL definition of &ldquo;agent&rdquo; (interacts with environment, state transitions, reward, observation).</p>
<h3 id="one-shot-vs-agentic-example">One-shot vs agentic example</h3>
<p>User on a chatbot: &ldquo;What is your refund policy?&rdquo;</p>
<ul>
<li><strong>One-shot + RAG</strong>: &ldquo;Refunds are available within 30 days of purchase.&rdquo; [link to policy]</li>
<li><strong>Agentic</strong>:
<ol>
<li>Agent retrieves refund policy via RAG</li>
<li>Agent asks user for order number</li>
<li>Agent queries an API to check order details</li>
<li>Agent confirms: &ldquo;Your order qualifies. The amount will be processed in 3–5 business days.&rdquo;</li>
</ol>
</li>
</ul>
<p>Much more thoughtful than the vanilla one.</p>
<h3 id="specialized-agents-in-the-wild">Specialized agents in the wild</h3>
<p>In SF you&rsquo;ll see billboards: AI software engineer, AI skill mentor, AI SDR, AI lawyer, AI specialized cloud engineer. It would be a stretch to say everything works, but work is being done. (Personal opinion: putting a human face behind these is gimmicky and more scary than engaging. In a few years, very few products will use a human face — it&rsquo;s a marketing tactic.)</p>
<h3 id="paradigm-shift-traditional-software-vs-agentic-ai-software">Paradigm shift: traditional software vs agentic AI software</h3>
<table>
  <thead>
      <tr>
          <th>Dimension</th>
          <th>Traditional software</th>
          <th>Agentic AI software</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Data</td>
          <td>Structured: JSON, databases, forms</td>
          <td>Free-form text, images, video; dynamic interpretation</td>
      </tr>
      <tr>
          <td>Logic</td>
          <td>Deterministic</td>
          <td>Fuzzy</td>
      </tr>
      <tr>
          <td>Decomposition</td>
          <td>Monolith / microservices</td>
          <td>Think as a manager: delegate to roles (graphic designer → marketing manager → performance marketing → data scientist)</td>
      </tr>
      <tr>
          <td>Cost of experimentation</td>
          <td>High; you rarely throw away code</td>
          <td>Low; AI companies are more comfortable throwing away code</td>
      </tr>
  </tbody>
</table>
<p>Fuzzy engineering is truly hard. If you let users ask anything, the chance of breakage and attack is high. Companies have been bitten because a user did something authorized that broke the database.</p>
<p>Example from Workera:</p>
<ul>
<li><strong>Deterministic item types</strong>: multiple choice, multi-select, drag-and-drop, ordering, matching — one correct answer.</li>
<li><strong>Fuzzy item types</strong>: voice questions, voice + coding role-plays — the scoring algorithm can make mistakes, and mistakes are costly.</li>
</ul>
<p>Mitigation: a <strong><a href="/blog/llm/knowledge-cards/human-in-the-loop/" data-link-title="Human-in-the-loop（HITL）" data-link-desc="人類介入 LLM 工作流的設計：三種觸發時機（pre-act / mid-stream / post-hoc）、避免橡皮圖章化的四條件">human in the loop</a></strong> — e.g. the appeal feature at the end of an assessment that lets users challenge the agent, bringing a human in to fix and align it.</p>
<p>Advice for building a company: get as much done deterministically as possible. Then for the fuzzy parts (back-and-forth interaction), design guardrails up front.</p>
<h3 id="enterprise-workflows-the-mckinsey-credit-memo-example">Enterprise workflows: the McKinsey credit memo example</h3>
<p>A financial institution takes 1–4 weeks to produce a credit risk memo:</p>
<ol>
<li>Relationship manager gathers data from 15+ sources</li>
<li>RM and credit analyst collaboratively analyze</li>
<li>Credit analyst spends 20+ hours writing the memo</li>
<li>RM and analyst loop on feedback</li>
</ol>
<p>With Gen AI agents (McKinsey study), time drops 20–60%:</p>
<ol>
<li>RM works with Gen AI agent, provides materials</li>
<li>Agent decomposes into tasks for specialist sub-agents</li>
<li>Agents gather data, draft memo</li>
<li>RM and analyst review and give feedback</li>
</ol>
<p>The hardest part is changing people. In theory, this is great. In practice — 100,000-employee enterprises will take 10–20 years to rewire job descriptions, business workflows, incentives, and training to make this real at scale.</p>
<h3 id="core-components-of-an-agent">Core components of an agent</h3>
<p>Take a travel booking agent:</p>
<ul>
<li><strong>Prompts</strong> — the prompts we&rsquo;ve learned to optimize</li>
<li><strong>Context management / memory</strong>:
<ul>
<li><strong>Core / working memory</strong>: fast access. Things needed every interaction (e.g. user&rsquo;s name).</li>
<li><strong>Archival / long-term memory</strong>: slower. Things used occasionally (e.g. birthday).</li>
<li>Why split: imagine ChatGPT had to re-read all memories on every call. If memory lookup takes 3 seconds, every interaction takes 3 seconds. Working memory must be highly optimized.</li>
</ul>
</li>
<li><strong>Tools</strong>: flight search API, hotel API, car rental API, weather API, payment processing API. You typically pass API documentation to the LLM — they&rsquo;re good at reading JSON specs and learning the GET request format.</li>
<li><strong>Resources</strong> (Anthropic&rsquo;s term): data sitting somewhere (e.g. your CRM) that you let the agent read. Provide a lookup tool and access to the resource.</li>
</ul>
<h3 id="degrees-of-autonomy">Degrees of autonomy</h3>
<p>From least to most autonomous:</p>
<ul>
<li><strong>Least</strong>: hard-code the steps. &ldquo;First identify intent, then look up history, then call the flight API, &hellip;&rdquo;</li>
<li><strong>Semi</strong>: hard-code the tools only. &ldquo;You&rsquo;re a travel agent, help the user book travel. Here are your tools.&rdquo;</li>
<li><strong>Most</strong>: agent decides both steps and tools. Give it a code editor; it can ping any web API, perform calculations, generate code to display data.</li>
</ul>
<h3 id="apis-vs-mcp-model-context-protocol">APIs vs MCP (Model Context Protocol)</h3>
<p>With <strong>APIs</strong>, you teach the LLM to ping a specific API: give it documentation, define how to call it, what it returns. You do this one-off per API. Doesn&rsquo;t scale well.</p>
<p>With <strong><a href="/blog/llm/knowledge-cards/mcp/" data-link-title="MCP（Model Context Protocol）" data-link-desc="LLM application ↔ 外部 tool server 之間的標準化協議、複用 OpenAI 相容 API 的成功模式">MCP</a></strong> (Anthropic-coined), there&rsquo;s a system in the middle. Agents communicate with an MCP server:</p>
<blockquote>
<p>&ldquo;What do you need to give me flight info?&rdquo;
&ldquo;I need origin, destination, and what you&rsquo;re looking for.&rdquo;
&ldquo;Here are my requirements.&rdquo;
&ldquo;You forgot to tell me your budget.&rdquo;</p></blockquote>
<p>It&rsquo;s agent-to-agent communication. Companies publish their MCPs; your agent figures out how to get the data it needs.</p>
<blockquote>
<p><strong>Q</strong>: Isn&rsquo;t MCP just a shifted maintenance burden — APIs change, MCPs change?</p>
<p>Yes. But at least the agent can go back and forth and discover requirements. Ideally a startup has documentation, an LLM workflow reads docs and updates code accordingly.</p></blockquote>
<blockquote>
<p><strong>Q</strong>: Are there security concerns with MCP?</p>
<p>Likely, depending on the data exposed. Most MCPs have authentication, like APIs. The exact security surface depends on the implementation.</p></blockquote>
<blockquote>
<p><strong>Q</strong>: Is MCP about efficiency or accessing more data?</p>
<p>Efficiency. You still control what data is exposed. Compared to one-off API integration, MCP lets a coding agent communicate efficiently with many MCP servers and find what it needs.</p></blockquote>
<h3 id="step-by-step-workflow-example-travel-agent">Step-by-step workflow example: travel agent</h3>
<ol>
<li>User: &ldquo;Plan a trip to Paris Dec 15–20 with flights, hotels near the Eiffel Tower, and an itinerary.&rdquo;</li>
<li>Agent plans steps: find flights, search hotels, generate recommendations, validate preferences/budget, book.</li>
<li>Execute: use tools, combine results.</li>
<li>Proactive interaction: propose to user, validate, iterate.</li>
<li>Update memory: &ldquo;User only likes direct flights.&rdquo; &ldquo;User is fine with 3-star hotels.&rdquo;</li>
</ol>
<h2 id="7-case-study-building-a-customer-support-agent--evals">7. Case study: building a customer support agent + evals</h2>
<p>PM asks you to build a customer support agent. Example: &ldquo;I need to change my shipping address for order X — I moved.&rdquo;</p>
<h3 id="where-to-start">Where to start</h3>
<ul>
<li><strong>Research existing models / benchmarks</strong> for customer support</li>
<li><strong>Decompose the task</strong>: what would a human support agent do?</li>
<li><strong>Guess what&rsquo;s fuzzy vs deterministic</strong> in advance</li>
</ul>
<blockquote>
<p>Recommended start: sit with a customer support agent for a day or two. Watch their workflow. Ask where they struggle and how much time each step takes. That gives you the task decomposition.</p></blockquote>
<h3 id="decomposed-task">Decomposed task</h3>
<p>A human support agent typically:</p>
<ol>
<li>Extracts key info</li>
<li>Looks up the customer record in the database</li>
<li>Checks policy (allowed to update address?)</li>
<li>Drafts a response email</li>
<li>Sends the email</li>
</ol>
<h3 id="designing-the-agentic-workflow">Designing the agentic workflow</h3>
<p>For each step, pick the right primitive:</p>
<ul>
<li><strong>Step 1 extract info</strong>: vanilla LLM call — extract intent, order number, new address</li>
<li><strong>Step 2 lookup + update</strong>: tool — connect to database (custom tool or MCP)</li>
<li><strong>Step 3 check policy</strong>: RAG or rule lookup</li>
<li><strong>Step 4 draft email</strong>: LLM call, with the confirmation pasted in</li>
<li><strong>Step 5 send email</strong>: tool — post to email API</li>
</ul>
<h3 id="evals-how-do-you-know-it-works">Evals: how do you know it works?</h3>
<p>Assume you have <strong>LLM traces</strong> (a must in any AI startup — if a startup doesn&rsquo;t have traces, debugging is brutal). Several dimensions for evaluation:</p>
<p><strong>End-to-end vs component-based</strong>:</p>
<ul>
<li>End-to-end: user satisfaction rating at the end. If user rates 1, follow up: &ldquo;What was the issue?&rdquo; → &ldquo;Prices were too high&rdquo; → fix the relevant tool/prompt.</li>
<li>Component-based: error-analyze each tool / prompt independently. &ldquo;The tool keeps forgetting to update the email field.&rdquo; &ldquo;The email-send call uses wrong format.&rdquo;</li>
</ul>
<p><strong>Objective vs subjective</strong>:</p>
<ul>
<li>Objective: &ldquo;LLM extracted the wrong order ID.&rdquo; You can write Python to check alignment between user input and DB lookup. Catch automatically.</li>
<li>Subjective: &ldquo;Should we recommend a direct flight or cheaper indirect?&rdquo; Captured via:
<ul>
<li>Curated eval dataset — write 10 prompts where users say &ldquo;I prefer direct flights, I care about time.&rdquo; Define what a good output looks like.</li>
<li>LLM judges grading on a rubric.</li>
</ul>
</li>
</ul>
<p><strong>Quantitative vs qualitative</strong>:</p>
<ul>
<li>Quantitative: % successful address updates; latency per component (e.g. send-email takes 5s — too long).</li>
<li>Qualitative: error analysis on hallucinations, tone mismatch, user confusion. Typically white-glove.</li>
</ul>
<p>Example of subjective tone eval: error-analyze 20 user interactions, notice the LLM seems rude / overly short. Then build LLM judges with a politeness rubric. Then swap the underlying LLM (GPT-4 → Grok → Llama), run side by side, see which is most polite on average. Or fix the LLM and tweak the prompt (&ldquo;Act like a travel agent&rdquo; → &ldquo;Act like a helpful travel agent&rdquo;) to measure the word&rsquo;s influence.</p>
<h2 id="8-multi-agent-workflows">8. Multi-agent workflows</h2>
<p>Why multi-agent when a single workflow already has multiple steps?</p>
<ul>
<li><strong>Parallelism</strong> — independent things can run in parallel</li>
<li><strong>Reuse</strong> — a design agent built once can serve marketing, product, etc. Many stakeholders benefit from one optimized agent.</li>
</ul>
<h3 id="smart-home-example">Smart home example</h3>
<p>Brainstormed by the class:</p>
<ul>
<li><strong>Biometric / location agent</strong>: tracks where you are and how you&rsquo;re moving</li>
<li><strong>Climate agent</strong>: monitors and adjusts room temperature</li>
<li><strong>Energy efficiency agent</strong>: tracks usage, gives feedback, may control utilities</li>
<li><strong>Security agent</strong>: identifies who&rsquo;s entering, applies role-based permissions (parent vs kid)</li>
<li><strong>Weather / external API agent</strong>: integrates outdoor conditions to control temperature, blinds, etc.</li>
<li><strong>Fridge / grocery agent</strong>: knows what&rsquo;s inside via camera, knows preferences, has e-commerce API access for restocking</li>
<li><strong>Notification / alerts agent</strong>: system updates, energy savings</li>
<li><strong>Orchestrator agent</strong>: the user-facing entry point that delegates to specialists</li>
</ul>
<h3 id="interaction-patterns">Interaction patterns</h3>
<ul>
<li><strong>Flat / all-to-all</strong>: every agent can talk to every agent</li>
<li><strong>Hierarchical</strong>: orchestrator routes to specialists</li>
</ul>
<p>Smart home likely wants <strong>hierarchical</strong> for UX — users want one interface, not one app per agent. Some flat links may still help (climate + energy efficiency probably need to talk directly).</p>
<p>When you allow agents to speak to each other, it&rsquo;s basically an MCP-style protocol: treat the other agent like a tool. &ldquo;Here&rsquo;s how you interact, here&rsquo;s what it tells you, here&rsquo;s what it needs from you.&rdquo;</p>
<h3 id="advantages">Advantages</h3>
<ul>
<li>Easier to debug specialized agents than a monolithic system</li>
<li>Parallelization, time savings</li>
</ul>
<h2 id="9-whats-next-in-ai">9. What&rsquo;s next in AI</h2>
<h3 id="are-we-plateauing-ilya-sutskevers-question">Are we plateauing? (Ilya Sutskever&rsquo;s question)</h3>
<p>The community feeling around the latest GPT release was that the performance jump wasn&rsquo;t what people expected — though the unified hood (no model selector) made consumer UX better.</p>
<p>LLM <strong>scaling laws</strong> say more compute + energy → better performance, but that eventually plateaus. What takes us to the next step is probably <strong>architecture search</strong>. The human brain operates very differently — much more efficient, much faster, with far less data. Big labs are hiring thousands of engineers precisely to hunt the next architectural breakthrough. Whoever discovered Transformers had tremendous impact on AI&rsquo;s direction; the next analogous discovery could unlock a 10x reduction in compute and energy needs. (Foundation series analogy: individuals can disproportionately shape the future via their decisions.)</p>
<h3 id="multi-modality">Multi-modality</h3>
<p>LLMs started as text-only, added images. Models good at images are also better at text — being good at cat images makes you better at text about cats. Add audio and video, and the whole system improves. Pinnacle: robotics, where all modalities converge — the robot is better at avoiding a cat because it knows what a cat looks like, sounds like, smells like.</p>
<h3 id="methods-working-in-harmony">Methods working in harmony</h3>
<p>Humans probably use a mix of methods:</p>
<ul>
<li><strong>Meta-learning</strong> — survival instinct encoded in DNA (the baby&rsquo;s &ldquo;pre-training&rdquo;)</li>
<li><strong>Supervised</strong> — parents pointing and saying &ldquo;good / bad&rdquo;</li>
<li><strong>Reinforcement</strong> — falling and getting hurt</li>
<li><strong>Unsupervised</strong> — observing others</li>
</ul>
<p>Future AI systems likely combine the methods you saw in CS230, optimizing for speed, latency, cost, and energy.</p>
<h3 id="human-centric-vs-non-human-centric-research">Human-centric vs non-human-centric research</h3>
<p>The human body is limiting. Pure brain-modeled research may miss compute/energy optimizations. Still, the brain has lots to teach — e.g. one research direction asks: does the brain do backpropagation? Probably not — likely only forward propagation. Worth reading if you&rsquo;re curious about AI&rsquo;s direction.</p>
<h3 id="velocity">Velocity</h3>
<p>Things move so fast that we deliberately teach <strong>breadth</strong>, not depth — because today&rsquo;s specific RAG technique #17 will be irrelevant in two years. Get the breadth, develop the ability to sprint into depth when needed. The half-life of skills is low.</p>
<h2 id="後話">後話</h2>
<p>這篇是 Stanford CS230 公開課的整理、保留英文原文以避免翻譯失真。要看本 blog 對應的中文原理化內容、可以接：</p>
<ul>
<li><a href="/blog/llm/04-applications/" data-link-title="模組四：LLM 應用層原理" data-link-desc="Prompt 技術光譜、RAG、tool use、agent、應用層協議、人機協作、multi-agent、workflow 編排、eval 設計：跨工具不變的概念地圖">模組四：LLM 應用層原理</a> — RAG / tool use / agent / workflow patterns 的跨工具不變原理</li>
<li><a href="/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1 RAG 原理</a></li>
<li><a href="/blog/llm/04-applications/agent-architecture/" data-link-title="4.4 Agent 架構原理" data-link-desc="Agent loop 結構、失敗模式、什麼任務適合 vs 不適合、跟人類審查的協作模型">4.4 Agent 架構原理</a></li>
<li><a href="/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14 Benchmarking 與評估方法論</a></li>
<li><a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-Judge 評估方法</a></li>
</ul>
]]></content:encoded></item></channel></rss>