<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Hands-on：本地 AI 工具實作筆記 on Tarragon</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/</link><description>Recent content in Hands-on：本地 AI 工具實作筆記 on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Mon, 11 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/index.xml" rel="self" type="application/rss+xml"/><item><title>Hands-on：安裝 ComfyUI + SDXL base</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/comfyui-setup/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/comfyui-setup/</guid><description>&lt;p>本篇紀錄裝 ComfyUI 跟 Stable Diffusion XL base 模型、在 Apple Silicon Mac 上跑通最小 text-to-image 流程。ComfyUI 是 2026 年 Apple Silicon 跑 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/diffusion/" data-link-title="Diffusion" data-link-desc="產圖用的生成式 AI 架構：跟寫 code 用的 Transformer 是不同路線">Diffusion&lt;/a> 最主流的選擇——節點式工作流（拖拉節點連線、像 visual programming、每個節點負責一段運算）、跨平台、Python 環境、容易客製化。Draw Things（Mac 原生 GUI）更簡單、但 ComfyUI 接 workflow 跟 custom node 的能力強很多。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>ComfyUI&lt;/strong>：main branch、shallow clone
&lt;strong>示範模型&lt;/strong>：Stable Diffusion XL base 1.0（6.5 GB、&lt;code>stabilityai/stable-diffusion-xl-base-1.0&lt;/code>）
&lt;strong>Python&lt;/strong>：3.14（venv 隔離、不污染系統）&lt;/p>&lt;/blockquote>
&lt;h2 id="前置設定">前置設定&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>項目&lt;/th>
 &lt;th>檢查指令&lt;/th>
 &lt;th>預期&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Git&lt;/td>
 &lt;td>&lt;code>which git&lt;/code>&lt;/td>
 &lt;td>&lt;code>/usr/bin/git&lt;/code> 或 brew 版&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Python 3.10+&lt;/td>
 &lt;td>&lt;code>python3 --version&lt;/code>&lt;/td>
 &lt;td>3.10 ~ 3.14 都可、本 demo 用 3.14&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>磁碟空間&lt;/td>
 &lt;td>&lt;code>df -h ~&lt;/code>&lt;/td>
 &lt;td>至少 15 GB（runtime 3 GB + SDXL 6.5 GB + cache）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/unified-memory/" data-link-title="Unified Memory Architecture" data-link-desc="Apple Silicon 讓 CPU / GPU / NE 共用同一塊記憶體：跑大模型的優勢來源">統一記憶體&lt;/a>&lt;/td>
 &lt;td>&lt;code>system_profiler SPHardwareDataType | grep Memory&lt;/code>&lt;/td>
 &lt;td>至少 16 GB、推薦 32 GB+&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>ComfyUI 在 Apple Silicon 跑 Diffusion 用 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/gpu-compute-backend/" data-link-title="GPU Compute Backend" data-link-desc="GPU 加速計算的底層 API 介面（CUDA / ROCm / Vulkan / Metal / SYCL）、決定推論軟體能否用 GPU 跑得快">MPS（Metal Performance Shaders）backend&lt;/a>、不需要 NVIDIA CUDA。但跑 SDXL 至少要 12 GB 統一記憶體留給 model + activation、16 GB Mac 跟其他 app 一起會吃緊。&lt;/p>
&lt;h2 id="clone-comfyui">Clone ComfyUI&lt;/h2>
&lt;p>放在 &lt;code>~/Projects/&lt;/code> 下、跟其他 dev project 同層：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="nb">cd&lt;/span> ~/Projects
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">git clone --depth &lt;span class="m">1&lt;/span> https://github.com/comfyanonymous/ComfyUI.git
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="nb">cd&lt;/span> ComfyUI&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>--depth 1&lt;/code> 只拉最新 commit、不拉全部歷史、省幾百 MB。要追歷史 / submit PR 才需要 full clone。&lt;/p>
&lt;p>ComfyUI 目錄結構（核心部分）：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">ComfyUI/
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">├── main.py # 啟動 entry point
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">├── server.py # HTTP server
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">├── nodes.py # 內建節點實作
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">├── custom_nodes/ # 第三方 / 客製節點放這
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">├── models/
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">│ ├── checkpoints/ # SD / SDXL 主 model 檔放這
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">│ ├── loras/ # LoRA 微調權重
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">│ ├── vae/ # VAE 模型
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">│ ├── controlnet/ # ControlNet 模型
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">│ └── ...
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">├── output/ # 生成的圖
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">├── input/ # 拖進 ComfyUI 的圖片
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">└── requirements.txt&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="建-venv--裝-dependencies">建 venv + 裝 dependencies&lt;/h2>
&lt;p>ComfyUI requirements 含 PyTorch、numpy、PIL、safetensors、einops 等、套件多、版本敏感。用 venv 隔離：&lt;/p></description><content:encoded><![CDATA[<p>本篇紀錄裝 ComfyUI 跟 Stable Diffusion XL base 模型、在 Apple Silicon Mac 上跑通最小 text-to-image 流程。ComfyUI 是 2026 年 Apple Silicon 跑 <a href="/blog/llm/knowledge-cards/diffusion/" data-link-title="Diffusion" data-link-desc="產圖用的生成式 AI 架構：跟寫 code 用的 Transformer 是不同路線">Diffusion</a> 最主流的選擇——節點式工作流（拖拉節點連線、像 visual programming、每個節點負責一段運算）、跨平台、Python 環境、容易客製化。Draw Things（Mac 原生 GUI）更簡單、但 ComfyUI 接 workflow 跟 custom node 的能力強很多。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>ComfyUI</strong>：main branch、shallow clone
<strong>示範模型</strong>：Stable Diffusion XL base 1.0（6.5 GB、<code>stabilityai/stable-diffusion-xl-base-1.0</code>）
<strong>Python</strong>：3.14（venv 隔離、不污染系統）</p></blockquote>
<h2 id="前置設定">前置設定</h2>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>檢查指令</th>
          <th>預期</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Git</td>
          <td><code>which git</code></td>
          <td><code>/usr/bin/git</code> 或 brew 版</td>
      </tr>
      <tr>
          <td>Python 3.10+</td>
          <td><code>python3 --version</code></td>
          <td>3.10 ~ 3.14 都可、本 demo 用 3.14</td>
      </tr>
      <tr>
          <td>磁碟空間</td>
          <td><code>df -h ~</code></td>
          <td>至少 15 GB（runtime 3 GB + SDXL 6.5 GB + cache）</td>
      </tr>
      <tr>
          <td><a href="/blog/llm/knowledge-cards/unified-memory/" data-link-title="Unified Memory Architecture" data-link-desc="Apple Silicon 讓 CPU / GPU / NE 共用同一塊記憶體：跑大模型的優勢來源">統一記憶體</a></td>
          <td><code>system_profiler SPHardwareDataType | grep Memory</code></td>
          <td>至少 16 GB、推薦 32 GB+</td>
      </tr>
  </tbody>
</table>
<p>ComfyUI 在 Apple Silicon 跑 Diffusion 用 <a href="/blog/llm/knowledge-cards/gpu-compute-backend/" data-link-title="GPU Compute Backend" data-link-desc="GPU 加速計算的底層 API 介面（CUDA / ROCm / Vulkan / Metal / SYCL）、決定推論軟體能否用 GPU 跑得快">MPS（Metal Performance Shaders）backend</a>、不需要 NVIDIA CUDA。但跑 SDXL 至少要 12 GB 統一記憶體留給 model + activation、16 GB Mac 跟其他 app 一起會吃緊。</p>
<h2 id="clone-comfyui">Clone ComfyUI</h2>
<p>放在 <code>~/Projects/</code> 下、跟其他 dev project 同層：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">cd</span> ~/Projects
</span></span><span class="line"><span class="ln">2</span><span class="cl">git clone --depth <span class="m">1</span> https://github.com/comfyanonymous/ComfyUI.git
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="nb">cd</span> ComfyUI</span></span></code></pre></div><p><code>--depth 1</code> 只拉最新 commit、不拉全部歷史、省幾百 MB。要追歷史 / submit PR 才需要 full clone。</p>
<p>ComfyUI 目錄結構（核心部分）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">ComfyUI/
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">├── main.py              # 啟動 entry point
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">├── server.py            # HTTP server
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">├── nodes.py             # 內建節點實作
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">├── custom_nodes/        # 第三方 / 客製節點放這
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">├── models/
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│   ├── checkpoints/     # SD / SDXL 主 model 檔放這
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│   ├── loras/           # LoRA 微調權重
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">│   ├── vae/             # VAE 模型
</span></span><span class="line"><span class="ln">10</span><span class="cl">│   ├── controlnet/      # ControlNet 模型
</span></span><span class="line"><span class="ln">11</span><span class="cl">│   └── ...
</span></span><span class="line"><span class="ln">12</span><span class="cl">├── output/              # 生成的圖
</span></span><span class="line"><span class="ln">13</span><span class="cl">├── input/               # 拖進 ComfyUI 的圖片
</span></span><span class="line"><span class="ln">14</span><span class="cl">└── requirements.txt</span></span></code></pre></div><h2 id="建-venv--裝-dependencies">建 venv + 裝 dependencies</h2>
<p>ComfyUI requirements 含 PyTorch、numpy、PIL、safetensors、einops 等、套件多、版本敏感。用 venv 隔離：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">cd</span> ~/Projects/ComfyUI
</span></span><span class="line"><span class="ln">2</span><span class="cl">python3 -m venv venv
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="nb">source</span> venv/bin/activate
</span></span><span class="line"><span class="ln">4</span><span class="cl">python --version  <span class="c1"># 確認在 venv 內</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">pip install --upgrade pip</span></span></code></pre></div><p>裝 dependencies：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">pip install -r requirements.txt</span></span></code></pre></div><p>實測時間：10-15 分鐘（torch + 各種 dep）、首次跑會編譯部分 C extension。完成後預期看到：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Successfully installed Mako-... MarkupSafe-... Pillow-... PyOpenGL-... ...
</span></span><span class="line"><span class="ln">2</span><span class="cl">  torch-... torchvision-... torchaudio-... ...
</span></span><span class="line"><span class="ln">3</span><span class="cl">  safetensors-... transformers-... ...</span></span></code></pre></div><p>驗證 PyTorch + MPS：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">python -c <span class="s2">&#34;import torch; print(&#39;torch:&#39;, torch.__version__, &#39;mps:&#39;, torch.backends.mps.is_available())&#34;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># torch: 2.x.x mps: True</span></span></span></code></pre></div><p><code>mps: True</code> 表示 Apple Silicon GPU 加速可用。</p>
<h2 id="下載-sdxl-base-模型">下載 SDXL base 模型</h2>
<p>SDXL base 約 6.5 GB、是 Stable Diffusion XL 的基礎 model。從 Hugging Face 拉到 ComfyUI 的 <code>models/checkpoints/</code>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">mkdir -p ~/Projects/ComfyUI/models/checkpoints
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="nb">cd</span> ~/Projects/ComfyUI/models/checkpoints
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># -L 跟 redirect、--continue-at - 支援中斷後重續、避免 6.5 GB 重下</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">curl -L --continue-at - -o sd_xl_base_1.0.safetensors <span class="se">\
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="se"></span>  <span class="s2">&#34;https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors?download=true&#34;</span></span></span></code></pre></div><p>下載時間視網速、10-30 分鐘 broadband 都正常。網路中斷時重跑同一個指令、<code>--continue-at -</code> 會從中斷處續傳、不用重下 6.5 GB。完成後：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">ls -lh sd_xl_base_1.0.safetensors
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 6.5 GB</span></span></span></code></pre></div><p>可選的進階模型：</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>大小</th>
          <th>用途</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SDXL base 1.0</td>
          <td>6.5 GB</td>
          <td>基礎、本 demo 用</td>
      </tr>
      <tr>
          <td>SDXL refiner 1.0</td>
          <td>6.1 GB</td>
          <td>跟 base 配對、提升細節</td>
      </tr>
      <tr>
          <td>SD 1.5</td>
          <td>4.0 GB</td>
          <td>較小、生態最成熟（很多 LoRA）</td>
      </tr>
      <tr>
          <td>Flux.1 schnell</td>
          <td>12 GB</td>
          <td>2024+ 最強開源 SD 級</td>
      </tr>
      <tr>
          <td>Flux.1 dev</td>
          <td>24 GB</td>
          <td>Flux 完整版、品質最佳</td>
      </tr>
  </tbody>
</table>
<p>SDXL 6.5 GB 是「能驗證 + 不過大」的甜蜜點。再小可以選 SD 1.5（4 GB）、跑 Flux 要 24 GB 磁碟 + 16 GB+ 統一記憶體。</p>
<h2 id="啟動-comfyui-server">啟動 ComfyUI Server</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">cd</span> ~/Projects/ComfyUI
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="nb">source</span> venv/bin/activate
</span></span><span class="line"><span class="ln">3</span><span class="cl">python main.py</span></span></code></pre></div><p>預期輸出：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">[Prompt Server] Starting ComfyUI...
</span></span><span class="line"><span class="ln">2</span><span class="cl">Total VRAM 32768 MB, total RAM 32768 MB
</span></span><span class="line"><span class="ln">3</span><span class="cl">pytorch version: 2.x.x
</span></span><span class="line"><span class="ln">4</span><span class="cl">Set vram state to: SHARED
</span></span><span class="line"><span class="ln">5</span><span class="cl">Device: mps
</span></span><span class="line"><span class="ln">6</span><span class="cl">Using sub quadratic attention for cross-attention
</span></span><span class="line"><span class="ln">7</span><span class="cl">...
</span></span><span class="line"><span class="ln">8</span><span class="cl">Starting server
</span></span><span class="line"><span class="ln">9</span><span class="cl">To see the GUI go to: http://127.0.0.1:8188</span></span></code></pre></div><p>Apple Silicon 統一記憶體被 PyTorch 報成 VRAM 是預期、不是 bug：mps backend 把整個統一記憶體當成「GPU 可見記憶體」、所以 32GB Mac 顯示 <code>Total VRAM 32768 MB</code>。實際使用上 ComfyUI、其他 app 跟系統共用同一塊。</p>
<p>關鍵驗證：</p>
<ul>
<li><code>Device: mps</code> → Apple Silicon GPU 啟用</li>
<li><code>Starting server</code> + <code>http://127.0.0.1:8188</code> → server 跑了</li>
</ul>
<p>開瀏覽器到 <code>http://127.0.0.1:8188</code>、看到節點式 UI 就成功。第一次開啟會載入預設 workflow（一個簡單 text-to-image）。</p>
<p>要對外暴露（讓 LAN 內其他機器連）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">python main.py --listen 0.0.0.0 --port <span class="m">8188</span></span></span></code></pre></div><p>跟 <a href="/blog/llm/00-foundations/privacy-data-flow/" data-link-title="0.7 隱私 / 資安的資料流原理" data-link-desc="從「位置」到「資料流」的思考升級：信任邊界、合約模型、零信任原則套用到 LLM 工作流">0.7 隱私資料流</a> 提的一樣、<code>0.0.0.0</code> 等於暴露給整個區網、家用 OK 公共網路要小心。</p>
<h2 id="跑第一張圖">跑第一張圖</h2>
<p>ComfyUI 預設 workflow 是 text-to-image：</p>
<ol>
<li><strong>CheckpointLoader 節點</strong>：選 <code>sd_xl_base_1.0.safetensors</code>。</li>
<li><strong>CLIPTextEncode（Prompt）節點</strong>：輸入 prompt、例如 <code>a photograph of a cat sitting on a wooden chair, natural lighting</code>。</li>
<li><strong>CLIPTextEncode（Negative）節點</strong>：輸入 negative prompt、例如 <code>blurry, low quality, artifacts</code>。</li>
<li><strong>EmptyLatentImage 節點</strong>：設定 1024×1024（SDXL 最佳尺寸）。</li>
<li><strong>KSampler 節點</strong>：steps=20、cfg=7、sampler=<code>euler</code> 或 <code>dpmpp_2m</code>。</li>
<li><strong>VAEDecode 節點</strong>：把 latent 轉成 RGB image。</li>
<li><strong>SaveImage 節點</strong>：存到 <code>output/</code>。</li>
</ol>
<p>點右側 panel 的 <code>Queue Prompt</code>、開始生成。</p>
<p>實測時間（M4 Pro 32GB、SDXL base、1024×1024、MPS backend）：</p>
<table>
  <thead>
      <tr>
          <th>Steps</th>
          <th>第一張（含 model 載入）</th>
          <th>後續同 model</th>
          <th>備註</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>15</td>
          <td>約 100-110 秒</td>
          <td>約 30-40 秒</td>
          <td>本驗證實測 106s（含載入）</td>
      </tr>
      <tr>
          <td>20</td>
          <td>約 130-150 秒</td>
          <td>約 40-60 秒</td>
          <td>ComfyUI 預設值</td>
      </tr>
      <tr>
          <td>30</td>
          <td>約 200 秒</td>
          <td>約 80 秒</td>
          <td>品質更高、邊際效益小</td>
      </tr>
  </tbody>
</table>
<p>16GB Mac 跑 SDXL：每張 60-180 秒、可能會降頻。</p>
<p>生成完成後在 <code>output/</code> 看到 PNG 檔（如 <code>comfyui-test_00001_.png</code>）。</p>
<h2 id="用-rest-api-直接生成不開瀏覽器">用 REST API 直接生成（不開瀏覽器）</h2>
<p>GUI 適合互動探索、自動化要走 REST API。完整 script 在 <code>scripts/comfyui-test/generate.py</code>、實際驗證指令：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">cd</span> ~/Projects/blog
</span></span><span class="line"><span class="ln">2</span><span class="cl">python3 scripts/comfyui-test/generate.py --steps <span class="m">15</span></span></span></code></pre></div><p>腳本流程：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">def</span> <span class="nf">build_workflow</span><span class="p">(</span><span class="n">prompt_text</span><span class="p">,</span> <span class="n">neg_text</span><span class="p">,</span> <span class="n">steps</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">    <span class="k">return</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">        <span class="s2">&#34;3&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;inputs&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;seed&#34;</span><span class="p">:</span> <span class="mi">42</span><span class="p">,</span> <span class="s2">&#34;steps&#34;</span><span class="p">:</span> <span class="n">steps</span><span class="p">,</span> <span class="s2">&#34;cfg&#34;</span><span class="p">:</span> <span class="mf">7.0</span><span class="p">,</span> <span class="s2">&#34;sampler_name&#34;</span><span class="p">:</span> <span class="s2">&#34;euler&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">                         <span class="s2">&#34;scheduler&#34;</span><span class="p">:</span> <span class="s2">&#34;normal&#34;</span><span class="p">,</span> <span class="s2">&#34;denoise&#34;</span><span class="p">:</span> <span class="mf">1.0</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">                         <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;4&#34;</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="s2">&#34;positive&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;6&#34;</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">                         <span class="s2">&#34;negative&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;7&#34;</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="s2">&#34;latent_image&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;5&#34;</span><span class="p">,</span> <span class="mi">0</span><span class="p">]},</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">              <span class="s2">&#34;class_type&#34;</span><span class="p">:</span> <span class="s2">&#34;KSampler&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">        <span class="s2">&#34;4&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;inputs&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;ckpt_name&#34;</span><span class="p">:</span> <span class="s2">&#34;sd_xl_base_1.0.safetensors&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">              <span class="s2">&#34;class_type&#34;</span><span class="p">:</span> <span class="s2">&#34;CheckpointLoaderSimple&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">        <span class="s2">&#34;5&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;inputs&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;width&#34;</span><span class="p">:</span> <span class="mi">1024</span><span class="p">,</span> <span class="s2">&#34;height&#34;</span><span class="p">:</span> <span class="mi">1024</span><span class="p">,</span> <span class="s2">&#34;batch_size&#34;</span><span class="p">:</span> <span class="mi">1</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">              <span class="s2">&#34;class_type&#34;</span><span class="p">:</span> <span class="s2">&#34;EmptyLatentImage&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">        <span class="s2">&#34;6&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;inputs&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;text&#34;</span><span class="p">:</span> <span class="n">prompt_text</span><span class="p">,</span> <span class="s2">&#34;clip&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;4&#34;</span><span class="p">,</span> <span class="mi">1</span><span class="p">]},</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">              <span class="s2">&#34;class_type&#34;</span><span class="p">:</span> <span class="s2">&#34;CLIPTextEncode&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">        <span class="s2">&#34;7&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;inputs&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;text&#34;</span><span class="p">:</span> <span class="n">neg_text</span><span class="p">,</span> <span class="s2">&#34;clip&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;4&#34;</span><span class="p">,</span> <span class="mi">1</span><span class="p">]},</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">              <span class="s2">&#34;class_type&#34;</span><span class="p">:</span> <span class="s2">&#34;CLIPTextEncode&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">        <span class="s2">&#34;8&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;inputs&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;samples&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;3&#34;</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="s2">&#34;vae&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;4&#34;</span><span class="p">,</span> <span class="mi">2</span><span class="p">]},</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">              <span class="s2">&#34;class_type&#34;</span><span class="p">:</span> <span class="s2">&#34;VAEDecode&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">        <span class="s2">&#34;9&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;inputs&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;filename_prefix&#34;</span><span class="p">:</span> <span class="s2">&#34;comfyui-test&#34;</span><span class="p">,</span> <span class="s2">&#34;images&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;8&#34;</span><span class="p">,</span> <span class="mi">0</span><span class="p">]},</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">              <span class="s2">&#34;class_type&#34;</span><span class="p">:</span> <span class="s2">&#34;SaveImage&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">    <span class="p">}</span></span></span></code></pre></div><p><strong>workflow JSON 結構解釋</strong>：</p>
<ul>
<li><strong>每個 key（&ldquo;3&rdquo;、&ldquo;4&rdquo;、…）是節點 ID</strong>。任意整數字串、只要在 workflow 內唯一即可。</li>
<li><strong><code>class_type</code></strong>：節點類型（KSampler、CheckpointLoaderSimple、CLIPTextEncode 等）、ComfyUI 內建。</li>
<li><strong><code>inputs</code></strong>：節點參數。標量值（如 <code>1024</code>、<code>&quot;euler&quot;</code>）直接寫；連到別的節點輸出用 <code>[node_id, output_index]</code> 形式。</li>
<li><strong><code>[&quot;4&quot;, 0]</code></strong> 表示「節點 4 的第 0 個 output」。CheckpointLoaderSimple 有三個 output：<code>model</code>（0）、<code>clip</code>（1）、<code>vae</code>（2）、所以 <code>[&quot;4&quot;, 0]</code> 是 model、<code>[&quot;4&quot;, 1]</code> 是 clip、<code>[&quot;4&quot;, 2]</code> 是 vae。</li>
</ul>
<p><strong>每個節點做什麼</strong>：</p>
<ul>
<li><strong>4 CheckpointLoaderSimple</strong>：載 SDXL safetensors、輸出 model / clip / vae 三個東西。是整條 graph 的根。</li>
<li><strong>5 EmptyLatentImage</strong>：建一張 1024×1024 的空白 latent tensor（不是 RGB 圖、是 4-channel latent space tensor）。SDXL 的 「畫布」。</li>
<li><strong>6 CLIPTextEncode (positive)</strong>：把 prompt 文字用 CLIP text encoder 轉成 conditioning vector。</li>
<li><strong>7 CLIPTextEncode (negative)</strong>：同上、但是 negative prompt（要 avoid 的特徵）。</li>
<li><strong>3 KSampler</strong>：核心 denoising loop。15-30 個 step、把 latent 從噪聲變成跟 conditioning 對齊的 latent。</li>
<li><strong>8 VAEDecode</strong>：把 latent 用 VAE 解碼成 RGB 圖（1024×1024×3）。</li>
<li><strong>9 SaveImage</strong>：寫 PNG 到 <code>output/</code> 目錄、檔名 prefix <code>comfyui-test</code>。</li>
</ul>
<p><strong>為什麼 graph 結構這樣</strong>：</p>
<ul>
<li><strong>為什麼 model / clip / vae 從同一個 checkpoint 拿</strong>：SDXL 設計上三個元件互相 train、必須同源。從不同 checkpoint 拿會造成生成品質崩。</li>
<li><strong>為什麼 EmptyLatentImage 不直接接 KSampler、要設 batch_size</strong>：保留 batch 維度、未來要 batch generation（一次生 4 張）改 <code>batch_size: 4</code> 就好、其他節點不用改。</li>
<li><strong>為什麼 sampler 用 <code>euler</code>、scheduler 用 <code>normal</code></strong>：最簡單的組合、SDXL base 上品質可預測。其他選項（<code>dpmpp_2m</code>、<code>karras</code> scheduler 等）品質可能更好但效果各模型不同。</li>
<li><strong>為什麼 cfg=7.0</strong>：classifier-free guidance scale。SDXL 的標準預設、太低（&lt; 3）模型忽略 prompt、太高（&gt; 12）過 saturated。</li>
<li><strong>為什麼 seed=42</strong>：固定 seed 讓結果可重現。每次跑同 prompt 同 seed 同 model 結果完全一樣——是調 prompt / 比較 model 的必要條件。</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">    <span class="n">workflow</span> <span class="o">=</span> <span class="n">build_workflow</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">prompt</span><span class="p">,</span> <span class="n">args</span><span class="o">.</span><span class="n">neg</span><span class="p">,</span> <span class="n">args</span><span class="o">.</span><span class="n">steps</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="n">client_id</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">uuid</span><span class="o">.</span><span class="n">uuid4</span><span class="p">())</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="n">resp</span> <span class="o">=</span> <span class="n">http_post_json</span><span class="p">(</span><span class="s2">&#34;/prompt&#34;</span><span class="p">,</span> <span class="p">{</span><span class="s2">&#34;prompt&#34;</span><span class="p">:</span> <span class="n">workflow</span><span class="p">,</span> <span class="s2">&#34;client_id&#34;</span><span class="p">:</span> <span class="n">client_id</span><span class="p">})</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="n">prompt_id</span> <span class="o">=</span> <span class="n">resp</span><span class="p">[</span><span class="s2">&#34;prompt_id&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">    <span class="k">while</span> <span class="kc">True</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">        <span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">        <span class="n">history</span> <span class="o">=</span> <span class="n">http_get_json</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;/history/</span><span class="si">{</span><span class="n">prompt_id</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">        <span class="k">if</span> <span class="n">prompt_id</span> <span class="ow">in</span> <span class="n">history</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">            <span class="n">outputs</span> <span class="o">=</span> <span class="n">history</span><span class="p">[</span><span class="n">prompt_id</span><span class="p">]</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;outputs&#34;</span><span class="p">,</span> <span class="p">{})</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">            <span class="k">break</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl">    <span class="n">img</span> <span class="o">=</span> <span class="n">outputs</span><span class="p">[</span><span class="s2">&#34;9&#34;</span><span class="p">][</span><span class="s2">&#34;images&#34;</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">    <span class="n">qs</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">parse</span><span class="o">.</span><span class="n">urlencode</span><span class="p">({</span><span class="s2">&#34;filename&#34;</span><span class="p">:</span> <span class="n">img</span><span class="p">[</span><span class="s2">&#34;filename&#34;</span><span class="p">],</span> <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;output&#34;</span><span class="p">})</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">    <span class="n">blob</span> <span class="o">=</span> <span class="n">http_get_bytes</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;/view?</span><span class="si">{</span><span class="n">qs</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    <span class="n">Path</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">out</span><span class="p">)</span><span class="o">.</span><span class="n">write_bytes</span><span class="p">(</span><span class="n">blob</span><span class="p">)</span></span></span></code></pre></div><p><strong>每段做什麼</strong>：</p>
<ol>
<li><strong><code>client_id = str(uuid.uuid4())</code></strong>：每個 client 識別碼。ComfyUI 用 client_id 把 progress events 路由給正確 WebSocket subscriber。本 demo 用 polling、client_id 隨意產生即可。</li>
<li><strong><code>POST /prompt</code></strong>：送 workflow + client_id、server 回 <code>prompt_id</code>（這次 job 的 UUID）。Server 把 workflow 丟進 internal queue、立刻 return、不會等 generation。</li>
<li><strong><code>while True: time.sleep(2); GET /history/{prompt_id}</code></strong>：polling 等 job 完成。完成的 job 才會出現在 <code>/history</code> 裡（執行中 / queued 都不算）。</li>
<li><strong><code>if prompt_id in history</code></strong>：完成判讀——history 內出現該 prompt_id 表示 generation 結束。</li>
<li><strong><code>outputs[&quot;9&quot;][&quot;images&quot;][0]</code></strong>：節點 9 (SaveImage) 的輸出、含 <code>filename</code>、<code>subfolder</code>、<code>type</code> 等資訊。</li>
<li><strong><code>/view?filename=...&amp;type=output</code></strong>：拿生成的 PNG bytes。<code>type=output</code> 是 ComfyUI 的內部 dir 標記（區分 output / input / temp）。</li>
</ol>
<p><strong>為什麼這樣設計</strong>：</p>
<ul>
<li><strong>為什麼 polling 而不是 WebSocket</strong>：WebSocket 要 subscribe events、處理 connection lifecycle、邏輯複雜。Polling 兩行解決、對教學 demo 夠用。Production 自動化系統建議用 WebSocket、知道每個 progress event。</li>
<li><strong>為什麼 <code>time.sleep(2)</code></strong>：太短（&lt; 1s）對 server 造成不必要 polling；太長（&gt; 5s）感知延遲明顯。2 秒是 demo 友善平衡。</li>
<li><strong>為什麼用 prompt_id 而不是 client_id 查 history</strong>：一個 client 可能送多個 job、prompt_id 唯一識別 job。client_id 主要用 WebSocket 訂閱、不是 history query 主鍵。</li>
<li><strong>為什麼 <code>Path(args.out).write_bytes(blob)</code></strong>：PNG 是 binary、用 <code>write_bytes</code> 直接寫；改用 <code>open(...).write()</code> 的 text mode 會在編碼轉換時破壞檔案內容。</li>
</ul>
<p><strong>實測</strong>：M4 Pro 32GB、prompt 「a photograph of an orange cat sitting on a wooden chair, soft natural lighting, detailed fur」、15 steps、cfg=7、euler+normal sampler、seed=42 → 106 秒生成 1024×1024 PNG、1.65 MB。</p>
<h2 id="comfyui-的-rest-api-形狀無-openai-相容層">ComfyUI 的 REST API 形狀（無 OpenAI 相容層）</h2>
<p>ComfyUI 沒提供 OpenAI 相容 API、它的 API 是自己的 REST + WebSocket：</p>
<ul>
<li><code>POST /prompt</code>：丟一個 workflow JSON、回傳 job id。</li>
<li><code>GET /history/{prompt_id}</code>：查看生成結果。</li>
<li><code>GET /view?filename=X</code>：拿生成的圖。</li>
<li>WebSocket：訂閱 job progress events。</li>
</ul>
<p>API 形狀跟 Diffusion 任務匹配、跟 LLM 的 <code>/chat/completions</code> 完全不同——這正是 <a href="/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1 RAG 章節</a> 提到「Diffusion 跟 Transformer 工具鏈互不通用」的具體展現。Ollama / LM Studio 對接 Continue.dev 的 OpenAI 相容路徑、跟 ComfyUI 接 SDXL 是完全平行的兩條路。</p>
<h2 id="常用-custom-nodes">常用 Custom Nodes</h2>
<p>ComfyUI 的核心功能來自 custom nodes、社群維護。最常用：</p>
<table>
  <thead>
      <tr>
          <th>Custom Node</th>
          <th>功能</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ComfyUI-Manager</td>
          <td>管理其他 custom node、安裝 / 更新</td>
      </tr>
      <tr>
          <td>ComfyUI-Impact-Pack</td>
          <td>物件偵測、masking、inpainting</td>
      </tr>
      <tr>
          <td>ComfyUI-AnimateDiff</td>
          <td>影片動畫生成</td>
      </tr>
      <tr>
          <td>ComfyUI-ControlNet-Aux</td>
          <td>ControlNet preprocessor</td>
      </tr>
      <tr>
          <td>ComfyUI-IPAdapter-plus</td>
          <td>圖像 reference embedding</td>
      </tr>
  </tbody>
</table>
<p>安裝方式（透過 ComfyUI-Manager）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">cd</span> ~/Projects/ComfyUI/custom_nodes
</span></span><span class="line"><span class="ln">2</span><span class="cl">git clone https://github.com/ltdrdata/ComfyUI-Manager.git
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># 重啟 ComfyUI、UI 多一個 Manager 按鈕、之後用 Manager 裝其他 node</span></span></span></code></pre></div><h2 id="常見坑">常見坑</h2>
<h3 id="python-版本太新torch-沒-wheel">Python 版本太新、torch 沒 wheel</h3>
<p>PyTorch 對最新 Python（3.13、3.14）的 wheel 發布有 lag、可能 <code>pip install -r requirements.txt</code> 跑 build from source 慢 + 失敗。退到 Python 3.11 / 3.12：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">brew install python@3.11
</span></span><span class="line"><span class="ln">2</span><span class="cl">python3.11 -m venv venv
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="nb">source</span> venv/bin/activate
</span></span><span class="line"><span class="ln">4</span><span class="cl">pip install -r requirements.txt</span></span></code></pre></div><h3 id="mps-false跑在-cpu-上"><code>mps: False</code>、跑在 CPU 上</h3>
<p>確認 PyTorch 是 Apple Silicon 版本（不是 x86_64 emulation）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">python -c <span class="s2">&#34;import platform; print(platform.machine())&#34;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># arm64 ← 正確；x86_64 ← 走 Rosetta、要重裝</span></span></span></code></pre></div><p>如果是 x86_64、表示 venv 用了 Intel Python。重建 venv：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">deactivate
</span></span><span class="line"><span class="ln">2</span><span class="cl">rm -rf venv
</span></span><span class="line"><span class="ln">3</span><span class="cl">arch -arm64 python3 -m venv venv</span></span></code></pre></div><h3 id="記憶體不夠推論時-crash">記憶體不夠、推論時 crash</h3>
<p>SDXL 在 16 GB Mac 上吃緊、可能 swap 或 crash。緩解：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 降解析度</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">python main.py --normalvram   <span class="c1"># 預設、~12 GB</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">python main.py --lowvram      <span class="c1"># 較省、~8 GB、慢</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">python main.py --novram       <span class="c1"># 極省、~4 GB、極慢、實用上界</span></span></span></code></pre></div><p>或換 SD 1.5（4 GB checkpoint）、記憶體需求 &lt; SDXL 的一半。</p>
<h3 id="workflow-json-載入失敗">Workflow JSON 載入失敗</h3>
<p>ComfyUI workflow 是 JSON 描述節點 + 連線。如果是別人分享的 workflow、可能用了你沒裝的 custom node。錯誤訊息會列出缺哪些 node、用 ComfyUI-Manager 補裝。</p>
<h3 id="port-8188-被佔">Port 8188 被佔</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">lsof -i :8188
</span></span><span class="line"><span class="ln">2</span><span class="cl">python main.py --port <span class="m">8189</span>  <span class="c1"># 改 port</span></span></span></code></pre></div><h2 id="跟-llm-stack-並存">跟 LLM stack 並存</h2>
<p>ComfyUI 用 port 8188、跟 Ollama (11434) / LM Studio (1234) 完全不撞、可同時跑。實務配置：</p>
<table>
  <thead>
      <tr>
          <th>服務</th>
          <th>Port</th>
          <th>用途</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ollama</td>
          <td>11434</td>
          <td>寫 code、對話</td>
      </tr>
      <tr>
          <td>ComfyUI</td>
          <td>8188</td>
          <td>產圖</td>
      </tr>
      <tr>
          <td>LM Studio</td>
          <td>1234</td>
          <td>探索新 LLM</td>
      </tr>
      <tr>
          <td>Open WebUI</td>
          <td>3000</td>
          <td>ChatGPT 風格瀏覽器介面</td>
      </tr>
  </tbody>
</table>
<p>各服務獨立、不干擾、可以一台 Mac 跑全部（看記憶體預算）。</p>
<h2 id="何時這篇會過時">何時這篇會過時</h2>
<ul>
<li>ComfyUI 主分支 API 短期內穩定（大量社群依賴）。</li>
<li>SDXL base 1.0 不會消失、但會被新版本（SDXL 1.1、Flux 等）取代——「下載 .safetensors 放 models/checkpoints/」流程不變。</li>
<li>MPS backend 持續優化、效能會提升、但介面不變。</li>
<li>Python 版本相容性會持續演化、<code>pip install -r requirements.txt</code> 偶爾要降版 Python。</li>
</ul>
<p>讀的時候若 pip install 失敗、看 ComfyUI GitHub issues 跟 PyTorch release notes 對應的 Python 版本。</p>
<p>跟其他 hands-on 章節的關係：完整 hands-on 系列見 <a href="/blog/llm/01-local-llm-services/hands-on/" data-link-title="Hands-on：本地 AI 工具實作筆記" data-link-desc="Ollama / ComfyUI / Whisper / Piper TTS：實際安裝、驗證、跑通的紀錄。隨工具版本演化、跟 1.x 原理章節互補。">Hands-on 章節索引</a>、跨服務的 lifecycle / 記憶體管理見 <a href="/blog/llm/01-local-llm-services/hands-on/resource-management/" data-link-title="Hands-on：LLM 運行中 &#43; 結束的資源管理" data-link-desc="RAM / 磁碟 / port 三個 dimension 的觀察跟釋放、Ollama keep_alive 跟 ComfyUI 兩種 lifecycle 對比、實測釋放數字">Resource management</a>、ComfyUI 跟 Ollama 同台跑的記憶體預算規劃見 <a href="/blog/llm/00-foundations/hardware-memory-budget/" data-link-title="0.5 Apple Silicon 記憶體預算" data-link-desc="記憶體決定能跑什麼，Q4 量化下的可運作模型對照與系統保留">0.5 Apple Silicon 記憶體預算</a>。</p>
]]></content:encoded></item><item><title>Hands-on：安裝 whisper.cpp 做語音轉文字</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/whisper-setup/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/whisper-setup/</guid><description>&lt;p>本篇紀錄在 Apple Silicon Mac 上裝 &lt;code>whisper.cpp&lt;/code> 並驗證英文語音轉文字。選 whisper.cpp 而非 &lt;code>openai-whisper&lt;/code>（Python 版）的理由：&lt;/p>
&lt;ul>
&lt;li>純 C++ 實作、Metal backend 直接吃 Apple Silicon GPU。&lt;/li>
&lt;li>Homebrew bottle、&lt;code>brew install&lt;/code> 一行裝完、不需要 Python 環境跟 torch wheel。&lt;/li>
&lt;li>Binary 名稱是 &lt;code>whisper-cli&lt;/code>、CLI-first、整合到 shell pipeline 容易。&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>whisper-cpp 版本&lt;/strong>：1.8.4
&lt;strong>示範模型&lt;/strong>：&lt;code>ggml-tiny.en.bin&lt;/code>（78 MB、英文專用、最小可用）
&lt;strong>實測&lt;/strong>：7 秒音訊 484ms 轉錄、用 Metal GPU 加速&lt;/p>&lt;/blockquote>
&lt;h2 id="前置設定">前置設定&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>項目&lt;/th>
 &lt;th>檢查指令&lt;/th>
 &lt;th>預期&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Homebrew&lt;/td>
 &lt;td>&lt;code>brew --version&lt;/code>&lt;/td>
 &lt;td>4.x&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>ffmpeg&lt;/td>
 &lt;td>&lt;code>which ffmpeg&lt;/code>&lt;/td>
 &lt;td>&lt;code>/opt/homebrew/bin/ffmpeg&lt;/code>（沒有：&lt;code>brew install ffmpeg&lt;/code>）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>磁碟空間&lt;/td>
 &lt;td>&lt;code>df -h ~&lt;/code>&lt;/td>
 &lt;td>至少 200 MB（whisper-cli + 1 個 small model）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;code>ffmpeg&lt;/code> 是必要的——whisper-cli 接受多種音訊格式、但實際內部會先轉成 16kHz mono WAV、ffmpeg 是這個轉換的依賴。&lt;/p>
&lt;h2 id="安裝-whisper-cpp">安裝 whisper-cpp&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">brew install whisper-cpp&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Homebrew 會裝：&lt;/p>
&lt;ul>
&lt;li>&lt;code>whisper-cli&lt;/code> binary 到 &lt;code>/opt/homebrew/bin/&lt;/code>&lt;/li>
&lt;li>&lt;code>ggml&lt;/code> 共用 lib 到 &lt;code>/opt/homebrew/Cellar/ggml/&lt;/code>&lt;/li>
&lt;li>BLAS / Metal backend 自動配對 Apple Silicon&lt;/li>
&lt;/ul>
&lt;p>驗證 binary 可用：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">which whisper-cli
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="c1"># /opt/homebrew/bin/whisper-cli&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">whisper-cli --help 2&amp;gt;&lt;span class="p">&amp;amp;&lt;/span>&lt;span class="m">1&lt;/span> &lt;span class="p">|&lt;/span> head -5&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>第一次跑會看到 Metal 初始化訊息：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">ggml_metal_library_init: using embedded metal library
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">ggml_metal_library_init: loaded in 6.883 sec&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>第一次 Metal lib 載入慢（~7 秒）、後續會 cache、變很快。&lt;/p>
&lt;h2 id="下載-model">下載 Model&lt;/h2>
&lt;p>whisper-cpp 跟 OpenAI 原版分離管理 model file、要自己下載 GGML 格式：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">mkdir -p ~/.whisper-models
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="nb">cd&lt;/span> ~/.whisper-models
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">curl -L -o ggml-tiny.en.bin &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="s2">&amp;#34;https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en.bin&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>可用 model 比較（大小越大、品質越好、速度越慢）：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Model&lt;/th>
 &lt;th>大小&lt;/th>
 &lt;th>適合場景&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;code>ggml-tiny.en.bin&lt;/code>&lt;/td>
 &lt;td>78 MB&lt;/td>
 &lt;td>英文、最小驗證、品質可接受&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>ggml-base.en.bin&lt;/code>&lt;/td>
 &lt;td>148 MB&lt;/td>
 &lt;td>英文、常用入門&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>ggml-small.en.bin&lt;/code>&lt;/td>
 &lt;td>488 MB&lt;/td>
 &lt;td>英文、daily use 甜蜜點&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>ggml-medium.en.bin&lt;/code>&lt;/td>
 &lt;td>1.5 GB&lt;/td>
 &lt;td>英文、品質敏感&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>ggml-small.bin&lt;/code>&lt;/td>
 &lt;td>488 MB&lt;/td>
 &lt;td>多語言（含中文）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>ggml-large-v3.bin&lt;/code>&lt;/td>
 &lt;td>3.1 GB&lt;/td>
 &lt;td>多語言、最佳品質、跑得最慢&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>選 &lt;code>tiny.en&lt;/code> 是因為&lt;strong>只驗證安裝路徑&lt;/strong>、實際日常用要 &lt;code>small.en&lt;/code> 起跳。&lt;/p>
&lt;p>驗證下載：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">ls -lh ~/.whisper-models/
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="c1"># 應該看到 78 MB 的 ggml-tiny.en.bin&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="跑第一次轉錄">跑第一次轉錄&lt;/h2>
&lt;p>需要一段測試音訊。可以用 macOS 內建 &lt;code>say&lt;/code> 生成、再用 ffmpeg 轉成 whisper.cpp 需要的格式（16kHz mono WAV）：&lt;/p></description><content:encoded><![CDATA[<p>本篇紀錄在 Apple Silicon Mac 上裝 <code>whisper.cpp</code> 並驗證英文語音轉文字。選 whisper.cpp 而非 <code>openai-whisper</code>（Python 版）的理由：</p>
<ul>
<li>純 C++ 實作、Metal backend 直接吃 Apple Silicon GPU。</li>
<li>Homebrew bottle、<code>brew install</code> 一行裝完、不需要 Python 環境跟 torch wheel。</li>
<li>Binary 名稱是 <code>whisper-cli</code>、CLI-first、整合到 shell pipeline 容易。</li>
</ul>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>whisper-cpp 版本</strong>：1.8.4
<strong>示範模型</strong>：<code>ggml-tiny.en.bin</code>（78 MB、英文專用、最小可用）
<strong>實測</strong>：7 秒音訊 484ms 轉錄、用 Metal GPU 加速</p></blockquote>
<h2 id="前置設定">前置設定</h2>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>檢查指令</th>
          <th>預期</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Homebrew</td>
          <td><code>brew --version</code></td>
          <td>4.x</td>
      </tr>
      <tr>
          <td>ffmpeg</td>
          <td><code>which ffmpeg</code></td>
          <td><code>/opt/homebrew/bin/ffmpeg</code>（沒有：<code>brew install ffmpeg</code>）</td>
      </tr>
      <tr>
          <td>磁碟空間</td>
          <td><code>df -h ~</code></td>
          <td>至少 200 MB（whisper-cli + 1 個 small model）</td>
      </tr>
  </tbody>
</table>
<p><code>ffmpeg</code> 是必要的——whisper-cli 接受多種音訊格式、但實際內部會先轉成 16kHz mono WAV、ffmpeg 是這個轉換的依賴。</p>
<h2 id="安裝-whisper-cpp">安裝 whisper-cpp</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">brew install whisper-cpp</span></span></code></pre></div><p>Homebrew 會裝：</p>
<ul>
<li><code>whisper-cli</code> binary 到 <code>/opt/homebrew/bin/</code></li>
<li><code>ggml</code> 共用 lib 到 <code>/opt/homebrew/Cellar/ggml/</code></li>
<li>BLAS / Metal backend 自動配對 Apple Silicon</li>
</ul>
<p>驗證 binary 可用：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">which whisper-cli
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># /opt/homebrew/bin/whisper-cli</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl">whisper-cli --help 2&gt;<span class="p">&amp;</span><span class="m">1</span> <span class="p">|</span> head -5</span></span></code></pre></div><p>第一次跑會看到 Metal 初始化訊息：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">ggml_metal_library_init: using embedded metal library
</span></span><span class="line"><span class="ln">2</span><span class="cl">ggml_metal_library_init: loaded in 6.883 sec</span></span></code></pre></div><p>第一次 Metal lib 載入慢（~7 秒）、後續會 cache、變很快。</p>
<h2 id="下載-model">下載 Model</h2>
<p>whisper-cpp 跟 OpenAI 原版分離管理 model file、要自己下載 GGML 格式：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">mkdir -p ~/.whisper-models
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="nb">cd</span> ~/.whisper-models
</span></span><span class="line"><span class="ln">3</span><span class="cl">curl -L -o ggml-tiny.en.bin <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  <span class="s2">&#34;https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en.bin&#34;</span></span></span></code></pre></div><p>可用 model 比較（大小越大、品質越好、速度越慢）：</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>大小</th>
          <th>適合場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>ggml-tiny.en.bin</code></td>
          <td>78 MB</td>
          <td>英文、最小驗證、品質可接受</td>
      </tr>
      <tr>
          <td><code>ggml-base.en.bin</code></td>
          <td>148 MB</td>
          <td>英文、常用入門</td>
      </tr>
      <tr>
          <td><code>ggml-small.en.bin</code></td>
          <td>488 MB</td>
          <td>英文、daily use 甜蜜點</td>
      </tr>
      <tr>
          <td><code>ggml-medium.en.bin</code></td>
          <td>1.5 GB</td>
          <td>英文、品質敏感</td>
      </tr>
      <tr>
          <td><code>ggml-small.bin</code></td>
          <td>488 MB</td>
          <td>多語言（含中文）</td>
      </tr>
      <tr>
          <td><code>ggml-large-v3.bin</code></td>
          <td>3.1 GB</td>
          <td>多語言、最佳品質、跑得最慢</td>
      </tr>
  </tbody>
</table>
<p>選 <code>tiny.en</code> 是因為<strong>只驗證安裝路徑</strong>、實際日常用要 <code>small.en</code> 起跳。</p>
<p>驗證下載：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">ls -lh ~/.whisper-models/
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 應該看到 78 MB 的 ggml-tiny.en.bin</span></span></span></code></pre></div><h2 id="跑第一次轉錄">跑第一次轉錄</h2>
<p>需要一段測試音訊。可以用 macOS 內建 <code>say</code> 生成、再用 ffmpeg 轉成 whisper.cpp 需要的格式（16kHz mono WAV）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">cd</span> /tmp
</span></span><span class="line"><span class="ln">2</span><span class="cl">say -o sample.aiff -v Samantha <span class="s2">&#34;Hello world. This is a test of the whisper transcription system. It should produce accurate text from this short audio clip.&#34;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">ffmpeg -loglevel error -y -i sample.aiff -ar <span class="m">16000</span> -ac <span class="m">1</span> sample.wav</span></span></code></pre></div><p><code>-ar 16000 -ac 1</code> 是 whisper.cpp 的標準輸入規格（16 kHz、單聲道、16-bit PCM）。Whisper 模型訓練時用這個 sample rate、輸入不符會降低準確度。</p>
<p>轉錄：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">whisper-cli -m ~/.whisper-models/ggml-tiny.en.bin -f /tmp/sample.wav</span></span></code></pre></div><p>預期輸出（含時間軸）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">[00:00:00.000 --&gt; 00:00:03.980]   Hello World, this is a test of the whisper transcription system.
</span></span><span class="line"><span class="ln">2</span><span class="cl">[00:00:03.980 --&gt; 00:00:06.980]   It should produce accurate text from this short audio clip.
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl">whisper_print_timings:     load time =    39.88 ms
</span></span><span class="line"><span class="ln">5</span><span class="cl">whisper_print_timings:   encode time =   220.01 ms
</span></span><span class="line"><span class="ln">6</span><span class="cl">whisper_print_timings:    total time =   484.08 ms</span></span></code></pre></div><p>關鍵觀察：</p>
<ul>
<li><strong>484ms</strong> 處理 7 秒音訊、約 14x 即時速度。</li>
<li>轉錄結果跟原文一致（除了 <code>world</code> 大寫變 <code>World</code>）。</li>
<li>含時間軸（time stamps）、可以做 subtitle / 字幕對齊。</li>
</ul>
<p>要拿不含時間軸的純文字：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">whisper-cli -m ~/.whisper-models/ggml-tiny.en.bin -f /tmp/sample.wav -nt
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># -nt 是 --no-timestamps</span></span></span></code></pre></div><h2 id="常用選項">常用選項</h2>
<table>
  <thead>
      <tr>
          <th>選項</th>
          <th>作用</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>-l zh</code></td>
          <td>指定語言（中文）；多語言 model 用、單語 model 用不到</td>
      </tr>
      <tr>
          <td><code>-otxt</code></td>
          <td>同時輸出 .txt 檔（純文字、無時間軸）</td>
      </tr>
      <tr>
          <td><code>-osrt</code></td>
          <td>同時輸出 .srt 字幕檔</td>
      </tr>
      <tr>
          <td><code>-ovtt</code></td>
          <td>同時輸出 .vtt 字幕檔</td>
      </tr>
      <tr>
          <td><code>-of OUT</code></td>
          <td>設定輸出檔名 prefix</td>
      </tr>
      <tr>
          <td><code>-t N</code></td>
          <td>用 N 個 thread（預設用 CPU 核心數）</td>
      </tr>
      <tr>
          <td><code>-pp</code></td>
          <td>print progress（顯示處理進度條、跑長音訊時開）</td>
      </tr>
  </tbody>
</table>
<p>實務常用組合：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 字幕生成</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">whisper-cli -m ~/.whisper-models/ggml-small.en.bin <span class="se">\
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="se"></span>  -f input.wav <span class="se">\
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="se"></span>  -osrt <span class="se">\
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="se"></span>  -of output_subtitle
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># 中文轉錄</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">whisper-cli -m ~/.whisper-models/ggml-small.bin <span class="se">\
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="se"></span>  -f speech.wav <span class="se">\
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="se"></span>  -l zh</span></span></code></pre></div><h2 id="跟其他工具串接">跟其他工具串接</h2>
<p>Whisper-cli 的 stdout 是純文字、容易串 pipeline：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 轉錄結果直接餵給 LLM 摘要</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">whisper-cli -m ~/.whisper-models/ggml-small.en.bin -f meeting.wav -nt <span class="se">\
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="se"></span>  <span class="p">|</span> curl -s http://localhost:11434/v1/chat/completions <span class="se">\
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="se"></span>    -H <span class="s2">&#34;Content-Type: application/json&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="se"></span>    -d @- <span class="s">&lt;&lt;EOF
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s">{
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s">  &#34;model&#34;: &#34;gemma3:1b&#34;,
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="s">  &#34;messages&#34;: [
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="s">    {&#34;role&#34;: &#34;system&#34;, &#34;content&#34;: &#34;Summarize the meeting transcript in 5 bullet points.&#34;},
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s">    {&#34;role&#34;: &#34;user&#34;, &#34;content&#34;: &#34;$(cat)&#34;}
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="s">  ]
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="s">}
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="s">EOF</span></span></span></code></pre></div><p>這個 pipeline 串接到 <a href="/blog/llm/01-local-llm-services/hands-on/ollama-setup/" data-link-title="Hands-on：安裝 Ollama &#43; 拉第一個 Gemma 模型" data-link-desc="brew install ollama、launchd service、ollama pull、curl 驗證 OpenAI 相容 API">Ollama</a> 完成「語音 → 文字 → 摘要」流程、整條本地、無雲端 API。</p>
<h2 id="常見坑">常見坑</h2>
<h3 id="audio-file-not-found--format-error">「audio file not found / format error」</h3>
<p>確認 ffmpeg 已轉成 16kHz mono：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">ffprobe input.wav 2&gt;<span class="p">&amp;</span><span class="m">1</span> <span class="p">|</span> grep -E <span class="s2">&#34;Stream|Audio&#34;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 應該看到：Audio: pcm_s16le, 16000 Hz, mono</span></span></span></code></pre></div><p>不是這個規格就用 ffmpeg 轉：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">ffmpeg -i input.mp3 -ar <span class="m">16000</span> -ac <span class="m">1</span> -c:a pcm_s16le output.wav</span></span></code></pre></div><h3 id="model-載入慢">Model 載入慢</h3>
<p>第一次 Metal lib 初始化要 ~7 秒、是 macOS Metal compiler 在 cache shader。後續快很多。</p>
<p>如果每次都慢、看是否 Metal cache 路徑（<code>~/Library/Caches/...</code>）有權限問題。</p>
<h3 id="中文--多語言準確度差">中文 / 多語言準確度差</h3>
<p>確認 model 不是 <code>.en</code> 後綴：<code>.en</code> model 只訓練英文、餵中文會 hallucinate。中文要用 <code>ggml-small.bin</code>、<code>ggml-medium.bin</code>、<code>ggml-large-v3.bin</code>（沒 <code>.en</code>）。</p>
<h3 id="output-拼錯字">Output 拼錯字</h3>
<p>Whisper tiny / base model 對非母音清晰、噪音多、口音重的音訊準確度差。換 small 或 medium 通常解決。</p>
<h2 id="完整-round-trip-驗證">完整 round-trip 驗證</h2>
<p>驗證 Whisper + Piper TTS 完整迴圈：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Piper 生成 WAV</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;Hello world test.&#34;</span> <span class="p">|</span> piper -m ~/.piper-voices/en_US-lessac-low.onnx -f /tmp/out.wav
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># Whisper 轉回文字</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">whisper-cli -m ~/.whisper-models/ggml-tiny.en.bin -f /tmp/out.wav -nt
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># 應該回：Hello world test.</span></span></span></code></pre></div><p>兩個都跑得起來表示整條 STT / TTS pipeline 工作。沒裝 Piper 的場景：用任何 16kHz 單聲道 WAV 都能驗證（macOS 內建 <code>say -o sample.aiff</code> + ffmpeg 轉檔、或從 Hugging Face 拉個 sample 音訊）、不一定要用 Piper。</p>
<p>跟其他章節的關係：完整 hands-on 系列見 <a href="/blog/llm/01-local-llm-services/hands-on/" data-link-title="Hands-on：本地 AI 工具實作筆記" data-link-desc="Ollama / ComfyUI / Whisper / Piper TTS：實際安裝、驗證、跑通的紀錄。隨工具版本演化、跟 1.x 原理章節互補。">Hands-on 章節索引</a>、本地 LLM 加 speech 在隱私 / 資料流上的位置見 <a href="/blog/llm/00-foundations/privacy-data-flow/" data-link-title="0.7 隱私 / 資安的資料流原理" data-link-desc="從「位置」到「資料流」的思考升級：信任邊界、合約模型、零信任原則套用到 LLM 工作流">0.7 隱私資料流原理</a>、排錯走三層方法論見 <a href="/blog/llm/01-local-llm-services/troubleshooting/" data-link-title="1.7 排錯方法論：用三層架構做故障定位" data-link-desc="故障定位的分層思考、症狀到層級的對應反射、log 在三層的角色差異、最小可重現的縮減策略">1.7 排錯方法論</a>。</p>
<h2 id="何時這篇會過時">何時這篇會過時</h2>
<ul>
<li><code>brew install whisper-cpp</code> 安裝方式短期內不會變。</li>
<li>GGML model 路徑（Hugging Face <code>ggerganov/whisper.cpp</code>）穩定、是 maintainer 官方 repo。</li>
<li>模型版本會更新（large-v3 → large-v4 等）、但「下載 GGML、用 whisper-cli 餵 WAV」流程不變。</li>
<li>Metal backend 自動啟用、不需配置——Apple Silicon GPU 演化會持續增進效能但不影響介面。</li>
</ul>
<p>讀的時候若 brew 跑失敗、查 whisper.cpp GitHub release notes；模型新版本看 Hugging Face <code>ggerganov/whisper.cpp</code> repo 列表。</p>
]]></content:encoded></item><item><title>Hands-on：安裝 Piper TTS 做文字轉語音</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/piper-tts-setup/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/piper-tts-setup/</guid><description>&lt;p>本篇紀錄裝 Piper TTS 並用它合成英文語音、再用 Whisper 轉回文字做 round-trip 驗證。選 Piper 而非雲端 TTS（OpenAI / ElevenLabs）的理由：&lt;/p>
&lt;ul>
&lt;li>完全本地、隱私邊界乾淨。&lt;/li>
&lt;li>ONNX runtime、Apple Silicon 跑得動、不依賴 GPU。&lt;/li>
&lt;li>模型小（low quality ~17-65 MB、medium ~50 MB、high ~125 MB）、適合 minimal 驗證。&lt;/li>
&lt;li>CLI-first、stdin 餵文字、stdout 或檔案輸出 WAV、容易串 pipeline。&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>Piper 版本&lt;/strong>：透過 pip 安裝
&lt;strong>示範 voice&lt;/strong>：&lt;code>en_US-lessac-low.onnx&lt;/code>（63 MB、英文女聲、low quality）
&lt;strong>實測&lt;/strong>：4 秒文字合成 &amp;lt; 1 秒、品質夠日常用&lt;/p>&lt;/blockquote>
&lt;h2 id="前置設定">前置設定&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>項目&lt;/th>
 &lt;th>檢查指令&lt;/th>
 &lt;th>預期&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Python&lt;/td>
 &lt;td>&lt;code>python3 --version&lt;/code>&lt;/td>
 &lt;td>3.11+&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>pip&lt;/td>
 &lt;td>&lt;code>pip3 --version&lt;/code>&lt;/td>
 &lt;td>25+&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>磁碟空間&lt;/td>
 &lt;td>&lt;code>df -h ~&lt;/code>&lt;/td>
 &lt;td>至少 200 MB（Piper + 一個 voice）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Piper 跟 Whisper 一樣分離 binary 跟 model：先裝 runtime、再下載 voice。&lt;/p>
&lt;h2 id="安裝-piper">安裝 Piper&lt;/h2>
&lt;p>&lt;code>piper-tts&lt;/code> 沒有 Homebrew formula、用 pip 裝：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">pip3 install piper-tts --break-system-packages&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>PEP 668&lt;/code> 是 macOS / Homebrew Python 的 external-management 機制、保護系統 Python 不被 pip 安裝污染；&lt;code>--break-system-packages&lt;/code> 是 bypass flag、跳過該檢查直接裝。比較乾淨的做法是用 venv：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">python3 -m venv ~/.piper-venv
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="nb">source&lt;/span> ~/.piper-venv/bin/activate
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">pip install piper-tts&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>但裝完 PATH 要指到 venv 的 piper、稍麻煩。本 demo 用 &lt;code>--break-system-packages&lt;/code> 簡化。實際生產建議用 venv 或 pipx。&lt;/p>
&lt;p>驗證 binary 在 PATH：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">which piper
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="c1"># /opt/homebrew/bin/piper（若 pip3 來自 Homebrew Python）&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="c1"># 或 ~/Library/Python/3.x/bin/piper（若 pip3 來自系統 Python）&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">piper --help &lt;span class="p">|&lt;/span> head -10&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>which piper&lt;/code> 找不到時、檢查兩個 bin 目錄哪邊有檔案、把該目錄加進 &lt;code>PATH&lt;/code>。&lt;/p>
&lt;h2 id="下載-voice-model">下載 Voice Model&lt;/h2>
&lt;p>Piper 用 ONNX 格式的 voice model、每個 voice 是一對 &lt;code>.onnx&lt;/code>（model 權重）+ &lt;code>.onnx.json&lt;/code>（metadata、含採樣率、phoneme map）。&lt;/p>
&lt;p>從 Hugging Face &lt;code>rhasspy/piper-voices&lt;/code> repo 拉：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">mkdir -p ~/.piper-voices
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="nb">cd&lt;/span> ~/.piper-voices
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="c1"># 英文女聲、low quality（小、快）&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">curl -L -o en_US-lessac-low.onnx &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="s2">&amp;#34;https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/low/en_US-lessac-low.onnx&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">curl -L -o en_US-lessac-low.onnx.json &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="s2">&amp;#34;https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/low/en_US-lessac-low.onnx.json&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>可用 voice quality 等級：&lt;/p></description><content:encoded><![CDATA[<p>本篇紀錄裝 Piper TTS 並用它合成英文語音、再用 Whisper 轉回文字做 round-trip 驗證。選 Piper 而非雲端 TTS（OpenAI / ElevenLabs）的理由：</p>
<ul>
<li>完全本地、隱私邊界乾淨。</li>
<li>ONNX runtime、Apple Silicon 跑得動、不依賴 GPU。</li>
<li>模型小（low quality ~17-65 MB、medium ~50 MB、high ~125 MB）、適合 minimal 驗證。</li>
<li>CLI-first、stdin 餵文字、stdout 或檔案輸出 WAV、容易串 pipeline。</li>
</ul>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>Piper 版本</strong>：透過 pip 安裝
<strong>示範 voice</strong>：<code>en_US-lessac-low.onnx</code>（63 MB、英文女聲、low quality）
<strong>實測</strong>：4 秒文字合成 &lt; 1 秒、品質夠日常用</p></blockquote>
<h2 id="前置設定">前置設定</h2>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>檢查指令</th>
          <th>預期</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Python</td>
          <td><code>python3 --version</code></td>
          <td>3.11+</td>
      </tr>
      <tr>
          <td>pip</td>
          <td><code>pip3 --version</code></td>
          <td>25+</td>
      </tr>
      <tr>
          <td>磁碟空間</td>
          <td><code>df -h ~</code></td>
          <td>至少 200 MB（Piper + 一個 voice）</td>
      </tr>
  </tbody>
</table>
<p>Piper 跟 Whisper 一樣分離 binary 跟 model：先裝 runtime、再下載 voice。</p>
<h2 id="安裝-piper">安裝 Piper</h2>
<p><code>piper-tts</code> 沒有 Homebrew formula、用 pip 裝：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">pip3 install piper-tts --break-system-packages</span></span></code></pre></div><p><code>PEP 668</code> 是 macOS / Homebrew Python 的 external-management 機制、保護系統 Python 不被 pip 安裝污染；<code>--break-system-packages</code> 是 bypass flag、跳過該檢查直接裝。比較乾淨的做法是用 venv：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">python3 -m venv ~/.piper-venv
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="nb">source</span> ~/.piper-venv/bin/activate
</span></span><span class="line"><span class="ln">3</span><span class="cl">pip install piper-tts</span></span></code></pre></div><p>但裝完 PATH 要指到 venv 的 piper、稍麻煩。本 demo 用 <code>--break-system-packages</code> 簡化。實際生產建議用 venv 或 pipx。</p>
<p>驗證 binary 在 PATH：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">which piper
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># /opt/homebrew/bin/piper（若 pip3 來自 Homebrew Python）</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># 或 ~/Library/Python/3.x/bin/piper（若 pip3 來自系統 Python）</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl">piper --help <span class="p">|</span> head -10</span></span></code></pre></div><p><code>which piper</code> 找不到時、檢查兩個 bin 目錄哪邊有檔案、把該目錄加進 <code>PATH</code>。</p>
<h2 id="下載-voice-model">下載 Voice Model</h2>
<p>Piper 用 ONNX 格式的 voice model、每個 voice 是一對 <code>.onnx</code>（model 權重）+ <code>.onnx.json</code>（metadata、含採樣率、phoneme map）。</p>
<p>從 Hugging Face <code>rhasspy/piper-voices</code> repo 拉：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">mkdir -p ~/.piper-voices
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="nb">cd</span> ~/.piper-voices
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 英文女聲、low quality（小、快）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">curl -L -o en_US-lessac-low.onnx <span class="se">\
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="se"></span>  <span class="s2">&#34;https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/low/en_US-lessac-low.onnx&#34;</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">curl -L -o en_US-lessac-low.onnx.json <span class="se">\
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="se"></span>  <span class="s2">&#34;https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/low/en_US-lessac-low.onnx.json&#34;</span></span></span></code></pre></div><p>可用 voice quality 等級：</p>
<table>
  <thead>
      <tr>
          <th>Quality</th>
          <th>大小</th>
          <th>用途</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>low</code></td>
          <td>17-65 MB</td>
          <td>快、品質粗糙、適合 prototype</td>
      </tr>
      <tr>
          <td><code>medium</code></td>
          <td>50-100 MB</td>
          <td>平衡、日常用</td>
      </tr>
      <tr>
          <td><code>high</code></td>
          <td>100-200 MB</td>
          <td>品質佳、合成略慢</td>
      </tr>
      <tr>
          <td><code>x_low</code></td>
          <td>&lt; 20 MB</td>
          <td>極小、品質明顯差、適合受限環境</td>
      </tr>
  </tbody>
</table>
<p>語言 / 地區覆蓋（部分）：</p>
<table>
  <thead>
      <tr>
          <th>Locale</th>
          <th>Voice 範例</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>en_US</code></td>
          <td>lessac、ryan、amy、libritts</td>
      </tr>
      <tr>
          <td><code>en_GB</code></td>
          <td>alan、cori、jenny</td>
      </tr>
      <tr>
          <td><code>zh_CN</code></td>
          <td>huayan（北京話）</td>
      </tr>
      <tr>
          <td><code>ja_JP</code>（社群）</td>
          <td>較少</td>
      </tr>
      <tr>
          <td><code>de_DE</code> / <code>fr_FR</code> / <code>es_ES</code> 等</td>
          <td>各有多個</td>
      </tr>
  </tbody>
</table>
<p>完整清單在 <code>rhasspy/piper-voices</code> 的 <a href="https://github.com/rhasspy/piper">VOICES.md</a>。</p>
<p>驗證下載：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">ls -lh ~/.piper-voices/
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># en_US-lessac-low.onnx       63M</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># en_US-lessac-low.onnx.json  4.9K</span></span></span></code></pre></div><h2 id="跑第一次合成">跑第一次合成</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;Hello from Piper TTS, this is a synthesized voice test.&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  <span class="p">|</span> piper -m ~/.piper-voices/en_US-lessac-low.onnx -f /tmp/piper-out.wav</span></span></code></pre></div><p>說明：</p>
<ul>
<li>文字從 stdin 進、是 Piper 的標準輸入方式。</li>
<li><code>-m</code>：voice model <code>.onnx</code> path。Piper 自動找同目錄的 <code>.onnx.json</code>。</li>
<li><code>-f</code>：output WAV path。不指定的話直接寫 stdout（可以 pipe 到 <code>aplay</code> / <code>afplay</code> 即時播放）。</li>
</ul>
<p>預期輸出：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">ls -lh /tmp/piper-out.wav
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 128 KB</span></span></span></code></pre></div><p>驗證 WAV 規格：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">file /tmp/piper-out.wav
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl">ffprobe -loglevel error -show_format /tmp/piper-out.wav <span class="p">|</span> grep duration
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># duration=3.984000</span></span></span></code></pre></div><p>16-bit PCM、16 kHz mono——跟 <a href="/blog/llm/01-local-llm-services/hands-on/whisper-setup/" data-link-title="Hands-on：安裝 whisper.cpp 做語音轉文字" data-link-desc="brew install whisper-cpp、下載 GGML model、Metal 加速、ffmpeg 餵 WAV、484ms 完成 7 秒音訊轉錄">Whisper</a> 期望的輸入規格一致、可以直接 round-trip。</p>
<p>播放確認：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">afplay /tmp/piper-out.wav</span></span></code></pre></div><h2 id="常用選項">常用選項</h2>
<table>
  <thead>
      <tr>
          <th>選項</th>
          <th>作用</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>-m MODEL</code></td>
          <td>voice model <code>.onnx</code> 路徑（必填）</td>
      </tr>
      <tr>
          <td><code>-c CONFIG</code></td>
          <td>metadata json 路徑（預設自動找同名 <code>.onnx.json</code>）</td>
      </tr>
      <tr>
          <td><code>-i FILE</code></td>
          <td>輸入文字檔（替代 stdin）</td>
      </tr>
      <tr>
          <td><code>-f OUTPUT</code></td>
          <td>輸出 WAV 路徑</td>
      </tr>
      <tr>
          <td><code>-d DIR</code></td>
          <td>輸出目錄（多句時自動分檔）</td>
      </tr>
      <tr>
          <td><code>--length-scale FACTOR</code></td>
          <td>速度調整（&lt; 1 加速、&gt; 1 減速、預設 1.0）</td>
      </tr>
      <tr>
          <td><code>--volume FACTOR</code></td>
          <td>音量調整（0.0-1.0）</td>
      </tr>
      <tr>
          <td><code>-s SPEAKER</code></td>
          <td>多 speaker model 選 speaker（如 libritts）</td>
      </tr>
      <tr>
          <td><code>--cuda</code></td>
          <td>用 CUDA（Apple Silicon 用不到、留 default）</td>
      </tr>
  </tbody>
</table>
<p>典型應用：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 從文字檔合成</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">piper -m ~/.piper-voices/en_US-lessac-low.onnx <span class="se">\
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="se"></span>  -i article.txt <span class="se">\
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="se"></span>  -f narration.wav
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># 多句子分檔</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">piper -m ~/.piper-voices/en_US-lessac-medium.onnx <span class="se">\
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="se"></span>  -i script.txt <span class="se">\
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="se"></span>  -d ~/audio-output/ <span class="se">\
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="se"></span>  --output-dir-naming text
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"># 慢速朗讀（學習用）</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">piper -m ~/.piper-voices/en_US-lessac-low.onnx <span class="se">\
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="se"></span>  --length-scale 1.4 <span class="se">\
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="se"></span>  -f slow.wav <span class="o">&lt;&lt;&lt;</span> <span class="s2">&#34;Slowly read this sentence.&#34;</span></span></span></code></pre></div><h2 id="round-trip-驗證">Round-Trip 驗證</h2>
<p>確認 TTS + STT 整條串得起來：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 1. Piper TTS：文字 → WAV</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;The quick brown fox jumps over the lazy dog.&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  <span class="p">|</span> piper -m ~/.piper-voices/en_US-lessac-low.onnx -f /tmp/test.wav
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 2. Whisper STT：WAV → 文字</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">whisper-cli -m ~/.whisper-models/ggml-tiny.en.bin -f /tmp/test.wav -nt</span></span></code></pre></div><p>預期 Whisper 回應接近原文字（可能大小寫 / 標點稍變）。Round-trip 成功表示：</p>
<ul>
<li>Piper 輸出格式（16kHz mono WAV）符合 Whisper 輸入需求。</li>
<li>兩個模型對英文的訓練分佈相容。</li>
</ul>
<h2 id="跟-llm-串接llm-說話的-minimal-pipeline">跟 LLM 串接：「LLM 說話」的 minimal pipeline</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. LLM 生成回答</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="nv">ANSWER</span><span class="o">=</span><span class="k">$(</span>curl -s http://localhost:11434/v1/chat/completions <span class="se">\
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="se"></span>  -H <span class="s2">&#34;Content-Type: application/json&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="se"></span>  -d <span class="s1">&#39;{
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="s1">    &#34;model&#34;: &#34;gemma3:1b&#34;,
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s1">    &#34;messages&#34;: [{&#34;role&#34;:&#34;user&#34;,&#34;content&#34;:&#34;Tell me a one-sentence joke.&#34;}],
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s1">    &#34;stream&#34;: false
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="s1">  }&#39;</span> <span class="p">|</span> python3 -c <span class="s2">&#34;import json,sys; print(json.load(sys.stdin)[&#39;choices&#39;][0][&#39;message&#39;][&#39;content&#39;])&#34;</span><span class="k">)</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># 2. Piper 把回答念出來</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;</span><span class="nv">$ANSWER</span><span class="s2">&#34;</span> <span class="p">|</span> piper -m ~/.piper-voices/en_US-lessac-low.onnx -f /tmp/llm-says.wav
</span></span><span class="line"><span class="ln">12</span><span class="cl">
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"># 3. 播放</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">afplay /tmp/llm-says.wav</span></span></code></pre></div><p>三行 shell 完成「Local LLM 講笑話」整條 pipeline、無雲端、無 GPU。</p>
<h2 id="常見坑">常見坑</h2>
<h3 id="中文--多語言">中文 / 多語言</h3>
<p><code>en_US-lessac-low</code> 是英文 voice、餵中文會發音怪。中文要下載 <code>zh_CN-huayan-*</code>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">curl -L -o ~/.piper-voices/zh_CN-huayan-medium.onnx <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  <span class="s2">&#34;https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/zh/zh_CN/huayan/medium/zh_CN-huayan-medium.onnx&#34;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">curl -L -o ~/.piper-voices/zh_CN-huayan-medium.onnx.json <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  <span class="s2">&#34;https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/zh/zh_CN/huayan/medium/zh_CN-huayan-medium.onnx.json&#34;</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;你好，這是 Piper TTS 的中文測試。&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="se"></span>  <span class="p">|</span> piper -m ~/.piper-voices/zh_CN-huayan-medium.onnx -f /tmp/zh-out.wav</span></span></code></pre></div><p>zh_CN 預設是北京話腔調。</p>
<h3 id="--break-system-packages-警告"><code>--break-system-packages</code> 警告</h3>
<p>macOS 系統 Python 3.13+ 預設禁止 pip 直接裝。安全做法用 venv 或 pipx；不想搞 venv 就用 <code>--break-system-packages</code> flag（會跳警告但能裝）。長期建議遷到 venv、避免污染系統 Python。</p>
<h3 id="voice-quality-不夠">Voice quality 不夠</h3>
<p><code>low</code> quality 的 voice 適合驗證 / prototype、實際用 <code>medium</code> 或 <code>high</code>。低品質 voice 在長段文字會聽起來機械、自然度差。</p>
<h3 id="sample-rate-mismatch">Sample rate mismatch</h3>
<p>Voice metadata（<code>.onnx.json</code> 內 <code>sample_rate</code>）決定輸出 sample rate、不同 voice 可能不同（多數 22050 或 16000）。Whisper 期望 16000、若 Piper 輸出 22050、可能需要 ffmpeg 降採樣：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">ffmpeg -i piper-out.wav -ar <span class="m">16000</span> piper-out-16k.wav</span></span></code></pre></div><p><code>en_US-lessac-low</code> 本來就是 16k、沒這問題。</p>
<h2 id="何時這篇會過時">何時這篇會過時</h2>
<ul>
<li><code>pip install piper-tts</code> 安裝方式可能演化（轉純 binary release？）、但 ONNX model + CLI invocation 形式應該穩定。</li>
<li>Voice model 格式（ONNX）是 web 通用標準、未來增加 quality / locale、現有 voice 不會被 deprecate。</li>
<li>Hugging Face <code>rhasspy/piper-voices</code> repo 是 maintainer 官方、不會消失。</li>
</ul>
<p>讀的時候若 pip install 失敗、查 <a href="https://github.com/rhasspy/piper">piper GitHub</a> 最新 install 路徑；voice 列表看 piper-voices repo。</p>
<p>跟其他 hands-on 章節的關係：完整 hands-on 系列見 <a href="/blog/llm/01-local-llm-services/hands-on/" data-link-title="Hands-on：本地 AI 工具實作筆記" data-link-desc="Ollama / ComfyUI / Whisper / Piper TTS：實際安裝、驗證、跑通的紀錄。隨工具版本演化、跟 1.x 原理章節互補。">Hands-on 章節索引</a>、語音 round-trip 對接見 <a href="/blog/llm/01-local-llm-services/hands-on/whisper-setup/" data-link-title="Hands-on：安裝 whisper.cpp 做語音轉文字" data-link-desc="brew install whisper-cpp、下載 GGML model、Metal 加速、ffmpeg 餵 WAV、484ms 完成 7 秒音訊轉錄">Whisper STT</a>、跨服務 lifecycle 與記憶體管理見 <a href="/blog/llm/01-local-llm-services/hands-on/resource-management/" data-link-title="Hands-on：LLM 運行中 &#43; 結束的資源管理" data-link-desc="RAM / 磁碟 / port 三個 dimension 的觀察跟釋放、Ollama keep_alive 跟 ComfyUI 兩種 lifecycle 對比、實測釋放數字">Resource management</a>。</p>
]]></content:encoded></item><item><title>Hands-on：用 blog content 當 corpus 跑 RAG</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/rag-demo/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/rag-demo/</guid><description>&lt;p>本篇把 &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &amp;#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1 RAG 原理&lt;/a> 的概念落到一個能跑的最小實作：用本 blog 的 &lt;code>content/llm/&lt;/code> 當 corpus、Ollama 的 &lt;code>nomic-embed-text&lt;/code> 做 embedding、&lt;code>gemma3:1b&lt;/code> 做生成、兩個 Python 檔案完成 ingest + query 整條鏈。實作刻意保持 minimal、為的是把每一段都看清楚、跟原理對應。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>環境&lt;/strong>：macOS、Ollama 0.23.2、&lt;code>nomic-embed-text&lt;/code>、&lt;code>gemma3:1b&lt;/code>
&lt;strong>Corpus&lt;/strong>：本 blog 的 &lt;code>content/llm/&lt;/code>、71 個 markdown 檔
&lt;strong>結果&lt;/strong>：22 秒索引 463 個 chunk、retrieval 命中率好、generation 受 1B 模型能力限制——剛好示範「retrieval 跟 generation 各自會失敗」的兩段式失敗模式&lt;/p>&lt;/blockquote>
&lt;h2 id="前置設定">前置設定&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>項目&lt;/th>
 &lt;th>來源 / 指令&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Ollama 跑著&lt;/td>
 &lt;td>見 &lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/ollama-setup/" data-link-title="Hands-on：安裝 Ollama &amp;#43; 拉第一個 Gemma 模型" data-link-desc="brew install ollama、launchd service、ollama pull、curl 驗證 OpenAI 相容 API">Ollama 安裝&lt;/a>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Embedding 模型&lt;/td>
 &lt;td>&lt;code>ollama pull nomic-embed-text&lt;/code>（274 MB、768 維）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Chat 模型&lt;/td>
 &lt;td>&lt;code>ollama pull gemma3:1b&lt;/code>（815 MB）。能力弱但夠驗證流程；上 31B 級才能拿到「真正能用」的 answer 品質&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Python&lt;/td>
 &lt;td>3.11+（標準 lib &lt;code>urllib&lt;/code> / &lt;code>pickle&lt;/code> 即可、不需要外部依賴）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h3 id="驗證-embedding-api-可用">驗證 embedding API 可用&lt;/h3>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">curl -s http://localhost:11434/api/embeddings &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> -d &lt;span class="s1">&amp;#39;{&amp;#34;model&amp;#34;:&amp;#34;nomic-embed-text&amp;#34;,&amp;#34;prompt&amp;#34;:&amp;#34;hello world&amp;#34;}&amp;#39;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="p">|&lt;/span> python3 -c &lt;span class="s2">&amp;#34;import json,sys; r=json.load(sys.stdin); print(&amp;#39;dim:&amp;#39;, len(r[&amp;#39;embedding&amp;#39;]))&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>逐項說明：&lt;/p>
&lt;ul>
&lt;li>&lt;code>curl -s&lt;/code>：&lt;code>-s&lt;/code> 是 silent 模式、不顯示下載進度條（不然會混進 stdout、後面 python parse 會炸）。&lt;/li>
&lt;li>&lt;code>http://localhost:11434/api/embeddings&lt;/code>：用 Ollama &lt;strong>原生&lt;/strong> embedding endpoint。也有 &lt;code>/v1/embeddings&lt;/code>（OpenAI 相容）、但原生回應結構較簡（直接 &lt;code>{&amp;quot;embedding&amp;quot;: [...]}&lt;/code>、不是 OpenAI 那種 &lt;code>{&amp;quot;data&amp;quot;: [{&amp;quot;embedding&amp;quot;: [...]}]}&lt;/code> 巢狀）。本 demo 用原生、parse 更直接。&lt;/li>
&lt;li>&lt;code>-d '{&amp;quot;model&amp;quot;:&amp;quot;...&amp;quot;,&amp;quot;prompt&amp;quot;:&amp;quot;...&amp;quot;}'&lt;/code>：JSON payload。&lt;code>model&lt;/code> 是 Ollama tag、&lt;code>prompt&lt;/code> 是要 embed 的文字。&lt;/li>
&lt;li>&lt;code>python3 -c &amp;quot;...&amp;quot;&lt;/code>：stdin 接 curl 輸出、parse JSON、印 embedding 長度。&lt;/li>
&lt;li>&lt;strong>為什麼測 &lt;code>dim: 768&lt;/code>&lt;/strong>：&lt;code>nomic-embed-text&lt;/code> 模型架構決定 embedding 維度是 768。每次 embed 任何文字都會回固定 768 維向量、是 retrieval 的基本資料形狀。看到 &lt;code>dim: 768&lt;/code> 表示：API 通了、模型載入了、輸出形狀對。&lt;/li>
&lt;/ul>
&lt;h2 id="設計取捨">設計取捨&lt;/h2>
&lt;p>實作前先對齊 &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &amp;#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1 RAG 原理&lt;/a> 提的設計取捨、決定每段怎麼做：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>取捨點&lt;/th>
 &lt;th>本 demo 的選擇&lt;/th>
 &lt;th>Trade-off&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Chunking 粒度&lt;/td>
 &lt;td>段落感知 + 軟 token cap（~400 token）&lt;/td>
 &lt;td>簡單、保留段落邊界；不做語意 chunking&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Embedding 模型&lt;/td>
 &lt;td>&lt;code>nomic-embed-text&lt;/code>（768 維）&lt;/td>
 &lt;td>主流、Ollama 內建、英文為主；中文混合場景仍可運作&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>向量儲存&lt;/td>
 &lt;td>Python pickle 檔&lt;/td>
 &lt;td>463 chunks 用 in-memory 完全夠；何時該換見 &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/vector-storage-engineering/" data-link-title="4.22 RAG storage 工程：從 pickle 到 vector database 的選型判讀" data-link-desc="RAG storage backend 選型：規模到哪個階段該從 in-memory 升級到 vector DB、dependency chain 如何收窄選項">4.22 RAG storage 工程&lt;/a>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Retrieval&lt;/td>
 &lt;td>Cosine similarity、top-K&lt;/td>
 &lt;td>無 hybrid、無 re-ranker；夠驗證、品質受 embedding 限制&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Generation&lt;/td>
 &lt;td>&lt;code>gemma3:1b&lt;/code> 純 Ollama OpenAI 相容 API&lt;/td>
 &lt;td>1B 模型能力弱、會編造；用來示範 retrieval 跟 generation 兩段分離&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>這些選擇都對應到 4.0 章節的「會變的部分」清單——可預期半年後 embedding 模型有新選擇、chunking 有更好策略、re-ranker 變主流。但骨架（retrieval + augmentation 兩段式）不變。&lt;/p></description><content:encoded><![CDATA[<p>本篇把 <a href="/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1 RAG 原理</a> 的概念落到一個能跑的最小實作：用本 blog 的 <code>content/llm/</code> 當 corpus、Ollama 的 <code>nomic-embed-text</code> 做 embedding、<code>gemma3:1b</code> 做生成、兩個 Python 檔案完成 ingest + query 整條鏈。實作刻意保持 minimal、為的是把每一段都看清楚、跟原理對應。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>環境</strong>：macOS、Ollama 0.23.2、<code>nomic-embed-text</code>、<code>gemma3:1b</code>
<strong>Corpus</strong>：本 blog 的 <code>content/llm/</code>、71 個 markdown 檔
<strong>結果</strong>：22 秒索引 463 個 chunk、retrieval 命中率好、generation 受 1B 模型能力限制——剛好示範「retrieval 跟 generation 各自會失敗」的兩段式失敗模式</p></blockquote>
<h2 id="前置設定">前置設定</h2>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>來源 / 指令</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ollama 跑著</td>
          <td>見 <a href="/blog/llm/01-local-llm-services/hands-on/ollama-setup/" data-link-title="Hands-on：安裝 Ollama &#43; 拉第一個 Gemma 模型" data-link-desc="brew install ollama、launchd service、ollama pull、curl 驗證 OpenAI 相容 API">Ollama 安裝</a></td>
      </tr>
      <tr>
          <td>Embedding 模型</td>
          <td><code>ollama pull nomic-embed-text</code>（274 MB、768 維）</td>
      </tr>
      <tr>
          <td>Chat 模型</td>
          <td><code>ollama pull gemma3:1b</code>（815 MB）。能力弱但夠驗證流程；上 31B 級才能拿到「真正能用」的 answer 品質</td>
      </tr>
      <tr>
          <td>Python</td>
          <td>3.11+（標準 lib <code>urllib</code> / <code>pickle</code> 即可、不需要外部依賴）</td>
      </tr>
  </tbody>
</table>
<h3 id="驗證-embedding-api-可用">驗證 embedding API 可用</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">curl -s http://localhost:11434/api/embeddings <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  -d <span class="s1">&#39;{&#34;model&#34;:&#34;nomic-embed-text&#34;,&#34;prompt&#34;:&#34;hello world&#34;}&#39;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  <span class="p">|</span> python3 -c <span class="s2">&#34;import json,sys; r=json.load(sys.stdin); print(&#39;dim:&#39;, len(r[&#39;embedding&#39;]))&#34;</span></span></span></code></pre></div><p>逐項說明：</p>
<ul>
<li><code>curl -s</code>：<code>-s</code> 是 silent 模式、不顯示下載進度條（不然會混進 stdout、後面 python parse 會炸）。</li>
<li><code>http://localhost:11434/api/embeddings</code>：用 Ollama <strong>原生</strong> embedding endpoint。也有 <code>/v1/embeddings</code>（OpenAI 相容）、但原生回應結構較簡（直接 <code>{&quot;embedding&quot;: [...]}</code>、不是 OpenAI 那種 <code>{&quot;data&quot;: [{&quot;embedding&quot;: [...]}]}</code> 巢狀）。本 demo 用原生、parse 更直接。</li>
<li><code>-d '{&quot;model&quot;:&quot;...&quot;,&quot;prompt&quot;:&quot;...&quot;}'</code>：JSON payload。<code>model</code> 是 Ollama tag、<code>prompt</code> 是要 embed 的文字。</li>
<li><code>python3 -c &quot;...&quot;</code>：stdin 接 curl 輸出、parse JSON、印 embedding 長度。</li>
<li><strong>為什麼測 <code>dim: 768</code></strong>：<code>nomic-embed-text</code> 模型架構決定 embedding 維度是 768。每次 embed 任何文字都會回固定 768 維向量、是 retrieval 的基本資料形狀。看到 <code>dim: 768</code> 表示：API 通了、模型載入了、輸出形狀對。</li>
</ul>
<h2 id="設計取捨">設計取捨</h2>
<p>實作前先對齊 <a href="/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1 RAG 原理</a> 提的設計取捨、決定每段怎麼做：</p>
<table>
  <thead>
      <tr>
          <th>取捨點</th>
          <th>本 demo 的選擇</th>
          <th>Trade-off</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Chunking 粒度</td>
          <td>段落感知 + 軟 token cap（~400 token）</td>
          <td>簡單、保留段落邊界；不做語意 chunking</td>
      </tr>
      <tr>
          <td>Embedding 模型</td>
          <td><code>nomic-embed-text</code>（768 維）</td>
          <td>主流、Ollama 內建、英文為主；中文混合場景仍可運作</td>
      </tr>
      <tr>
          <td>向量儲存</td>
          <td>Python pickle 檔</td>
          <td>463 chunks 用 in-memory 完全夠；何時該換見 <a href="/blog/llm/04-applications/vector-storage-engineering/" data-link-title="4.22 RAG storage 工程：從 pickle 到 vector database 的選型判讀" data-link-desc="RAG storage backend 選型：規模到哪個階段該從 in-memory 升級到 vector DB、dependency chain 如何收窄選項">4.22 RAG storage 工程</a></td>
      </tr>
      <tr>
          <td>Retrieval</td>
          <td>Cosine similarity、top-K</td>
          <td>無 hybrid、無 re-ranker；夠驗證、品質受 embedding 限制</td>
      </tr>
      <tr>
          <td>Generation</td>
          <td><code>gemma3:1b</code> 純 Ollama OpenAI 相容 API</td>
          <td>1B 模型能力弱、會編造；用來示範 retrieval 跟 generation 兩段分離</td>
      </tr>
  </tbody>
</table>
<p>這些選擇都對應到 4.0 章節的「會變的部分」清單——可預期半年後 embedding 模型有新選擇、chunking 有更好策略、re-ranker 變主流。但骨架（retrieval + augmentation 兩段式）不變。</p>
<h2 id="ingest把-corpus-變索引">Ingest：把 corpus 變索引</h2>
<p>完整檔案：<code>scripts/rag-demo/ingest.py</code>（本 repo 下）。三段 function：切 chunk、embed、走訪 + 持久化。</p>
<h3 id="1-slice_markdown段落感知的-chunk-切割">1. <code>slice_markdown</code>：段落感知的 chunk 切割</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">def</span> <span class="nf">slice_markdown</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">soft_token_cap</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">400</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">    <span class="n">paragraphs</span> <span class="o">=</span> <span class="p">[</span><span class="n">p</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">r</span><span class="s2">&#34;\n\s*\n&#34;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span> <span class="k">if</span> <span class="n">p</span><span class="o">.</span><span class="n">strip</span><span class="p">()]</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="n">chunks</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="n">buf</span><span class="p">,</span> <span class="n">buf_len</span> <span class="o">=</span> <span class="p">[],</span> <span class="mi">0</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">paragraphs</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">        <span class="n">plen</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>  <span class="c1"># char-count / 2 ≈ token (CJK + English heuristic)</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">        <span class="k">if</span> <span class="n">buf</span> <span class="ow">and</span> <span class="n">buf_len</span> <span class="o">+</span> <span class="n">plen</span> <span class="o">&gt;</span> <span class="n">soft_token_cap</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">            <span class="n">chunks</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s2">&#34;</span><span class="se">\n\n</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">buf</span><span class="p">))</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">            <span class="n">buf</span><span class="p">,</span> <span class="n">buf_len</span> <span class="o">=</span> <span class="p">[],</span> <span class="mi">0</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">        <span class="n">buf</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">        <span class="n">buf_len</span> <span class="o">+=</span> <span class="n">plen</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">    <span class="k">if</span> <span class="n">buf</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">        <span class="n">chunks</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s2">&#34;</span><span class="se">\n\n</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">buf</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">    <span class="k">return</span> <span class="n">chunks</span></span></span></code></pre></div><p><strong>每段做什麼</strong>：</p>
<ol>
<li><strong><code>re.split(r&quot;\n\s*\n&quot;, text)</code></strong>：用「空白行」當分隔符切段落。<code>\n\s*\n</code> 比 <code>\n\n</code> 寬一點、允許中間有 whitespace（空白、tab）。Markdown 段落的標準分隔是空白行、這個 regex 捕捉所有段落邊界。</li>
<li><strong><code>[p.strip() for ... if p.strip()]</code></strong>：每段去除前後空白、過濾掉純空段落。</li>
<li><strong><code>buf, buf_len = [], 0</code></strong>：累積一個正在構建的 chunk。<code>buf</code> 是段落 list、<code>buf_len</code> 是該 chunk 的 token 累計估算。</li>
<li><strong><code>plen = len(p) / 2</code></strong>：估算這段的 token 數。</li>
<li><strong><code>if buf and buf_len + plen &gt; soft_token_cap</code></strong>：「greedy pack」邏輯——如果加上這段就會超過 cap、把目前 buffer flush 成一個 chunk、再開新 buffer 裝這段。</li>
<li><strong><code>if buf: chunks.append(...)</code></strong>：迴圈結束後、最後一個 buffer 還沒 flush、補上。</li>
</ol>
<p><strong>為什麼這樣設計</strong>：</p>
<ul>
<li><strong>為什麼 paragraph-aware、不是固定 token cap</strong>：<a href="/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1 RAG 原理</a> 提的 chunking 設計取捨——固定 token cap 容易切過句子或段落中間、語意被截斷。Paragraph-aware 切在自然邊界、保留段落內語意完整。</li>
<li><strong>為什麼 <code>soft</code> token cap（軟限制）而不是硬切</strong>：硬切會把一個 800-token 段落切成兩半；軟切讓「目前 chunk + 下一段超過 cap」時 flush 目前 chunk、下一段獨立成新 chunk（即使超過 cap 也保留段落完整）。代價：個別 chunk 可能超過 cap、retrieval 拿到的塊較大、但內容完整。</li>
<li><strong>為什麼 <code>len(p) / 2</code> 估 token</strong>：英文約 4 字元 / token、中文約 1.5 字元 / token、混合平均 / 2 在兩種場景都合理。要精確用 tokenizer（如 <code>tiktoken</code>）、但 demo 不需要——這個 heuristic 在 ±20% 內、夠用來做 chunking 決策。</li>
<li><strong>為什麼 <code>\n\n</code>.join(buf)`</strong>：flush 成 chunk 時、段落間保留空白行分隔、讀者看到 chunk 仍是合法 markdown 結構、不是平鋪文字。</li>
</ul>
<h3 id="2-embed呼叫-ollama-embedding-api">2. <code>embed</code>：呼叫 Ollama embedding API</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">embed</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">float</span><span class="p">]:</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="n">payload</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">({</span><span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="s2">&#34;nomic-embed-text&#34;</span><span class="p">,</span> <span class="s2">&#34;prompt&#34;</span><span class="p">:</span> <span class="n">text</span><span class="p">})</span><span class="o">.</span><span class="n">encode</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="n">req</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">request</span><span class="o">.</span><span class="n">Request</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">        <span class="s2">&#34;http://localhost:11434/api/embeddings&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">        <span class="n">data</span><span class="o">=</span><span class="n">payload</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">        <span class="n">headers</span><span class="o">=</span><span class="p">{</span><span class="s2">&#34;Content-Type&#34;</span><span class="p">:</span> <span class="s2">&#34;application/json&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">    <span class="k">with</span> <span class="n">urllib</span><span class="o">.</span><span class="n">request</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">req</span><span class="p">,</span> <span class="n">timeout</span><span class="o">=</span><span class="mi">60</span><span class="p">)</span> <span class="k">as</span> <span class="n">resp</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">9</span><span class="cl">        <span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">read</span><span class="p">())[</span><span class="s2">&#34;embedding&#34;</span><span class="p">]</span></span></span></code></pre></div><p><strong>每行做什麼</strong>：</p>
<ol>
<li><strong><code>payload = json.dumps(...).encode()</code></strong>：把 dict 轉成 JSON 字串、再 encode 成 bytes。HTTP body 必須是 bytes、不能直接傳 str。</li>
<li><strong><code>urllib.request.Request(...)</code></strong>：建立 HTTP request 物件。沒寫 <code>method</code> 預設是 GET、但有 <code>data</code> 參數會自動變 POST。</li>
<li><strong><code>headers={&quot;Content-Type&quot;: &quot;application/json&quot;}</code></strong>：告訴 server payload 是 JSON。少了這個、Ollama 可能 parse 不出 body。</li>
<li><strong><code>urlopen(req, timeout=60)</code></strong>：發送 request、<code>timeout=60</code> 是 socket-level timeout（連線 + 讀取總共最多 60 秒）。</li>
<li><strong><code>json.loads(resp.read())[&quot;embedding&quot;]</code></strong>：讀回應 body、parse JSON、取 <code>embedding</code> 欄位（768 維 list of float）。</li>
</ol>
<p><strong>為什麼這樣設計</strong>：</p>
<ul>
<li><strong>為什麼用 stdlib <code>urllib</code> 而不是 <code>requests</code></strong>：完全沒有外部 dependency、<code>urllib</code> 是 Python stdlib 內建。<code>requests</code> 較友善但要 <code>pip install</code>、本 demo 想 minimal。</li>
<li><strong>為什麼 timeout=60</strong>：embed 一段文字通常 &lt; 200ms、60 秒夠 buffer 意外（首次 model 載入記憶體可能 5-10 秒）。設無限會在 Ollama 掛掉時整個 script hang。</li>
<li><strong>為什麼 <code>/api/embeddings</code>、不是 <code>/v1/embeddings</code></strong>：兩者都可。原生 endpoint 回應結構平、parse 直接（<code>r[&quot;embedding&quot;]</code>）；OpenAI 相容回應較巢狀（<code>r[&quot;data&quot;][0][&quot;embedding&quot;]</code>）。對 demo、寫法簡單較重要。</li>
</ul>
<h3 id="3-走訪--持久化">3. 走訪 + 持久化</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="n">md_files</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">content_root</span><span class="o">.</span><span class="n">rglob</span><span class="p">(</span><span class="s2">&#34;*.md&#34;</span><span class="p">))</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">records</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="k">for</span> <span class="n">md</span> <span class="ow">in</span> <span class="n">md_files</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">md</span><span class="o">.</span><span class="n">read_text</span><span class="p">(</span><span class="n">encoding</span><span class="o">=</span><span class="s2">&#34;utf-8&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s2">&#34;^---\n.*?\n---\n&#34;</span><span class="p">,</span> <span class="s2">&#34;&#34;</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">count</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">flags</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">DOTALL</span><span class="p">)</span>  <span class="c1"># 去掉 frontmatter</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="n">chunks</span> <span class="o">=</span> <span class="n">slice_markdown</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">    <span class="k">for</span> <span class="n">j</span><span class="p">,</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">chunks</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">        <span class="n">vec</span> <span class="o">=</span> <span class="n">embed</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">        <span class="n">records</span><span class="o">.</span><span class="n">append</span><span class="p">({</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">            <span class="s2">&#34;source&#34;</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">md</span><span class="o">.</span><span class="n">relative_to</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">content_root</span><span class="o">.</span><span class="n">parent</span><span class="p">)),</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">            <span class="s2">&#34;chunk_index&#34;</span><span class="p">:</span> <span class="n">j</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">            <span class="s2">&#34;text&#34;</span><span class="p">:</span> <span class="n">chunk</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">            <span class="s2">&#34;embedding&#34;</span><span class="p">:</span> <span class="n">vec</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">        <span class="p">})</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">&#34;scripts/rag-demo/index.pkl&#34;</span><span class="p">,</span> <span class="s2">&#34;wb&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">    <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">records</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span></span></span></code></pre></div><p><strong>每段做什麼</strong>：</p>
<ol>
<li><strong><code>args.content_root.rglob(&quot;*.md&quot;)</code></strong>：recursive glob、回 <code>Path</code> iterator、找出 <code>content_root</code> 下所有 <code>.md</code> 檔（含子目錄）。</li>
<li><strong><code>sorted(...)</code></strong>：排序、讓每次 ingest 順序穩定（git diff 比較友善、retrieval 結果可重現）。</li>
<li><strong><code>text.read_text(encoding=&quot;utf-8&quot;)</code></strong>：讀檔、明確指定 UTF-8（中文 markdown 必要、否則 macOS / Linux 預設可能不一致）。</li>
<li><strong><code>re.sub(r&quot;^---\n.*?\n---\n&quot;, &quot;&quot;, text, count=1, flags=re.DOTALL)</code></strong>：去掉 Hugo frontmatter。
<ul>
<li><code>^---\n</code>：開頭 <code>---\n</code>。</li>
<li><code>.*?</code>：non-greedy match、配到下一個 <code>---</code> 就停。</li>
<li><code>\n---\n</code>：closing fence。</li>
<li><code>count=1</code>：只 strip 第一個（檔案中可能有其他 <code>---</code> 是水平分隔線、不要誤殺）。</li>
<li><code>flags=re.DOTALL</code>：讓 <code>.</code> 也匹配換行符（預設 <code>.</code> 不匹配 <code>\n</code>、規 frontmatter 跨行就吃不到）。</li>
</ul>
</li>
<li><strong><code>records.append({...})</code></strong>：每個 chunk 一個 record、含 source path、chunk index、原文、embedding。</li>
<li><strong><code>md.relative_to(args.content_root.parent)</code></strong>：把絕對 path 變成 <code>llm/00-foundations/xxx.md</code> 形式、retrieval 顯示時短、跨機器可移植。</li>
<li><strong><code>pickle.dump(records, f)</code></strong>：把整個 records list 序列化到 binary 檔。</li>
</ol>
<p><strong>為什麼這樣設計</strong>：</p>
<ul>
<li><strong>為什麼要 strip frontmatter</strong>：Frontmatter 是 <code>title</code>、<code>date</code>、<code>tags</code> 等 metadata、不是文章正文。embed 進去會稀釋向量語意（讓「date」「2026-05-11」等 keyword 影響相似度計算）。Strip 後 embedding 只 capture 內容語意。</li>
<li><strong>為什麼 records 是 list of dict 而不是 numpy array</strong>：兩個原因。(1) 每個 record 含 source / chunk_index / text / embedding 四種異質欄位、numpy 處理不直接。(2) 463 chunks 規模、純 Python list 跑 cosine 也只是毫秒級、不需要 vectorize。十萬 chunk 以上才考慮 numpy array + batched dot product。</li>
<li><strong>為什麼 pickle 而不是 JSON</strong>：embedding 是 768-float list、JSON 序列化會把每個 float 變成 ASCII 字串（每個 ~20 bytes）、檔案大很多、parse 也慢。Pickle 是 binary format、保留原本資料結構、檔案小、loader 快。代價：pickle 有 Python 版本相依、跨語言不能讀——但本 demo 索引只給自家 query.py / mcp_server.py 用、可接受。</li>
<li><strong>為什麼存 <code>text</code> 跟 <code>embedding</code>、不只 embedding</strong>：retrieval 要回 chunk 原文給 LLM 看、不能只有 source path（不然每次 query 還要再讀檔）。這裡的 corpus 檔案就是 <a href="/blog/llm/knowledge-cards/retrieval-source/" data-link-title="Retrieval Source" data-link-desc="RAG 從哪個 corpus、index、tool 或外部系統取回內容，決定來源可信度、freshness、權限與引用責任">retrieval source</a>；Pickle 多存原文成本低（~100 byte / chunk）、查詢時方便很多。</li>
</ul>
<h3 id="跑-ingest">跑 ingest</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">cd</span> ~/Projects/blog
</span></span><span class="line"><span class="ln">2</span><span class="cl">python3 scripts/rag-demo/ingest.py</span></span></code></pre></div><ul>
<li><code>cd ~/Projects/blog</code>：切到 repo 根、讓相對路徑 <code>content/llm</code> 對得到 corpus、<code>scripts/rag-demo/index.pkl</code> 對得到 output 位置。</li>
<li><code>python3 scripts/rag-demo/ingest.py</code>：跑 ingest script、預設讀 <code>content/llm/</code>、寫 <code>scripts/rag-demo/index.pkl</code>。</li>
</ul>
<p>實測輸出：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Found 71 markdown files under content/llm
</span></span><span class="line"><span class="ln">2</span><span class="cl">  [10/71] 86 chunks in 4.5s
</span></span><span class="line"><span class="ln">3</span><span class="cl">  [20/71] 181 chunks in 8.6s
</span></span><span class="line"><span class="ln">4</span><span class="cl">  ...
</span></span><span class="line"><span class="ln">5</span><span class="cl">  [70/71] 461 chunks in 22.2s
</span></span><span class="line"><span class="ln">6</span><span class="cl">Wrote 463 records to scripts/rag-demo/index.pkl (22.3s)</span></span></code></pre></div><p>463 chunks、22 秒、平均 ~21 chunks/sec。瓶頸是 sequential API call、用 async / batch 能快 5-10 倍、但這個量級不值得。</p>
<h2 id="queryretrieval--augmentation--generation">Query：retrieval + augmentation + generation</h2>
<p>完整檔案：<code>scripts/rag-demo/query.py</code>。三段。</p>
<h3 id="1-cosine-similarity--top-k-retrieval">1. Cosine similarity + top-K retrieval</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">def</span> <span class="nf">cosine</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">    <span class="n">dot</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">x</span> <span class="o">*</span> <span class="n">y</span> <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">))</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="n">na</span> <span class="o">=</span> <span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="nb">sum</span><span class="p">(</span><span class="n">x</span> <span class="o">*</span> <span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">a</span><span class="p">))</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="n">nb</span> <span class="o">=</span> <span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="nb">sum</span><span class="p">(</span><span class="n">y</span> <span class="o">*</span> <span class="n">y</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">b</span><span class="p">))</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="k">return</span> <span class="n">dot</span> <span class="o">/</span> <span class="p">(</span><span class="n">na</span> <span class="o">*</span> <span class="n">nb</span><span class="p">)</span> <span class="k">if</span> <span class="n">na</span> <span class="ow">and</span> <span class="n">nb</span> <span class="k">else</span> <span class="mf">0.0</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="k">def</span> <span class="nf">retrieve</span><span class="p">(</span><span class="n">records</span><span class="p">,</span> <span class="n">query_vec</span><span class="p">,</span> <span class="n">top_k</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="n">scored</span> <span class="o">=</span> <span class="p">[(</span><span class="n">cosine</span><span class="p">(</span><span class="n">query_vec</span><span class="p">,</span> <span class="n">r</span><span class="p">[</span><span class="s2">&#34;embedding&#34;</span><span class="p">]),</span> <span class="n">r</span><span class="p">)</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">records</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="n">scored</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="k">return</span> <span class="n">scored</span><span class="p">[:</span><span class="n">top_k</span><span class="p">]</span></span></span></code></pre></div><p><strong>每行做什麼</strong>：</p>
<ol>
<li><strong><code>dot = sum(x * y for x, y in zip(a, b))</code></strong>：兩個向量的內積（dot product）。<code>zip(a, b)</code> 把兩個 list 對位配對、generator expression 算每對相乘、sum 加起來。</li>
<li><strong><code>na = math.sqrt(sum(x * x for x in a))</code></strong>：a 的 L2 norm（歐氏範數）—— <code>sqrt(x1² + x2² + ... + xn²)</code>。</li>
<li><strong><code>nb = math.sqrt(sum(y * y for y in b))</code></strong>：b 的 L2 norm。</li>
<li><strong><code>return dot / (na * nb) if na and nb else 0.0</code></strong>：cosine = dot / (||a|| × ||b||)。三元運算子防 zero division——若任一向量是零向量、na 或 nb 為 0、回 0.0 而不是 crash。</li>
<li><strong><code>scored = [(cosine(query_vec, r[&quot;embedding&quot;]), r) for r in records]</code></strong>：對每個 record 算相似度、組成 (score, record) tuple 的 list。</li>
<li><strong><code>scored.sort(key=lambda x: x[0], reverse=True)</code></strong>：按 score 從大到小排序。<code>key=lambda x: x[0]</code> 取 tuple 第一個元素（score）當排序 key。</li>
<li><strong><code>return scored[:top_k]</code></strong>：取前 K 個。</li>
</ol>
<p><strong>為什麼這樣設計</strong>：</p>
<ul>
<li><strong>為什麼 cosine 而不是純 dot product</strong>：純 dot product 受向量長度影響——長向量自動拿高分、跟「相似度」無關。Cosine 把向量正規化到單位長度、純看方向、是「語意相似」的標準衡量。語意相似 embedding 應該方向相近、長度差異不重要。</li>
<li><strong>為什麼用 <code>math.sqrt</code> 而不是 <code>**0.5</code></strong>：兩者數學等價、但 <code>math.sqrt</code> 用 C-level 實作、CPython 中比 Python 級 <code>**0.5</code> 快幾倍。對 463 chunks 影響不大、但 production scale 會放大差異——習慣寫 <code>math.sqrt</code> 的好。</li>
<li><strong>為什麼 <code>if na and nb else 0.0</code></strong>：防禦性程式設計。理論上 embedding 不會是零向量（模型架構保證有非零權重）、但邊界情況（空輸入、API 出錯回 placeholder）可能出現、避免 ZeroDivisionError 整個 query 失敗。回 0.0 表示「無法判斷相似度」、retrieval 排序時自然排到最後。</li>
<li><strong>為什麼 sort 全部、不用 heap</strong>：463 records、Python sort 是 O(n log n)、毫秒級。<code>heapq.nlargest(top_k, ...)</code> 是 O(n log k)、在 k=4、n=463 上實測幾乎沒差。十萬 record 以上才看到顯著差別。</li>
<li><strong>為什麼用 list of tuple、不用 numpy</strong>：跟 ingest 同樣的理由——小規模不需要 vectorize、純 Python 清楚。</li>
</ul>
<h3 id="2-建-augmented-prompt">2. 建 augmented prompt</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="n">context_blocks</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="k">for</span> <span class="n">score</span><span class="p">,</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">retrieved</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="n">context_blocks</span><span class="o">.</span><span class="n">append</span><span class="p">(</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">        <span class="sa">f</span><span class="s2">&#34;[來源：</span><span class="si">{</span><span class="n">r</span><span class="p">[</span><span class="s1">&#39;source&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2">#chunk</span><span class="si">{</span><span class="n">r</span><span class="p">[</span><span class="s1">&#39;chunk_index&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> 相似度：</span><span class="si">{</span><span class="n">score</span><span class="si">:</span><span class="s2">.3f</span><span class="si">}</span><span class="s2">]</span><span class="se">\n</span><span class="si">{</span><span class="n">r</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2">&#34;</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">system</span> <span class="o">=</span> <span class="p">(</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="s2">&#34;你是一個技術文件問答助手。&#34;</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="s2">&#34;依下方 context 內容回答問題、不要編造 context 外的事實。&#34;</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="s2">&#34;若 context 不足以回答、明確說『資料不足』。&#34;</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">    <span class="s2">&#34;回答末尾列出引用的來源 path。&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">user</span> <span class="o">=</span> <span class="s2">&#34;## Context</span><span class="se">\n\n</span><span class="s2">&#34;</span> <span class="o">+</span> <span class="s2">&#34;</span><span class="se">\n\n</span><span class="s2">---</span><span class="se">\n\n</span><span class="s2">&#34;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">context_blocks</span><span class="p">)</span> <span class="o">+</span> <span class="sa">f</span><span class="s2">&#34;</span><span class="se">\n\n</span><span class="s2">## Question</span><span class="se">\n\n</span><span class="si">{</span><span class="n">question</span><span class="si">}</span><span class="s2">&#34;</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">messages</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">    <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">system</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">user</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="p">]</span></span></span></code></pre></div><p><strong>每行做什麼</strong>：</p>
<ol>
<li><strong><code>f&quot;[來源：{...} 相似度：{score:.3f}]\n{r['text']}&quot;</code></strong>：每個 retrieved chunk 加 header 標明出處跟相似度、再接原文。<code>:.3f</code> 是 score 格式化到三位小數。</li>
<li><strong><code>&quot;\n\n---\n\n&quot;.join(context_blocks)</code></strong>：用 <code>---</code> 水平分隔線分隔各 chunk、視覺上清楚。</li>
<li><strong><code>{&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: system}</code></strong>：system message 給 LLM 設定角色 + 約束。</li>
<li><strong><code>{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: user}</code></strong>：user message 含 context 跟 question、是 LLM 實際讀的內容。</li>
</ol>
<p><strong>為什麼這樣設計</strong>：</p>
<ul>
<li><strong>為什麼 <a href="/blog/llm/knowledge-cards/system-prompt/" data-link-title="System Prompt" data-link-desc="LLM application 中由開發者預設、不直接顯示給使用者的指令層、定義模型的角色、行為規範、輸出格式">system prompt</a> 約束四件事</strong>（角色、忠於 context、資料不足時明說、引用來源）：
<ul>
<li><strong>角色</strong>：「技術文件問答助手」框定模型行為、減少 off-topic 回應。</li>
<li><strong>忠於 context</strong>：對抗 RAG 最常見的失敗模式——LLM 看到 context 但用自己訓練的 knowledge 補完、結果跟 corpus 不一致。明確要求 follow context 能降低（雖然不能完全消除、見實測 1）。</li>
<li><strong>資料不足時明說</strong>：避免 LLM「硬要回答」造成 hallucination。對 weak model 這條 follow 度差、但對 large model 有效。</li>
<li><strong>引用來源</strong>：traceability。讀者能回查 corpus、驗證模型答案。</li>
</ul>
</li>
<li><strong>為什麼 <code>## Context</code> / <code>## Question</code> 結構</strong>：用 markdown heading 結構幫助 LLM 區分「我要讀什麼」「我要回答什麼」。比平鋪文字穩定（即使對小模型）。</li>
<li><strong>為什麼把 retrieved chunks 全塞 user message、不分開</strong>：MCP / function calling 的更現代做法是把 retrieved 結果做成 tool response、模型主動 call retrieval tool。本 demo 不引入 tool use、直接塞 prompt 較單純——能說明 RAG 核心（augmentation）不必牽扯 tool use。</li>
</ul>
<h3 id="3-呼叫-chat-completions">3. 呼叫 chat completions</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">chat</span><span class="p">(</span><span class="n">messages</span><span class="p">,</span> <span class="n">model</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="n">payload</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">({</span><span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="n">model</span><span class="p">,</span> <span class="s2">&#34;messages&#34;</span><span class="p">:</span> <span class="n">messages</span><span class="p">,</span> <span class="s2">&#34;stream&#34;</span><span class="p">:</span> <span class="kc">False</span><span class="p">})</span><span class="o">.</span><span class="n">encode</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="n">req</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">request</span><span class="o">.</span><span class="n">Request</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">        <span class="s2">&#34;http://localhost:11434/v1/chat/completions&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">        <span class="n">data</span><span class="o">=</span><span class="n">payload</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">        <span class="n">headers</span><span class="o">=</span><span class="p">{</span><span class="s2">&#34;Content-Type&#34;</span><span class="p">:</span> <span class="s2">&#34;application/json&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">    <span class="k">with</span> <span class="n">urllib</span><span class="o">.</span><span class="n">request</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">req</span><span class="p">,</span> <span class="n">timeout</span><span class="o">=</span><span class="mi">180</span><span class="p">)</span> <span class="k">as</span> <span class="n">resp</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">9</span><span class="cl">        <span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">read</span><span class="p">())[</span><span class="s2">&#34;choices&#34;</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s2">&#34;message&#34;</span><span class="p">][</span><span class="s2">&#34;content&#34;</span><span class="p">]</span></span></span></code></pre></div><p><strong>每行做什麼</strong>：</p>
<ol>
<li><strong><code>json.dumps({&quot;model&quot;: ..., &quot;messages&quot;: ..., &quot;stream&quot;: False}).encode()</code></strong>：構造 OpenAI 相容 chat completions request body。<code>stream: False</code> 讓 server 等生成完再一次回、不要 SSE 串流。</li>
<li><strong><code>/v1/chat/completions</code></strong>：OpenAI 相容 endpoint、跟雲端 OpenAI 完全同樣 schema。</li>
<li><strong><code>timeout=180</code></strong>：3 分鐘、給長 context + 慢模型空間。</li>
<li><strong><code>[&quot;choices&quot;][0][&quot;message&quot;][&quot;content&quot;]</code></strong>：parse OpenAI 標準 response 結構、取第一個 choice 的 content。</li>
</ol>
<p><strong>為什麼這樣設計</strong>：</p>
<ul>
<li><strong>為什麼 <code>stream: False</code></strong>：demo 要把完整 answer 印出、不需要 incremental display。<code>stream: True</code> 要寫 SSE parser、複雜。Production 互動式 UI 才需要 streaming。</li>
<li><strong>為什麼 timeout=180、不是 60</strong>：1B 模型 + 4 個 retrieved chunks 的 context、prefill 可能要 5-30 秒、生成 100-500 token 又要 5-20 秒、保守設 3 分鐘。embed function 用 60 是因為 embedding 是純 forward pass、單一 token 量級操作、不需要這麼長。</li>
<li><strong>為什麼 <code>/v1/...</code> 而不是 <code>/api/...</code></strong>：chat completions 走 OpenAI 相容 endpoint、生態都用這個格式（Continue.dev、Cursor、各家 SDK）。embedding 用 <code>/api/...</code> 是因為原生 schema 簡單；chat 用 <code>/v1/...</code> 是因為 message-based 結構是 OpenAI 標準、跨工具互通。</li>
</ul>
<h2 id="實測結果retrieval-對generation-弱">實測結果：retrieval 對、generation 弱</h2>
<h3 id="測試-1什麼是-mtp為什麼對寫-code-場景特別有效">測試 1：「什麼是 MTP？為什麼對寫 code 場景特別有效？」</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">python3 scripts/rag-demo/query.py --show-retrieved <span class="s2">&#34;什麼是 MTP？為什麼對寫 code 場景特別有效？&#34;</span></span></span></code></pre></div><p><code>--show-retrieved</code> 是個 flag、開啟後在 stderr 印 retrieved chunks 跟 score、答案還是進 stdout。是 debug 跟教學用、不會影響 LLM 看到的 prompt。</p>
<p>Retrieval：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">0.870  llm/knowledge-cards/transformer.md#chunk2
</span></span><span class="line"><span class="ln">2</span><span class="cl">0.825  llm/03-theoretical-foundations/sampling-and-decoding.md#chunk8
</span></span><span class="line"><span class="ln">3</span><span class="cl">0.782  llm/knowledge-cards/ttft.md#chunk1
</span></span><span class="line"><span class="ln">4</span><span class="cl">0.771  llm/knowledge-cards/mtp.md#chunk2</span></span></code></pre></div><p>四個 chunk 都跟問題相關、相似度合理。MTP 卡確實被命中（雖然不是 top-1、是因為 transformer.md 該段提到 MTP）。</p>
<p>Generation（1B 模型）：</p>
<blockquote>
<p>MTP 僅指使用 Ollama 進行 Coding 模型訓練與部署、它是一種系統性的方式&hellip;
來源：<a href="https://llm.dev/mti/">llm.dev</a></p></blockquote>
<p><strong>錯</strong>：1B 模型編造了「MTP 僅指使用 Ollama」這個事實（不對、MTP 是 Google 為 Gemma 釋出的、跟 Ollama 沒直接關係）、來源 URL 也是 hallucination。</p>
<h3 id="測試-2mcp-跟-function-calling-有什麼差別">測試 2：「MCP 跟 function calling 有什麼差別？」</h3>
<p>Retrieval：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">0.721  llm/04-applications/application-protocols.md#chunk2
</span></span><span class="line"><span class="ln">2</span><span class="cl">0.704  llm/04-applications/application-protocols.md#chunk1
</span></span><span class="line"><span class="ln">3</span><span class="cl">0.702  llm/04-applications/application-protocols.md#chunk0
</span></span><span class="line"><span class="ln">4</span><span class="cl">0.693  llm/knowledge-cards/function-calling.md#chunk1</span></span></code></pre></div><p>完美命中——4.3 應用層協議章節三個 chunk + function-calling 卡。</p>
<p>Generation：模型把幾段重複拼接、framing 跟原文有出入、但比測試 1 好（因為 context 涵蓋直接答案）。</p>
<h2 id="觀察跟原理對應">觀察跟原理對應</h2>
<p>這個 demo 剛好示範 <a href="/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1 RAG 原理</a> 提的兩段式失敗模式：</p>
<table>
  <thead>
      <tr>
          <th>階段</th>
          <th>表現</th>
          <th>原因</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Retrieval</td>
          <td>命中率好、找到對的 chunks</td>
          <td><code>nomic-embed-text</code> 對技術文件覆蓋好、cosine 對短 query 也 OK</td>
      </tr>
      <tr>
          <td>Generation</td>
          <td>內容有時編造、不忠於 context、來源亂寫</td>
          <td><code>gemma3:1b</code> 模型容量不足以可靠 follow system prompt</td>
      </tr>
  </tbody>
</table>
<p>換 31B+ 模型 generation 會改善很多——這也是 4.0 章節提到「retrieval 跟下游 LLM 訓練分佈不一致」會放大失敗的具體例子。寫 RAG 系統時、generation 失敗不一定是「retrieval 沒給對 context」、可能是「模型不夠強」。</p>
<h2 id="何時這份-demo-會過時">何時這份 demo 會過時</h2>
<ul>
<li><strong>Ollama API 形狀</strong>：短期內不會變（生態都依賴）。</li>
<li><strong><code>nomic-embed-text</code> / <code>gemma3:1b</code> 具體 tag</strong>：預期會被新模型取代、但 retrieval + augmentation 結構不變。</li>
<li><strong>Chunking heuristic</strong>：簡單 char-count / 2 很粗、半年後若有便宜的 token counter 直接接會更準。</li>
<li><strong>Pickle 儲存</strong>：production 場景建議換 vector DB、本 demo 是教學用。</li>
</ul>
<p>實作換代時、保留 ingest / retrieve / augment / generate 四段、各段內部換工具即可——這四段是 RAG 的骨架、跨工具世代不變。</p>
<h2 id="跑這個-demo-的指令總結">跑這個 demo 的指令總結</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 一次性建索引（每次 corpus 變動才需要重建）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="nb">cd</span> ~/Projects/blog
</span></span><span class="line"><span class="ln">3</span><span class="cl">python3 scripts/rag-demo/ingest.py</span></span></code></pre></div><ul>
<li><code>cd</code>：切到 repo 根、relative path 對得到。</li>
<li><code>python3 ingest.py</code>：跑索引、預設讀 <code>content/llm/</code>、寫 <code>scripts/rag-demo/index.pkl</code>。每次 corpus 變動才需要重跑、不變的話 index 就一直用。</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 查詢（任意次）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">python3 scripts/rag-demo/query.py --show-retrieved <span class="s2">&#34;你的問題&#34;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">python3 scripts/rag-demo/query.py --top-k <span class="m">5</span> --model gemma3:1b <span class="s2">&#34;問題&#34;</span></span></span></code></pre></div><ul>
<li><code>--show-retrieved</code>：教學 / debug 用、列 retrieved chunks 跟 score 到 stderr。</li>
<li><code>--top-k 5</code>：取 top 5 instead of 預設 4。chunks 越多 context 越長、TTFT 越久、但訊息越完整。</li>
<li><code>--model gemma3:1b</code>：指定 chat model。換 <code>gemma3:4b</code>、<code>gemma4:31b-coding-mtp-bf16</code> 等 generation 品質會大幅改善。</li>
</ul>
<p>完整 source 在 <code>scripts/rag-demo/</code> 下、200 行 Python、無外部 dependency。</p>
<p>跟其他 hands-on 章節的關係：完整 hands-on 系列見 <a href="/blog/llm/01-local-llm-services/hands-on/" data-link-title="Hands-on：本地 AI 工具實作筆記" data-link-desc="Ollama / ComfyUI / Whisper / Piper TTS：實際安裝、驗證、跑通的紀錄。隨工具版本演化、跟 1.x 原理章節互補。">Hands-on 章節索引</a>、把 retrieval 包成 MCP server 暴露給 LLM application 見 <a href="/blog/llm/01-local-llm-services/hands-on/mcp-demo/" data-link-title="Hands-on：用 blog content 寫一個最小 MCP server" data-link-desc="stdio JSON-RPC、stdlib-only Python、暴露 blog content 給 LLM 用、validating 4.3 應用層協議">MCP demo</a>、RAG + MCP 同跑的記憶體 / 程序預算見 <a href="/blog/llm/01-local-llm-services/hands-on/rag-mcp-resources/" data-link-title="Hands-on：RAG / MCP 的資源 footprint" data-link-desc="RAG ingest / query / MCP server 三階段的 RAM / 磁碟 / process 實測、多模型並存的 RAM 衝突、本地 LLM 跑 RAG 跟單純 chat 的差異">RAG + MCP resource footprint</a>、術語見 <a href="/blog/llm/knowledge-cards/rag/" data-link-title="RAG" data-link-desc="Retrieval-Augmented Generation：動態外掛知識給 LLM、繞開模型參數記憶的靜態限制">RAG</a> 跟 <a href="/blog/llm/knowledge-cards/embedding-model/" data-link-title="Embedding Model" data-link-desc="把文字轉成向量的模型：用於 codebase 索引與語意搜尋">embedding model</a>。</p>
]]></content:encoded></item><item><title>Hands-on：用 blog content 寫一個最小 MCP server</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/mcp-demo/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/mcp-demo/</guid><description>&lt;p>本篇把 &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/application-protocols/" data-link-title="4.6 應用層協議：function calling / structured output / MCP" data-link-desc="三個常被混為一談的概念：模型能力、sampling 約束、server 協議，三者的層級差異與組合方式">4.6 應用層協議&lt;/a> 的 MCP 概念落到一個可跑的最小實作：用 stdio JSON-RPC 暴露兩個 tool（&lt;code>search_blog&lt;/code>、&lt;code>read_chunk&lt;/code>）、客戶端 spawn server 跟它對話、驗證 protocol initialize / tools/list / tools/call / error 四個基本流程。實作刻意只用 Python stdlib、不依賴 MCP SDK、為的是把 wire protocol 看清楚、跟 4.3 的「server 協議層」framing 對應。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>環境&lt;/strong>：Python 3.11+、stdlib only（json / subprocess / urllib）
&lt;strong>依賴&lt;/strong>：RAG demo 的 &lt;code>index.pkl&lt;/code>（&lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &amp;#43; cosine retrieval &amp;#43; Ollama chat、validating 4.0 RAG 原理">見 RAG demo&lt;/a>）
&lt;strong>協議版本&lt;/strong>：MCP &lt;code>2025-03-26&lt;/code>&lt;/p>&lt;/blockquote>
&lt;h2 id="mcp-是什麼層的東西">MCP 是什麼層的東西&lt;/h2>
&lt;p>回顧 &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/application-protocols/" data-link-title="4.6 應用層協議：function calling / structured output / MCP" data-link-desc="三個常被混為一談的概念：模型能力、sampling 約束、server 協議，三者的層級差異與組合方式">4.6 應用層協議&lt;/a> 的層級劃分：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function calling&lt;/strong>：模型訓練建立的能力（模型層）。&lt;/li>
&lt;li>&lt;strong>Structured output&lt;/strong>：sampling 階段約束（推論層）。&lt;/li>
&lt;li>&lt;strong>MCP&lt;/strong>：LLM application ↔ 外部 tool server 的協議（架構層）。&lt;/li>
&lt;/ul>
&lt;p>MCP 不管「模型怎麼呼叫工具」、它管「工具怎麼被暴露給 application」。本 demo 寫的是 server 端：server 不知道是哪個 LLM 在用它、不假設客戶端用 function calling 還是 structured output、它只專注「把 tool 透過 JSON-RPC 暴露出去」。&lt;/p>
&lt;p>這跟 &lt;a href="https://tarrragon.github.io/blog/llm/00-foundations/openai-compatible-api/" data-link-title="0.3 OpenAI 相容 API" data-link-desc="為什麼幾乎所有本地 LLM 工具不用改就能切到本地：背後是同一套 API 形狀">OpenAI 相容 API&lt;/a> 的設計哲學一致：定義最小可用標準、讓生態繞著標準長。&lt;/p>
&lt;h2 id="前置設定">前置設定&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>項目&lt;/th>
 &lt;th>來源&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Ollama + &lt;code>nomic-embed-text&lt;/code>&lt;/td>
 &lt;td>&lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/ollama-setup/" data-link-title="Hands-on：安裝 Ollama &amp;#43; 拉第一個 Gemma 模型" data-link-desc="brew install ollama、launchd service、ollama pull、curl 驗證 OpenAI 相容 API">Ollama 安裝&lt;/a>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>RAG index（&lt;code>index.pkl&lt;/code>）&lt;/td>
 &lt;td>&lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &amp;#43; cosine retrieval &amp;#43; Ollama chat、validating 4.0 RAG 原理">RAG demo&lt;/a> 跑過 &lt;code>ingest.py&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Python&lt;/td>
 &lt;td>3.11+&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>不需要安裝 MCP SDK——本 demo 手寫 JSON-RPC 處理、為了 inspection 透明度。Production server 建議改用 &lt;a href="https://github.com/modelcontextprotocol">官方 SDK&lt;/a>（Python / TypeScript 都有）、處理 framing、capability negotiation、transport edge cases。&lt;/p></description><content:encoded><![CDATA[<p>本篇把 <a href="/blog/llm/04-applications/application-protocols/" data-link-title="4.6 應用層協議：function calling / structured output / MCP" data-link-desc="三個常被混為一談的概念：模型能力、sampling 約束、server 協議，三者的層級差異與組合方式">4.6 應用層協議</a> 的 MCP 概念落到一個可跑的最小實作：用 stdio JSON-RPC 暴露兩個 tool（<code>search_blog</code>、<code>read_chunk</code>）、客戶端 spawn server 跟它對話、驗證 protocol initialize / tools/list / tools/call / error 四個基本流程。實作刻意只用 Python stdlib、不依賴 MCP SDK、為的是把 wire protocol 看清楚、跟 4.3 的「server 協議層」framing 對應。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>環境</strong>：Python 3.11+、stdlib only（json / subprocess / urllib）
<strong>依賴</strong>：RAG demo 的 <code>index.pkl</code>（<a href="/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &#43; cosine retrieval &#43; Ollama chat、validating 4.0 RAG 原理">見 RAG demo</a>）
<strong>協議版本</strong>：MCP <code>2025-03-26</code></p></blockquote>
<h2 id="mcp-是什麼層的東西">MCP 是什麼層的東西</h2>
<p>回顧 <a href="/blog/llm/04-applications/application-protocols/" data-link-title="4.6 應用層協議：function calling / structured output / MCP" data-link-desc="三個常被混為一談的概念：模型能力、sampling 約束、server 協議，三者的層級差異與組合方式">4.6 應用層協議</a> 的層級劃分：</p>
<ul>
<li><strong>Function calling</strong>：模型訓練建立的能力（模型層）。</li>
<li><strong>Structured output</strong>：sampling 階段約束（推論層）。</li>
<li><strong>MCP</strong>：LLM application ↔ 外部 tool server 的協議（架構層）。</li>
</ul>
<p>MCP 不管「模型怎麼呼叫工具」、它管「工具怎麼被暴露給 application」。本 demo 寫的是 server 端：server 不知道是哪個 LLM 在用它、不假設客戶端用 function calling 還是 structured output、它只專注「把 tool 透過 JSON-RPC 暴露出去」。</p>
<p>這跟 <a href="/blog/llm/00-foundations/openai-compatible-api/" data-link-title="0.3 OpenAI 相容 API" data-link-desc="為什麼幾乎所有本地 LLM 工具不用改就能切到本地：背後是同一套 API 形狀">OpenAI 相容 API</a> 的設計哲學一致：定義最小可用標準、讓生態繞著標準長。</p>
<h2 id="前置設定">前置設定</h2>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>來源</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ollama + <code>nomic-embed-text</code></td>
          <td><a href="/blog/llm/01-local-llm-services/hands-on/ollama-setup/" data-link-title="Hands-on：安裝 Ollama &#43; 拉第一個 Gemma 模型" data-link-desc="brew install ollama、launchd service、ollama pull、curl 驗證 OpenAI 相容 API">Ollama 安裝</a></td>
      </tr>
      <tr>
          <td>RAG index（<code>index.pkl</code>）</td>
          <td><a href="/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &#43; cosine retrieval &#43; Ollama chat、validating 4.0 RAG 原理">RAG demo</a> 跑過 <code>ingest.py</code></td>
      </tr>
      <tr>
          <td>Python</td>
          <td>3.11+</td>
      </tr>
  </tbody>
</table>
<p>不需要安裝 MCP SDK——本 demo 手寫 JSON-RPC 處理、為了 inspection 透明度。Production server 建議改用 <a href="https://github.com/modelcontextprotocol">官方 SDK</a>（Python / TypeScript 都有）、處理 framing、capability negotiation、transport edge cases。</p>
<h2 id="mcp-協議的最小子集">MCP 協議的最小子集</h2>
<p><a href="/blog/llm/knowledge-cards/mcp/" data-link-title="MCP（Model Context Protocol）" data-link-desc="LLM application ↔ 外部 tool server 之間的標準化協議、複用 OpenAI 相容 API 的成功模式">MCP server</a> 要 handle 的核心 method：</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>角色</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>initialize</code></td>
          <td>Client 跟 server 握手、交換 protocol version + capability</td>
      </tr>
      <tr>
          <td><code>notifications/initialized</code></td>
          <td>Client 通知 handshake 完成（notification、無 response）</td>
      </tr>
      <tr>
          <td><code>tools/list</code></td>
          <td>Client 問 server 有哪些 tool</td>
      </tr>
      <tr>
          <td><code>tools/call</code></td>
          <td>Client 呼叫某 tool、傳 arguments</td>
      </tr>
  </tbody>
</table>
<p>四個 method 之外、還可以暴露 resources / prompts / sampling、本 demo 只做 tools。</p>
<h2 id="server-實作">Server 實作</h2>
<p>完整檔案：<code>scripts/mcp-demo/blog_mcp_server.py</code>、約 150 行。</p>
<h3 id="主迴圈讀-stdin分派-method寫-stdout">主迴圈：讀 stdin、分派 method、寫 stdout</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">    <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;[blog-mcp-demo] starting, index=</span><span class="si">{</span><span class="n">INDEX_PATH</span><span class="si">}</span><span class="s2">, tools=</span><span class="si">{</span><span class="nb">list</span><span class="p">(</span><span class="n">TOOLS</span><span class="o">.</span><span class="n">keys</span><span class="p">())</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">        <span class="n">line</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">        <span class="k">if</span> <span class="ow">not</span> <span class="n">line</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">            <span class="k">continue</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">        <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">            <span class="n">msg</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">        <span class="k">except</span> <span class="n">json</span><span class="o">.</span><span class="n">JSONDecodeError</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">            <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;  parse error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">            <span class="k">continue</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">        <span class="n">method</span> <span class="o">=</span> <span class="n">msg</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;method&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">        <span class="n">rid</span> <span class="o">=</span> <span class="n">msg</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;id&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">        <span class="n">params</span> <span class="o">=</span> <span class="n">msg</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;params&#34;</span><span class="p">,</span> <span class="p">{})</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">        <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;  → </span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s2"> (id=</span><span class="si">{</span><span class="n">rid</span><span class="si">}</span><span class="s2">)&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">        <span class="k">if</span> <span class="n">method</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">HANDLERS</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">            <span class="n">respond</span><span class="p">(</span><span class="n">rid</span><span class="p">,</span> <span class="n">error</span><span class="o">=</span><span class="p">{</span><span class="s2">&#34;code&#34;</span><span class="p">:</span> <span class="o">-</span><span class="mi">32601</span><span class="p">,</span> <span class="s2">&#34;message&#34;</span><span class="p">:</span> <span class="sa">f</span><span class="s2">&#34;Method not found: </span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">})</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">            <span class="k">continue</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">        <span class="n">handler</span> <span class="o">=</span> <span class="n">HANDLERS</span><span class="p">[</span><span class="n">method</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">        <span class="k">if</span> <span class="n">handler</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">            <span class="k">continue</span>  <span class="c1"># notification, no response expected</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">        <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">            <span class="n">result</span> <span class="o">=</span> <span class="n">handler</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">            <span class="n">respond</span><span class="p">(</span><span class="n">rid</span><span class="p">,</span> <span class="n">result</span><span class="o">=</span><span class="n">result</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">        <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl">            <span class="n">log</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;  ✗ handler error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl">            <span class="n">respond</span><span class="p">(</span><span class="n">rid</span><span class="p">,</span> <span class="n">error</span><span class="o">=</span><span class="p">{</span><span class="s2">&#34;code&#34;</span><span class="p">:</span> <span class="o">-</span><span class="mi">32000</span><span class="p">,</span> <span class="s2">&#34;message&#34;</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)})</span></span></span></code></pre></div><p><strong>每段做什麼</strong>：</p>
<ol>
<li><strong><code>log(...)</code> 開機訊息</strong>：印到 stderr（不是 stdout）、讓人類能看到 server 啟動了、什麼 tools 可用。stdout 完全保留給 JSON-RPC 用。</li>
<li><strong><code>for line in sys.stdin</code></strong>：MCP 的 stdio transport 是 line-delimited JSON—— 每個 message 一行、<code>\n</code> 結束。Python 的 file iteration 自動按行切。</li>
<li><strong><code>line.strip()</code> + <code>if not line</code></strong>：空行 skip（不是 protocol error、只是 idle）。</li>
<li><strong><code>json.loads(line)</code></strong> with <code>try / except</code>：parse 失敗（malformed input）不 crash、log error 繼續下一行。Protocol 訊息該是合法 JSON、parse error 表示 client 出錯。</li>
<li><strong><code>msg.get(&quot;method&quot;)</code> / <code>msg.get(&quot;id&quot;)</code> / <code>msg.get(&quot;params&quot;, {})</code></strong>：JSON-RPC 2.0 標準三個欄位。<code>get</code> 而不是 <code>[]</code>、避免 KeyError；params 預設空 dict、後面 handler 可以安全 <code>.get(&quot;xxx&quot;)</code>。</li>
<li><strong><code>if method not in HANDLERS: respond(rid, error={&quot;code&quot;: -32601, ...})</code></strong>：未知 method 回標準 JSON-RPC error <code>-32601</code>（Method not found）。Client 知道這個 method 不能用、但 server 不死。</li>
<li><strong><code>if handler is None: continue</code></strong>：notification（如 <code>notifications/initialized</code>）對應的 handler 是 <code>None</code>、不該回 response。</li>
<li><strong><code>try: result = handler(params); respond(rid, result=result)</code></strong>：呼叫 handler、把結果回給 client。</li>
<li><strong><code>except Exception as e: ... respond(rid, error={&quot;code&quot;: -32000, ...})</code></strong>：handler 內部錯誤回 <code>-32000</code>（generic server error）。確保 server 任何時候都不 crash、即使工具 bug 也讓 client 拿到 error response。</li>
</ol>
<p><strong>為什麼這樣設計</strong>：</p>
<ul>
<li><strong>為什麼用 line-delimited JSON、不是 length-prefixed</strong>：MCP spec 規定 stdio transport 是 newline-delimited。length-prefixed 是 LSP 的做法、解析複雜（要先讀 Content-Length header 再讀 N bytes）；newline-delimited 用 <code>for line in sys.stdin</code> 一行解決。</li>
<li><strong>為什麼 stderr 不能寫 stdout</strong>：stdio transport 的 invariant——stdout 是 protocol channel、只能寫 JSON-RPC message。任何 stray print() / debug output 進 stdout、會被 client parse JSON 時炸（「multiple JSON values on one line」或 invalid JSON）。所有 log / debug / progress message 必須走 stderr。寫錯這條 server 看起來不工作、debug 很久才找到。</li>
<li><strong>為什麼 dispatch 用 dict-of-handlers 而不是 if/elif chain</strong>：擴充性。加新 method 只要往 <code>HANDLERS</code> dict 加一項、不用改 main loop。也讓 dispatch logic 跟 method 實作分離、容易測試。</li>
<li><strong>為什麼每個 handler 都用 try/except 包</strong>：「single point of failure」設計——任何 handler 例外不影響其他 method。Server 應該是 long-running daemon、不能因為一個 tool bug 死掉。</li>
<li><strong>為什麼 errors 用 JSON-RPC error code 而不是 HTTP-style status</strong>：JSON-RPC 2.0 標準。<code>-32700</code> parse error、<code>-32600</code> invalid request、<code>-32601</code> method not found、<code>-32602</code> invalid params、<code>-32603</code> internal error、<code>-32000</code> to <code>-32099</code> 留給應用層自訂。</li>
</ul>
<h3 id="工具search_blog">工具：search_blog</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">def</span> <span class="nf">tool_search_blog</span><span class="p">(</span><span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">top_k</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">    <span class="n">records</span> <span class="o">=</span> <span class="n">load_index</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="n">q_vec</span> <span class="o">=</span> <span class="n">embed</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="n">scored</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">        <span class="p">((</span><span class="n">cosine</span><span class="p">(</span><span class="n">q_vec</span><span class="p">,</span> <span class="n">r</span><span class="p">[</span><span class="s2">&#34;embedding&#34;</span><span class="p">]),</span> <span class="n">r</span><span class="p">)</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">records</span><span class="p">),</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">        <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">        <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="p">)[:</span><span class="n">top_k</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="n">results</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">        <span class="p">{</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">            <span class="s2">&#34;source&#34;</span><span class="p">:</span> <span class="n">r</span><span class="p">[</span><span class="s2">&#34;source&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">            <span class="s2">&#34;chunk_index&#34;</span><span class="p">:</span> <span class="n">r</span><span class="p">[</span><span class="s2">&#34;chunk_index&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">            <span class="s2">&#34;score&#34;</span><span class="p">:</span> <span class="nb">round</span><span class="p">(</span><span class="n">score</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">            <span class="s2">&#34;preview&#34;</span><span class="p">:</span> <span class="n">r</span><span class="p">[</span><span class="s2">&#34;text&#34;</span><span class="p">][:</span><span class="mi">160</span><span class="p">]</span> <span class="o">+</span> <span class="p">(</span><span class="s2">&#34;...&#34;</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">r</span><span class="p">[</span><span class="s2">&#34;text&#34;</span><span class="p">])</span> <span class="o">&gt;</span> <span class="mi">160</span> <span class="k">else</span> <span class="s2">&#34;&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">        <span class="k">for</span> <span class="n">score</span><span class="p">,</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">scored</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">    <span class="k">return</span> <span class="p">{</span><span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="p">[{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;text&#34;</span><span class="p">,</span> <span class="s2">&#34;text&#34;</span><span class="p">:</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">results</span><span class="p">,</span> <span class="n">ensure_ascii</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">)}]}</span></span></span></code></pre></div><p><strong>每段做什麼</strong>：</p>
<ol>
<li><strong><code>records = load_index()</code></strong>：lazy load <code>index.pkl</code>、第一次 call 載入記憶體、後續直接用 cached。Server 啟動時 lazy load 而不是 import 時 load、讓 server 即使在 Ollama 還沒起 / index 不存在時也能 boot（之後 call 才會報 error）。</li>
<li><strong><code>q_vec = embed(query)</code></strong>：把 query 轉成 768 維向量、呼叫 Ollama embedding API、跟 RAG demo 的 <code>embed</code> 是同一個 function。</li>
<li><strong><code>sorted((...) for r in records, key=lambda x: x[0], reverse=True)[:top_k]</code></strong>：generator expression + sorted 一次完成「算分 → 排序 → 取 top-K」。</li>
<li><strong><code>results = [{...} for score, r in scored]</code></strong>：把 top-K 整理成 client 友善的 dict 結構、含 source、chunk_index、score、preview（前 160 字 + 省略號）。</li>
<li><strong><code>{&quot;content&quot;: [{&quot;type&quot;: &quot;text&quot;, &quot;text&quot;: json.dumps(...)}]}</code></strong>：MCP <code>tools/call</code> 標準 response 格式——<code>content</code> 是 array、每個元素 type + payload。<code>type: &quot;text&quot;</code> 是文字 content、<code>text</code> 是實際內容（這裡是 JSON 字串、讓 LLM 可以 parse）。</li>
</ol>
<p><strong>為什麼這樣設計</strong>：</p>
<ul>
<li><strong>為什麼 generator expression 而非 list comprehension</strong>：<code>(... for r in records)</code> 是 generator、<code>sorted</code> 直接消費、不會在記憶體中建中間 list。對 463 records 影響不大、但展現 memory-efficient pattern。</li>
<li><strong>為什麼 preview 切到 160 字</strong>：兩件事的平衡——讓 LLM 看到的 search result 短（不淹沒 LLM 的 context）、但夠判讀（160 中文字約 80 token、能看出 chunk 是不是相關）。如果 LLM 要完整內容、再 call <code>read_chunk</code>。</li>
<li><strong>為什麼回傳 JSON 字串、不是 nested object</strong>：MCP <code>content</code> 規定每個 element 是 <code>{type, payload}</code>、<code>type: &quot;text&quot;</code> 的 <code>text</code> 必須是 string、不能直接放 nested object。要傳結構化資料、就把它 <code>json.dumps</code> 成字串。LLM 看到後可以自己 parse。</li>
<li><strong>為什麼 <code>ensure_ascii=False</code></strong>：預設 <code>json.dumps</code> 把非 ASCII 字元（如中文）轉成 <code>\uXXXX</code>、難讀。<code>ensure_ascii=False</code> 直接輸出 UTF-8、LLM 也能直接讀懂、節省 token 數（一個中文字 1 token vs 6 token 的 <code>中</code>）。</li>
<li><strong>為什麼 <code>round(score, 4)</code></strong>：score 是 float、原始可能是 <code>0.7497284598827362</code>、長且無意義。<code>round(score, 4)</code> 保留 4 位小數、<code>0.7497</code>、夠精確、wire size 短。</li>
</ul>
<h3 id="工具read_chunk">工具：read_chunk</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">tool_read_chunk</span><span class="p">(</span><span class="n">source</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">chunk_index</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="n">records</span> <span class="o">=</span> <span class="n">load_index</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">records</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">        <span class="k">if</span> <span class="n">r</span><span class="p">[</span><span class="s2">&#34;source&#34;</span><span class="p">]</span> <span class="o">==</span> <span class="n">source</span> <span class="ow">and</span> <span class="n">r</span><span class="p">[</span><span class="s2">&#34;chunk_index&#34;</span><span class="p">]</span> <span class="o">==</span> <span class="n">chunk_index</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">            <span class="k">return</span> <span class="p">{</span><span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="p">[{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;text&#34;</span><span class="p">,</span> <span class="s2">&#34;text&#34;</span><span class="p">:</span> <span class="n">r</span><span class="p">[</span><span class="s2">&#34;text&#34;</span><span class="p">]}]}</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">    <span class="k">return</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">        <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="p">[{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;text&#34;</span><span class="p">,</span> <span class="s2">&#34;text&#34;</span><span class="p">:</span> <span class="sa">f</span><span class="s2">&#34;Not found: </span><span class="si">{</span><span class="n">source</span><span class="si">}</span><span class="s2">#chunk</span><span class="si">{</span><span class="n">chunk_index</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">}],</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">        <span class="s2">&#34;isError&#34;</span><span class="p">:</span> <span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">9</span><span class="cl">    <span class="p">}</span></span></span></code></pre></div><p><strong>每段做什麼</strong>：</p>
<ol>
<li><strong><code>for r in records: if r[&quot;source&quot;] == source and r[&quot;chunk_index&quot;] == chunk_index: return ...</code></strong>：linear scan 找匹配的 record、找到回完整 text。</li>
<li><strong>找不到時 <code>return {... &quot;isError&quot;: True}</code></strong>：MCP 標準的「tool 內部失敗」訊號。<code>isError: True</code> 告訴 client「這個 tool call 失敗了」、<code>content</code> 內是 human-readable error message。</li>
</ol>
<p><strong>為什麼這樣設計</strong>：</p>
<ul>
<li><strong>為什麼 linear scan 而不是 dict lookup</strong>：可以改用 <code>{(source, chunk_index): record}</code> dict 變 O(1)。但 463 records 的 linear scan 是 &lt; 1ms、optimize 不值得。Production 跟 vector DB 整合時、retrieval 系統自帶 indexing。</li>
<li><strong>為什麼 <code>isError: True</code> 而不是 JSON-RPC error</strong>：分兩種錯誤：
<ul>
<li><strong>Protocol error</strong>：method 不存在、params 不合法、JSON parse 失敗——回 JSON-RPC <code>error</code> 物件。</li>
<li><strong>Tool semantic error</strong>：method OK、params OK、但 tool 邏輯上不能 complete（找不到資料、外部 service down）——回 normal response 加 <code>isError: True</code>。
MCP 設計這層分離、讓 client / LLM 區分「我做錯了」（協議層）跟「資料不存在」（語意層）。Production 設計工具時要仔細區分。</li>
</ul>
</li>
</ul>
<h3 id="tool-描述用-json-schema">Tool 描述用 JSON Schema</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="n">TOOLS</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">    <span class="s2">&#34;search_blog&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">        <span class="s2">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Semantic search over blog content. Returns top-K relevant chunks with source paths.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">        <span class="s2">&#34;inputSchema&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">            <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;object&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">            <span class="s2">&#34;properties&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">                <span class="s2">&#34;query&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;string&#34;</span><span class="p">,</span> <span class="s2">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Natural language query&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">                <span class="s2">&#34;top_k&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;integer&#34;</span><span class="p">,</span> <span class="s2">&#34;default&#34;</span><span class="p">:</span> <span class="mi">5</span><span class="p">,</span> <span class="s2">&#34;minimum&#34;</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s2">&#34;maximum&#34;</span><span class="p">:</span> <span class="mi">20</span><span class="p">},</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">            <span class="p">},</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">            <span class="s2">&#34;required&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;query&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">        <span class="p">},</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">        <span class="s2">&#34;fn&#34;</span><span class="p">:</span> <span class="k">lambda</span> <span class="n">args</span><span class="p">:</span> <span class="n">tool_search_blog</span><span class="p">(</span><span class="n">args</span><span class="p">[</span><span class="s2">&#34;query&#34;</span><span class="p">],</span> <span class="n">args</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;top_k&#34;</span><span class="p">,</span> <span class="mi">5</span><span class="p">)),</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">    <span class="s2">&#34;read_chunk&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">        <span class="s2">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Read the full text of a specific chunk by source path and chunk index.&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">        <span class="s2">&#34;inputSchema&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">            <span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;object&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">            <span class="s2">&#34;properties&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">                <span class="s2">&#34;source&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;string&#34;</span><span class="p">,</span> <span class="s2">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Markdown file path relative to content/&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">                <span class="s2">&#34;chunk_index&#34;</span><span class="p">:</span> <span class="p">{</span><span class="s2">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;integer&#34;</span><span class="p">,</span> <span class="s2">&#34;minimum&#34;</span><span class="p">:</span> <span class="mi">0</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">            <span class="p">},</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">            <span class="s2">&#34;required&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;source&#34;</span><span class="p">,</span> <span class="s2">&#34;chunk_index&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">        <span class="p">},</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">        <span class="s2">&#34;fn&#34;</span><span class="p">:</span> <span class="k">lambda</span> <span class="n">args</span><span class="p">:</span> <span class="n">tool_read_chunk</span><span class="p">(</span><span class="n">args</span><span class="p">[</span><span class="s2">&#34;source&#34;</span><span class="p">],</span> <span class="n">args</span><span class="p">[</span><span class="s2">&#34;chunk_index&#34;</span><span class="p">]),</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p><strong>每個 field 角色</strong>：</p>
<ol>
<li><strong><code>description</code></strong>：給 LLM 看的、解釋這個 tool 解什麼問題。LLM 看 description 決定何時 call。<strong>這是模型 follow tool 的最主要訊號</strong>——寫得清晰具體、模型用得對。</li>
<li><strong><code>inputSchema</code></strong>：JSON Schema、描述 tool 接受的參數結構。LLM application 用這個 schema 約束 LLM 生成「合法的呼叫」。</li>
<li><strong><code>properties</code></strong>：每個參數的型別 + 約束。</li>
<li><strong><code>required</code></strong>：必填參數清單。LLM 漏掉時、client 端可以 reject、不會浪費 round-trip。</li>
<li><strong><code>default</code></strong>：可選參數的預設值。傳的時候不給、tool 就用 default。</li>
<li><strong><code>minimum</code> / <code>maximum</code></strong>：數值約束。<code>top_k</code> 設 1-20 是因為 &lt; 1 沒意義、&gt; 20 浪費 retrieval。</li>
<li><strong><code>fn</code></strong>：實際 dispatch 用的 callable。本 demo 用 lambda 把 <code>args</code> dict 轉成 positional / keyword call。</li>
</ol>
<p><strong>為什麼這樣設計</strong>：</p>
<ul>
<li><strong>為什麼 description 要具體</strong>：LLM 看 description 決定 call 時機。「search the blog」對 LLM 來說太模糊（搜什麼？找什麼？）、改成「Semantic search over blog content. Returns top-K relevant chunks with source paths.」明確描述輸入跟輸出形狀、LLM 能判讀「使用者問技術問題時該 call 這個」。</li>
<li><strong>為什麼 schema 用 JSON Schema、不是自訂格式</strong>：JSON Schema 是 web 標準、所有 LLM application 都認識、跨 framework 可移植。也是 <a href="/blog/llm/knowledge-cards/function-calling/" data-link-title="Function Calling" data-link-desc="模型訓練階段建立的「呼叫工具」能力：知道何時該呼叫、傳什麼參數">function calling</a> 跟 <a href="/blog/llm/04-applications/tool-use-principles/" data-link-title="4.3 Tool use 原理：LLM 跟外部世界互動" data-link-desc="Structured output 是 LLM 跨入工程系統的橋、function calling 取捨、為什麼本地小模型 tool use 表現崩潰">Tool use 原理</a> 的 schema 描述語言。</li>
<li><strong>為什麼 <code>required</code> 跟 <code>default</code> 兩個機制</strong>：對 LLM 看的 prompt 越清楚越好。<code>required</code> 告訴 LLM「不傳這個會錯」、<code>default</code> 告訴 LLM「可不傳、預設值是 X」。沒分清的話、LLM 可能總是傳所有參數、雜訊多。</li>
<li><strong>為什麼 <code>fn</code> 用 lambda 包</strong>：實際 tool function 是 positional args、但 client 送的是 dict。lambda 把 dict 拆成 function call 的 args。也方便將來如果 tool function signature 變、只要改 lambda 不用改 dispatcher。</li>
</ul>
<h2 id="client-實作測試用">Client 實作（測試用）</h2>
<p>完整檔案：<code>scripts/mcp-demo/test_client.py</code>。實際 production 用 Claude Desktop / Cursor 等 MCP-capable application。本 demo 寫一個 stdio client、模擬 application 行為：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="n">proc</span> <span class="o">=</span> <span class="n">subprocess</span><span class="o">.</span><span class="n">Popen</span><span class="p">(</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">    <span class="p">[</span><span class="n">sys</span><span class="o">.</span><span class="n">executable</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">SERVER</span><span class="p">)],</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="n">stdin</span><span class="o">=</span><span class="n">subprocess</span><span class="o">.</span><span class="n">PIPE</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="n">stdout</span><span class="o">=</span><span class="n">subprocess</span><span class="o">.</span><span class="n">PIPE</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="o">.</span><span class="n">PIPE</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="n">text</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">    <span class="n">bufsize</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="k">def</span> <span class="nf">send</span><span class="p">(</span><span class="n">method</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">rid</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">    <span class="n">msg</span> <span class="o">=</span> <span class="p">{</span><span class="s2">&#34;jsonrpc&#34;</span><span class="p">:</span> <span class="s2">&#34;2.0&#34;</span><span class="p">,</span> <span class="s2">&#34;method&#34;</span><span class="p">:</span> <span class="n">method</span><span class="p">}</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">    <span class="k">if</span> <span class="n">params</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">        <span class="n">msg</span><span class="p">[</span><span class="s2">&#34;params&#34;</span><span class="p">]</span> <span class="o">=</span> <span class="n">params</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">    <span class="k">if</span> <span class="n">rid</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">        <span class="n">msg</span><span class="p">[</span><span class="s2">&#34;id&#34;</span><span class="p">]</span> <span class="o">=</span> <span class="n">rid</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">    <span class="n">proc</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span> <span class="o">+</span> <span class="s2">&#34;</span><span class="se">\n</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    <span class="n">proc</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">    <span class="k">if</span> <span class="n">rid</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">        <span class="k">return</span> <span class="kc">None</span>  <span class="c1"># notification</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">    <span class="n">line</span> <span class="o">=</span> <span class="n">proc</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">    <span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">line</span><span class="p">)</span></span></span></code></pre></div><p><strong>每個參數做什麼</strong>：</p>
<ol>
<li><strong><code>subprocess.Popen([sys.executable, str(SERVER)], ...)</code></strong>：spawn server 當 child process。用 <code>sys.executable</code> 確保用同一個 Python interpreter（避免 venv 跟系統 Python 混用）。</li>
<li><strong><code>stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE</code></strong>：三條 pipe 都接到 client、讓我們能讀寫 server 的 stdio。</li>
<li><strong><code>text=True</code></strong>：自動處理 str ↔ bytes 編碼、直接讀寫字串、不用手動 encode/decode。預設是 binary mode。</li>
<li><strong><code>bufsize=1</code></strong>：line buffering、每寫一行就 flush。沒這個的話、Python 預設 block buffering（4KB 才 flush）、client 寫的 message server 看不到、整個卡住。</li>
<li><strong><code>proc.stdin.write(json.dumps(msg) + &quot;\n&quot;)</code></strong>：寫 JSON 訊息、結尾加 <code>\n</code>（line-delimited）。</li>
<li><strong><code>proc.stdin.flush()</code></strong>：強制立刻送出。即使有 <code>bufsize=1</code>、明確 flush 是好習慣、避免任何 buffer 累積。</li>
<li><strong><code>if rid is None: return None</code></strong>：notification 不該等 response。</li>
<li><strong><code>line = proc.stdout.readline()</code> + <code>json.loads(line)</code></strong>：讀一行 response、parse。</li>
</ol>
<p><strong>為什麼這樣設計</strong>：</p>
<ul>
<li><strong>為什麼 stdio 而不是 socket / HTTP</strong>：MCP stdio transport 的主要場景是「application spawn server」(Claude Desktop 開 Python 進程當 MCP server)。Stdio 自然形成 1-to-1 ownership、不需要 port allocation、不需要 auth。HTTP transport 也存在、用在 multi-client 場景。</li>
<li><strong>為什麼 <code>bufsize=1</code> 這麼關鍵</strong>：Python 預設 stdio buffer 4KB。如果 server / client 任一邊寫了 short message 但沒 fill 4KB、message 不會被另一邊看到、protocol 卡死。看起來是 hang、debug 困難。<code>bufsize=1</code> 強制 line buffering、解決這個 deadlock。</li>
<li><strong>為什麼 <code>text=True</code></strong>：JSON-RPC 都是文字、binary mode 要手動 <code>.encode()</code> / <code>.decode()</code>、增加複雜度。<code>text=True</code> 自動處理 UTF-8。</li>
</ul>
<h2 id="跑通整條流程">跑通整條流程</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">cd</span> ~/Projects/blog
</span></span><span class="line"><span class="ln">2</span><span class="cl">python3 scripts/mcp-demo/test_client.py</span></span></code></pre></div><ul>
<li><code>cd ~/Projects/blog</code>：切到 repo 根、讓 SERVER 路徑相對解析正確。</li>
<li><code>python3 scripts/mcp-demo/test_client.py</code>：跑 test client、它會 spawn server 跟它對話。</li>
</ul>
<p>預期看到五個階段：</p>
<h3 id="1-initialize握手">1. initialize（握手）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="err">===</span> <span class="mi">1</span><span class="err">.</span> <span class="err">initialize</span> <span class="err">===</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  <span class="nt">&#34;jsonrpc&#34;</span><span class="p">:</span> <span class="s2">&#34;2.0&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  <span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">  <span class="nt">&#34;result&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="nt">&#34;protocolVersion&#34;</span><span class="p">:</span> <span class="s2">&#34;2025-03-26&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">    <span class="nt">&#34;capabilities&#34;</span><span class="p">:</span> <span class="p">{</span><span class="nt">&#34;tools&#34;</span><span class="p">:</span> <span class="p">{}},</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="nt">&#34;serverInfo&#34;</span><span class="p">:</span> <span class="p">{</span><span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;blog-mcp-demo&#34;</span><span class="p">,</span> <span class="nt">&#34;version&#34;</span><span class="p">:</span> <span class="s2">&#34;0.1.0&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p><strong>Protocol 意義</strong>：</p>
<ul>
<li><code>protocolVersion</code>：server 支援的 MCP 版本。Client 要 negotiate（自己 cap 較新時要 downgrade）。</li>
<li><code>capabilities.tools: {}</code>：server 宣告「我支援 tools 功能」、空 object 表示沒額外 sub-feature。Client 拿到後知道可以 call <code>tools/list</code>。</li>
<li><code>serverInfo</code>：server 識別資訊、給 client 顯示用（debug、logging）。</li>
<li><code>id: 1</code>：對應 client 送的 request id、讓 client 知道這個 response 是哪個 request 的。</li>
</ul>
<h3 id="2-toolslist">2. tools/list</h3>
<p>Server 回兩個 tool 的完整 schema：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  <span class="nt">&#34;tools&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">      <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;search_blog&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">      <span class="nt">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Semantic search over blog content...&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">      <span class="nt">&#34;inputSchema&#34;</span><span class="p">:</span> <span class="p">{</span><span class="err">...JSON</span> <span class="err">Schema...</span><span class="p">}</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">      <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;read_chunk&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">      <span class="nt">&#34;description&#34;</span><span class="p">:</span> <span class="s2">&#34;Read the full text of a specific chunk...&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">      <span class="nt">&#34;inputSchema&#34;</span><span class="p">:</span> <span class="p">{</span><span class="err">...</span><span class="p">}</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">  <span class="p">]</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p><strong>Protocol 意義</strong>：這個輸出就是 LLM application 會塞給 LLM 的 tool 描述。LLM application 把這份 schema 用 <a href="/blog/llm/knowledge-cards/function-calling/" data-link-title="Function Calling" data-link-desc="模型訓練階段建立的「呼叫工具」能力：知道何時該呼叫、傳什麼參數">function calling</a> 機制給模型看、模型決定何時呼叫、傳什麼參數。Server 跟模型之間靠這層 schema 對齊、模型不直接呼叫 server、是經 application 中介。</p>
<h3 id="3-toolscall-search_blog">3. tools/call: search_blog</h3>
<p>Client 送：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  <span class="nt">&#34;method&#34;</span><span class="p">:</span> <span class="s2">&#34;tools/call&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">  <span class="nt">&#34;params&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">    <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;search_blog&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">    <span class="nt">&#34;arguments&#34;</span><span class="p">:</span> <span class="p">{</span><span class="nt">&#34;query&#34;</span><span class="p">:</span> <span class="s2">&#34;什麼是 KV cache？&#34;</span><span class="p">,</span> <span class="nt">&#34;top_k&#34;</span><span class="p">:</span> <span class="mi">3</span><span class="p">}</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">  <span class="p">},</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">  <span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="mi">3</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p><code>params</code> 包兩件事：</p>
<ul>
<li><code>name</code>：要 call 的 tool 名（matches <code>tools/list</code> 內某個 tool）。</li>
<li><code>arguments</code>：實際傳給 tool 的 dict、結構符合該 tool 的 <code>inputSchema</code>。</li>
</ul>
<p>Server 回 cosine 搜尋結果（preview）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="p">[</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  <span class="p">{</span><span class="nt">&#34;source&#34;</span><span class="p">:</span> <span class="s2">&#34;llm/00-foundations/hardware-memory-budget.md&#34;</span><span class="p">,</span> <span class="nt">&#34;chunk_index&#34;</span><span class="p">:</span> <span class="mi">5</span><span class="p">,</span> <span class="nt">&#34;score&#34;</span><span class="p">:</span> <span class="mf">0.7497</span><span class="p">,</span> <span class="nt">&#34;preview&#34;</span><span class="p">:</span> <span class="s2">&#34;| Context 長度 | KV cache 估算...&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">  <span class="p">{</span><span class="nt">&#34;source&#34;</span><span class="p">:</span> <span class="s2">&#34;llm/00-foundations/why-llm-feels-slow.md&#34;</span><span class="p">,</span> <span class="nt">&#34;chunk_index&#34;</span><span class="p">:</span> <span class="mi">4</span><span class="p">,</span> <span class="nt">&#34;score&#34;</span><span class="p">:</span> <span class="mf">0.7212</span><span class="p">,</span> <span class="nt">&#34;preview&#34;</span><span class="p">:</span> <span class="s2">&#34;...&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">  <span class="p">{</span><span class="nt">&#34;source&#34;</span><span class="p">:</span> <span class="s2">&#34;llm/03-theoretical-foundations/attention-mechanism.md&#34;</span><span class="p">,</span> <span class="nt">&#34;chunk_index&#34;</span><span class="p">:</span> <span class="mi">7</span><span class="p">,</span> <span class="nt">&#34;score&#34;</span><span class="p">:</span> <span class="mf">0.7176</span><span class="p">,</span> <span class="nt">&#34;preview&#34;</span><span class="p">:</span> <span class="s2">&#34;...&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="p">]</span></span></span></code></pre></div><p>實測命中合理——KV cache 相關段落都被找到。</p>
<h3 id="4-toolscall-read_chunk">4. tools/call: read_chunk</h3>
<p>Client 用 search 拿到的 source + chunk_index、call <code>read_chunk</code> 拿完整內容：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  <span class="nt">&#34;method&#34;</span><span class="p">:</span> <span class="s2">&#34;tools/call&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  <span class="nt">&#34;params&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;read_chunk&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="nt">&#34;arguments&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">      <span class="nt">&#34;source&#34;</span><span class="p">:</span> <span class="s2">&#34;llm/00-foundations/hardware-memory-budget.md&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">      <span class="nt">&#34;chunk_index&#34;</span><span class="p">:</span> <span class="mi">5</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p>Server 回該 chunk 的完整 markdown 文字。這實現了「search → read」的兩段流程——避免 search 一次就把所有 chunk 完整內容塞給 LLM（context 暴炸）、讓 LLM 自己看 preview 決定要 deep dive 哪個。</p>
<h3 id="5-錯誤路徑">5. 錯誤路徑</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="err">===</span> <span class="mi">5</span><span class="err">.</span> <span class="err">unknown</span> <span class="err">method</span> <span class="err">(error</span> <span class="err">path)</span> <span class="err">===</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="p">{</span><span class="nt">&#34;jsonrpc&#34;</span><span class="p">:</span> <span class="s2">&#34;2.0&#34;</span><span class="p">,</span> <span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="mi">5</span><span class="p">,</span> <span class="nt">&#34;error&#34;</span><span class="p">:</span> <span class="p">{</span><span class="nt">&#34;code&#34;</span><span class="p">:</span> <span class="mi">-32601</span><span class="p">,</span> <span class="nt">&#34;message&#34;</span><span class="p">:</span> <span class="s2">&#34;Method not found: does/not/exist&#34;</span><span class="p">}}</span></span></span></code></pre></div><p><code>-32601</code> 是 JSON-RPC 標準 error code for unknown method。Server 對未知 method 回標準 error、不 crash。Client 知道這個 method 不能用、繼續其他操作。</p>
<h2 id="跟-claude-desktop--cursor-整合">跟 Claude Desktop / Cursor 整合</h2>
<p>把這個 server 接到實際 MCP-capable application：</p>
<h3 id="claude-desktop">Claude Desktop</h3>
<p>編輯 <code>~/Library/Application Support/Claude/claude_desktop_config.json</code>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  <span class="nt">&#34;mcpServers&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="nt">&#34;blog-search&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">      <span class="nt">&#34;command&#34;</span><span class="p">:</span> <span class="s2">&#34;/path/to/python3&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">      <span class="nt">&#34;args&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;&lt;absolute-path-to-blog&gt;/scripts/mcp-demo/blog_mcp_server.py&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p><strong>每個 field 做什麼</strong>：</p>
<ul>
<li><code>mcpServers</code>：MCP server 註冊表、key 是任意名稱（client 識別用）。</li>
<li><code>command</code>：spawn 用的 executable path。要寫絕對路徑、Claude Desktop 啟動時的 PATH 可能不含 <code>python3</code>。</li>
<li><code>args</code>：傳給 command 的 args list。第一個是 script path。</li>
</ul>
<p><strong>為什麼這樣設計</strong>：Claude Desktop 啟動時讀這個 config、對每個 server 用 <code>subprocess.spawn(command, args)</code> 起 child process、用 stdio 跟它對話。跟本 demo 的 <code>test_client.py</code> 做的事完全一樣、只是改成 GUI application 而已。</p>
<p>重啟 Claude Desktop 後、在對話框問「用 search_blog 找 KV cache 相關段落」、Claude 會自動 call tool 並用結果回答。</p>
<h3 id="cursor">Cursor</h3>
<p><code>.cursor/mcp.json</code>（per-project）或全域設定類似結構。具體欄位看當下版本文件。</p>
<p>兩種整合的共通點：<strong>MCP server 自己不變</strong>、只要 application 端配置 path 跟 args、整合就完成。這正是 4.3 章節 N×M → N+M 的具體展現——本 server 不為任何特定 application 客製化、就能被多個 application 接到。</p>
<h2 id="觀察跟原理對應">觀察跟原理對應</h2>
<p>回到 <a href="/blog/llm/04-applications/application-protocols/" data-link-title="4.6 應用層協議：function calling / structured output / MCP" data-link-desc="三個常被混為一談的概念：模型能力、sampling 約束、server 協議，三者的層級差異與組合方式">4.6 應用層協議</a> 的三層 framing：</p>
<table>
  <thead>
      <tr>
          <th>層級</th>
          <th>本 demo 是否實作</th>
          <th>怎麼實作</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>模型能力</td>
          <td>不在本 demo 範圍</td>
          <td>LLM application 自己決定用 GPT/Claude/Gemma</td>
      </tr>
      <tr>
          <td>Sampling 約束</td>
          <td>不在本 demo 範圍</td>
          <td>application + 推論伺服器配合</td>
      </tr>
      <tr>
          <td>Server 協議</td>
          <td><strong>本 demo 焦點</strong></td>
          <td>JSON-RPC over stdio + tools/list / tools/call</td>
      </tr>
  </tbody>
</table>
<p>這個分離正是 MCP 的核心收益：server 寫好之後、用什麼 LLM 跟它互動跟 server 無關。換掉 LLM、換掉 application、server code 完全不動。</p>
<h2 id="何時這份-demo-會過時">何時這份 demo 會過時</h2>
<ul>
<li><strong>MCP protocol version</strong>：目前用 <code>2025-03-26</code>、未來會更新、但「server 暴露 tool 給 application」的 framing 不變。</li>
<li><strong>JSON-RPC 細節</strong>：可能 transport 形式增加（HTTP / WebSocket）、stdio 不會消失。</li>
<li><strong>Tool 描述格式</strong>：JSON Schema 是 web 通用標準、不會被換掉。</li>
</ul>
<p>實作換代時、可以把手寫 JSON-RPC 換成官方 SDK、tool 內部邏輯（embedding / cosine / pickle）依需求換、但 protocol 骨架（initialize / tools/list / tools/call）會保留。</p>
<h2 id="跑這個-demo-的指令總結">跑這個 demo 的指令總結</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 前置：確認 Ollama 跑著、index.pkl 存在</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">ollama list <span class="p">|</span> grep nomic-embed-text
</span></span><span class="line"><span class="ln">3</span><span class="cl">ls scripts/rag-demo/index.pkl</span></span></code></pre></div><ul>
<li><code>ollama list</code>：列已下載 model、<code>grep</code> 過濾出 embedding model。沒看到表示要先 <code>ollama pull nomic-embed-text</code>。</li>
<li><code>ls scripts/rag-demo/index.pkl</code>：確認 RAG ingest 跑過、index 存在。沒看到要先跑 <code>python3 scripts/rag-demo/ingest.py</code>。</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 自動測試 MCP server</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">python3 scripts/mcp-demo/test_client.py</span></span></code></pre></div><ul>
<li>跑 test_client、spawn server、依序送 5 個 request 驗證 protocol。stdout 印 protocol 對話、stderr 印 server log。看到全部 5 階段 OK 就成功。</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 手動跟 server 互動（看 protocol 原始 wire format）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">python3 scripts/mcp-demo/blog_mcp_server.py
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># 然後手打：{&#34;jsonrpc&#34;:&#34;2.0&#34;,&#34;id&#34;:1,&#34;method&#34;:&#34;initialize&#34;,&#34;params&#34;:{}}</span></span></span></code></pre></div><ul>
<li>直接 invoke server、它讀 stdin 等 request。手打 JSON-RPC 訊息、看 server 回。是學 protocol 最直接的方式——你會看到 wire format 真實長相、跟自動 client 包裝後不一樣。</li>
</ul>
<p>完整 source 在 <code>scripts/mcp-demo/</code>、約 250 行 Python、stdlib only。</p>
<p>跟其他 hands-on 章節的關係：完整 hands-on 系列見 <a href="/blog/llm/01-local-llm-services/hands-on/" data-link-title="Hands-on：本地 AI 工具實作筆記" data-link-desc="Ollama / ComfyUI / Whisper / Piper TTS：實際安裝、驗證、跑通的紀錄。隨工具版本演化、跟 1.x 原理章節互補。">Hands-on 章節索引</a>、本 demo 依賴的索引由 <a href="/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &#43; cosine retrieval &#43; Ollama chat、validating 4.0 RAG 原理">RAG demo</a> ingest 產生、MCP + RAG 同跑的記憶體 / 程序預算見 <a href="/blog/llm/01-local-llm-services/hands-on/rag-mcp-resources/" data-link-title="Hands-on：RAG / MCP 的資源 footprint" data-link-desc="RAG ingest / query / MCP server 三階段的 RAM / 磁碟 / process 實測、多模型並存的 RAM 衝突、本地 LLM 跑 RAG 跟單純 chat 的差異">RAG + MCP resource footprint</a>、術語見 <a href="/blog/llm/knowledge-cards/mcp/" data-link-title="MCP（Model Context Protocol）" data-link-desc="LLM application ↔ 外部 tool server 之間的標準化協議、複用 OpenAI 相容 API 的成功模式">MCP</a>。</p>
]]></content:encoded></item><item><title>Hands-on：Ollama 改檔案 / 寫程式碼的權限邊界在哪</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/permission-boundary/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/permission-boundary/</guid><description>&lt;p>「Ollama 自己改檔案要不要 sudo？」「叫它寫 &lt;code>rm -rf&lt;/code> 會直接刪嗎？」這類問題的答案來自一個根本事實：&lt;strong>LLM 是 pure function、文字進、文字出、本身沒任何 file system / shell / network 副作用&lt;/strong>。改檔案、刪檔案、發網路請求、執行 shell command——全部由 &lt;strong>wrapper 或人類&lt;/strong>做。LLM 「以為」自己做了什麼、跟實際發生什麼是兩件事。&lt;/p>
&lt;p>本篇用四組對照實驗證明這個事實、再展開 wrapper 三檔審查粒度的設計取捨。這跟 &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/tool-use-principles/" data-link-title="4.3 Tool use 原理：LLM 跟外部世界互動" data-link-desc="Structured output 是 LLM 跨入工程系統的橋、function calling 取捨、為什麼本地小模型 tool use 表現崩潰">4.3 副作用範圍設計&lt;/a>、&lt;a href="https://tarrragon.github.io/blog/llm/04-applications/agent-architecture/" data-link-title="4.4 Agent 架構原理" data-link-desc="Agent loop 結構、失敗模式、什麼任務適合 vs 不適合、跟人類審查的協作模型">4.4 Agent 跟人類審查的協作模型&lt;/a>、&lt;a href="https://tarrragon.github.io/blog/llm/00-foundations/privacy-data-flow/" data-link-title="0.7 隱私 / 資安的資料流原理" data-link-desc="從「位置」到「資料流」的思考升級：信任邊界、合約模型、零信任原則套用到 LLM 工作流">0.7 隱私資料流原理&lt;/a> 三個原則章節對應、實作層的權限與供應鏈判讀對應 &lt;a href="https://tarrragon.github.io/blog/llm/06-security/tool-use-permission-model/" data-link-title="6.2 tool use 與 MCP server 的權限模型" data-link-desc="個人 dev 場景下 tool use / MCP server 的副作用權限：檔案系統 / shell / 網路存取邊界、第三方 MCP 信任、副作用的可逆性">6.2 tool use 與 MCP server 的權限模型&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/llm/06-security/model-supply-chain-trust/" data-link-title="6.0 模型供應鏈與信任邊界" data-link-desc="個人 dev 用本地 LLM 時的模型權重來源信任：GGUF 完整性、Hugging Face / Ollama registry 信任、量化版本污染、檔案完整性檢查">6.0 模型供應鏈與信任邊界&lt;/a>。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>環境&lt;/strong>：Ollama 0.23.2、&lt;code>gemma3:1b&lt;/code>、Python stdlib
&lt;strong>檔案位置&lt;/strong>：&lt;code>scripts/permission-demo/edit_with_llm.py&lt;/code>&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼這個問題重要">為什麼這個問題重要&lt;/h2>
&lt;p>直覺常見的誤判：&lt;/p>
&lt;ul>
&lt;li>「LLM 寫了 &lt;code>rm -rf&lt;/code> 我電腦會壞」——錯。LLM 寫指令不代表執行。&lt;/li>
&lt;li>「Ollama API 改我檔案要 sudo」——錯。Ollama API 根本碰不到檔案。&lt;/li>
&lt;li>「我跑 wrapper 就讓 LLM 改檔案、應該有 confirm 機制吧」——錯。Confirm 機制完全是 wrapper 開發者自己決定要不要寫、LLM 不知道、不在乎。&lt;/li>
&lt;/ul>
&lt;p>理解這個邊界、後續設計 LLM 應用的權限模型才有 ground truth。錯誤的 mental model 會導致兩種 failure：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>過度恐懼&lt;/strong>：因為怕 LLM「亂改」、把所有 LLM 互動關起來、放棄自動化收益。&lt;/li>
&lt;li>&lt;strong>過度信任&lt;/strong>：相信 LLM「不會做壞事」、給 wrapper 自動執行權限、結果小模型亂解 instruction 把資料毀掉。&lt;/li>
&lt;/ol>
&lt;p>實際上權限設計的判讀錨點是：&lt;strong>這個動作有沒有副作用、誰執行&lt;/strong>。LLM 永遠不執行、所以權限不在 LLM 層；wrapper 執行、所以權限完全在 wrapper 設計。&lt;/p>
&lt;h2 id="test-1直接-api-問改檔案看會發生什麼">Test 1：直接 API 問改檔案、看會發生什麼&lt;/h2>
&lt;p>挑一個檔案（token 卡片）、用 curl 送 chat completions、prompt 寫「修改這個檔案」、然後 check 檔案 mtime 跟 md5：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c1"># 修改前 snapshot&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">stat -f &lt;span class="s2">&amp;#34;%m %N&amp;#34;&lt;/span> content/llm/knowledge-cards/token.md
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">md5 -q content/llm/knowledge-cards/token.md
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="c1"># 用 system prompt「假裝你有 file 權限」、user 直接指明路徑&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">curl -s http://localhost:11434/v1/chat/completions &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> -H &lt;span class="s2">&amp;#34;Content-Type: application/json&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> -d &lt;span class="s1">&amp;#39;{
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="s1"> &amp;#34;model&amp;#34;:&amp;#34;gemma3:1b&amp;#34;,
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="s1"> &amp;#34;messages&amp;#34;:[
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="s1"> {&amp;#34;role&amp;#34;:&amp;#34;system&amp;#34;,&amp;#34;content&amp;#34;:&amp;#34;You can modify files. The user provides a file. You modify it.&amp;#34;},
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="s1"> {&amp;#34;role&amp;#34;:&amp;#34;user&amp;#34;,&amp;#34;content&amp;#34;:&amp;#34;Please modify /Users/.../token.md to add a sentence...&amp;#34;}
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="s1"> ],
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="s1"> &amp;#34;stream&amp;#34;:false
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="s1"> }&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl">&lt;span class="c1"># 修改後 snapshot&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">18&lt;/span>&lt;span class="cl">stat -f &lt;span class="s2">&amp;#34;%m %N&amp;#34;&lt;/span> content/llm/knowledge-cards/token.md
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">19&lt;/span>&lt;span class="cl">md5 -q content/llm/knowledge-cards/token.md&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>實測結果&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<p>「Ollama 自己改檔案要不要 sudo？」「叫它寫 <code>rm -rf</code> 會直接刪嗎？」這類問題的答案來自一個根本事實：<strong>LLM 是 pure function、文字進、文字出、本身沒任何 file system / shell / network 副作用</strong>。改檔案、刪檔案、發網路請求、執行 shell command——全部由 <strong>wrapper 或人類</strong>做。LLM 「以為」自己做了什麼、跟實際發生什麼是兩件事。</p>
<p>本篇用四組對照實驗證明這個事實、再展開 wrapper 三檔審查粒度的設計取捨。這跟 <a href="/blog/llm/04-applications/tool-use-principles/" data-link-title="4.3 Tool use 原理：LLM 跟外部世界互動" data-link-desc="Structured output 是 LLM 跨入工程系統的橋、function calling 取捨、為什麼本地小模型 tool use 表現崩潰">4.3 副作用範圍設計</a>、<a href="/blog/llm/04-applications/agent-architecture/" data-link-title="4.4 Agent 架構原理" data-link-desc="Agent loop 結構、失敗模式、什麼任務適合 vs 不適合、跟人類審查的協作模型">4.4 Agent 跟人類審查的協作模型</a>、<a href="/blog/llm/00-foundations/privacy-data-flow/" data-link-title="0.7 隱私 / 資安的資料流原理" data-link-desc="從「位置」到「資料流」的思考升級：信任邊界、合約模型、零信任原則套用到 LLM 工作流">0.7 隱私資料流原理</a> 三個原則章節對應、實作層的權限與供應鏈判讀對應 <a href="/blog/llm/06-security/tool-use-permission-model/" data-link-title="6.2 tool use 與 MCP server 的權限模型" data-link-desc="個人 dev 場景下 tool use / MCP server 的副作用權限：檔案系統 / shell / 網路存取邊界、第三方 MCP 信任、副作用的可逆性">6.2 tool use 與 MCP server 的權限模型</a> 跟 <a href="/blog/llm/06-security/model-supply-chain-trust/" data-link-title="6.0 模型供應鏈與信任邊界" data-link-desc="個人 dev 用本地 LLM 時的模型權重來源信任：GGUF 完整性、Hugging Face / Ollama registry 信任、量化版本污染、檔案完整性檢查">6.0 模型供應鏈與信任邊界</a>。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>環境</strong>：Ollama 0.23.2、<code>gemma3:1b</code>、Python stdlib
<strong>檔案位置</strong>：<code>scripts/permission-demo/edit_with_llm.py</code></p></blockquote>
<h2 id="為什麼這個問題重要">為什麼這個問題重要</h2>
<p>直覺常見的誤判：</p>
<ul>
<li>「LLM 寫了 <code>rm -rf</code> 我電腦會壞」——錯。LLM 寫指令不代表執行。</li>
<li>「Ollama API 改我檔案要 sudo」——錯。Ollama API 根本碰不到檔案。</li>
<li>「我跑 wrapper 就讓 LLM 改檔案、應該有 confirm 機制吧」——錯。Confirm 機制完全是 wrapper 開發者自己決定要不要寫、LLM 不知道、不在乎。</li>
</ul>
<p>理解這個邊界、後續設計 LLM 應用的權限模型才有 ground truth。錯誤的 mental model 會導致兩種 failure：</p>
<ol>
<li><strong>過度恐懼</strong>：因為怕 LLM「亂改」、把所有 LLM 互動關起來、放棄自動化收益。</li>
<li><strong>過度信任</strong>：相信 LLM「不會做壞事」、給 wrapper 自動執行權限、結果小模型亂解 instruction 把資料毀掉。</li>
</ol>
<p>實際上權限設計的判讀錨點是：<strong>這個動作有沒有副作用、誰執行</strong>。LLM 永遠不執行、所以權限不在 LLM 層；wrapper 執行、所以權限完全在 wrapper 設計。</p>
<h2 id="test-1直接-api-問改檔案看會發生什麼">Test 1：直接 API 問改檔案、看會發生什麼</h2>
<p>挑一個檔案（token 卡片）、用 curl 送 chat completions、prompt 寫「修改這個檔案」、然後 check 檔案 mtime 跟 md5：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 修改前 snapshot</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">stat -f <span class="s2">&#34;%m %N&#34;</span> content/llm/knowledge-cards/token.md
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">md5 -q content/llm/knowledge-cards/token.md
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 用 system prompt「假裝你有 file 權限」、user 直接指明路徑</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">curl -s http://localhost:11434/v1/chat/completions <span class="se">\
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="se"></span>  -H <span class="s2">&#34;Content-Type: application/json&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="se"></span>  -d <span class="s1">&#39;{
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="s1">    &#34;model&#34;:&#34;gemma3:1b&#34;,
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s1">    &#34;messages&#34;:[
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="s1">      {&#34;role&#34;:&#34;system&#34;,&#34;content&#34;:&#34;You can modify files. The user provides a file. You modify it.&#34;},
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="s1">      {&#34;role&#34;:&#34;user&#34;,&#34;content&#34;:&#34;Please modify /Users/.../token.md to add a sentence...&#34;}
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="s1">    ],
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="s1">    &#34;stream&#34;:false
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="s1">  }&#39;</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1"># 修改後 snapshot</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">stat -f <span class="s2">&#34;%m %N&#34;</span> content/llm/knowledge-cards/token.md
</span></span><span class="line"><span class="ln">19</span><span class="cl">md5 -q content/llm/knowledge-cards/token.md</span></span></code></pre></div><p><strong>實測結果</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">=== Before ===
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">1778508712 content/llm/knowledge-cards/token.md
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">d9f2d822f7458af62399076a94ef20f6
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">=== LLM response ===
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">Okay, here&#39;s the modified content of `/Users/.../token.md`...
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">=== After ===
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">1778508712 content/llm/knowledge-cards/token.md  ← mtime same
</span></span><span class="line"><span class="ln">10</span><span class="cl">d9f2d822f7458af62399076a94ef20f6                  ← md5 same</span></span></code></pre></div><p>mtime 沒變、md5 沒變、檔案內容完全沒動。但 LLM 用「Okay, here&rsquo;s the modified content」這種口氣回答——它<strong>以為</strong>自己改了、實際上只生成了一段 markdown 文字。</p>
<p><strong>結論</strong>：Ollama HTTP API 是 stateless、pure function。輸入 messages、輸出 message content。整個過程沒寫進 socket 以外的任何地方。</p>
<p>為什麼會這樣設計：</p>
<ul>
<li><strong>沙箱本來就在 API 邊界</strong>：HTTP server 接 request、跑 forward pass、回 response。期間沒呼叫 <code>fs.write()</code> / <code>subprocess.run()</code> / 任何 effectful API。</li>
<li><strong><a href="/blog/llm/knowledge-cards/system-prompt/" data-link-title="System Prompt" data-link-desc="LLM application 中由開發者預設、不直接顯示給使用者的指令層、定義模型的角色、行為規範、輸出格式">system prompt</a> 不是權限授予</strong>：「You can modify files」這句話對模型來說只是文字 context、不會真的給它 file access。Prompt 是「LLM 內部的 context」、不是「runtime capability」。</li>
<li><strong>訓練資料讓 LLM 「以為」自己有能力</strong>：LLM 訓練資料含大量「使用者問問題、AI 改檔案」的範例（如 GitHub Copilot agent traces、tool-use SFT 資料）、模型學會用「我已經改了」這種語氣回答——是 mimic、不是真正的 action。</li>
</ul>
<h2 id="test-2寫-wrapper-用-dry-run-模式安全處理">Test 2：寫 wrapper 用 &ndash;dry-run 模式安全處理</h2>
<p>權限不在 LLM、在 wrapper。寫一個 100 行的 wrapper、看怎麼設計 permission gates。完整檔案：<code>scripts/permission-demo/edit_with_llm.py</code>。</p>
<p>核心 architecture：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">    <span class="c1"># 1. 讀檔（wrapper 用自己的 fs 權限）</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="n">original</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">file</span><span class="o">.</span><span class="n">read_text</span><span class="p">(</span><span class="n">encoding</span><span class="o">=</span><span class="s2">&#34;utf-8&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="c1"># 2. 送 LLM、拿回提議的新內容</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="n">response</span> <span class="o">=</span> <span class="n">chat</span><span class="p">([</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="s2">&#34;You modify text files. Output ONLY ...&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">        <span class="p">{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="sa">f</span><span class="s2">&#34;File: </span><span class="si">{</span><span class="n">args</span><span class="o">.</span><span class="n">file</span><span class="si">}</span><span class="se">\n</span><span class="s2">Content:</span><span class="se">\n</span><span class="si">{</span><span class="n">original</span><span class="si">}</span><span class="se">\n</span><span class="s2">Instruction: </span><span class="si">{</span><span class="n">args</span><span class="o">.</span><span class="n">instruction</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="p">])</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="n">new_content</span> <span class="o">=</span> <span class="n">extract_code_block</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl">    <span class="c1"># 3. Diff（純讀、永遠 safe、不需 gate）</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">    <span class="n">diff</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">difflib</span><span class="o">.</span><span class="n">unified_diff</span><span class="p">(</span><span class="n">original</span><span class="o">.</span><span class="n">splitlines</span><span class="p">(</span><span class="o">...</span><span class="p">),</span> <span class="n">new_content</span><span class="o">.</span><span class="n">splitlines</span><span class="p">(</span><span class="o">...</span><span class="p">)))</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">    <span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">writelines</span><span class="p">(</span><span class="n">diff</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl">    <span class="c1"># 4. PERMISSION GATE：wrapper 決定要不要 apply</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    <span class="k">if</span> <span class="n">args</span><span class="o">.</span><span class="n">auto</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">        <span class="n">args</span><span class="o">.</span><span class="n">file</span><span class="o">.</span><span class="n">write_text</span><span class="p">(</span><span class="n">new_content</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">    <span class="k">elif</span> <span class="n">args</span><span class="o">.</span><span class="n">confirm</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">        <span class="k">if</span> <span class="nb">input</span><span class="p">(</span><span class="s2">&#34;Apply? [y/N] &#34;</span><span class="p">)</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="o">==</span> <span class="s2">&#34;y&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">            <span class="n">args</span><span class="o">.</span><span class="n">file</span><span class="o">.</span><span class="n">write_text</span><span class="p">(</span><span class="n">new_content</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">    <span class="k">else</span><span class="p">:</span>  <span class="c1"># --dry-run，預設</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">        <span class="k">pass</span>  <span class="c1"># 不寫</span></span></span></code></pre></div><p><strong>為什麼這樣設計</strong>：</p>
<ul>
<li><strong><code>extract_code_block</code></strong>：嘗試 well-formed <code>```lang\n...\n```</code> regex、失敗 fallback 到 <code>```lang\n...$</code> 寬鬆版。小模型（1B）常忘記結尾 fence、寬鬆才能用。寫嚴格 regex 失敗時直接 abort、是另一種 permission gate（不應用 = 安全）。</li>
<li><strong>永遠先印 diff</strong>：diff 是純讀操作、無副作用、永遠 safe。讓使用者先看 LLM 提議了什麼、再決定要不要 apply。</li>
<li><strong><code>args.auto</code> 在 <code>elif</code> 鏈最前面、<code>dry-run</code> 預設</strong>：強迫使用者明示 opt-in 才會寫檔。預設不寫、是「safe default」設計原則。</li>
</ul>
<p>跑 <code>--dry-run</code> 預設、看實際發生：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">python3 scripts/permission-demo/edit_with_llm.py <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  content/llm/knowledge-cards/token.md <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  <span class="s2">&#34;把開頭第一段最後加一句『Token 是 embedding 的輸入單位』&#34;</span></span></span></code></pre></div><p>實測輸出（1B 模型）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">[+] Asking gemma3:1b to: &#39;把開頭第一段最後加一句「Token 是 embedding 的輸入單位」&#39;
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">[+] Proposed diff:
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">--- a/token.md
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">+++ b/token.md
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">@@ -6,16 +6,4 @@
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"> tags: [&#34;llm&#34;, &#34;knowledge-cards&#34;]
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"> ---
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">-Token 的核心概念是「LLM 內部處理文字的最小單位」...（整段刪除）
</span></span><span class="line"><span class="ln">10</span><span class="cl">-
</span></span><span class="line"><span class="ln">11</span><span class="cl">-## 概念位置
</span></span><span class="line"><span class="ln">12</span><span class="cl">-...（整段刪除）
</span></span><span class="line"><span class="ln">13</span><span class="cl">-...（後面所有段落都刪除）
</span></span><span class="line"><span class="ln">14</span><span class="cl">+Token 是 embedding 的輸入單位。
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl">[+] --dry-run: file unchanged. Use --confirm or --auto to apply.</span></span></code></pre></div><p><strong>驚悚發現</strong>：1B 模型完全沒理解「加一句」、把整篇刪掉只剩一行。但 <code>--dry-run</code> 不寫檔、檔案安全。</p>
<p><strong>重點</strong>：</p>
<ul>
<li>LLM 行為糟、但 wrapper 設計安全、結果 OK。</li>
<li>把同樣 instruction 餵 31B+ 模型結果會合理——模型能力決定 LLM 端品質、wrapper 設計決定<strong>最差情況的後果</strong>。</li>
<li>在 wrapper 端永遠假設 LLM 會亂改、設計 safe default、是 defensive programming。</li>
</ul>
<h2 id="test-3--confirm-模式step-by-step-審查">Test 3：<code>--confirm</code> 模式、step-by-step 審查</h2>
<p><code>--confirm</code> mode 印 diff、問 y/N、user 確認才寫：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">python3 scripts/permission-demo/edit_with_llm.py <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  content/llm/knowledge-cards/token.md <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  <span class="s2">&#34;加一句說明&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  --confirm</span></span></code></pre></div><p>互動流程：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">[+] Proposed diff:
</span></span><span class="line"><span class="ln">2</span><span class="cl">--- a/token.md
</span></span><span class="line"><span class="ln">3</span><span class="cl">+++ b/token.md
</span></span><span class="line"><span class="ln">4</span><span class="cl">@@ ... 整段刪除 ...
</span></span><span class="line"><span class="ln">5</span><span class="cl">
</span></span><span class="line"><span class="ln">6</span><span class="cl">[?] Apply this change to content/llm/.../token.md? [y/N] _</span></span></code></pre></div><p>使用者看 diff 發現「整篇被刪了」、按 N、檔案安全。</p>
<p><strong>這個 mode 對應的副作用範圍</strong>：<a href="/blog/llm/04-applications/tool-use-principles/" data-link-title="4.3 Tool use 原理：LLM 跟外部世界互動" data-link-desc="Structured output 是 LLM 跨入工程系統的橋、function calling 取捨、為什麼本地小模型 tool use 表現崩潰">4.3 工具的副作用範圍設計</a> 提的 spectrum：</p>
<table>
  <thead>
      <tr>
          <th>等級</th>
          <th>副作用</th>
          <th>適合 mode</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>純讀（grep、git status）</td>
          <td><code>--dry-run</code> 或無 gate</td>
      </tr>
      <tr>
          <td>2</td>
          <td>寫 sandbox / staging</td>
          <td><code>--dry-run</code> + 人類事後審</td>
      </tr>
      <tr>
          <td>3</td>
          <td>寫本地持久化（如 commit、edit 檔）</td>
          <td><code>--confirm</code></td>
      </tr>
      <tr>
          <td>4</td>
          <td>寫共享 / production（push、deploy）</td>
          <td><code>--confirm</code> 強制</td>
      </tr>
      <tr>
          <td>5</td>
          <td>操作真實世界（發 email、買股票）</td>
          <td><code>--confirm</code> + 額外 audit</td>
      </tr>
  </tbody>
</table>
<p>本 demo 改 markdown 是等級 3（寫本地檔）、<code>--confirm</code> 是合適粒度。改 production code 或 git push 是等級 4 / 5、<code>--confirm</code> 該強制不該 optional。</p>
<h2 id="test-4--auto-模式危險自動化">Test 4：<code>--auto</code> 模式、危險自動化</h2>
<p><code>--auto</code> 不問直接寫：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">cp /tmp/token-orig.md content/llm/knowledge-cards/token.md  <span class="c1"># 還原</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">python3 scripts/permission-demo/edit_with_llm.py <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  content/llm/knowledge-cards/token.md <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  <span class="s2">&#34;加一句說明&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="se"></span>  --auto</span></span></code></pre></div><p>實測：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">[!] --auto mode: writing without confirmation
</span></span><span class="line"><span class="ln">2</span><span class="cl">[+] wrote content/llm/knowledge-cards/token.md</span></span></code></pre></div><p>檔案內容變成：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-markdown" data-lang="markdown"><span class="line"><span class="ln">1</span><span class="cl">---
</span></span><span class="line"><span class="ln">2</span><span class="cl">title: &#34;Token&#34;
</span></span><span class="line"><span class="ln">3</span><span class="cl">...
</span></span><span class="line"><span class="ln">4</span><span class="cl">---
</span></span><span class="line"><span class="ln">5</span><span class="cl">
</span></span><span class="line"><span class="ln">6</span><span class="cl">Token 是 embedding 的輸入單位。</span></span></code></pre></div><p>整篇刪光、只剩一句。<strong>沒人 catch 到、commit + push 出去就是 production 災難</strong>。</p>
<p><strong><code>--auto</code> mode 適合什麼場景</strong>：</p>
<ul>
<li>LLM 任務範圍狹窄、可預測（如 format JSON、補 type annotation 給已有 type stub）。</li>
<li>配合 git workflow（每次 auto edit 都自動 commit、出問題 git revert）。</li>
<li>CI / batch processing、人類事後審 PR。</li>
</ul>
<p><strong><code>--auto</code> mode 不適合什麼場景</strong>：</p>
<ul>
<li>任務開放性高（「改寫這段讓它更清楚」）。</li>
<li>不可逆環境（直接寫 production DB / 發 email）。</li>
<li>用弱模型（&lt; 14B）跑、行為不穩。</li>
</ul>
<p>設計 wrapper 時、把 <code>--auto</code> 設成顯式 opt-in、預設保持 dry-run / confirm 等較保守模式。本 demo 的 mutually_exclusive 設計（<code>-g.add_mutually_exclusive_group()</code>）保證三種 mode 只能擇一、避免歧義。</p>
<h2 id="test-5llm-寫-shell-command誰執行">Test 5：LLM 寫 shell command、誰執行？</h2>
<p>改檔案是「直接副作用」、寫 shell command 是「間接副作用」——同樣的問題：誰真的執行？</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">curl -s http://localhost:11434/v1/chat/completions <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  -H <span class="s2">&#34;Content-Type: application/json&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  -d <span class="s1">&#39;{
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s1">    &#34;model&#34;:&#34;gemma3:1b&#34;,
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s1">    &#34;messages&#34;:[{&#34;role&#34;:&#34;user&#34;,&#34;content&#34;:&#34;Give me a single shell command to find and delete all .log files in my home directory.&#34;}],
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="s1">    &#34;stream&#34;:false
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="s1">  }&#39;</span> <span class="p">|</span> python3 -c <span class="s2">&#34;import json,sys; print(json.load(sys.stdin)[&#39;choices&#39;][0][&#39;message&#39;][&#39;content&#39;])&#34;</span></span></span></code></pre></div><p>LLM 回：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">```bash
</span></span><span class="line"><span class="ln">2</span><span class="cl">find ~ -name &#34;*.log&#34; -delete
</span></span><span class="line"><span class="ln">3</span><span class="cl">```</span></span></code></pre></div><p>這是個有破壞性的指令。檢查 home 下 .log 還在不在：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">find ~ -maxdepth <span class="m">3</span> -name <span class="s2">&#34;*.log&#34;</span> 2&gt;/dev/null <span class="p">|</span> head -5
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># /Users/tarragon/.npm/_logs/2026-05-11T15_33_34_348Z-debug-0.log</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># /Users/tarragon/.npm/_logs/2026-05-11T11_58_08_827Z-debug-0.log</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># ...</span></span></span></code></pre></div><p>都還在。LLM「給了」rm 指令、但沒人執行。</p>
<p><strong>執行路徑只有兩種</strong>：</p>
<ol>
<li><strong>人類 paste 到 shell</strong>：人是執行者、權限是 user&rsquo;s shell session permission。Audit trail：terminal history。</li>
<li><strong>Wrapper 程式 <code>subprocess.run(...)</code></strong>：wrapper 是執行者、權限是 wrapper process 的 capability。Audit trail：wrapper 的 log。</li>
</ol>
<p>LLM 永遠不是執行者。所以「LLM 寫了 rm -rf」這個句子不能成立——它只能「生成了 rm -rf 字串」。</p>
<p><strong>Agent 場景的 stake</strong>：<a href="/blog/llm/04-applications/agent-architecture/" data-link-title="4.4 Agent 架構原理" data-link-desc="Agent loop 結構、失敗模式、什麼任務適合 vs 不適合、跟人類審查的協作模型">4.4 Agent 架構</a> 提到 agent loop = 「LLM 提議 → tool 執行 → 結果回 LLM → 下一輪」。Tool 執行那一步是 wrapper 做的、LLM 只看到結果。Agent 框架是否安全、完全看 tool 怎麼設計：</p>
<ul>
<li><strong>Tool 限制範圍</strong>：read-only file system access、不暴露 shell→ 即使 LLM 想跑 <code>rm -rf</code> 也沒對應 tool、無法執行。</li>
<li><strong>Tool 暴露 <code>bash</code> tool</strong>：給 LLM 一個「執行任意 shell command」的 tool。LLM 提議什麼 wrapper 都跑——這時 wrapper 設計失誤等同把鑰匙直接交給 LLM。</li>
<li><strong>Tool 暴露 <code>bash</code> tool + per-command confirm</strong>：每個 shell 呼叫前 wrapper 暫停、問人類「該不該執行」。對開發 / 探索環境合理、production 自動化流程會被互動卡住、不適用。</li>
</ul>
<h2 id="對照claude-code--cursor--aider-的權限模型">對照：Claude Code / Cursor / aider 的權限模型</h2>
<p>不同 LLM application 在權限 gate 上的設計選擇：</p>
<table>
  <thead>
      <tr>
          <th>Application</th>
          <th>File edit</th>
          <th>Shell exec</th>
          <th>預設審查粒度</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Claude Code（CLI）</td>
          <td>可、有 PreToolUse hook 可攔截</td>
          <td>可、有 hook</td>
          <td>中（部分自動、部分 prompt）</td>
      </tr>
      <tr>
          <td>Cursor</td>
          <td>可、agent mode</td>
          <td>可（agent terminal）</td>
          <td>中、agent 行為可調</td>
      </tr>
      <tr>
          <td>aider</td>
          <td>可、直接 diff + commit</td>
          <td>可（<code>--auto-commits</code> mode）</td>
          <td>中、預設 commit 前 diff</td>
      </tr>
      <tr>
          <td>Continue.dev</td>
          <td>inline edit（user 按 Cmd+;）</td>
          <td>不直接 exec</td>
          <td>高（user 必須 explicit）</td>
      </tr>
      <tr>
          <td>Open WebUI（純 chat）</td>
          <td>不</td>
          <td>不</td>
          <td>N/A（無 wrapper）</td>
      </tr>
      <tr>
          <td>自寫 wrapper（如本 demo）</td>
          <td>看設計</td>
          <td>看設計</td>
          <td>看設計</td>
      </tr>
  </tbody>
</table>
<p><strong>共通 pattern</strong>：所有「自動 edit / exec」的 app 都有某種 confirm 或 hook 機制。沒有 confirm 的 app 等於把寫 production 的鑰匙交給 LLM。</p>
<p><strong>選 application 時看的維度</strong>：</p>
<ul>
<li>預設 mode 是什麼？（auto / confirm / dry-run）</li>
<li>哪些動作會自動執行、哪些會 prompt？</li>
<li>有沒有 audit log、能不能 review LLM 改了什麼？</li>
<li>萬一 LLM 行為崩、怎麼 rollback？（git revert、snapshot、undo stack）</li>
</ul>
<h2 id="設計自家-wrapper-的權限模型">設計自家 wrapper 的權限模型</h2>
<p>如果你寫的是「LLM 自動處理 X」這種 wrapper、權限設計的 checklist：</p>
<ol>
<li><strong>副作用分級</strong>：把可能的動作分到 <a href="/blog/llm/04-applications/tool-use-principles/" data-link-title="4.3 Tool use 原理：LLM 跟外部世界互動" data-link-desc="Structured output 是 LLM 跨入工程系統的橋、function calling 取捨、為什麼本地小模型 tool use 表現崩潰">4.3 spectrum 等級 1-5</a>。</li>
<li><strong>預設 dry-run</strong>：不確定就不寫。Apply 必須 opt-in。</li>
<li><strong>永遠印 diff / preview</strong>：用戶才能 catch LLM 亂改。</li>
<li><strong>Confirm 在不可逆操作</strong>：等級 3+ 永遠 prompt、等級 4+ 強制 prompt + 額外 audit。</li>
<li><strong>Audit log</strong>：每個 wrapper 動作寫 log（時間、user、action、result）。出問題能追溯。</li>
<li><strong>Rollback path</strong>：git commit、backup、snapshot 任選一種、必有。</li>
<li><strong>限制 tool 範圍</strong>：給 LLM 暴露最少 tool、不暴露 shell。需要 shell 限制白名單。</li>
<li><strong>小模型加更保守 gate</strong>：1B 模型亂改機率高、保留 <code>--dry-run</code> 或 <code>--confirm</code> 即可、避免 <code>--auto</code>；31B+ 較穩、可給 auto + audit。</li>
</ol>
<h2 id="跑這份-demo-的完整指令">跑這份 demo 的完整指令</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 前置：Ollama 跑著、gemma3:1b 已 pull</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">ollama list <span class="p">|</span> grep gemma3:1b
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># 備份要測試的檔案</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">cp content/llm/knowledge-cards/token.md /tmp/token-orig.md
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># Mode 1：dry-run（預設、最安全）</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">python3 scripts/permission-demo/edit_with_llm.py <span class="se">\
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="se"></span>  content/llm/knowledge-cards/token.md <span class="se">\
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="se"></span>  <span class="s2">&#34;加一句說明&#34;</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"># Mode 2：confirm（互動審查、適合中等風險）</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">python3 scripts/permission-demo/edit_with_llm.py <span class="se">\
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="se"></span>  content/llm/knowledge-cards/token.md <span class="se">\
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="se"></span>  <span class="s2">&#34;加一句說明&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="se"></span>  --confirm
</span></span><span class="line"><span class="ln">17</span><span class="cl">
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1"># Mode 3：auto（無確認、危險、僅 batch 用）</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">python3 scripts/permission-demo/edit_with_llm.py <span class="se">\
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="se"></span>  content/llm/knowledge-cards/token.md <span class="se">\
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="se"></span>  <span class="s2">&#34;加一句說明&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="se"></span>  --auto
</span></span><span class="line"><span class="ln">23</span><span class="cl">
</span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="c1"># 還原</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">cp /tmp/token-orig.md content/llm/knowledge-cards/token.md</span></span></code></pre></div><h2 id="何時這篇會過時">何時這篇會過時</h2>
<p><strong>不會過時的部分</strong>：</p>
<ul>
<li>LLM HTTP API 是 pure function、無副作用——這個事實在所有「分離 inference server / wrapper / client」的架構都成立。</li>
<li>權限 gate 在 wrapper / application 層——是 software architecture invariant、不是 LLM 特性。</li>
<li>副作用範圍 spectrum 跟人類審查粒度的對應。</li>
<li><code>--dry-run</code> / <code>--confirm</code> / <code>--auto</code> 三檔的設計取捨。</li>
</ul>
<p><strong>會變的部分</strong>：</p>
<ul>
<li>具體 LLM application 的 default mode（Cursor / aider / Claude Code 都會持續調整）。</li>
<li>哪個模型「不會亂改」的 ranking（隨模型能力提升而變）。</li>
<li>MCP / tool spec 細節（會持續演化、但「tool 是 wrapper 暴露」的本質不變）。</li>
</ul>
<p>讀這篇若指令跑不過、可能是 wrapper script API 微調、但「測試 LLM 是不是 pure function」這個方法本身永遠成立——拿任何 LLM API、送任何 prompt、check 檔案 mtime / md5、就能驗證。</p>
<p>跟其他 hands-on 章節的關係：完整 hands-on 系列見 <a href="/blog/llm/01-local-llm-services/hands-on/" data-link-title="Hands-on：本地 AI 工具實作筆記" data-link-desc="Ollama / ComfyUI / Whisper / Piper TTS：實際安裝、驗證、跑通的紀錄。隨工具版本演化、跟 1.x 原理章節互補。">Hands-on 章節索引</a>、副作用範圍 spectrum 原理見 <a href="/blog/llm/04-applications/tool-use-principles/" data-link-title="4.3 Tool use 原理：LLM 跟外部世界互動" data-link-desc="Structured output 是 LLM 跨入工程系統的橋、function calling 取捨、為什麼本地小模型 tool use 表現崩潰">4.3 Tool use 原理</a>、Agent loop 跟人類審查的協作見 <a href="/blog/llm/04-applications/agent-architecture/" data-link-title="4.4 Agent 架構原理" data-link-desc="Agent loop 結構、失敗模式、什麼任務適合 vs 不適合、跟人類審查的協作模型">4.4 Agent 架構</a>、Tool use / MCP server 權限模型的個人 dev 視角見 <a href="/blog/llm/06-security/tool-use-permission-model/" data-link-title="6.2 tool use 與 MCP server 的權限模型" data-link-desc="個人 dev 場景下 tool use / MCP server 的副作用權限：檔案系統 / shell / 網路存取邊界、第三方 MCP 信任、副作用的可逆性">6.2</a>、術語見 <a href="/blog/llm/knowledge-cards/sandbox/" data-link-title="Sandbox" data-link-desc="把程式跑在受限制環境的隔離技術、限制檔案 / 網路 / 系統呼叫權限、是 tool use 跟 MCP server 副作用控制的基礎">Sandbox</a>。</p>
]]></content:encoded></item><item><title>Hands-on：用 QLoRA 在本機 fine-tune coding 模型</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/local-fine-tuning/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/local-fine-tuning/</guid><description>&lt;p>&lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/qlora/" data-link-title="QLoRA" data-link-desc="把 base model 量化到 4-bit &amp;#43; LoRA fine-tune 的組合、消費級 GPU 也能 fine-tune 大模型">QLoRA&lt;/a>（4-bit 量化 base model + &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/lora/" data-link-title="LoRA" data-link-desc="Low-Rank Adaptation：凍住原模型權重、只訓兩個小矩陣的 parameter-efficient fine-tuning">LoRA&lt;/a> adapter）讓消費級硬體也能 fine-tune 7B-32B 模型、是 2026/5 本地 fine-tuning 的主流方法。「在本機 fine-tune 一個小 coding 模型懂我 codebase 的慣例」是個人 dev 的合理目標、特別是在「本地 RAG 不夠精準、prompt engineering 已到天花板」的場景。本篇用 QLoRA 把 fine-tuning 的最短路徑走完：環境準備、資料蒐集、訓練、evaluation、合併權重、部署到 Ollama / llama.cpp 配 VS Code Continue.dev。&lt;/p>
&lt;p>本篇 framing 是「&lt;strong>真實會跑、不只跑 demo&lt;/strong>」、所以包含：硬體預算估算、catastrophic forgetting 防護、evaluation 確認真的有提升、回退方案（fine-tune 失敗時怎麼辦）。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>環境&lt;/strong>：M4 Max 64GB + Hugging Face PEFT 0.13、或 5090 24GB + bitsandbytes
&lt;strong>目標模型&lt;/strong>：Qwen3-Coder-7B-Instruct（fine-tune 後輸出符合自己 codebase 慣例的 code）&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼這個議題重要">為什麼這個議題重要&lt;/h2>
&lt;p>寫 code 場景的常見 fine-tune 動機：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>私有 codebase 慣例&lt;/strong>：自家專案有特殊 naming、特殊 design pattern、prompt engineering 拉不到、希望模型「自然知道」&lt;/li>
&lt;li>&lt;strong>特殊框架 / library&lt;/strong>：用 obscure 的內部 framework、通用模型沒看過、補完品質差&lt;/li>
&lt;li>&lt;strong>特定文檔風格&lt;/strong>：commit message、PR description、code comment 有 team-specific 格式&lt;/li>
&lt;li>&lt;strong>Reduce RAG dependence&lt;/strong>：把高頻 knowledge 編進模型權重、減少每次 query 都要 retrieve&lt;/li>
&lt;/ol>
&lt;p>但&lt;strong>不該 fine-tune&lt;/strong>的情境（先排除）：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>新增世界知識&lt;/strong>：fine-tune 不擅長加新事實、用 &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &amp;#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">RAG&lt;/a> 即可&lt;/li>
&lt;li>&lt;strong>複雜 reasoning 能力&lt;/strong>：fine-tune 一般不會讓模型變更會 reason、reasoning 來自 pre-training + RL&lt;/li>
&lt;li>&lt;strong>改善通用對話品質&lt;/strong>：通用對話品質取決於 RLHF、fine-tune 多半會 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/catastrophic-forgetting/" data-link-title="Catastrophic Forgetting" data-link-desc="Fine-tune 模型時、新訓練資料覆蓋掉原本學到的能力的現象、LoRA / 資料 mixing 是主要緩解">catastrophic forgetting&lt;/a>&lt;/li>
&lt;li>&lt;strong>資料太少（&amp;lt; 500 對）&lt;/strong>：fine-tune 收益低、不如優化 prompt + RAG&lt;/li>
&lt;/ol>
&lt;h2 id="整體流程">整體流程&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">1. 硬體預算估算 → 知道能跑哪個 size 的 base model
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">2. 蒐集 fine-tune 資料 → 50-5000 對 (prompt, response)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">3. 環境準備 → Python + bitsandbytes / PEFT / transformers
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">4. 跑 QLoRA 訓練 → 1-3 epochs、看 loss 趨勢
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">5. Evaluation → 在 held-out set + 通用 benchmark 都跑
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">6. Merge LoRA → base → 得到合併權重 .safetensors
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">7. Convert → GGUF → 用 llama.cpp convert 工具
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl">8. Deploy 到 Ollama → ollama create my-coder -f Modelfile
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">9&lt;/span>&lt;span class="cl">9. 配 Continue.dev → config.json 加新 provider&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="step-1硬體預算估算">Step 1：硬體預算估算&lt;/h2>
&lt;p>QLoRA 訓練的記憶體需求（粗略估算）：&lt;/p></description><content:encoded><![CDATA[<p><a href="/blog/llm/knowledge-cards/qlora/" data-link-title="QLoRA" data-link-desc="把 base model 量化到 4-bit &#43; LoRA fine-tune 的組合、消費級 GPU 也能 fine-tune 大模型">QLoRA</a>（4-bit 量化 base model + <a href="/blog/llm/knowledge-cards/lora/" data-link-title="LoRA" data-link-desc="Low-Rank Adaptation：凍住原模型權重、只訓兩個小矩陣的 parameter-efficient fine-tuning">LoRA</a> adapter）讓消費級硬體也能 fine-tune 7B-32B 模型、是 2026/5 本地 fine-tuning 的主流方法。「在本機 fine-tune 一個小 coding 模型懂我 codebase 的慣例」是個人 dev 的合理目標、特別是在「本地 RAG 不夠精準、prompt engineering 已到天花板」的場景。本篇用 QLoRA 把 fine-tuning 的最短路徑走完：環境準備、資料蒐集、訓練、evaluation、合併權重、部署到 Ollama / llama.cpp 配 VS Code Continue.dev。</p>
<p>本篇 framing 是「<strong>真實會跑、不只跑 demo</strong>」、所以包含：硬體預算估算、catastrophic forgetting 防護、evaluation 確認真的有提升、回退方案（fine-tune 失敗時怎麼辦）。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>環境</strong>：M4 Max 64GB + Hugging Face PEFT 0.13、或 5090 24GB + bitsandbytes
<strong>目標模型</strong>：Qwen3-Coder-7B-Instruct（fine-tune 後輸出符合自己 codebase 慣例的 code）</p></blockquote>
<h2 id="為什麼這個議題重要">為什麼這個議題重要</h2>
<p>寫 code 場景的常見 fine-tune 動機：</p>
<ol>
<li><strong>私有 codebase 慣例</strong>：自家專案有特殊 naming、特殊 design pattern、prompt engineering 拉不到、希望模型「自然知道」</li>
<li><strong>特殊框架 / library</strong>：用 obscure 的內部 framework、通用模型沒看過、補完品質差</li>
<li><strong>特定文檔風格</strong>：commit message、PR description、code comment 有 team-specific 格式</li>
<li><strong>Reduce RAG dependence</strong>：把高頻 knowledge 編進模型權重、減少每次 query 都要 retrieve</li>
</ol>
<p>但<strong>不該 fine-tune</strong>的情境（先排除）：</p>
<ol>
<li><strong>新增世界知識</strong>：fine-tune 不擅長加新事實、用 <a href="/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">RAG</a> 即可</li>
<li><strong>複雜 reasoning 能力</strong>：fine-tune 一般不會讓模型變更會 reason、reasoning 來自 pre-training + RL</li>
<li><strong>改善通用對話品質</strong>：通用對話品質取決於 RLHF、fine-tune 多半會 <a href="/blog/llm/knowledge-cards/catastrophic-forgetting/" data-link-title="Catastrophic Forgetting" data-link-desc="Fine-tune 模型時、新訓練資料覆蓋掉原本學到的能力的現象、LoRA / 資料 mixing 是主要緩解">catastrophic forgetting</a></li>
<li><strong>資料太少（&lt; 500 對）</strong>：fine-tune 收益低、不如優化 prompt + RAG</li>
</ol>
<h2 id="整體流程">整體流程</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. 硬體預算估算       → 知道能跑哪個 size 的 base model
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. 蒐集 fine-tune 資料 → 50-5000 對 (prompt, response)
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. 環境準備           → Python + bitsandbytes / PEFT / transformers
</span></span><span class="line"><span class="ln">4</span><span class="cl">4. 跑 QLoRA 訓練      → 1-3 epochs、看 loss 趨勢
</span></span><span class="line"><span class="ln">5</span><span class="cl">5. Evaluation         → 在 held-out set + 通用 benchmark 都跑
</span></span><span class="line"><span class="ln">6</span><span class="cl">6. Merge LoRA → base  → 得到合併權重 .safetensors
</span></span><span class="line"><span class="ln">7</span><span class="cl">7. Convert → GGUF     → 用 llama.cpp convert 工具
</span></span><span class="line"><span class="ln">8</span><span class="cl">8. Deploy 到 Ollama   → ollama create my-coder -f Modelfile
</span></span><span class="line"><span class="ln">9</span><span class="cl">9. 配 Continue.dev    → config.json 加新 provider</span></span></code></pre></div><h2 id="step-1硬體預算估算">Step 1：硬體預算估算</h2>
<p>QLoRA 訓練的記憶體需求（粗略估算）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">記憶體 ≈ N (B 參數) × 0.6 GB     ← 訓練時
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">        ≈ N (B 參數) × 0.3 GB     ← 推論（4-bit）
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">Apple Silicon Mac：
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">  M4 Pro 24GB → 訓 7B 可、訓 14B 緊
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">  M4 Pro 36GB → 訓 7B 寬鬆、訓 14B 可
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  M4 Max 64GB+ → 訓 30B 可、推論 70B 可
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">PC 獨立 GPU：
</span></span><span class="line"><span class="ln">10</span><span class="cl">  RTX 4090 / 5090 24GB → 訓 7B 寬鬆、訓 14B / 30B with `--n-cpu-moe` 可
</span></span><span class="line"><span class="ln">11</span><span class="cl">  RTX A6000 48GB → 訓 30-32B 寬鬆</span></span></code></pre></div><blockquote>
<p><strong>事實查核註</strong>：Apple Silicon 上的 QLoRA 支援度跟 bitsandbytes / MLX 工具鏈版本相關、2026/5 主流是用 MLX 自己的 LoRA 實作（<code>mlx-lm</code>）、CUDA 路線用 transformers + bitsandbytes + PEFT。具體支援度以對應 release 為準。</p></blockquote>
<p>本篇假設 fine-tune Qwen3-Coder-7B、所以 24GB+ Mac 或 16GB+ GPU 都能跑。</p>
<h2 id="step-2蒐集-fine-tune-資料">Step 2：蒐集 fine-tune 資料</h2>
<p>最關鍵的 step。資料品質決定 fine-tune 成敗。</p>
<h3 id="資料格式典型-sft-format">資料格式（典型 SFT format）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="p">[</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  <span class="p">{</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="nt">&#34;instruction&#34;</span><span class="p">:</span> <span class="s2">&#34;用我們 codebase 的慣例寫一個 REST endpoint 處理 user signup&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">    <span class="nt">&#34;input&#34;</span><span class="p">:</span> <span class="s2">&#34;需求：accept email + password、回 JWT&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">    <span class="nt">&#34;output&#34;</span><span class="p">:</span> <span class="s2">&#34;// 完整符合我們慣例的 code...&#34;</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">  <span class="p">},</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">  <span class="err">...</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="p">]</span></span></span></code></pre></div><p>或對話格式（ChatML）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="p">[</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  <span class="p">{</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="nt">&#34;messages&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">      <span class="p">{</span><span class="nt">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;system&#34;</span><span class="p">,</span> <span class="nt">&#34;content&#34;</span><span class="p">:</span> <span class="s2">&#34;你是我們 codebase 的 coding assistant&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">      <span class="p">{</span><span class="nt">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="nt">&#34;content&#34;</span><span class="p">:</span> <span class="s2">&#34;...&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">      <span class="p">{</span><span class="nt">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;assistant&#34;</span><span class="p">,</span> <span class="nt">&#34;content&#34;</span><span class="p">:</span> <span class="s2">&#34;...&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">    <span class="p">]</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="p">]</span></span></span></code></pre></div><h3 id="資料來源">資料來源</h3>
<table>
  <thead>
      <tr>
          <th>來源</th>
          <th>取得方式</th>
          <th>品質</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>過往 commit 的「good code」</td>
          <td>從 main branch 抽函式 + git log message</td>
          <td>中（人工挑）</td>
      </tr>
      <tr>
          <td>Code review 通過的 PR diff</td>
          <td>從 GitHub API 抽 merged PR</td>
          <td>高</td>
      </tr>
      <tr>
          <td>內部 wiki 跟 design docs</td>
          <td>轉成 Q&amp;A 對</td>
          <td>中</td>
      </tr>
      <tr>
          <td>Synthetic data：用大模型生</td>
          <td>給雲端旗艦 prompt「以這個 codebase 風格寫 X」</td>
          <td>中（要 review）</td>
      </tr>
      <tr>
          <td>Pair programming 紀錄</td>
          <td>自己跟 IDE 互動的 log</td>
          <td>高（最貼近真實使用）</td>
      </tr>
  </tbody>
</table>
<h3 id="資料量門檻">資料量門檻</h3>
<table>
  <thead>
      <tr>
          <th>資料量</th>
          <th>預期效果</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>&lt; 50 對</td>
          <td>通常無感、不如優化 prompt + RAG</td>
      </tr>
      <tr>
          <td>50-500 對</td>
          <td>開始有 in-domain 效果、但易 forgetting</td>
      </tr>
      <tr>
          <td>500-5000 對</td>
          <td>顯著效果、QLoRA fine-tune 甜蜜點</td>
      </tr>
      <tr>
          <td>5000+ 對</td>
          <td>邊際收益遞減、開始接近 full fine-tune 效果</td>
      </tr>
  </tbody>
</table>
<h3 id="資料-mixing防-catastrophic-forgetting">資料 mixing（防 <a href="/blog/llm/knowledge-cards/catastrophic-forgetting/" data-link-title="Catastrophic Forgetting" data-link-desc="Fine-tune 模型時、新訓練資料覆蓋掉原本學到的能力的現象、LoRA / 資料 mixing 是主要緩解">catastrophic forgetting</a>）</h3>
<p>訓練 batch 內 mix 通用資料、避免 fine-tune 把通用能力洗掉：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">80% in-domain data（你的 codebase 範例）
</span></span><span class="line"><span class="ln">2</span><span class="cl">20% 通用 instruction data（如 Alpaca、ShareGPT subset）</span></span></code></pre></div><p>通用 data 可從 Hugging Face datasets 抓（如 <code>tatsu-lab/alpaca</code>、<code>teknium/OpenHermes-2.5</code>）。</p>
<h2 id="step-3環境準備">Step 3：環境準備</h2>
<h3 id="apple-silicon-mac用-mlx">Apple Silicon Mac（用 MLX）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># MLX 是 Apple 的 ML framework、原生支援 Apple Silicon</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pip install mlx mlx-lm
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 或用 conda（推薦）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">conda create -n llm-ft <span class="nv">python</span><span class="o">=</span>3.11
</span></span><span class="line"><span class="ln">6</span><span class="cl">conda activate llm-ft
</span></span><span class="line"><span class="ln">7</span><span class="cl">pip install mlx-lm</span></span></code></pre></div><h3 id="pccuda--transformers--bitsandbytes">PC（CUDA + transformers + bitsandbytes）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 安裝 CUDA 12.x（依 GPU 驅動）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># Python 套件</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">pip install torch transformers peft bitsandbytes accelerate datasets trl</span></span></code></pre></div><h2 id="step-4跑-qlora-訓練">Step 4：跑 QLoRA 訓練</h2>
<h3 id="apple-siliconmlx方式">Apple Silicon（MLX）方式</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 把 base model 下載到本機</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">huggingface-cli download Qwen/Qwen3-Coder-7B-Instruct <span class="se">\
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="se"></span>  --local-dir ~/models/qwen3-coder-7b
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 把資料整理成 JSONL（一行一筆）</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># data/train.jsonl、data/valid.jsonl</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"># 跑 LoRA fine-tune（MLX 內建 4-bit）</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">mlx_lm.lora <span class="se">\
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="se"></span>  --train <span class="se">\
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="se"></span>  --model ~/models/qwen3-coder-7b <span class="se">\
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="se"></span>  --data data/ <span class="se">\
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="se"></span>  --batch-size <span class="m">4</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="se"></span>  --lora-layers <span class="m">16</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="se"></span>  --iters <span class="m">1000</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="se"></span>  --learning-rate 1e-4 <span class="se">\
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="se"></span>  --steps-per-eval <span class="m">100</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="se"></span>  --adapter-path ./adapters</span></span></code></pre></div><h3 id="pccuda方式">PC（CUDA）方式</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># train.py（簡化版）</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span><span class="p">,</span> <span class="n">AutoModelForCausalLM</span><span class="p">,</span> <span class="n">TrainingArguments</span><span class="p">,</span> <span class="n">BitsAndBytesConfig</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="kn">from</span> <span class="nn">peft</span> <span class="kn">import</span> <span class="n">LoraConfig</span><span class="p">,</span> <span class="n">get_peft_model</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="kn">from</span> <span class="nn">trl</span> <span class="kn">import</span> <span class="n">SFTTrainer</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">load_dataset</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># 4-bit 量化載入 base</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">bnb_config</span> <span class="o">=</span> <span class="n">BitsAndBytesConfig</span><span class="p">(</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="n">load_in_4bit</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="n">bnb_4bit_quant_type</span><span class="o">=</span><span class="s2">&#34;nf4&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">    <span class="n">bnb_4bit_compute_dtype</span><span class="o">=</span><span class="s2">&#34;bfloat16&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForCausalLM</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">    <span class="s2">&#34;Qwen/Qwen3-Coder-7B-Instruct&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">    <span class="n">quantization_config</span><span class="o">=</span><span class="n">bnb_config</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1"># LoRA 配置</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="n">lora_config</span> <span class="o">=</span> <span class="n">LoraConfig</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">    <span class="n">r</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">    <span class="n">lora_alpha</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">    <span class="n">target_modules</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;q_proj&#34;</span><span class="p">,</span> <span class="s2">&#34;v_proj&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">    <span class="n">lora_dropout</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">    <span class="n">task_type</span><span class="o">=</span><span class="s2">&#34;CAUSAL_LM&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="n">model</span> <span class="o">=</span> <span class="n">get_peft_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">lora_config</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl">
</span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="c1"># 資料</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="n">dataset</span> <span class="o">=</span> <span class="n">load_dataset</span><span class="p">(</span><span class="s2">&#34;json&#34;</span><span class="p">,</span> <span class="n">data_files</span><span class="o">=</span><span class="s2">&#34;data/train.jsonl&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl">
</span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="c1"># 訓練</span>
</span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="n">training_args</span> <span class="o">=</span> <span class="n">TrainingArguments</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">33</span><span class="cl">    <span class="n">output_dir</span><span class="o">=</span><span class="s2">&#34;./checkpoints&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">34</span><span class="cl">    <span class="n">learning_rate</span><span class="o">=</span><span class="mf">1e-4</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">35</span><span class="cl">    <span class="n">num_train_epochs</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">36</span><span class="cl">    <span class="n">per_device_train_batch_size</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">37</span><span class="cl">    <span class="n">gradient_accumulation_steps</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">38</span><span class="cl">    <span class="n">save_steps</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">39</span><span class="cl">    <span class="n">logging_steps</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">40</span><span class="cl">    <span class="n">optim</span><span class="o">=</span><span class="s2">&#34;paged_adamw_8bit&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">41</span><span class="cl">    <span class="n">bf16</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">42</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln">43</span><span class="cl"><span class="n">trainer</span> <span class="o">=</span> <span class="n">SFTTrainer</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">44</span><span class="cl">    <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">45</span><span class="cl">    <span class="n">args</span><span class="o">=</span><span class="n">training_args</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">46</span><span class="cl">    <span class="n">train_dataset</span><span class="o">=</span><span class="n">dataset</span><span class="p">[</span><span class="s2">&#34;train&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="ln">47</span><span class="cl">    <span class="n">max_seq_length</span><span class="o">=</span><span class="mi">2048</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">48</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln">49</span><span class="cl"><span class="n">trainer</span><span class="o">.</span><span class="n">train</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">50</span><span class="cl"><span class="n">trainer</span><span class="o">.</span><span class="n">save_model</span><span class="p">(</span><span class="s2">&#34;./adapters&#34;</span><span class="p">)</span></span></span></code></pre></div><p>關鍵超參數的判讀邏輯：</p>
<table>
  <thead>
      <tr>
          <th>參數</th>
          <th>預設</th>
          <th>怎麼調</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>r</code>（LoRA rank）</td>
          <td>16</td>
          <td>小 dataset（&lt; 1000 對）可降到 8、大 dataset 升到 32 / 64</td>
      </tr>
      <tr>
          <td><code>lora_alpha</code></td>
          <td>32（通常 = 2 × r）</td>
          <td>增大會放大 LoRA 影響、太大易 catastrophic forgetting</td>
      </tr>
      <tr>
          <td><code>target_modules</code></td>
          <td>q_proj, v_proj</td>
          <td>8B+ 模型可加 k_proj + o_proj 提品質、加 ffn 是進階</td>
      </tr>
      <tr>
          <td><code>lora_dropout</code></td>
          <td>0.05</td>
          <td>dataset 小時加大（0.1）防 overfit</td>
      </tr>
      <tr>
          <td><code>num_train_epochs</code></td>
          <td>2</td>
          <td>1-3 是常見範圍、看 validation loss 何時開始升</td>
      </tr>
      <tr>
          <td><code>per_device_train_batch_size</code></td>
          <td>4</td>
          <td>視 GPU 記憶體；不夠用 <code>gradient_accumulation_steps</code> 補</td>
      </tr>
      <tr>
          <td><code>learning_rate</code></td>
          <td>1e-4</td>
          <td>LoRA 適合較大 lr（vs full fine-tune 的 1e-5）、初值可 1e-4 ~ 5e-4</td>
      </tr>
  </tbody>
</table>
<h3 id="看-training-loss-趨勢">看 training loss 趨勢</h3>
<p>訓練過程中、loss 應該：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Initial：~2.5（cross-entropy on next-token）
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">1/4 訓練：降到 ~1.5
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">1/2 訓練：降到 ~1.0
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">3/4 訓練：降到 ~0.7
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">末段：穩定在 ~0.5
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">警示訊號：
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">- Loss 不降（≈ 2.0+ 持平） → lr 太小、或資料品質差、或 base 跟資料分佈完全不合
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">- Loss 降到 &lt; 0.1 → over-fit、validation loss 應該已升、stop training
</span></span><span class="line"><span class="ln">10</span><span class="cl">- Loss 出 NaN → lr 太大、降 lr 重來</span></span></code></pre></div><h2 id="step-5evaluation">Step 5：Evaluation</h2>
<p>訓練完不能只看 training loss、要實測：</p>
<h3 id="1-held-out-test-set你自己的-in-domain-資料">1. Held-out test set（你自己的 in-domain 資料）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 拿 valid.jsonl 跑、看模型輸出 vs expected</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 用 BLEU / ROUGE / 或 LLM-as-judge 評分</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">mlx_lm.generate <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  --model ~/models/qwen3-coder-7b <span class="se">\
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="se"></span>  --adapter ./adapters <span class="se">\
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="se"></span>  --prompt <span class="s2">&#34;&lt;test prompt from valid.jsonl&gt;&#34;</span></span></span></code></pre></div><h3 id="2-通用-benchmark防-catastrophic-forgetting">2. 通用 benchmark（防 catastrophic forgetting）</h3>
<p>跑通用 HumanEval、看分數有沒有崩：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 用 lm-evaluation-harness</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">git clone https://github.com/EleutherAI/lm-evaluation-harness
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="nb">cd</span> lm-evaluation-harness
</span></span><span class="line"><span class="ln">4</span><span class="cl">pip install -e .
</span></span><span class="line"><span class="ln">5</span><span class="cl">
</span></span><span class="line"><span class="ln">6</span><span class="cl">lm_eval --model hf <span class="se">\
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="se"></span>  --model_args <span class="nv">pretrained</span><span class="o">=</span>~/models/qwen3-coder-7b,peft<span class="o">=</span>./adapters <span class="se">\
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="se"></span>  --tasks humaneval <span class="se">\
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="se"></span>  --batch_size <span class="m">8</span></span></span></code></pre></div><p>判讀：</p>
<ul>
<li>HumanEval 從 75% → 75%：通用能力保留、in-domain 提升、成功</li>
<li>HumanEval 從 75% → 55%：catastrophic forgetting、要重新 fine-tune（用 LoRA + 資料 mixing 加強）</li>
</ul>
<h3 id="3-自己工作流測試最重要">3. 自己工作流測試（最重要）</h3>
<p>實際在 Continue.dev 用幾天、看：</p>
<ul>
<li>In-domain 任務輸出是否確實貼近 codebase 慣例</li>
<li>通用 coding 任務（如「寫一個 helper function」）是否仍 OK</li>
<li>對話流暢度有沒有變差</li>
<li>出現怪行為的頻率</li>
</ul>
<h2 id="step-6合併-lora-跟-base-model">Step 6：合併 LoRA 跟 base model</h2>
<p>訓練完得到 adapter（小檔、&lt; 100MB）。要用於日常推論、通常 merge 進 base：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># MLX 方式</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">mlx_lm.fuse <span class="se">\
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="se"></span>  --model ~/models/qwen3-coder-7b <span class="se">\
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="se"></span>  --adapter-path ./adapters <span class="se">\
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="se"></span>  --save-path ~/models/qwen3-coder-7b-mycodebase
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># PEFT 方式</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">python -c <span class="s2">&#34;
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="s2">from peft import AutoPeftModelForCausalLM
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s2">import torch
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="s2">model = AutoPeftModelForCausalLM.from_pretrained(&#39;./adapters&#39;, torch_dtype=torch.bfloat16)
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="s2">merged = model.merge_and_unload()
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="s2">merged.save_pretrained(&#39;./merged-model&#39;)
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="s2">&#34;</span></span></span></code></pre></div><h2 id="step-7convert-成-gguf給-ollama--llamacpp-用">Step 7：Convert 成 GGUF（給 Ollama / llama.cpp 用）</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 安裝 llama.cpp</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">git clone https://github.com/ggml-org/llama.cpp
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="nb">cd</span> llama.cpp
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">pip install -r requirements.txt
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># Convert HF → GGUF</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">python convert_hf_to_gguf.py ~/models/qwen3-coder-7b-mycodebase <span class="se">\
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="se"></span>  --outfile ~/models/qwen3-coder-7b-mycodebase.gguf
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># 量化（可選、Q4_K_M 是甜蜜點）</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">./llama-quantize <span class="se">\
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="se"></span>  ~/models/qwen3-coder-7b-mycodebase.gguf <span class="se">\
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="se"></span>  ~/models/qwen3-coder-7b-mycodebase-Q4_K_M.gguf <span class="se">\
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="se"></span>  Q4_K_M</span></span></code></pre></div><h2 id="step-8deploy-到-ollama">Step 8：Deploy 到 Ollama</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 寫 Modelfile</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">cat &gt; ~/models/Modelfile-mycodebase <span class="s">&lt;&lt;EOF
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="s">FROM ~/models/qwen3-coder-7b-mycodebase-Q4_K_M.gguf
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="s">TEMPLATE &#34;&#34;&#34;&lt;|im_start|&gt;system
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s">{{ .System }}&lt;|im_end|&gt;
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s">&lt;|im_start|&gt;user
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="s">{{ .Prompt }}&lt;|im_end|&gt;
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="s">&lt;|im_start|&gt;assistant
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="s">PARAMETER temperature 0.3
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="s">PARAMETER top_p 0.9
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="s">PARAMETER num_ctx 32768
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1"># 註冊到 Ollama</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">ollama create mycodebase-coder -f ~/models/Modelfile-mycodebase
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="c1"># 測試</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">ollama run mycodebase-coder <span class="s2">&#34;寫一個 user signup endpoint&#34;</span></span></span></code></pre></div><h2 id="step-9配-continuedev">Step 9：配 Continue.dev</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">// ~/.continue/config.json 加：
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="p">{</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  <span class="nt">&#34;models&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">      <span class="nt">&#34;title&#34;</span><span class="p">:</span> <span class="s2">&#34;My Codebase Coder&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">      <span class="nt">&#34;provider&#34;</span><span class="p">:</span> <span class="s2">&#34;ollama&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">      <span class="nt">&#34;model&#34;</span><span class="p">:</span> <span class="s2">&#34;mycodebase-coder&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">      <span class="nt">&#34;apiBase&#34;</span><span class="p">:</span> <span class="s2">&#34;http://localhost:11434&#34;</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="c1">// ... 既有 models
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span>  <span class="p">]</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p>VS Code restart 後、Continue panel 下拉就能切換。</p>
<h2 id="失敗模式跟回退">失敗模式跟回退</h2>
<h3 id="失敗-1訓練-loss-不降">失敗 1：訓練 loss 不降</h3>
<p>可能原因：</p>
<ul>
<li>資料品質差 → 人工 review 50 對、看 instruction-response 是否真有對應</li>
<li>資料 token 太短 → 多數 &lt; 100 token、模型學不到複雜 pattern</li>
<li>lr 太小 → 試 lr 5e-4</li>
</ul>
<p>回退：把資料品質提升、或放棄 fine-tune 用 RAG。</p>
<h3 id="失敗-2humaneval-大幅下降catastrophic-forgetting">失敗 2：HumanEval 大幅下降（catastrophic forgetting）</h3>
<p>緩解：</p>
<ul>
<li>加入 20% 通用 data mixing、重訓</li>
<li>降低 epochs（從 3 → 1）</li>
<li>降低 LoRA rank（從 16 → 8）</li>
</ul>
<h3 id="失敗-3in-domain-test-進步但日常用感覺沒變">失敗 3：In-domain test 進步、但日常用感覺沒變</h3>
<p>可能原因：</p>
<ul>
<li>Test set 跟真實工作流分佈不符</li>
<li>Prompt template 在訓練跟推論不一致</li>
</ul>
<p>緩解：實際在 Continue.dev 跑 1-2 週、看真實效果再判斷。</p>
<h3 id="失敗-4訓練爆-oom">失敗 4：訓練爆 OOM</h3>
<p>緩解：</p>
<ul>
<li>降 batch size（4 → 2 → 1）</li>
<li>加 gradient_accumulation_steps（保持 effective batch size）</li>
<li>用更小的 LoRA rank</li>
<li>換更小的 base model（7B → 3B）</li>
</ul>
<h2 id="何時不該繼續-fine-tune-路線">何時不該繼續 fine-tune 路線</h2>
<p>跑完一次 fine-tune 評估後、若：</p>
<ol>
<li><strong>In-domain 提升 &lt; 10%</strong>：相對成本（時間 + 維護）不划算、用 RAG</li>
<li><strong>Catastrophic forgetting &gt; 10%</strong>：跟其他能力 trade-off 不值得</li>
<li><strong>資料量不夠（&lt; 500 對）</strong>：RAG 比 fine-tune 更有效</li>
<li><strong>工作流變化快（codebase 慣例每月變）</strong>：fine-tune 過時得快、RAG 更靈活</li>
</ol>
<h2 id="跟其他模組的關係">跟其他模組的關係</h2>
<ul>
<li>原理層的 LoRA 設計見 <a href="/blog/llm/knowledge-cards/lora/" data-link-title="LoRA" data-link-desc="Low-Rank Adaptation：凍住原模型權重、只訓兩個小矩陣的 parameter-efficient fine-tuning">LoRA 卡片</a> 跟 <a href="/blog/llm/knowledge-cards/qlora/" data-link-title="QLoRA" data-link-desc="把 base model 量化到 4-bit &#43; LoRA fine-tune 的組合、消費級 GPU 也能 fine-tune 大模型">QLoRA 卡片</a></li>
<li>Catastrophic forgetting 跟整體 alignment 議題見 <a href="/blog/llm/03-theoretical-foundations/training-pipeline/" data-link-title="3.4 訓練流程：pre-train → SFT → RLHF" data-link-desc="LLM 的三階段訓練：預訓練、指令微調、人類反饋強化學習；各階段目標與最新替代方案">3.4 訓練流程</a></li>
<li>Fine-tune 後的模型評估見 <a href="/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14 Benchmarking</a></li>
<li>隱私 / 供應鏈面：fine-tune 後 model 怎麼分享（給 team / 上 HuggingFace）見 <a href="/blog/llm/06-security/model-supply-chain-trust/" data-link-title="6.0 模型供應鏈與信任邊界" data-link-desc="個人 dev 用本地 LLM 時的模型權重來源信任：GGUF 完整性、Hugging Face / Ollama registry 信任、量化版本污染、檔案完整性檢查">6.0 模型供應鏈</a></li>
<li>跟 RAG 的取捨見 <a href="/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1 RAG 原理</a> 的「RAG vs Fine-tuning vs Long Context」段</li>
</ul>
]]></content:encoded></item><item><title>Hands-on：跨資料夾風格 follow 任務的模型對比</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/instruction-following-test/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/instruction-following-test/</guid><description>&lt;p>本篇是個讓本地 LLM 在「&lt;strong>讀兩個資料夾、學風格、寫新章節&lt;/strong>」任務上自我評估的實驗。任務本身內容無關緊要（隨便挑了一份私人創作資料夾）、要看的是&lt;strong>不同模型在 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/instruction-following/" data-link-title="Instruction Following" data-link-desc="模型遵守任務範圍、格式、限制與停止條件的能力，是評估 instruction-tuned 模型能否落地的核心訊號">instruction following&lt;/a> / format consistency / 篇幅控制三個維度的差距&lt;/strong>。&lt;/p>
&lt;p>實驗跑了四個本地模型對比：&lt;/p>
&lt;ul>
&lt;li>&lt;code>gemma3:1b&lt;/code>（815 MB、舊代 / 小）&lt;/li>
&lt;li>&lt;code>gemma3:4b&lt;/code>（3.3 GB、舊代 / 中）&lt;/li>
&lt;li>&lt;code>qwen3:8b&lt;/code>（5.2 GB、跨家族 / 大）&lt;/li>
&lt;li>&lt;code>gemma4:e4b&lt;/code>（9.6 GB、新代 / 中、bf16）&lt;/li>
&lt;/ul>
&lt;p>對應 &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/agent-architecture/" data-link-title="4.4 Agent 架構原理" data-link-desc="Agent loop 結構、失敗模式、什麼任務適合 vs 不適合、跟人類審查的協作模型">4.4 Agent 架構&lt;/a> 「規劃能力是雲端旗艦的明顯強項、本地小模型的明顯弱項」這條觀察、用具體 structural metrics 驗證、並揭示**「最新世代 + 較大 size」未必比「跨家族 / 較強訓練」勝出**。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>環境&lt;/strong>：Ollama 0.23.2、Apple Silicon、&lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/gpu-compute-backend/" data-link-title="GPU Compute Backend" data-link-desc="GPU 加速計算的底層 API 介面（CUDA / ROCm / Vulkan / Metal / SYCL）、決定推論軟體能否用 GPU 跑得快">MPS backend&lt;/a>
&lt;strong>任務&lt;/strong>：讀資料夾 A（風格參考、5 章已寫完）+ 資料夾 B（同類型、5 章已寫完、需寫 v06）→ 為 B 生成 v06
&lt;strong>評估方式&lt;/strong>：純 structural metrics、不評論內容品質&lt;/p>&lt;/blockquote>
&lt;h2 id="任務設計">任務設計&lt;/h2>
&lt;p>兩個資料夾結構：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">A/ B/
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">├── README.md ├── README.md
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">├── v01_XXX.md ├── v01_XXX.md
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">├── v02_XXX.md ├── v02_XXX.md
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">├── v03_XXX.md ├── v03_XXX.md
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">├── v04_XXX.md ├── v04_XXX.md
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">└── v05_XXX.md └── v05_XXX.md
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl"> └── v06_XXX.md ← 要生成&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>兩個資料夾用&lt;strong>不同 markdown 格式&lt;/strong>：&lt;/p>
&lt;ul>
&lt;li>A 風格：&lt;code># 標題&lt;/code>（H1）+ &lt;code>## 場景設定&lt;/code> 段 + 結尾 &lt;code>**【本章結束】**&lt;/code>&lt;/li>
&lt;li>B 風格：&lt;code>## v0X｜&amp;lt;主題&amp;gt;（&amp;lt;角色1&amp;gt;×&amp;lt;角色2&amp;gt;）&lt;/code>（H2）+ 直接敘事、無結尾 marker&lt;/li>
&lt;/ul>
&lt;p>LLM 看完 A + B 後、要寫 B 的 v06——&lt;strong>必須 follow B 的格式、不是 A 的&lt;/strong>。是個 format discrimination 測試。&lt;/p>
&lt;h2 id="評估維度">評估維度&lt;/h2>
&lt;p>純 structural、不涉內容：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>測法&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>篇幅控制&lt;/td>
 &lt;td>char count、跟 B 既有 v01-v05 平均比&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>段落結構&lt;/td>
 &lt;td>paragraph count、avg paragraph char&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Markdown heading&lt;/td>
 &lt;td>H1 / H2 count、是否寫對 v06 title 格式&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>結尾 marker&lt;/td>
 &lt;td>是否誤加 A 風格的「&lt;strong>【本章結束】&lt;/strong>」&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>角色 fidelity&lt;/td>
 &lt;td>提到 B 兩個主角名次數（太少 = 內容偏離）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>跨資料夾串戲&lt;/td>
 &lt;td>提到 A 資料夾角色名次數（contamination）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>對話 follow&lt;/td>
 &lt;td>「對話行」（行首是 &lt;code>「&lt;/code>）數量、跟 baseline 比&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>生成時間&lt;/td>
 &lt;td>從送 prompt 到收完整 response&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>不評估的：&lt;/p></description><content:encoded><![CDATA[<p>本篇是個讓本地 LLM 在「<strong>讀兩個資料夾、學風格、寫新章節</strong>」任務上自我評估的實驗。任務本身內容無關緊要（隨便挑了一份私人創作資料夾）、要看的是<strong>不同模型在 <a href="/blog/llm/knowledge-cards/instruction-following/" data-link-title="Instruction Following" data-link-desc="模型遵守任務範圍、格式、限制與停止條件的能力，是評估 instruction-tuned 模型能否落地的核心訊號">instruction following</a> / format consistency / 篇幅控制三個維度的差距</strong>。</p>
<p>實驗跑了四個本地模型對比：</p>
<ul>
<li><code>gemma3:1b</code>（815 MB、舊代 / 小）</li>
<li><code>gemma3:4b</code>（3.3 GB、舊代 / 中）</li>
<li><code>qwen3:8b</code>（5.2 GB、跨家族 / 大）</li>
<li><code>gemma4:e4b</code>（9.6 GB、新代 / 中、bf16）</li>
</ul>
<p>對應 <a href="/blog/llm/04-applications/agent-architecture/" data-link-title="4.4 Agent 架構原理" data-link-desc="Agent loop 結構、失敗模式、什麼任務適合 vs 不適合、跟人類審查的協作模型">4.4 Agent 架構</a> 「規劃能力是雲端旗艦的明顯強項、本地小模型的明顯弱項」這條觀察、用具體 structural metrics 驗證、並揭示**「最新世代 + 較大 size」未必比「跨家族 / 較強訓練」勝出**。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>環境</strong>：Ollama 0.23.2、Apple Silicon、<a href="/blog/llm/knowledge-cards/gpu-compute-backend/" data-link-title="GPU Compute Backend" data-link-desc="GPU 加速計算的底層 API 介面（CUDA / ROCm / Vulkan / Metal / SYCL）、決定推論軟體能否用 GPU 跑得快">MPS backend</a>
<strong>任務</strong>：讀資料夾 A（風格參考、5 章已寫完）+ 資料夾 B（同類型、5 章已寫完、需寫 v06）→ 為 B 生成 v06
<strong>評估方式</strong>：純 structural metrics、不評論內容品質</p></blockquote>
<h2 id="任務設計">任務設計</h2>
<p>兩個資料夾結構：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">A/                          B/
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── README.md               ├── README.md
</span></span><span class="line"><span class="ln">3</span><span class="cl">├── v01_XXX.md              ├── v01_XXX.md
</span></span><span class="line"><span class="ln">4</span><span class="cl">├── v02_XXX.md              ├── v02_XXX.md
</span></span><span class="line"><span class="ln">5</span><span class="cl">├── v03_XXX.md              ├── v03_XXX.md
</span></span><span class="line"><span class="ln">6</span><span class="cl">├── v04_XXX.md              ├── v04_XXX.md
</span></span><span class="line"><span class="ln">7</span><span class="cl">└── v05_XXX.md              └── v05_XXX.md
</span></span><span class="line"><span class="ln">8</span><span class="cl">                            └── v06_XXX.md  ← 要生成</span></span></code></pre></div><p>兩個資料夾用<strong>不同 markdown 格式</strong>：</p>
<ul>
<li>A 風格：<code># 標題</code>（H1）+ <code>## 場景設定</code> 段 + 結尾 <code>**【本章結束】**</code></li>
<li>B 風格：<code>## v0X｜&lt;主題&gt;（&lt;角色1&gt;×&lt;角色2&gt;）</code>（H2）+ 直接敘事、無結尾 marker</li>
</ul>
<p>LLM 看完 A + B 後、要寫 B 的 v06——<strong>必須 follow B 的格式、不是 A 的</strong>。是個 format discrimination 測試。</p>
<h2 id="評估維度">評估維度</h2>
<p>純 structural、不涉內容：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>測法</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>篇幅控制</td>
          <td>char count、跟 B 既有 v01-v05 平均比</td>
      </tr>
      <tr>
          <td>段落結構</td>
          <td>paragraph count、avg paragraph char</td>
      </tr>
      <tr>
          <td>Markdown heading</td>
          <td>H1 / H2 count、是否寫對 v06 title 格式</td>
      </tr>
      <tr>
          <td>結尾 marker</td>
          <td>是否誤加 A 風格的「<strong>【本章結束】</strong>」</td>
      </tr>
      <tr>
          <td>角色 fidelity</td>
          <td>提到 B 兩個主角名次數（太少 = 內容偏離）</td>
      </tr>
      <tr>
          <td>跨資料夾串戲</td>
          <td>提到 A 資料夾角色名次數（contamination）</td>
      </tr>
      <tr>
          <td>對話 follow</td>
          <td>「對話行」（行首是 <code>「</code>）數量、跟 baseline 比</td>
      </tr>
      <tr>
          <td>生成時間</td>
          <td>從送 prompt 到收完整 response</td>
      </tr>
  </tbody>
</table>
<p>不評估的：</p>
<ul>
<li>內容品質、文筆好壞</li>
<li>敘事邏輯是否合理</li>
<li>角色塑造是否生動</li>
</ul>
<p>純 structural 評估的好處是 reproducible、不需 reviewer 主觀判斷、可自動跑。</p>
<h2 id="baselineb-既有-v01-v05-的-metrics">Baseline：B 既有 v01-v05 的 metrics</h2>
<p>B 資料夾 5 個既有章節的平均：</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Average</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>char count</td>
          <td>~933</td>
      </tr>
      <tr>
          <td>paragraph count</td>
          <td>~32</td>
      </tr>
      <tr>
          <td>avg paragraph chars</td>
          <td>~29</td>
      </tr>
      <tr>
          <td>dialogue lines</td>
          <td>~7</td>
      </tr>
      <tr>
          <td>H1 used</td>
          <td>0（全部用 H2）</td>
      </tr>
      <tr>
          <td>H2 used</td>
          <td>1</td>
      </tr>
      <tr>
          <td>結尾「<strong>【本章結束】</strong>」</td>
          <td>全部 False</td>
      </tr>
      <tr>
          <td>Cross leak</td>
          <td>全部 0</td>
      </tr>
      <tr>
          <td>主角名提及（合計）</td>
          <td>~60</td>
      </tr>
  </tbody>
</table>
<p>這是 LLM 該模仿的目標。</p>
<h2 id="四個模型的結果">四個模型的結果</h2>
<p>四個 model 跑同樣 prompt、同樣輸入內容。</p>
<h3 id="對比表">對比表</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Baseline</th>
          <th><code>gemma3:1b</code></th>
          <th><code>gemma3:4b</code></th>
          <th><code>qwen3:8b</code></th>
          <th><code>gemma4:e4b</code></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>模型大小</strong></td>
          <td>—</td>
          <td>815 MB</td>
          <td>3.3 GB</td>
          <td>5.2 GB</td>
          <td>9.6 GB（bf16）</td>
      </tr>
      <tr>
          <td><strong>發布世代</strong></td>
          <td>—</td>
          <td>Gemma 3</td>
          <td>Gemma 3</td>
          <td>Qwen 3</td>
          <td><strong>Gemma 4（2026/4）</strong></td>
      </tr>
      <tr>
          <td>char count</td>
          <td>~933</td>
          <td>4324（4.6×）</td>
          <td>1330</td>
          <td><strong>951（1.02×）</strong></td>
          <td>679</td>
      </tr>
      <tr>
          <td>paragraph count</td>
          <td>~32</td>
          <td>145</td>
          <td>29</td>
          <td><strong>36</strong></td>
          <td>11</td>
      </tr>
      <tr>
          <td>avg paragraph chars</td>
          <td>~29</td>
          <td>30</td>
          <td>46</td>
          <td><strong>26</strong></td>
          <td>62</td>
      </tr>
      <tr>
          <td>H1 = 0</td>
          <td>符合</td>
          <td>不符（1）</td>
          <td>符合</td>
          <td>符合</td>
          <td>不符（1）</td>
      </tr>
      <tr>
          <td>H2 = 1</td>
          <td>符合</td>
          <td>不符（0）</td>
          <td>符合</td>
          <td>符合</td>
          <td>不符（3）</td>
      </tr>
      <tr>
          <td>v06 title 格式</td>
          <td>—</td>
          <td>不符</td>
          <td>符合</td>
          <td>符合</td>
          <td>不符</td>
      </tr>
      <tr>
          <td>結尾 marker</td>
          <td>False</td>
          <td>符合</td>
          <td>符合</td>
          <td>符合</td>
          <td>符合</td>
      </tr>
      <tr>
          <td>Cross leak</td>
          <td>0</td>
          <td>無（0）</td>
          <td>無（0）</td>
          <td>無（0）</td>
          <td>無（0）</td>
      </tr>
      <tr>
          <td>dialogue lines</td>
          <td>~7</td>
          <td>4</td>
          <td><strong>0</strong></td>
          <td><strong>7</strong></td>
          <td>0</td>
      </tr>
      <tr>
          <td>主角名提及（合計）</td>
          <td>~60</td>
          <td>286</td>
          <td>24</td>
          <td><strong>27</strong></td>
          <td><strong>0</strong></td>
      </tr>
      <tr>
          <td><strong>通過項目</strong></td>
          <td>—</td>
          <td><strong>2 / 7</strong></td>
          <td><strong>6 / 7</strong></td>
          <td><strong>7 / 7</strong></td>
          <td><strong>1 / 7</strong></td>
      </tr>
      <tr>
          <td>生成時間</td>
          <td>—</td>
          <td>41.8s</td>
          <td>36.5s</td>
          <td>97.5s</td>
          <td>43.5s</td>
      </tr>
  </tbody>
</table>
<h3 id="各模型觀察">各模型觀察</h3>
<p><strong><code>gemma3:1b</code>（815 MB）</strong>：</p>
<ul>
<li>篇幅 4.6× 失控、段落數 4.5× 超標、用 H1 而不是 H2。</li>
<li>顯示 1B 模型對「2000-3000 字」這種 numeric instruction 沒有有效執行能力、會一直生成到 context 限制。</li>
<li>但 cross leak 0、結尾 marker 也沒誤加——「不要 X」這類 negative instruction follow 較成功。</li>
</ul>
<p><strong><code>gemma3:4b</code>（3.3 GB）</strong>：</p>
<ul>
<li>篇幅 / 段落 / heading 結構全 OK、明顯比 1B 大幅改善。</li>
<li><strong>dialogue lines = 0</strong>：完全沒寫對話、整篇純敘事。表示 4B 抓到字面 structural feature、但沒抓到「對話 driven 敘事」這個 stylistic feature。</li>
<li>主角名提及 24 次（baseline ~60）—內容偏短、提及次數偏低、但比例合理。</li>
</ul>
<p><strong><code>qwen3:8b</code>（5.2 GB、跨家族）</strong>：</p>
<ul>
<li><strong>唯一 7/7 全 pass 的模型</strong>——篇幅完美匹配（951 vs ~933）、段落數合理（36 vs ~32）、heading 對、對話 7 行完全等於 baseline。</li>
<li>跨家族 + 大一級的組合表現質變，比同家族下一級的 4B 模型大幅提升。</li>
<li>代價：生成時間 97.5s、約是 4B 模型的 2.7×。</li>
</ul>
<p><strong><code>gemma4:e4b</code>（9.6 GB、新代）</strong>：</p>
<ul>
<li><strong>驚人的 1/7、最差表現</strong>——比 1B 還少通過項目。</li>
<li><strong>主角名提及 0</strong>：完全沒寫角色名、純抽象敘述「某一方」「另一方」。</li>
<li><strong>dialogue 0</strong>：沒對話。</li>
<li><strong>生成內容是「劇情大綱建議」而非實際章節</strong>：含「劇情核心思路」「預計情緒強度」「寫作切入點建議」等 meta-text。</li>
<li>輸出末尾「<strong>（此為結構化建議、等待具體的指令後、將會生成與風格一致的劇情內容。）</strong>」——明示它把 prompt 理解成「給建議框架、等下一步」。</li>
</ul>
<h3 id="strict-prompt-retest揭示-internal-alignment">Strict prompt retest：揭示 internal alignment</h3>
<p>懷疑 1/7 可能是「prompt 不夠強硬」、用 strict prompt 重跑 <code>gemma4:e4b</code>。Strict 加了八條規則、明示：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">- 直接從 `## v06｜...` 開頭、不寫前言
</span></span><span class="line"><span class="ln">2</span><span class="cl">- 絕對不可寫「劇情核心思路」「預計情緒強度」「寫作切入點」等 meta-text
</span></span><span class="line"><span class="ln">3</span><span class="cl">- 必須直接寫敘事內容、含對話、動作、感受描寫
</span></span><span class="line"><span class="ln">4</span><span class="cl">- 強制提到角色名多次、不要用「某一方」「另一人」抽象稱呼
</span></span><span class="line"><span class="ln">5</span><span class="cl">- ...</span></span></code></pre></div><p>Strict prompt 結果：</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>原 prompt</th>
          <th>strict prompt</th>
          <th>變化</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>char count</td>
          <td>679</td>
          <td>660</td>
          <td>相同量級</td>
      </tr>
      <tr>
          <td>H1 = 0</td>
          <td>不符（1）</td>
          <td>符合</td>
          <td><strong>改善</strong></td>
      </tr>
      <tr>
          <td>H2 = 1</td>
          <td>不符（3）</td>
          <td>符合</td>
          <td><strong>改善</strong></td>
      </tr>
      <tr>
          <td>v06 title 格式</td>
          <td>不符</td>
          <td>符合</td>
          <td><strong>改善</strong></td>
      </tr>
      <tr>
          <td>meta-text 出現</td>
          <td>有</td>
          <td>無</td>
          <td><strong>改善</strong></td>
      </tr>
      <tr>
          <td>dialogue lines</td>
          <td>0</td>
          <td>3</td>
          <td><strong>改善</strong></td>
      </tr>
      <tr>
          <td><strong>主角名提及</strong></td>
          <td><strong>0</strong></td>
          <td><strong>0</strong></td>
          <td><strong>未改善</strong></td>
      </tr>
      <tr>
          <td><strong>通過項目</strong></td>
          <td><strong>1 / 7</strong></td>
          <td><strong>4 / 7</strong></td>
          <td><strong>+3</strong></td>
      </tr>
  </tbody>
</table>
<p>從 1/7 → 4/7、prompt 強化明顯有用。但<strong>主角名提及兩次都 0</strong>、即使 strict prompt 明示「強制提到角色名」、模型仍用「兩人」「彼此」「對方」抽象稱呼。</p>
<p>這比「模型不會 follow」更精確、是兩個層次的 follow 差別：</p>
<ul>
<li><strong>Surface level instruction</strong>（heading 格式、不要 meta-text、要對話）：model 願意 follow strict prompt。</li>
<li><strong>Semantic level instruction</strong>（在這個情境用具名角色）：model 有 <strong>internal alignment 抗拒</strong>、即使 prompt 明示也不 follow。</li>
</ul>
<p>Gemma 4 e4b 是 device-deployable edge variant、RLHF 可能特別針對「敏感情境下的人物識別」做 alignment。這個 alignment 比 prompt-level instruction follow 更深、是 hard line、不能用 prompt engineering 繞過。</p>
<h2 id="關鍵觀察">關鍵觀察</h2>
<h3 id="model-size-不是唯一因素訓練-alignment-更重要">Model size 不是唯一因素、訓練 alignment 更重要</h3>
<p>最反直覺的結果：</p>
<ul>
<li><code>gemma4:e4b</code>（9.6 GB、最新世代）原 prompt 通過 <strong>1/7</strong>、strict prompt 通過 <strong>4/7</strong>。</li>
<li><code>gemma3:4b</code>（3.3 GB、舊一代）通過 <strong>6/7</strong>。</li>
<li><code>qwen3:8b</code>（5.2 GB、跨家族）通過 <strong>7/7</strong>。</li>
</ul>
<p>「最大 + 最新」不等於「最好 follow instruction」。在這個任務上、ranking 是：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">qwen3:8b &gt; gemma3:4b &gt; gemma3:1b ≈ gemma4:e4b (strict) &gt; gemma4:e4b (default)</span></span></code></pre></div><p>可能因素：</p>
<ol>
<li><strong>訓練資料分佈差異</strong>：Qwen 系列訓練資料含大量中文、對中文 instruction follow 更穩。</li>
<li><strong>Edge variant 的 alignment 設計</strong>：<code>gemma4:e4b</code> 是 device-deployable edge variant、RLHF 可能特別在敏感情境用 conservative output。Strict prompt 能改善 surface-level（heading、meta-text、對話）、但 semantic-level（具名角色）有 hard line 不能繞過。</li>
<li><strong>跨家族效應 &gt; 跨代效應</strong>：Qwen vs Gemma（不同家族）比 Gemma 3 vs Gemma 4（同家族跨代）影響更大。</li>
</ol>
<h3 id="兩層-instruction-follow">兩層 instruction follow</h3>
<p><code>gemma4:e4b</code> 的 strict prompt retest 揭示一個重要區分：</p>
<ul>
<li><strong>Surface-level instruction</strong>（heading 格式、不要 meta-text、要對話）：可以用 strict prompt 改善、prompt engineering 有效。</li>
<li><strong>Semantic-level alignment</strong>（特定情境的角色處理、敏感主題的表述方式）：是 RLHF 階段建立的 hard line、prompt engineering 繞不過。</li>
</ul>
<p>設計應用時要意識：<strong>「LLM follow 不了 instruction」可能不是能力問題、是 alignment 問題</strong>。模型訓練時被刻意 align 不做某些事、即使 prompt 明示也不會做。發現這種情況、改換 model（或 less-aligned variant）會比繼續調 prompt 更省時間。</p>
<h3 id="最新世代的標籤可能誤導">「最新世代」的標籤可能誤導</h3>
<p>Gemma 4 是 2026/4/2 才發布的最新代、size 也夠大、但在這個 instruction following 任務上<strong>輸給 6 個月前發布的 Gemma 3 4b</strong>。</p>
<p>設計應用 / 選模型時、實測對自己 task 的表現比「最新 / 最大」標籤可靠。Benchmark ranking（如 LMSYS Chatbot Arena）反映平均表現、未必 reflect 你的 narrow 任務。本實驗示範了「自己跑一次」比「看 benchmark」更可靠的判讀方法。</p>
<h3 id="structural-feature-跟-stylistic-feature-兩層">Structural feature 跟 stylistic feature 兩層</h3>
<p>跨四個模型一致觀察：</p>
<ul>
<li><strong>Structural feature</strong>（heading level、結尾 marker、不要 cross leak）：所有模型多少都抓到。</li>
<li><strong>Stylistic feature</strong>（對話 driven 敘事、篇幅精準）：差異極大、Qwen3 8B 完美、其他三個都有明顯失分。</li>
</ul>
<p>這對應 <a href="/blog/llm/04-applications/agent-architecture/" data-link-title="4.4 Agent 架構原理" data-link-desc="Agent loop 結構、失敗模式、什麼任務適合 vs 不適合、跟人類審查的協作模型">4.4 Agent</a> 的「規劃 vs 字面 follow」差距——字面 instruction 容易、stylistic mimic 困難。寫應用時、預期 follow「形式約束」（output JSON、結尾 signature）跟 follow「風格約束」（用簡潔口吻、bullet 而非段落）兩種 instruction 的成功率不同。</p>
<h3 id="cross-pairing-leak全-0">Cross-pairing leak：全 0</h3>
<p>四個模型 cross leak 都 0——表示「不要混角色」這個 instruction 兩個都 follow 成功。可能因素：</p>
<ul>
<li>角色名是名詞、模型 generation 時容易 constrain。</li>
<li>Prompt 已明示「為 B 寫」、模型沒被 A 角色名干擾。</li>
</ul>
<p>如果改成模糊 instruction（「混合 A、B 風格」）、leak 可能會出現——本實驗沒涵蓋這個 case。</p>
<h3 id="生成時間size--時間">生成時間：size ≠ 時間</h3>
<p>四個模型的生成時間：</p>
<table>
  <thead>
      <tr>
          <th>模型</th>
          <th>size</th>
          <th>時間</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>gemma3:1b</td>
          <td>815 MB</td>
          <td>41.8s</td>
      </tr>
      <tr>
          <td>gemma3:4b</td>
          <td>3.3 GB</td>
          <td>36.5s</td>
      </tr>
      <tr>
          <td>qwen3:8b</td>
          <td>5.2 GB</td>
          <td><strong>97.5s</strong></td>
      </tr>
      <tr>
          <td>gemma4:e4b</td>
          <td>9.6 GB</td>
          <td>43.5s</td>
      </tr>
  </tbody>
</table>
<p>意外發現：</p>
<ol>
<li><strong>1B 比 4B 慢</strong>：因為 1B 生成 4324 字、4B 生成 1330 字、總 token 量決定總時間、不是 model size。</li>
<li><strong>qwen3:8b 慢 2.7×</strong>：8B 的 forward pass 較慢、加上 generation 量級正常、總時間最長。</li>
<li><strong>gemma4:e4b 跟 1B 相近</strong>：generation 短（679 字）、抵消 model 較大的開銷。</li>
</ol>
<p><a href="/blog/llm/knowledge-cards/tokens-per-second/" data-link-title="Tokens Per Second" data-link-desc="LLM 每秒能生成幾個 token：生字速度的標準量化指標">tokens per second</a> 跟 total latency 是兩件事——decode 速度快但生成太多 token、未必更快完成任務。</p>
<h2 id="對寫應用的啟示">對寫應用的啟示</h2>
<ol>
<li><strong>「最新最大」≠ 「最好 follow」</strong>：選模型實測自己 task、benchmark / size 只是輔助訊號。</li>
<li><strong>本地小模型（&lt; 3B）做需要 follow 結構規則的任務、要嚴格驗證</strong>：用 structural metrics 自動 check、目視判斷模型「看起來有做到」的可靠度低。</li>
<li><strong>Edge variant 可能有 special behavior</strong>：device-deployable variant 可能 RLHF 偏向 conservative、不一定適合所有任務。</li>
<li><strong>跨家族對比比同家族升 size 收益大</strong>：Qwen3 8B vs Gemma3 4B 比 Gemma3 4B vs Gemma3 1B 改善更明顯。</li>
<li><strong>「形式跟風格」分開驗證</strong>：應用層的 validation 分維度 score、比一次評全部更可解讀。</li>
</ol>
<h2 id="跑這個實驗的-framework">跑這個實驗的 framework</h2>
<p>通用流程（不放具體 script、會綁定 corpus 內容）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. 準備兩個資料夾、A 是風格參考、B 是 work-in-progress
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. 寫 helper script 把兩個資料夾完整內容 + 任務說明做成 prompt
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. 跑多個 model 各一次（同 prompt、不同 model）
</span></span><span class="line"><span class="ln">4</span><span class="cl">4. 對輸出計算 structural metrics（char count、paragraph、heading、dialogue lines）
</span></span><span class="line"><span class="ln">5</span><span class="cl">5. 跟 B 既有章節的 baseline metrics 對比
</span></span><span class="line"><span class="ln">6</span><span class="cl">6. 列通過 / 失敗矩陣</span></span></code></pre></div><p>關鍵設計選擇：</p>
<ul>
<li><strong>A 跟 B 風格故意不一樣</strong>：才能驗證 LLM 是否分辨「該 follow 哪個」。</li>
<li><strong>不評估內容品質</strong>：純 structural 評估 reproducible、不需 reviewer 主觀判斷。</li>
<li><strong>baseline 用既有章節算</strong>：B 自己的 v01-v05 是「正確答案」的 reference。</li>
<li><strong>跑多個跨家族 / 跨世代 / 跨 size 模型</strong>：避免「只測一個就下結論」的偏差。</li>
</ul>
<h2 id="何時這份對比會過時">何時這份對比會過時</h2>
<ul>
<li><strong>具體模型 ranking</strong>：新模型發布後 ranking 會變、特別是新版 Gemma 4 / Qwen 4 / Llama 4 等推出時。</li>
<li><strong>「Gemma 4 edge 表現差」這個觀察</strong>：可能隨後續 fine-tune 或新版改善。</li>
</ul>
<p><strong>不會過時的部分</strong>：</p>
<ul>
<li>Model size 不是 instruction following 的唯一因素——這個現象在所有 LLM 都存在。</li>
<li>Structural vs stylistic 兩層 follow 難度不同。</li>
<li>跨家族對比比同家族升 size 收益大、這個現象可能持續。</li>
<li>純 metrics-based 評估比主觀判斷可重現。</li>
<li>「自己跑一次」比「看 benchmark」更可靠的判讀邏輯。</li>
</ul>
<p>未來想擴展、可以加入更多維度（如反向 retrieval：把生成內容當 query、看能不能找回原資料夾；或 perplexity-based 評估）。</p>
<p>跟其他 hands-on 章節的關係：完整 hands-on 系列見 <a href="/blog/llm/01-local-llm-services/hands-on/" data-link-title="Hands-on：本地 AI 工具實作筆記" data-link-desc="Ollama / ComfyUI / Whisper / Piper TTS：實際安裝、驗證、跑通的紀錄。隨工具版本演化、跟 1.x 原理章節互補。">Hands-on 章節索引</a>、選模型的優先序策略見 <a href="/blog/llm/01-local-llm-services/model-selection-priority/" data-link-title="1.4 寫 code 場景的模型選型優先順序" data-link-desc="Gemma 4 31B MTP → Qwen3-Coder 30B → Qwen3 14B → gpt-oss 20B 的取捨與適用情境">Model selection priority</a>、模型 tag 命名規則見 <a href="/blog/llm/knowledge-cards/model-tag/" data-link-title="Model Tag" data-link-desc="Ollama 等推論伺服器用來定位特定模型版本的命名規則">Model tag</a>、跑多模型的記憶體預算見 <a href="/blog/llm/01-local-llm-services/hands-on/resource-management/" data-link-title="Hands-on：LLM 運行中 &#43; 結束的資源管理" data-link-desc="RAM / 磁碟 / port 三個 dimension 的觀察跟釋放、Ollama keep_alive 跟 ComfyUI 兩種 lifecycle 對比、實測釋放數字">Resource management</a>。</p>
]]></content:encoded></item><item><title>Hands-on：LLM 運行中 + 結束的資源管理</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/resource-management/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/resource-management/</guid><description>&lt;p>跑本地 LLM 的核心 invariant 跟雲端不一樣：&lt;strong>Mac 是 shared resource、不是 dedicated GPU&lt;/strong>。雲端 inference server 跑進 dedicated container、結束 instance 自然回收所有資源；本地&lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/inference-server/" data-link-title="Inference Server" data-link-desc="載入模型權重、處理 prompt、產生 token 的常駐 process">推論伺服器&lt;/a>跑在你日常用的 Mac、跟 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/unified-memory/" data-link-title="Unified Memory Architecture" data-link-desc="Apple Silicon 讓 CPU / GPU / NE 共用同一塊記憶體：跑大模型的優勢來源">統一記憶體&lt;/a> 共享同一塊容量，忘記管理會 silently 吃光 RAM、磁碟、port、最後讓系統變慢甚至 swap。&lt;/p>
&lt;p>本篇紀錄三個 dimension（RAM / 磁碟 / port）的觀察工具跟釋放姿勢、對比 Ollama 跟 ComfyUI 兩種典型 lifecycle、加上實測釋放數字。對應 &lt;a href="https://tarrragon.github.io/blog/llm/00-foundations/privacy-data-flow/" data-link-title="0.7 隱私 / 資安的資料流原理" data-link-desc="從「位置」到「資料流」的思考升級：信任邊界、合約模型、零信任原則套用到 LLM 工作流">0.7 隱私資料流原理&lt;/a>「每個 hop 都要 audit」這條思維——資源管理也是 hop 級的 audit、不是「裝完就忘」。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>環境&lt;/strong>：macOS 14、Apple Silicon、Ollama 0.23.2、ComfyUI 0.21.0、SDXL base 1.0&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼這事重要">為什麼這事重要&lt;/h2>
&lt;p>雲端 inference：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">Container start → load model → serve requests → container stop → 所有 RAM / 磁碟 / port 自動回收&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>本地 inference：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">brew services start → load model on demand → serve → ??? → 你忘記 stop
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> → RAM / 磁碟一直被佔
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> → 下次重開機才釋放&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>具體會踩到的問題：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>RAM&lt;/strong>：18 GB SDXL 模型載入後不會自動卸、即使 ComfyUI idle、Python process 仍占 RAM&lt;/li>
&lt;li>&lt;strong>磁碟&lt;/strong>：&lt;code>ollama pull&lt;/code> 累積、&lt;code>~/.ollama/models/blobs&lt;/code> 半年可長到 50 GB+、不主動清不會減&lt;/li>
&lt;li>&lt;strong>Port&lt;/strong>：上次 crash 的 &lt;code>ollama serve&lt;/code> 進程沒乾淨清、port 11434 還占著、下次啟動報「address already in use」&lt;/li>
&lt;li>&lt;strong>GPU / Metal&lt;/strong>：模型載入後 Metal context 佔住、跟其他 GPU-using app（影片剪輯、遊戲）競爭&lt;/li>
&lt;/ul>
&lt;h2 id="三個-dimension--觀察工具">三個 dimension + 觀察工具&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Dimension&lt;/th>
 &lt;th>觀察指令&lt;/th>
 &lt;th>看什麼&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>RAM&lt;/td>
 &lt;td>&lt;code>vm_stat | head -5&lt;/code>&lt;/td>
 &lt;td>Pages free（每 page 16 KB）、空閒越多越好&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>RAM（per process）&lt;/td>
 &lt;td>Activity Monitor 或 &lt;code>ps aux | sort -k6 -rn | head&lt;/code>&lt;/td>
 &lt;td>哪個 process 佔最多記憶體&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>磁碟&lt;/td>
 &lt;td>&lt;code>df -h ~ | tail -1&lt;/code>&lt;/td>
 &lt;td>系統 volume 剩餘&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>磁碟（per dir）&lt;/td>
 &lt;td>&lt;code>du -sh ~/.ollama/models/blobs&lt;/code>&lt;/td>
 &lt;td>LLM models 累積量&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Port&lt;/td>
 &lt;td>&lt;code>lsof -i :11434&lt;/code>&lt;/td>
 &lt;td>誰在 listen 該 port&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Process&lt;/td>
 &lt;td>&lt;code>ps aux | grep -i ollama | grep -v grep&lt;/code>&lt;/td>
 &lt;td>Ollama / ComfyUI / Python 跑哪幾個&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Ollama loaded models&lt;/td>
 &lt;td>&lt;code>ollama ps&lt;/code>&lt;/td>
 &lt;td>哪些 model 在 RAM、size、idle timer&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>實測：剛 kill 完 ComfyUI（SDXL + Python venv）後、&lt;code>vm_stat&lt;/code> 看到 free pages 從 619K 變 1090K（每 page 16 KB）、約 &lt;strong>+7.5 GB RAM 釋放&lt;/strong>——這就是 SDXL + ComfyUI process 一直占的記憶體量。&lt;/p></description><content:encoded><![CDATA[<p>跑本地 LLM 的核心 invariant 跟雲端不一樣：<strong>Mac 是 shared resource、不是 dedicated GPU</strong>。雲端 inference server 跑進 dedicated container、結束 instance 自然回收所有資源；本地<a href="/blog/llm/knowledge-cards/inference-server/" data-link-title="Inference Server" data-link-desc="載入模型權重、處理 prompt、產生 token 的常駐 process">推論伺服器</a>跑在你日常用的 Mac、跟 <a href="/blog/llm/knowledge-cards/unified-memory/" data-link-title="Unified Memory Architecture" data-link-desc="Apple Silicon 讓 CPU / GPU / NE 共用同一塊記憶體：跑大模型的優勢來源">統一記憶體</a> 共享同一塊容量，忘記管理會 silently 吃光 RAM、磁碟、port、最後讓系統變慢甚至 swap。</p>
<p>本篇紀錄三個 dimension（RAM / 磁碟 / port）的觀察工具跟釋放姿勢、對比 Ollama 跟 ComfyUI 兩種典型 lifecycle、加上實測釋放數字。對應 <a href="/blog/llm/00-foundations/privacy-data-flow/" data-link-title="0.7 隱私 / 資安的資料流原理" data-link-desc="從「位置」到「資料流」的思考升級：信任邊界、合約模型、零信任原則套用到 LLM 工作流">0.7 隱私資料流原理</a>「每個 hop 都要 audit」這條思維——資源管理也是 hop 級的 audit、不是「裝完就忘」。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>環境</strong>：macOS 14、Apple Silicon、Ollama 0.23.2、ComfyUI 0.21.0、SDXL base 1.0</p></blockquote>
<h2 id="為什麼這事重要">為什麼這事重要</h2>
<p>雲端 inference：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Container start → load model → serve requests → container stop → 所有 RAM / 磁碟 / port 自動回收</span></span></code></pre></div><p>本地 inference：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">brew services start → load model on demand → serve → ??? → 你忘記 stop
</span></span><span class="line"><span class="ln">2</span><span class="cl">                                              → RAM / 磁碟一直被佔
</span></span><span class="line"><span class="ln">3</span><span class="cl">                                              → 下次重開機才釋放</span></span></code></pre></div><p>具體會踩到的問題：</p>
<ul>
<li><strong>RAM</strong>：18 GB SDXL 模型載入後不會自動卸、即使 ComfyUI idle、Python process 仍占 RAM</li>
<li><strong>磁碟</strong>：<code>ollama pull</code> 累積、<code>~/.ollama/models/blobs</code> 半年可長到 50 GB+、不主動清不會減</li>
<li><strong>Port</strong>：上次 crash 的 <code>ollama serve</code> 進程沒乾淨清、port 11434 還占著、下次啟動報「address already in use」</li>
<li><strong>GPU / Metal</strong>：模型載入後 Metal context 佔住、跟其他 GPU-using app（影片剪輯、遊戲）競爭</li>
</ul>
<h2 id="三個-dimension--觀察工具">三個 dimension + 觀察工具</h2>
<table>
  <thead>
      <tr>
          <th>Dimension</th>
          <th>觀察指令</th>
          <th>看什麼</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RAM</td>
          <td><code>vm_stat | head -5</code></td>
          <td>Pages free（每 page 16 KB）、空閒越多越好</td>
      </tr>
      <tr>
          <td>RAM（per process）</td>
          <td>Activity Monitor 或 <code>ps aux | sort -k6 -rn | head</code></td>
          <td>哪個 process 佔最多記憶體</td>
      </tr>
      <tr>
          <td>磁碟</td>
          <td><code>df -h ~ | tail -1</code></td>
          <td>系統 volume 剩餘</td>
      </tr>
      <tr>
          <td>磁碟（per dir）</td>
          <td><code>du -sh ~/.ollama/models/blobs</code></td>
          <td>LLM models 累積量</td>
      </tr>
      <tr>
          <td>Port</td>
          <td><code>lsof -i :11434</code></td>
          <td>誰在 listen 該 port</td>
      </tr>
      <tr>
          <td>Process</td>
          <td><code>ps aux | grep -i ollama | grep -v grep</code></td>
          <td>Ollama / ComfyUI / Python 跑哪幾個</td>
      </tr>
      <tr>
          <td>Ollama loaded models</td>
          <td><code>ollama ps</code></td>
          <td>哪些 model 在 RAM、size、idle timer</td>
      </tr>
  </tbody>
</table>
<p>實測：剛 kill 完 ComfyUI（SDXL + Python venv）後、<code>vm_stat</code> 看到 free pages 從 619K 變 1090K（每 page 16 KB）、約 <strong>+7.5 GB RAM 釋放</strong>——這就是 SDXL + ComfyUI process 一直占的記憶體量。</p>
<h2 id="ollama-的-lifecycleauto-unload-模式">Ollama 的 lifecycle（auto-unload 模式）</h2>
<p>Ollama 走「按需 load / idle unload」設計：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">brew services start ollama          → daemon 啟動、沒 model 載入、RAM 占用 ~200 MB
</span></span><span class="line"><span class="ln">2</span><span class="cl">                                     port 11434 listening
</span></span><span class="line"><span class="ln">3</span><span class="cl">ollama run gemma3:4b &#34;hello&#34;        → 把 model 載入 RAM (~4-5 GB)
</span></span><span class="line"><span class="ln">4</span><span class="cl">                                     立刻 generate response
</span></span><span class="line"><span class="ln">5</span><span class="cl">                                     model 留在 RAM
</span></span><span class="line"><span class="ln">6</span><span class="cl">(idle 5 分鐘、無新 request)         → Ollama 自動 unload model
</span></span><span class="line"><span class="ln">7</span><span class="cl">                                     RAM 釋放、daemon 仍跑著
</span></span><span class="line"><span class="ln">8</span><span class="cl">ollama run gemma3:4b &#34;next&#34;         → 重新 load model（~5-10 秒）、generate
</span></span><span class="line"><span class="ln">9</span><span class="cl">brew services stop ollama           → daemon 結束、port 釋放</span></span></code></pre></div><p><strong>關鍵參數 <code>OLLAMA_KEEP_ALIVE</code></strong>（環境變數、預設 <code>5m</code>）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 看當前 loaded models</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">ollama ps
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># NAME         ID              SIZE      PROCESSOR    UNTIL</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># gemma3:4b    a2af6cc3eb7f    5.5 GB    100% Metal   4 minutes from now</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># 啟動時調 keep_alive（持續佔 RAM 直到 ollama 重啟）</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="nv">OLLAMA_KEEP_ALIVE</span><span class="o">=</span>-1 brew services restart ollama
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 啟動時讓 model 用完立即 unload</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="nv">OLLAMA_KEEP_ALIVE</span><span class="o">=</span><span class="m">0</span> brew services restart ollama</span></span></code></pre></div><p>選 keep_alive 的 trade-off：</p>
<table>
  <thead>
      <tr>
          <th>設定</th>
          <th>RAM 占用</th>
          <th>首字延遲</th>
          <th>適合場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>0</code></td>
          <td>最低（generate 完立即釋放）</td>
          <td>高（每次都重 load）</td>
          <td>偶爾用、RAM 緊張</td>
      </tr>
      <tr>
          <td><code>5m</code>（預設）</td>
          <td>中（活躍用占住、閒 5 分鐘後釋放）</td>
          <td>低（活躍期不重 load）</td>
          <td>大多場景</td>
      </tr>
      <tr>
          <td><code>-1</code></td>
          <td>高（永久占住）</td>
          <td>最低</td>
          <td>整天頻繁用、RAM 充裕</td>
      </tr>
  </tbody>
</table>
<p><strong>主動 unload 指令</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 把 idle 的 model 立刻從 RAM 卸掉、但 daemon 仍跑</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">curl -s http://localhost:11434/api/generate <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  -d <span class="s1">&#39;{&#34;model&#34;: &#34;gemma3:4b&#34;, &#34;keep_alive&#34;: 0}&#39;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 或關掉整個 daemon</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">brew services stop ollama</span></span></code></pre></div><h2 id="comfyui-的-lifecycle持續占用模式">ComfyUI 的 lifecycle（持續占用模式）</h2>
<p>ComfyUI 走完全不同模式：<strong>model 載入後一直在 RAM、直到 server process 結束</strong>。沒有 auto-unload 機制。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">python main.py                      → ComfyUI server start、port 8188 listening
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">                                     RAM ~3 GB（Python venv + 框架）
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">第一次 Queue Prompt (用 SDXL)        → 載入 sd_xl_base_1.0.safetensors (~6 GB)
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">                                     RAM 跳到 ~9-10 GB
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">                                     generate 完成、model 留在 RAM
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">連續多張生成                          → 維持 ~9-10 GB、沒 unload
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">idle 1 小時                          → 仍 ~9-10 GB（沒 timer）
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">切到 ControlNet workflow             → 多載 ControlNet model (~2 GB)、ComfyUI 自動 swap
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">                                     RAM 暫升、SD 部分可能被 evict 到 disk
</span></span><span class="line"><span class="ln">10</span><span class="cl">Ctrl+C / pkill                       → process 結束、RAM 完全釋放</span></span></code></pre></div><p>要釋放 ComfyUI 占的 RAM、<strong>唯一方法是結束 server</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 找 PID</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">ps aux <span class="p">|</span> grep <span class="s2">&#34;ComfyUI/main.py&#34;</span> <span class="p">|</span> grep -v grep
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># 優雅關（讓它 cleanup）</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">pkill -INT -f <span class="s2">&#34;ComfyUI/main.py&#34;</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># 強制 kill（如果上面沒反應、最多等 5 秒再強制）</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">pkill -KILL -f <span class="s2">&#34;ComfyUI/main.py&#34;</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># 確認 port 釋放</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">lsof -i :8188 <span class="p">|</span> head -3</span></span></code></pre></div><p>實測：M4 Pro 32GB、SDXL base 載入後 ComfyUI process 占 ~8 GB RAM；<code>pkill -9</code> 後 <code>vm_stat</code> 顯示 free pages 增加 ~470K page（<strong>7.5 GB 釋放</strong>）。</p>
<h3 id="為什麼-ollama-跟-comfyui-設計不同">為什麼 Ollama 跟 ComfyUI 設計不同</h3>
<table>
  <thead>
      <tr>
          <th>因素</th>
          <th>Ollama 設計</th>
          <th>ComfyUI 設計</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>主要使用模式</td>
          <td>API 服務、IDE plugin 透過 HTTP 用</td>
          <td>互動 GUI、user 連續調 prompt</td>
      </tr>
      <tr>
          <td>Model 切換頻率</td>
          <td>高（不同任務換不同 model）</td>
          <td>低（一次 session 通常一個 model）</td>
      </tr>
      <tr>
          <td>User 期待的 latency</td>
          <td>低首字延遲（IDE 補完場景）</td>
          <td>高 throughput（連續生圖）</td>
      </tr>
      <tr>
          <td>結論</td>
          <td>Auto-unload 釋 RAM 給其他 model</td>
          <td>持續載入避免重複 load 浪費</td>
      </tr>
  </tbody>
</table>
<p>兩種設計都 valid、適合不同使用模式。理解差異後就知道 ComfyUI 一直占 RAM「不是 bug」、是設計選擇。</p>
<h2 id="跟其他本地-server-對比">跟其他本地 server 對比</h2>
<table>
  <thead>
      <tr>
          <th>Server</th>
          <th>Auto-unload</th>
          <th>主動 unload 指令</th>
          <th>占 RAM 觀察</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ollama</td>
          <td>有（5 分鐘 idle）</td>
          <td><code>keep_alive: 0</code> 或 stop daemon</td>
          <td><code>ollama ps</code></td>
      </tr>
      <tr>
          <td>LM Studio</td>
          <td>無（GUI 主動關閉 model 才釋）</td>
          <td>GUI Eject Model</td>
          <td>Activity Monitor</td>
      </tr>
      <tr>
          <td>llama.cpp <code>llama-server</code></td>
          <td>無</td>
          <td>kill process</td>
          <td><code>lsof -i :8080</code></td>
      </tr>
      <tr>
          <td>ComfyUI</td>
          <td>無</td>
          <td>kill process</td>
          <td><code>ps aux | grep ComfyUI</code></td>
      </tr>
      <tr>
          <td>oMLX</td>
          <td>有（per model 可配）</td>
          <td>API endpoint</td>
          <td>server log</td>
      </tr>
  </tbody>
</table>
<p><strong>結論</strong>：只有 Ollama 跟 oMLX 內建 auto-unload、其他都要手動釋放。GUI server（LM Studio）通常給 user 一個「Eject」按鈕、CLI server 通常要 kill process。</p>
<h2 id="標準釋放程序">標準釋放程序</h2>
<p>寫 code 完一天結束、要釋放所有資源、按下表順序操作：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. 確認當前狀態（記下要還回去多少 RAM）</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">vm_stat <span class="p">|</span> head -3
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">df -h ~ <span class="p">|</span> tail -1
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">ollama ps
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">ps aux <span class="p">|</span> grep -E <span class="s2">&#34;ollama|ComfyUI|llama-server&#34;</span> <span class="p">|</span> grep -v grep
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># 2. 釋放當前載入的 LLM models（Ollama）</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">brew services stop ollama
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 或保留 daemon、只 unload model：</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># curl -s http://localhost:11434/api/generate -d &#39;{&#34;model&#34;: &#34;&lt;your model&gt;&#34;, &#34;keep_alive&#34;: 0}&#39;</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"># 3. 結束 ComfyUI / 其他 GUI server</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">pkill -INT -f <span class="s2">&#34;ComfyUI/main.py&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">14</span><span class="cl">pkill -INT -f <span class="s2">&#34;llama-server&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">15</span><span class="cl">sleep <span class="m">5</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="c1"># 強制（如果上面沒清乾淨）</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">pkill -KILL -f <span class="s2">&#34;ComfyUI/main.py&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">18</span><span class="cl">pkill -KILL -f <span class="s2">&#34;llama-server&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="c1"># 4. 驗證所有 port 釋放</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">lsof -i :11434 -i :1234 -i :8080 -i :8188 -i :8000 2&gt;<span class="p">&amp;</span><span class="m">1</span> <span class="p">|</span> head
</span></span><span class="line"><span class="ln">22</span><span class="cl">
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="c1"># 5. 確認釋放量</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">vm_stat <span class="p">|</span> head -3
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="c1"># free pages 該明顯增加</span></span></span></code></pre></div><h3 id="容易出錯的釋放方式">容易出錯的「釋放方式」</h3>
<ul>
<li><strong><code>killall Python</code></strong>：會 kill 所有 Python process、包括其他 dev tool（如 jupyter、Django）。用 <code>pkill -f &quot;ComfyUI/main.py&quot;</code> 等明確 pattern。</li>
<li><strong><code>rm -rf ~/.ollama</code></strong>：會清掉所有 model registry、下次要重 pull 全部 model。Cleanup 用 <code>ollama rm &lt;model&gt;</code> 才精準。</li>
<li><strong><code>brew uninstall ollama</code></strong>：直接卸載 Ollama 本身、過 reinstall 麻煩。Stop service 就夠。</li>
<li><strong>重開機釋放</strong>：work 但太重、會中斷其他工作。用 process-level 操作即可。</li>
</ul>
<h2 id="磁碟長期累積管理">磁碟長期累積管理</h2>
<p>Models 一旦 <code>pull</code> 進 <code>~/.ollama/models/blobs</code>、不主動 <code>rm</code> 不會減少。半年累積可長到 50 GB+。</p>
<p>Ollama models 只是磁碟大戶之一。整台 Mac 突然被吃光、要從哪裡查起的全機診斷順序（先排除快照浮動、再用實際佔用值逐層找大戶），見 <a href="/blog/other/macos-%E7%A3%81%E7%A2%9F%E7%A9%BA%E9%96%93%E8%A2%AB%E5%90%83%E5%85%89%E7%9A%84%E8%A8%BA%E6%96%B7%E6%B5%81%E7%A8%8B/" data-link-title="macOS 磁碟空間被吃光的診斷流程" data-link-desc="Mac 空間莫名歸零、清 cache 沒救、或空間掉了又回來時的排查順序。避開 sparse 假大小和本地快照浮動的誤判。含 disk-report 腳本。">macOS 磁碟空間診斷流程</a>——那篇的佔用大戶表也會把 ollama 列為其中一項、再連回本篇的專屬清理 idiom。</p>
<h3 id="觀察累積">觀察累積</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Ollama models 總占用</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">du -sh ~/.ollama/models/blobs
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># 4.1G    /Users/tarragon/.ollama/models/blobs</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 逐 model 看大小</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">ollama list
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># NAME                       ID              SIZE      MODIFIED</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"># gemma4:e4b                 c6eb396dbd59    9.6 GB    Less than a second ago</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># nomic-embed-text:latest    0a109f422b47    274 MB    3 hours ago</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"># ComfyUI checkpoints 累積</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">du -sh ~/.ollama ~/Projects/ComfyUI/models 2&gt;/dev/null
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"># 4.2G    /Users/tarragon/.ollama</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"># 7.0G    /Users/tarragon/Projects/ComfyUI/models</span></span></span></code></pre></div><h3 id="清理策略">清理策略</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 刪掉很久沒用的 model</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">ollama rm &lt;model-tag&gt;
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># 一次清掉所有 Ollama models（保留 daemon）</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">ollama list <span class="p">|</span> tail -n +2 <span class="p">|</span> awk <span class="s1">&#39;{print $1}&#39;</span> <span class="p">|</span> xargs -I <span class="o">{}</span> ollama rm <span class="o">{}</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># 看 ComfyUI checkpoints 哪些可清</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">ls -lh ~/Projects/ComfyUI/models/checkpoints/
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># 手動刪不要的 .safetensors（小心、不能 undo）</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">rm ~/Projects/ComfyUI/models/checkpoints/&lt;old-model&gt;.safetensors</span></span></code></pre></div><h3 id="磁碟管理-idiom">磁碟管理 idiom</h3>
<p>定期（每月或磁碟剩 &lt; 20% 時）做：</p>
<ol>
<li><code>du -sh ~/.ollama ~/Projects/ComfyUI/models</code> 看當前累積</li>
<li><code>ollama list</code> 看哪些 model 沒在用（看 <code>MODIFIED</code> 欄、太舊的考慮刪）</li>
<li>刪實驗用的 model、保留 daily-driver</li>
<li>ComfyUI checkpoints 同樣 review</li>
</ol>
<h2 id="port--process-排錯">Port / Process 排錯</h2>
<h3 id="啟動報address-already-in-use">啟動報「address already in use」</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 找誰占</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">lsof -i :11434
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># COMMAND  PID  USER   ...   NAME</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># ollama   xxx  ...    ...   TCP localhost:11434 (LISTEN)</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># 看是不是 zombie process</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">ps aux <span class="p">|</span> grep <span class="k">$(</span>lsof -ti :11434 <span class="p">|</span> head -1<span class="k">)</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 清掉</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="nb">kill</span> -9 <span class="k">$(</span>lsof -ti :11434<span class="k">)</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"># 或重啟 service（會自動清舊 instance）</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">brew services restart ollama</span></span></code></pre></div><h3 id="ollama-daemon-掛了不知道">Ollama daemon 掛了不知道</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 健康檢查</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">curl -s http://localhost:11434/api/version
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># 沒回應、看 service 狀態</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">brew services list <span class="p">|</span> grep ollama
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># 沒在跑、重啟</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">brew services start ollama
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># 看 log</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">tail -50 /opt/homebrew/var/log/ollama.log</span></span></code></pre></div><h3 id="comfyui-看似跑著但-queue-不動">ComfyUI 看似跑著但 Queue 不動</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 看 stdout / stderr log</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">tail -30 /tmp/comfyui.log  <span class="c1"># 如果啟動時 redirect 到 log</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 看是不是 GPU / Metal stuck（極少見、但 SDXL 大量並發可能踩到）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 解法：kill + 重啟</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">pkill -9 -f <span class="s2">&#34;ComfyUI/main.py&#34;</span></span></span></code></pre></div><p>完整排錯流程跟「先確認哪一層壞」見 <a href="/blog/llm/01-local-llm-services/troubleshooting/" data-link-title="1.7 排錯方法論：用三層架構做故障定位" data-link-desc="故障定位的分層思考、症狀到層級的對應反射、log 在三層的角色差異、最小可重現的縮減策略">1.7 排錯方法論</a>。</p>
<h2 id="觀察記憶體佔用實測對照">觀察記憶體佔用：實測對照</h2>
<p>跑這幾步紀錄 baseline → load model → kill 的 RAM 變化：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Baseline</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">vm_stat <span class="p">|</span> grep <span class="s2">&#34;Pages free&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># Pages free:                              1090076.   ← ~17 GB free</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 啟動 Ollama + load 4B model</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">brew services start ollama
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">ollama run gemma3:4b <span class="s2">&#34;hello&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">ollama ps
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># NAME       SIZE     PROCESSOR    UNTIL</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># gemma3:4b  5.5 GB   100% Metal   4 minutes from now</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl">vm_stat <span class="p">|</span> grep <span class="s2">&#34;Pages free&#34;</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"># Pages free:                               750000.   ← 跌 ~5 GB（model 載入）</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1"># 額外啟動 ComfyUI + load SDXL</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">nohup python main.py &gt; /tmp/comfyui.log 2&gt;<span class="p">&amp;</span><span class="m">1</span> <span class="p">&amp;</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1"># 在 GUI 上 Queue Prompt 跑一次 SDXL generation</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">vm_stat <span class="p">|</span> grep <span class="s2">&#34;Pages free&#34;</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="c1"># Pages free:                               280000.   ← 再跌 ~7.5 GB（SDXL 載入 + Python venv）</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="c1"># kill 全部</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">brew services stop ollama
</span></span><span class="line"><span class="ln">23</span><span class="cl">pkill -9 -f <span class="s2">&#34;ComfyUI/main.py&#34;</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">sleep <span class="m">3</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">vm_stat <span class="p">|</span> grep <span class="s2">&#34;Pages free&#34;</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="c1"># Pages free:                              1090000.   ← 回到 baseline</span></span></span></code></pre></div><p>每 page 16 KB、所以 free pages 數字 × 16 KB = 實際 free RAM bytes。</p>
<h2 id="自動化釋放launchd--shell-alias">自動化釋放：launchd / shell alias</h2>
<p>寫個 shell function 一鍵 cleanup：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 加進 ~/.zshrc</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">llm-cleanup<span class="o">()</span> <span class="o">{</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Stopping Ollama...&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  brew services stop ollama 2&gt;/dev/null
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Killing ComfyUI...&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  pkill -INT -f <span class="s2">&#34;ComfyUI/main.py&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  sleep <span class="m">3</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  pkill -KILL -f <span class="s2">&#34;ComfyUI/main.py&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Killing other model servers...&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">  pkill -KILL -f <span class="s2">&#34;llama-server&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">13</span><span class="cl">  pkill -KILL -f <span class="s2">&#34;lm-studio-server&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">14</span><span class="cl">
</span></span><span class="line"><span class="ln">15</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Verifying ports...&#34;</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">  <span class="k">for</span> p in <span class="m">11434</span> <span class="m">1234</span> <span class="m">8080</span> <span class="m">8188</span> 8000<span class="p">;</span> <span class="k">do</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    lsof -i :<span class="nv">$p</span> 2&gt;/dev/null <span class="p">|</span> head -2
</span></span><span class="line"><span class="ln">18</span><span class="cl">  <span class="k">done</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Free RAM:&#34;</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">  vm_stat <span class="p">|</span> grep <span class="s2">&#34;Pages free&#34;</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="o">}</span></span></span></code></pre></div><p>完事打 <code>llm-cleanup</code> 一鍵釋放、不用記每個 process 怎麼 kill。</p>
<h2 id="何時這篇會過時">何時這篇會過時</h2>
<p><strong>不會過時的部分</strong>：</p>
<ul>
<li>RAM / 磁碟 / port 三個 dimension 是長期 invariant、用什麼 LLM server 都成立。</li>
<li>「Mac 是 shared resource、需要主動管理」這個 framing。</li>
<li>Ollama 跟 ComfyUI 兩種典型 lifecycle 對比（auto-unload vs persistent）。</li>
<li>觀察工具（<code>vm_stat</code>、<code>lsof</code>、<code>ps</code>、<code>du</code>、Activity Monitor）是 macOS 系統 API、不會 deprecate。</li>
<li>標準釋放程序、自動化 shell function 模式。</li>
</ul>
<p><strong>會變的部分</strong>：</p>
<ul>
<li>具體 model size / RAM 占用數字（隨模型架構演化）。</li>
<li><code>OLLAMA_KEEP_ALIVE</code> 等具體環境變數名（Ollama API 演化）。</li>
<li>ComfyUI 可能加 auto-unload feature（社群有 issue 在討論）。</li>
</ul>
<p>讀的時候若指令跑不過、先 <code>--help</code> 看當前版本 flag；釋放 RAM 的「kill process」這個機制本身永遠成立。</p>
<h2 id="跟其他-hands-on-章節的關係">跟其他 hands-on 章節的關係</h2>
<ul>
<li><a href="/blog/llm/01-local-llm-services/hands-on/ollama-setup/" data-link-title="Hands-on：安裝 Ollama &#43; 拉第一個 Gemma 模型" data-link-desc="brew install ollama、launchd service、ollama pull、curl 驗證 OpenAI 相容 API">Ollama 安裝</a>：介紹 <code>brew services start/stop</code>、本篇延伸 lifecycle 細節</li>
<li><a href="/blog/llm/01-local-llm-services/hands-on/comfyui-setup/" data-link-title="Hands-on：安裝 ComfyUI &#43; SDXL base" data-link-desc="git clone、venv、pip install requirements、SDXL safetensors 放哪、--listen 啟動 server、瀏覽器 workflow 驗證">ComfyUI 安裝</a>：介紹 ComfyUI 啟動、本篇延伸 RAM 占用 + 釋放</li>
<li><a href="/blog/llm/01-local-llm-services/troubleshooting/" data-link-title="1.7 排錯方法論：用三層架構做故障定位" data-link-desc="故障定位的分層思考、症狀到層級的對應反射、log 在三層的角色差異、最小可重現的縮減策略">1.7 排錯方法論</a>：用三層架構定位故障、本篇是 lifecycle 視角的補完</li>
<li><a href="/blog/llm/00-foundations/privacy-data-flow/" data-link-title="0.7 隱私 / 資安的資料流原理" data-link-desc="從「位置」到「資料流」的思考升級：信任邊界、合約模型、零信任原則套用到 LLM 工作流">0.7 隱私資料流原理</a>：「每個 hop 都要 audit」延伸到資源層</li>
</ul>
<p>整體心法：本地 LLM 工作流跟雲端不一樣、要主動管理 lifecycle、不能裝完就忘。</p>
]]></content:encoded></item><item><title>Hands-on：用本地 LLM 跑 judge harness（最小可行版）</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/local-llm-judge-harness/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/local-llm-judge-harness/</guid><description>&lt;p>&lt;a href="https://tarrragon.github.io/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-judge&lt;/a> 寫的是原理。本篇用 Ollama / LM Studio 在本地跑一個最小可行的 judge harness、對自己工作流的真實案例做 systematic eval。隱私敏感場景特別合用 — eval 資料（user query、agent output、可能含 PII）不需要送雲端。&lt;/p>
&lt;p>本篇 framing 是「&lt;strong>真的能跑、不只跑 demo&lt;/strong>」、所以包含：硬體預算估算、judge model 選型、bias 緩解、calibration 流程、跟 production trace 串接的延伸；術語對應 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/llm-as-judge/" data-link-title="LLM-as-Judge" data-link-desc="用 LLM 評估另一個 LLM 的輸出品質、production eval 的主流方法、500-5000× 成本降但有 bias 要處理">LLM-as-Judge&lt;/a> 與 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/llm-tracing/" data-link-title="LLM Tracing" data-link-desc="把 LLM 應用的每次 LLM call / tool call / memory op 編成結構化 span、用 OpenTelemetry GenAI semantic conventions 標準化">LLM Tracing&lt;/a>。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>環境&lt;/strong>：M4 Max 64GB / 或 24GB+ VRAM PC + Ollama
&lt;strong>Judge model&lt;/strong>：DeepSeek-R1-Distill-Qwen-32B 或 QwQ-32B（reasoning model 當 judge 更穩）&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼用本地-llm-當-judge">為什麼用本地 LLM 當 judge&lt;/h2>
&lt;p>跟雲端 judge（GPT-5 / Claude 4）對比：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>本地 judge&lt;/th>
 &lt;th>雲端 judge&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Cost&lt;/td>
 &lt;td>0（電費）&lt;/td>
 &lt;td>$0.001-0.01 per item&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>隱私&lt;/td>
 &lt;td>完全本地、eval 資料不出機器&lt;/td>
 &lt;td>送雲端、依政策&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Latency&lt;/td>
 &lt;td>視硬體、reasoning model 30B 約 30-60s&lt;/td>
 &lt;td>API call 5-30s&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>品質上限&lt;/td>
 &lt;td>本地 30B reasoning 接近 2024 雲端中段&lt;/td>
 &lt;td>雲端旗艦上限高&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>大量 batch&lt;/td>
 &lt;td>慢但 zero cost&lt;/td>
 &lt;td>快但 cost 累積&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>判讀：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>大量 production trace eval（千筆以上）+ 隱私敏感&lt;/strong> → 本地 judge&lt;/li>
&lt;li>&lt;strong>少量 high-stake eval（&amp;lt; 50 筆）&lt;/strong> → 雲端旗艦 judge&lt;/li>
&lt;li>&lt;strong>A/B test 快速 iterate&lt;/strong> → 雲端（latency 重要）&lt;/li>
&lt;/ul>
&lt;h2 id="硬體預算">硬體預算&lt;/h2>
&lt;p>Judge model 選擇看硬體：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>硬體&lt;/th>
 &lt;th>適合 judge model&lt;/th>
 &lt;th>預期 latency / item&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>M4 Pro 24GB / 4090 16GB&lt;/td>
 &lt;td>Qwen2.5-32B Q4 或 DeepSeek-R1-Distill-14B&lt;/td>
 &lt;td>30-60s&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>M4 Pro 36GB&lt;/td>
 &lt;td>DeepSeek-R1-Distill-Qwen-32B Q4&lt;/td>
 &lt;td>60-120s&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>M4 Max 48-64GB / 5090 24GB&lt;/td>
 &lt;td>QwQ-32B 或 DeepSeek-R1-Distill-Qwen-32B Q6&lt;/td>
 &lt;td>60-180s（含 reasoning trace）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>M4 Max 128GB / 多卡 PC&lt;/td>
 &lt;td>Llama 3.3 70B 或 Qwen3-72B&lt;/td>
 &lt;td>120-300s&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>注意：reasoning model 的 thinking trace 拉長 latency、跑大量 batch 要規劃時間（100 item × 60s = 100 min）。&lt;/p></description><content:encoded><![CDATA[<p><a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-judge</a> 寫的是原理。本篇用 Ollama / LM Studio 在本地跑一個最小可行的 judge harness、對自己工作流的真實案例做 systematic eval。隱私敏感場景特別合用 — eval 資料（user query、agent output、可能含 PII）不需要送雲端。</p>
<p>本篇 framing 是「<strong>真的能跑、不只跑 demo</strong>」、所以包含：硬體預算估算、judge model 選型、bias 緩解、calibration 流程、跟 production trace 串接的延伸；術語對應 <a href="/blog/llm/knowledge-cards/llm-as-judge/" data-link-title="LLM-as-Judge" data-link-desc="用 LLM 評估另一個 LLM 的輸出品質、production eval 的主流方法、500-5000× 成本降但有 bias 要處理">LLM-as-Judge</a> 與 <a href="/blog/llm/knowledge-cards/llm-tracing/" data-link-title="LLM Tracing" data-link-desc="把 LLM 應用的每次 LLM call / tool call / memory op 編成結構化 span、用 OpenTelemetry GenAI semantic conventions 標準化">LLM Tracing</a>。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>環境</strong>：M4 Max 64GB / 或 24GB+ VRAM PC + Ollama
<strong>Judge model</strong>：DeepSeek-R1-Distill-Qwen-32B 或 QwQ-32B（reasoning model 當 judge 更穩）</p></blockquote>
<h2 id="為什麼用本地-llm-當-judge">為什麼用本地 LLM 當 judge</h2>
<p>跟雲端 judge（GPT-5 / Claude 4）對比：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>本地 judge</th>
          <th>雲端 judge</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cost</td>
          <td>0（電費）</td>
          <td>$0.001-0.01 per item</td>
      </tr>
      <tr>
          <td>隱私</td>
          <td>完全本地、eval 資料不出機器</td>
          <td>送雲端、依政策</td>
      </tr>
      <tr>
          <td>Latency</td>
          <td>視硬體、reasoning model 30B 約 30-60s</td>
          <td>API call 5-30s</td>
      </tr>
      <tr>
          <td>品質上限</td>
          <td>本地 30B reasoning 接近 2024 雲端中段</td>
          <td>雲端旗艦上限高</td>
      </tr>
      <tr>
          <td>大量 batch</td>
          <td>慢但 zero cost</td>
          <td>快但 cost 累積</td>
      </tr>
  </tbody>
</table>
<p>判讀：</p>
<ul>
<li><strong>大量 production trace eval（千筆以上）+ 隱私敏感</strong> → 本地 judge</li>
<li><strong>少量 high-stake eval（&lt; 50 筆）</strong> → 雲端旗艦 judge</li>
<li><strong>A/B test 快速 iterate</strong> → 雲端（latency 重要）</li>
</ul>
<h2 id="硬體預算">硬體預算</h2>
<p>Judge model 選擇看硬體：</p>
<table>
  <thead>
      <tr>
          <th>硬體</th>
          <th>適合 judge model</th>
          <th>預期 latency / item</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>M4 Pro 24GB / 4090 16GB</td>
          <td>Qwen2.5-32B Q4 或 DeepSeek-R1-Distill-14B</td>
          <td>30-60s</td>
      </tr>
      <tr>
          <td>M4 Pro 36GB</td>
          <td>DeepSeek-R1-Distill-Qwen-32B Q4</td>
          <td>60-120s</td>
      </tr>
      <tr>
          <td>M4 Max 48-64GB / 5090 24GB</td>
          <td>QwQ-32B 或 DeepSeek-R1-Distill-Qwen-32B Q6</td>
          <td>60-180s（含 reasoning trace）</td>
      </tr>
      <tr>
          <td>M4 Max 128GB / 多卡 PC</td>
          <td>Llama 3.3 70B 或 Qwen3-72B</td>
          <td>120-300s</td>
      </tr>
  </tbody>
</table>
<p>注意：reasoning model 的 thinking trace 拉長 latency、跑大量 batch 要規劃時間（100 item × 60s = 100 min）。</p>
<p><strong>何時不適合用本地 judge</strong>：</p>
<ol>
<li><strong>硬體低於 M4 Pro 24GB / 4090 16GB</strong>（如 M1/M2 16GB、無獨立 GPU PC）：跑 32B reasoning model 太緊、強行跑會 swap、latency 爆 5-10×。改用 14B instruct model（如 Qwen2.5-14B Q4）作 judge、或直接走雲端 judge</li>
<li><strong>Batch × latency &gt; 你可接受的等待時間</strong>：100 item × 60s/item = 100 min；500 item × 120s = 17 hr。預估超過 4 hr 時改雲端 batch API</li>
<li><strong>eval 任務太 nuanced</strong>：細粒度倫理 / 法律 / 高 stake 判讀、本地 32B distill 能力不夠、用雲端旗艦 judge 或人工 review</li>
<li><strong>calibration 階段</strong>：第一次跑、要快速 iterate rubric、雲端 judge latency 短（5-30s）更適合 iterate</li>
</ol>
<h2 id="整體流程">整體流程</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. 蒐集 eval dataset    → JSONL：每行一個 (input, output) 待評
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. 設計 rubric         → 評分維度、scale、明確 anti-pattern
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. 寫 judge prompt     → 4 段式（task / input-output / rubric / format）
</span></span><span class="line"><span class="ln">4</span><span class="cl">4. 跑 harness          → 對每筆 input call judge、parse JSON output
</span></span><span class="line"><span class="ln">5</span><span class="cl">5. Aggregate 結果      → 算平均分數、找 outlier、看 reasoning
</span></span><span class="line"><span class="ln">6</span><span class="cl">6. Calibration（可選）  → 跟 human eval 比對、調 rubric
</span></span><span class="line"><span class="ln">7</span><span class="cl">7. 跟 production trace 串接 → 定期跑 production sample</span></span></code></pre></div><h2 id="step-1蒐集-eval-dataset">Step 1：蒐集 eval dataset</h2>
<p>JSONL format（每行一筆）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="p">{</span><span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="s2">&#34;001&#34;</span><span class="p">,</span> <span class="nt">&#34;input&#34;</span><span class="p">:</span> <span class="s2">&#34;用 Python 寫 fibonacci function&#34;</span><span class="p">,</span> <span class="nt">&#34;output&#34;</span><span class="p">:</span> <span class="s2">&#34;def fib(n):\n    if n &lt;= 1:\n        return n\n    return fib(n-1) + fib(n-2)&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="p">{</span><span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="s2">&#34;002&#34;</span><span class="p">,</span> <span class="nt">&#34;input&#34;</span><span class="p">:</span> <span class="s2">&#34;解釋這段 code 在做什麼：[code]&#34;</span><span class="p">,</span> <span class="nt">&#34;output&#34;</span><span class="p">:</span> <span class="s2">&#34;這段 code 實作了 ...&#34;</span><span class="p">}</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="p">{</span><span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="s2">&#34;003&#34;</span><span class="p">,</span> <span class="nt">&#34;input&#34;</span><span class="p">:</span> <span class="s2">&#34;[bug 描述]&#34;</span><span class="p">,</span> <span class="nt">&#34;output&#34;</span><span class="p">:</span> <span class="s2">&#34;[suggested fix]&#34;</span><span class="p">}</span></span></span></code></pre></div><p>來源：</p>
<ul>
<li>過往 Continue.dev / Cursor 跟 LLM 的對話 log</li>
<li>Production agent 的 trace（手動 export 或 LangSmith / Phoenix dump）</li>
<li>自己 hand-craft 30-100 個典型 case</li>
</ul>
<p>放在 <code>data/eval.jsonl</code>。</p>
<h2 id="step-2設計-rubric">Step 2：設計 rubric</h2>
<p>依任務類型設計、coding 任務的範例 rubric：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">評分維度：
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">1. Correctness（程式碼能否運作、邏輯是否正確）：1-5
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">2. Style（是否符合 codebase convention、習慣命名）：1-5
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">3. Completeness（是否完整解決 user request）：1-5
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">評分規則：
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">- 5：完美無瑕、可直接 merge
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">- 4：小修可用、整體正確
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">- 3：方向正確、需 substantial 修改
</span></span><span class="line"><span class="ln">10</span><span class="cl">- 2：部分對、主要邏輯有錯
</span></span><span class="line"><span class="ln">11</span><span class="cl">- 1：完全錯、誤導使用者
</span></span><span class="line"><span class="ln">12</span><span class="cl">
</span></span><span class="line"><span class="ln">13</span><span class="cl">明確不加分（緩解 verbosity bias）：
</span></span><span class="line"><span class="ln">14</span><span class="cl">- 冗長 / verbose（同樣正確的短答 = 長答）
</span></span><span class="line"><span class="ln">15</span><span class="cl">- 道歉 / 開場白
</span></span><span class="line"><span class="ln">16</span><span class="cl">- 「我希望這有幫助」這類禮貌話
</span></span><span class="line"><span class="ln">17</span><span class="cl">- 過多 markdown 修飾（不加分）</span></span></code></pre></div><h2 id="step-3judge-prompt-模板">Step 3：Judge prompt 模板</h2>
<p>寫成 file <code>prompts/judge.txt</code>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">你是 LLM 輸出品質評估員、要評估 coding assistant 對使用者請求的回答品質。
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">重要：請保持公正、忽略風格偏好、聚焦在實質品質。
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">User request:
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">{input}
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">Assistant response:
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">{output}
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl">評分維度（每維 1-5、加總用 overall）：
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl">1. Correctness：程式碼能否運作、邏輯正確
</span></span><span class="line"><span class="ln">13</span><span class="cl">   5: 完美無瑕
</span></span><span class="line"><span class="ln">14</span><span class="cl">   4: 小修可用
</span></span><span class="line"><span class="ln">15</span><span class="cl">   3: 方向正確、需 substantial 修改
</span></span><span class="line"><span class="ln">16</span><span class="cl">   2: 部分對、主要邏輯有錯
</span></span><span class="line"><span class="ln">17</span><span class="cl">   1: 完全錯
</span></span><span class="line"><span class="ln">18</span><span class="cl">
</span></span><span class="line"><span class="ln">19</span><span class="cl">2. Style：符合 codebase convention
</span></span><span class="line"><span class="ln">20</span><span class="cl">   1-5 同 scale
</span></span><span class="line"><span class="ln">21</span><span class="cl">
</span></span><span class="line"><span class="ln">22</span><span class="cl">3. Completeness：完整解決 user request
</span></span><span class="line"><span class="ln">23</span><span class="cl">   1-5 同 scale
</span></span><span class="line"><span class="ln">24</span><span class="cl">
</span></span><span class="line"><span class="ln">25</span><span class="cl">明確不加分項：
</span></span><span class="line"><span class="ln">26</span><span class="cl">- 冗長 / verbose（同樣正確的短答 = 長答）
</span></span><span class="line"><span class="ln">27</span><span class="cl">- 道歉 / 開場白
</span></span><span class="line"><span class="ln">28</span><span class="cl">- 「我希望這有幫助」這類禮貌話
</span></span><span class="line"><span class="ln">29</span><span class="cl">- 過多 markdown 修飾
</span></span><span class="line"><span class="ln">30</span><span class="cl">
</span></span><span class="line"><span class="ln">31</span><span class="cl">請依下列 JSON 輸出（不要加額外文字、不要 markdown code fence）：
</span></span><span class="line"><span class="ln">32</span><span class="cl">{
</span></span><span class="line"><span class="ln">33</span><span class="cl">  &#34;correctness&#34;: &lt;1-5&gt;,
</span></span><span class="line"><span class="ln">34</span><span class="cl">  &#34;style&#34;: &lt;1-5&gt;,
</span></span><span class="line"><span class="ln">35</span><span class="cl">  &#34;completeness&#34;: &lt;1-5&gt;,
</span></span><span class="line"><span class="ln">36</span><span class="cl">  &#34;reasoning&#34;: &#34;&lt;簡短解釋、&lt; 100 字&gt;&#34;,
</span></span><span class="line"><span class="ln">37</span><span class="cl">  &#34;overall&#34;: &lt;1-5&gt;
</span></span><span class="line"><span class="ln">38</span><span class="cl">}</span></span></code></pre></div><h2 id="step-4跑-harness">Step 4：跑 harness</h2>
<p>Python 最小可行版：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># judge_harness.py</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">json</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="kn">import</span> <span class="nn">requests</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="n">JUDGE_MODEL</span> <span class="o">=</span> <span class="s2">&#34;deepseek-r1:32b&#34;</span>  <span class="c1"># 或 qwq:32b</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">OLLAMA_URL</span> <span class="o">=</span> <span class="s2">&#34;http://localhost:11434/v1/chat/completions&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="k">def</span> <span class="nf">load_dataset</span><span class="p">(</span><span class="n">path</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="s2">&#34;&#34;&#34;Load JSONL eval dataset.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">path</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">        <span class="k">return</span> <span class="p">[</span><span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">line</span><span class="p">)</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">f</span> <span class="k">if</span> <span class="n">line</span><span class="o">.</span><span class="n">strip</span><span class="p">()]</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="k">def</span> <span class="nf">load_prompt_template</span><span class="p">(</span><span class="n">path</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">    <span class="k">return</span> <span class="n">Path</span><span class="p">(</span><span class="n">path</span><span class="p">)</span><span class="o">.</span><span class="n">read_text</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="k">def</span> <span class="nf">call_judge</span><span class="p">(</span><span class="n">prompt</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">    <span class="s2">&#34;&#34;&#34;Call Ollama judge model、回 raw response text.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">    <span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">OLLAMA_URL</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">        <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="n">JUDGE_MODEL</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">        <span class="s2">&#34;messages&#34;</span><span class="p">:</span> <span class="p">[{</span><span class="s2">&#34;role&#34;</span><span class="p">:</span> <span class="s2">&#34;user&#34;</span><span class="p">,</span> <span class="s2">&#34;content&#34;</span><span class="p">:</span> <span class="n">prompt</span><span class="p">}],</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">        <span class="s2">&#34;temperature&#34;</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span>  <span class="c1"># judge 用低 temperature 穩定</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">        <span class="s2">&#34;stream&#34;</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">    <span class="p">},</span> <span class="n">timeout</span><span class="o">=</span><span class="mi">600</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">    <span class="k">return</span> <span class="n">resp</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s2">&#34;choices&#34;</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s2">&#34;message&#34;</span><span class="p">][</span><span class="s2">&#34;content&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl">
</span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="k">def</span> <span class="nf">parse_judge_output</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">28</span><span class="cl">    <span class="s2">&#34;&#34;&#34;Parse judge 回的 JSON、容錯處理（reasoning model 可能加 &lt;think&gt; 標記）。&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl">    <span class="c1"># 跳過 reasoning trace</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl">    <span class="k">if</span> <span class="s2">&#34;&lt;/think&gt;&#34;</span> <span class="ow">in</span> <span class="n">text</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl">        <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">&#34;&lt;/think&gt;&#34;</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">32</span><span class="cl">
</span></span><span class="line"><span class="ln">33</span><span class="cl">    <span class="c1"># 找 JSON 區塊</span>
</span></span><span class="line"><span class="ln">34</span><span class="cl">    <span class="n">start</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">&#34;{&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">35</span><span class="cl">    <span class="n">end</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">rfind</span><span class="p">(</span><span class="s2">&#34;}&#34;</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
</span></span><span class="line"><span class="ln">36</span><span class="cl">    <span class="k">if</span> <span class="n">start</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span> <span class="ow">or</span> <span class="n">end</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">37</span><span class="cl">        <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="ln">38</span><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">39</span><span class="cl">        <span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">text</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">end</span><span class="p">])</span>
</span></span><span class="line"><span class="ln">40</span><span class="cl">    <span class="k">except</span> <span class="n">json</span><span class="o">.</span><span class="n">JSONDecodeError</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">41</span><span class="cl">        <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="ln">42</span><span class="cl">
</span></span><span class="line"><span class="ln">43</span><span class="cl"><span class="k">def</span> <span class="nf">run_harness</span><span class="p">(</span><span class="n">dataset_path</span><span class="p">,</span> <span class="n">prompt_template_path</span><span class="p">,</span> <span class="n">output_path</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">44</span><span class="cl">    <span class="n">dataset</span> <span class="o">=</span> <span class="n">load_dataset</span><span class="p">(</span><span class="n">dataset_path</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">45</span><span class="cl">    <span class="n">template</span> <span class="o">=</span> <span class="n">load_prompt_template</span><span class="p">(</span><span class="n">prompt_template_path</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">46</span><span class="cl">
</span></span><span class="line"><span class="ln">47</span><span class="cl">    <span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="ln">48</span><span class="cl">    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">item</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">dataset</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">49</span><span class="cl">        <span class="n">prompt</span> <span class="o">=</span> <span class="n">template</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">item</span><span class="p">[</span><span class="s2">&#34;input&#34;</span><span class="p">],</span> <span class="n">output</span><span class="o">=</span><span class="n">item</span><span class="p">[</span><span class="s2">&#34;output&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="ln">50</span><span class="cl">        <span class="n">raw</span> <span class="o">=</span> <span class="n">call_judge</span><span class="p">(</span><span class="n">prompt</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">51</span><span class="cl">        <span class="n">parsed</span> <span class="o">=</span> <span class="n">parse_judge_output</span><span class="p">(</span><span class="n">raw</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">52</span><span class="cl">
</span></span><span class="line"><span class="ln">53</span><span class="cl">        <span class="n">result</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">54</span><span class="cl">            <span class="s2">&#34;id&#34;</span><span class="p">:</span> <span class="n">item</span><span class="p">[</span><span class="s2">&#34;id&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="ln">55</span><span class="cl">            <span class="s2">&#34;scores&#34;</span><span class="p">:</span> <span class="n">parsed</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">56</span><span class="cl">            <span class="s2">&#34;raw_judge_output&#34;</span><span class="p">:</span> <span class="n">raw</span><span class="p">[:</span><span class="mi">500</span><span class="p">],</span>  <span class="c1"># 保留前 500 字便於 debug</span>
</span></span><span class="line"><span class="ln">57</span><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="ln">58</span><span class="cl">        <span class="n">results</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">59</span><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;[</span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s2">/</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span><span class="si">}</span><span class="s2">] id=</span><span class="si">{</span><span class="n">item</span><span class="p">[</span><span class="s1">&#39;id&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> overall=</span><span class="si">{</span><span class="n">parsed</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;overall&#39;</span><span class="p">)</span> <span class="k">if</span> <span class="n">parsed</span> <span class="k">else</span> <span class="s1">&#39;FAIL&#39;</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">60</span><span class="cl">
</span></span><span class="line"><span class="ln">61</span><span class="cl">    <span class="c1"># 寫出 JSONL</span>
</span></span><span class="line"><span class="ln">62</span><span class="cl">    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_path</span><span class="p">,</span> <span class="s2">&#34;w&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">63</span><span class="cl">        <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">64</span><span class="cl">            <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o">+</span> <span class="s2">&#34;</span><span class="se">\n</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">65</span><span class="cl">
</span></span><span class="line"><span class="ln">66</span><span class="cl">    <span class="c1"># Aggregate</span>
</span></span><span class="line"><span class="ln">67</span><span class="cl">    <span class="n">valid</span> <span class="o">=</span> <span class="p">[</span><span class="n">r</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results</span> <span class="k">if</span> <span class="n">r</span><span class="p">[</span><span class="s2">&#34;scores&#34;</span><span class="p">]]</span>
</span></span><span class="line"><span class="ln">68</span><span class="cl">    <span class="k">if</span> <span class="n">valid</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">69</span><span class="cl">        <span class="n">avg</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">r</span><span class="p">[</span><span class="s2">&#34;scores&#34;</span><span class="p">][</span><span class="s2">&#34;overall&#34;</span><span class="p">]</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">valid</span><span class="p">)</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">valid</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">70</span><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;</span><span class="se">\n</span><span class="s2">Aggregate: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">valid</span><span class="p">)</span><span class="si">}</span><span class="s2">/</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">results</span><span class="p">)</span><span class="si">}</span><span class="s2"> valid、avg overall = </span><span class="si">{</span><span class="n">avg</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">71</span><span class="cl">
</span></span><span class="line"><span class="ln">72</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">&#34;__main__&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">73</span><span class="cl">    <span class="n">run_harness</span><span class="p">(</span><span class="s2">&#34;data/eval.jsonl&#34;</span><span class="p">,</span> <span class="s2">&#34;prompts/judge.txt&#34;</span><span class="p">,</span> <span class="s2">&#34;results/eval.jsonl&#34;</span><span class="p">)</span></span></span></code></pre></div><p>跑：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 先確認 judge model 已 pull</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">ollama pull deepseek-r1:32b
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 跑 harness</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">python judge_harness.py</span></span></code></pre></div><h2 id="step-5aggregate-跟看-outlier">Step 5：Aggregate 跟看 outlier</h2>
<p>跑完後 results/eval.jsonl 含每筆評分跟 reasoning。看哪些是 outlier：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 找 overall &lt; 3 的 case（低分、值得 review）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">jq <span class="s1">&#39;select(.scores.overall &lt; 3)&#39;</span> results/eval.jsonl
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 看 reasoning 找系統性問題</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">jq <span class="s1">&#39;.scores.reasoning&#39;</span> results/eval.jsonl <span class="p">|</span> sort -u</span></span></code></pre></div><p>判讀：</p>
<ul>
<li><strong>多數 score 4-5、少數 1-2</strong>：整體品質好、focus 在低分 case 找 fix</li>
<li><strong>多數 score 2-3</strong>：系統性問題、改 prompt / model / agent design</li>
<li><strong>分數分佈兩極（很多 5 很多 1）</strong>：可能是 task difficulty 分群、stratified analysis</li>
</ul>
<h2 id="step-6calibration可選但推薦">Step 6：Calibration（可選但推薦）</h2>
<p>跟 human eval 比對、確認 judge 對齊：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. 從 dataset 抽 30 個（覆蓋 difficulty / score 分佈）
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. 自己 human eval（依同樣 rubric）
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. 對比 judge 跟 human 的 overall score
</span></span><span class="line"><span class="ln">4</span><span class="cl">4. 算 Spearman correlation
</span></span><span class="line"><span class="ln">5</span><span class="cl">   - &gt; 0.7：judge 對齊夠好、可信
</span></span><span class="line"><span class="ln">6</span><span class="cl">   - 0.5-0.7：部分問題、改 rubric
</span></span><span class="line"><span class="ln">7</span><span class="cl">   - &lt; 0.5：judge 不可信、換 model 或重寫 rubric</span></span></code></pre></div><p>低 correlation 的常見原因：</p>
<ul>
<li>Rubric 太 vague、judge 自由發揮</li>
<li>Judge model 能力不夠（換更強 judge）</li>
<li>Verbosity / position bias 沒緩解</li>
<li>Eval task 跟 judge 訓練分佈差距大</li>
</ul>
<h2 id="step-7跟-production-trace-串接延伸">Step 7：跟 production trace 串接（延伸）</h2>
<p>把 <a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20 LLM tracing</a> 蒐集的 production trace export 成 JSONL、定期跑 judge：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 假設用 Langfuse self-host</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">langfuse <span class="nb">export</span> --filter <span class="s2">&#34;user_feedback=negative&#34;</span> --output traces.jsonl
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 轉成 eval format</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">python convert_trace_to_eval.py traces.jsonl &gt; data/eval-from-prod.jsonl
</span></span><span class="line"><span class="ln">6</span><span class="cl">
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># 跑 judge</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">python judge_harness.py</span></span></code></pre></div><p>這是 production quality engineering 閉環的本地版本、隱私敏感場景的 cost-free alternative。</p>
<h2 id="失敗模式">失敗模式</h2>
<ol>
<li><strong>Judge 不輸出合法 JSON</strong>：reasoning model 可能在 <code>&lt;think&gt;...&lt;/think&gt;</code> 後仍加 markdown / 解釋</li>
</ol>
<p><strong>緩解</strong>：parse 時跳 <code>&lt;think&gt;</code> 段、容錯處理、或開 <a href="/blog/llm/knowledge-cards/constrained-decoding/" data-link-title="Constrained Decoding" data-link-desc="推論時用 grammar 強制 LLM 輸出符合特定格式（JSON / regex / CFG）的 sampling 機制、把不合法 token 的機率歸零">constrained decoding</a>（llama.cpp grammar）</p>
<ol start="2">
<li><strong>Latency 太長、batch 跑不完</strong>：reasoning model 32B 每 item 60-120s、100 item 要 2 小時</li>
</ol>
<p><strong>緩解</strong>：用較小 judge model（如 Qwen2.5-32B instruct、非 reasoning）、或拆 batch 並行</p>
<ol start="3">
<li><strong>Judge bias 沒緩解</strong>：本地 judge 跟雲端 judge 都會有 verbosity / position bias</li>
</ol>
<p><strong>緩解</strong>：rubric 寫明、pairwise 換位置跑 2 次</p>
<ol start="4">
<li><strong>本地 judge 能力上限</strong>：30B distill 對 nuanced case 判讀不如雲端旗艦</li>
</ol>
<p><strong>緩解</strong>：critical case 加 spot human review、或混用本地（量大）+ 雲端（精選 sample）</p>
<h2 id="跟其他章節的關係">跟其他章節的關係</h2>
<ul>
<li>原理層的 LLM-as-judge 設計見 <a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21</a></li>
<li>Production trace 串接見 <a href="/blog/llm/04-applications/llm-tracing-and-observability/" data-link-title="4.20 LLM tracing 與 observability" data-link-desc="OpenTelemetry GenAI semantic conventions、結構化 span 設計、cost / latency 監控、failure debug 流程、跟 LLM-as-judge eval 的串接">4.20 tracing</a></li>
<li>Reasoning model 選型見 <a href="/blog/llm/03-theoretical-foundations/reasoning-models/" data-link-title="3.8 Reasoning models：test-time compute paradigm" data-link-desc="Chain-of-thought 從 prompting 技巧演化成訓練 paradigm、reasoning model 的內部運作、本地可跑的選項與適用任務">3.8</a></li>
<li>隱私 / 跨雲端邊界判讀見 <a href="/blog/llm/06-security/cross-cloud-local-data-boundary/" data-link-title="6.4 跨雲端 / 本地的資料邊界" data-link-desc="個人 dev 場景下混用雲端 LLM 跟本地 LLM 時的 prompt 洩漏點：Continue.dev 多 provider 設定、隱私資料流、按敏感度分流的判讀">6.4</a></li>
<li>Benchmark 跟 in-house eval 的層次見 <a href="/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14</a></li>
</ul>
]]></content:encoded></item><item><title>Hands-on：RAG / MCP 的資源 footprint</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/rag-mcp-resources/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/rag-mcp-resources/</guid><description>&lt;p>&lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/resource-management/" data-link-title="Hands-on：LLM 運行中 &amp;#43; 結束的資源管理" data-link-desc="RAM / 磁碟 / port 三個 dimension 的觀察跟釋放、Ollama keep_alive 跟 ComfyUI 兩種 lifecycle 對比、實測釋放數字">Resource management 章&lt;/a> 講的是 Ollama / ComfyUI 等&lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/inference-server/" data-link-title="Inference Server" data-link-desc="載入模型權重、處理 prompt、產生 token 的常駐 process">推論伺服器&lt;/a>的 lifecycle。但&lt;strong>跑 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/rag/" data-link-title="RAG" data-link-desc="Retrieval-Augmented Generation：動態外掛知識給 LLM、繞開模型參數記憶的靜態限制">RAG&lt;/a> / &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/mcp/" data-link-title="MCP（Model Context Protocol）" data-link-desc="LLM application ↔ 外部 tool server 之間的標準化協議、複用 OpenAI 相容 API 的成功模式">MCP&lt;/a> 應用&lt;/strong>比單純 chat 多吃幾倍資源——&lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/embedding-model/" data-link-title="Embedding Model" data-link-desc="把文字轉成向量的模型：用於 codebase 索引與語意搜尋">embedding model&lt;/a>、chat model、index 檔、subprocess、tool 邏輯——而且不同階段（ingest vs query）的瓶頸不一樣。&lt;/p>
&lt;p>本篇紀錄 &lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &amp;#43; cosine retrieval &amp;#43; Ollama chat、validating 4.0 RAG 原理">RAG demo&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/mcp-demo/" data-link-title="Hands-on：用 blog content 寫一個最小 MCP server" data-link-desc="stdio JSON-RPC、stdlib-only Python、暴露 blog content 給 LLM 用、validating 4.3 應用層協議">MCP demo&lt;/a> 跑起來的實測資源 footprint、提供本地多模型並存的 baseline、給寫 production 應用前的 sanity check。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>環境&lt;/strong>：M4 Pro 32 GB、Ollama 0.23.2、Python 3.14
&lt;strong>Corpus&lt;/strong>：本 blog 的 &lt;code>content/llm/&lt;/code>、71 個 markdown 檔、463 chunks&lt;/p>&lt;/blockquote>
&lt;h2 id="各階段資源-footprint">各階段資源 footprint&lt;/h2>
&lt;p>RAG / MCP 工作流通常分三階段、各自吃不同資源：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>階段&lt;/th>
 &lt;th>主要資源消耗&lt;/th>
 &lt;th>持續時間&lt;/th>
 &lt;th>是否常駐&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>RAG ingest&lt;/strong>&lt;/td>
 &lt;td>embedding model RAM + CPU + 磁碟寫&lt;/td>
 &lt;td>one-shot（corpus 更動時跑）&lt;/td>
 &lt;td>否&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>RAG query&lt;/strong>&lt;/td>
 &lt;td>index 載入 RAM + chat model RAM + GPU&lt;/td>
 &lt;td>per-request&lt;/td>
 &lt;td>retrieval index 常駐&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>MCP server&lt;/strong>&lt;/td>
 &lt;td>subprocess 永久跑、tool 呼叫時動態載資源&lt;/td>
 &lt;td>session 內常駐&lt;/td>
 &lt;td>是&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>不同階段的瓶頸不一樣、優化目標也不同。&lt;/p>
&lt;h2 id="rag-ingest-階段one-shot-但批次密集">RAG Ingest 階段：one-shot 但批次密集&lt;/h2>
&lt;p>跑 &lt;code>python3 scripts/rag-demo/ingest.py&lt;/code> 時：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">Found 71 markdown files under content/llm
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> [10/71] 86 chunks in 4.5s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> [20/71] 181 chunks in 8.6s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> ...
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl"> [70/71] 461 chunks in 22.2s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">Wrote 463 records to scripts/rag-demo/index.pkl (22.3s)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>實測資源消耗：&lt;/p></description><content:encoded><![CDATA[<p><a href="/blog/llm/01-local-llm-services/hands-on/resource-management/" data-link-title="Hands-on：LLM 運行中 &#43; 結束的資源管理" data-link-desc="RAM / 磁碟 / port 三個 dimension 的觀察跟釋放、Ollama keep_alive 跟 ComfyUI 兩種 lifecycle 對比、實測釋放數字">Resource management 章</a> 講的是 Ollama / ComfyUI 等<a href="/blog/llm/knowledge-cards/inference-server/" data-link-title="Inference Server" data-link-desc="載入模型權重、處理 prompt、產生 token 的常駐 process">推論伺服器</a>的 lifecycle。但<strong>跑 <a href="/blog/llm/knowledge-cards/rag/" data-link-title="RAG" data-link-desc="Retrieval-Augmented Generation：動態外掛知識給 LLM、繞開模型參數記憶的靜態限制">RAG</a> / <a href="/blog/llm/knowledge-cards/mcp/" data-link-title="MCP（Model Context Protocol）" data-link-desc="LLM application ↔ 外部 tool server 之間的標準化協議、複用 OpenAI 相容 API 的成功模式">MCP</a> 應用</strong>比單純 chat 多吃幾倍資源——<a href="/blog/llm/knowledge-cards/embedding-model/" data-link-title="Embedding Model" data-link-desc="把文字轉成向量的模型：用於 codebase 索引與語意搜尋">embedding model</a>、chat model、index 檔、subprocess、tool 邏輯——而且不同階段（ingest vs query）的瓶頸不一樣。</p>
<p>本篇紀錄 <a href="/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &#43; cosine retrieval &#43; Ollama chat、validating 4.0 RAG 原理">RAG demo</a> 跟 <a href="/blog/llm/01-local-llm-services/hands-on/mcp-demo/" data-link-title="Hands-on：用 blog content 寫一個最小 MCP server" data-link-desc="stdio JSON-RPC、stdlib-only Python、暴露 blog content 給 LLM 用、validating 4.3 應用層協議">MCP demo</a> 跑起來的實測資源 footprint、提供本地多模型並存的 baseline、給寫 production 應用前的 sanity check。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>環境</strong>：M4 Pro 32 GB、Ollama 0.23.2、Python 3.14
<strong>Corpus</strong>：本 blog 的 <code>content/llm/</code>、71 個 markdown 檔、463 chunks</p></blockquote>
<h2 id="各階段資源-footprint">各階段資源 footprint</h2>
<p>RAG / MCP 工作流通常分三階段、各自吃不同資源：</p>
<table>
  <thead>
      <tr>
          <th>階段</th>
          <th>主要資源消耗</th>
          <th>持續時間</th>
          <th>是否常駐</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>RAG ingest</strong></td>
          <td>embedding model RAM + CPU + 磁碟寫</td>
          <td>one-shot（corpus 更動時跑）</td>
          <td>否</td>
      </tr>
      <tr>
          <td><strong>RAG query</strong></td>
          <td>index 載入 RAM + chat model RAM + GPU</td>
          <td>per-request</td>
          <td>retrieval index 常駐</td>
      </tr>
      <tr>
          <td><strong>MCP server</strong></td>
          <td>subprocess 永久跑、tool 呼叫時動態載資源</td>
          <td>session 內常駐</td>
          <td>是</td>
      </tr>
  </tbody>
</table>
<p>不同階段的瓶頸不一樣、優化目標也不同。</p>
<h2 id="rag-ingest-階段one-shot-但批次密集">RAG Ingest 階段：one-shot 但批次密集</h2>
<p>跑 <code>python3 scripts/rag-demo/ingest.py</code> 時：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Found 71 markdown files under content/llm
</span></span><span class="line"><span class="ln">2</span><span class="cl">  [10/71] 86 chunks in 4.5s
</span></span><span class="line"><span class="ln">3</span><span class="cl">  [20/71] 181 chunks in 8.6s
</span></span><span class="line"><span class="ln">4</span><span class="cl">  ...
</span></span><span class="line"><span class="ln">5</span><span class="cl">  [70/71] 461 chunks in 22.2s
</span></span><span class="line"><span class="ln">6</span><span class="cl">Wrote 463 records to scripts/rag-demo/index.pkl (22.3s)</span></span></code></pre></div><p>實測資源消耗：</p>
<table>
  <thead>
      <tr>
          <th>資源</th>
          <th>數字</th>
          <th>為什麼</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RAM（峰值）</td>
          <td>~600 MB</td>
          <td>nomic-embed-text 模型 (274 MB) + Python runtime + 累積 records (~200 MB)</td>
      </tr>
      <tr>
          <td>磁碟寫</td>
          <td><code>index.pkl</code> ~3.7 MB</td>
          <td>463 records、每筆含 chunk text + 768-dim float embedding</td>
      </tr>
      <tr>
          <td>CPU + GPU</td>
          <td>Ollama 推 embedding、Apple Silicon Metal backend</td>
          <td>22 秒處理 463 個 chunk、平均 ~21 chunk/sec</td>
      </tr>
      <tr>
          <td>網路</td>
          <td>0</td>
          <td>完全本地推論</td>
      </tr>
  </tbody>
</table>
<p><strong>Ingest 階段的特性</strong>：</p>
<ul>
<li><strong>One-shot</strong>：corpus 不變不用重跑、index 寫一次永久用。</li>
<li><strong>吃 CPU 多於 RAM</strong>：產生 embedding 是 forward pass、瓶頸在 GPU 算力、RAM 沒太大壓力。</li>
<li><strong>磁碟寫小</strong>：每 chunk 約 8 KB（text 部分 ~5 KB + embedding 768 floats × 4 bytes = ~3 KB）、463 chunks 總共 ~3.7 MB。</li>
<li><strong>可平行</strong>：sequential <code>embed(chunk)</code> 是最慢實作、用 batching API（如果 Ollama 支援）或多 worker、能快 5-10x。</li>
</ul>
<p><strong>規模 extrapolation</strong>：</p>
<table>
  <thead>
      <tr>
          <th>Corpus 大小</th>
          <th>預估 ingest 時間</th>
          <th>index.pkl 大小</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>71 docs / 463 chunks（本 blog）</td>
          <td>22 秒</td>
          <td>3.7 MB</td>
      </tr>
      <tr>
          <td>1000 docs / ~7000 chunks（中型 codebase）</td>
          <td>~5 分鐘</td>
          <td>~55 MB</td>
      </tr>
      <tr>
          <td>10000 docs / ~70000 chunks（大型 codebase）</td>
          <td>~50 分鐘</td>
          <td>~550 MB</td>
      </tr>
      <tr>
          <td>100K docs / ~700K chunks（公司 wiki）</td>
          <td>~8 小時</td>
          <td>~5.5 GB</td>
      </tr>
  </tbody>
</table>
<p>10K docs 以上就應該考慮：</p>
<ul>
<li><a href="/blog/llm/knowledge-cards/batching/" data-link-title="Batching" data-link-desc="多 request 一起跑、攤平 model load 成本：production LLM inference 的核心優化、決定 throughput vs latency 取捨">Batching</a> embedding（單次 request 送 50 個 chunks）</li>
<li>並行 worker（Python multiprocessing、4-8 worker）</li>
<li>換 <a href="/blog/llm/knowledge-cards/vector-database/" data-link-title="Vector Database" data-link-desc="為高維向量 (embedding) 設計的儲存 &#43; 近似最近鄰 (ANN) 檢索系統：RAG 從 prototype 跨到 production 的關鍵元件">vector database</a>（避免把全部資料用 pickle 塞 RAM）</li>
</ul>
<h2 id="rag-query-階段retrieval-加-generation">RAG Query 階段：retrieval 加 generation</h2>
<p>跑 <code>python3 scripts/rag-demo/query.py --show-retrieved &quot;問題&quot;</code> 時：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Loaded 463 chunks from scripts/rag-demo/index.pkl
</span></span><span class="line"><span class="ln">2</span><span class="cl">=== Retrieved chunks ===
</span></span><span class="line"><span class="ln">3</span><span class="cl">  0.870  llm/knowledge-cards/transformer.md#chunk2
</span></span><span class="line"><span class="ln">4</span><span class="cl">  ...
</span></span><span class="line"><span class="ln">5</span><span class="cl">（LLM 生成 response）</span></span></code></pre></div><p>實測資源消耗（單次 query）：</p>
<table>
  <thead>
      <tr>
          <th>階段</th>
          <th>RAM 增量</th>
          <th>時間</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>載 index.pkl 到 RAM</td>
          <td>3.7 MB（小 corpus）/ MB 級（大 corpus）</td>
          <td>&lt; 1 秒</td>
      </tr>
      <tr>
          <td>embed query</td>
          <td>0（已載入的 nomic-embed-text）</td>
          <td>200 ms</td>
      </tr>
      <tr>
          <td>cosine over 463 chunks</td>
          <td>純 Python 計算、暫時用 ~10 MB</td>
          <td>50 ms</td>
      </tr>
      <tr>
          <td>載 chat model（gemma3:1b）</td>
          <td>~1 GB（首次）/ 0（已 cached）</td>
          <td>5-10 秒（首次）/ 0（cached）</td>
      </tr>
      <tr>
          <td>生成 response</td>
          <td>0 額外</td>
          <td>5-30 秒（看 model + prompt 長度）</td>
      </tr>
  </tbody>
</table>
<p><strong>Query 階段的特性</strong>：</p>
<ul>
<li><strong>第一次 cold start</strong>：要載 chat model 進 RAM、5-10 秒首字延遲。</li>
<li><strong>後續 query 都快</strong>：embedding model + chat model 都在 RAM、retrieval 毫秒級、只剩 generation 時間。</li>
<li><strong>RAM 占用 = embedding model + chat model + index</strong>：
<ul>
<li>463 chunks: 274 MB + chat model + 3.7 MB ≈ chat model + 280 MB</li>
<li>100K chunks: 274 MB + chat model + ~800 MB 進 RAM、加上 mmap pickle 額外開銷</li>
</ul>
</li>
<li><strong>瓶頸是 chat model</strong>：retrieval 部分快、瓶頸完全在 generation。</li>
</ul>
<p><strong>多模型並存</strong>（embedding + chat）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 看當前 RAM 占用</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">ollama ps
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># NAME                       SIZE      UNTIL</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># nomic-embed-text:latest    274 MB    4 minutes from now</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># gemma3:4b                  5.5 GB    4 minutes from now</span></span></span></code></pre></div><p>兩個 model 都載入時、Ollama RAM 占用約 6 GB。Ollama 的 <code>OLLAMA_KEEP_ALIVE</code>（預設 5 分鐘）會 idle 後分別 unload 兩個 model。</p>
<p><strong>規模 sanity check</strong>：</p>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>RAM 需求</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>純 chat（gemma3:1b）</td>
          <td>~1 GB</td>
      </tr>
      <tr>
          <td>RAG with gemma3:1b + nomic-embed-text + 小 index</td>
          <td>~1.5 GB</td>
      </tr>
      <tr>
          <td>RAG with gemma3:4b + nomic-embed-text + 中型 index</td>
          <td>~6 GB</td>
      </tr>
      <tr>
          <td>RAG with gemma4:31b + nomic-embed-text + 大 index</td>
          <td>~20 GB</td>
      </tr>
  </tbody>
</table>
<p>跑 RAG 比 chat 額外要 ~300-1000 MB（embedding model + index）、不會太重。</p>
<h2 id="mcp-server-階段subprocess-常駐">MCP Server 階段：subprocess 常駐</h2>
<p>跑 <code>python3 scripts/mcp-demo/test_client.py</code> 時、client 會 spawn <code>blog_mcp_server.py</code> 當 child process。</p>
<p>實測：</p>
<table>
  <thead>
      <tr>
          <th>資源</th>
          <th>數字</th>
          <th>備註</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Subprocess RAM</td>
          <td>~50 MB</td>
          <td>Python runtime + index.pkl mmap</td>
      </tr>
      <tr>
          <td>stdio pipe 數量</td>
          <td>3（stdin、stdout、stderr）</td>
          <td>每 spawn 一個 server 都要 3 FD</td>
      </tr>
      <tr>
          <td>持續時間</td>
          <td>client 在跑就在跑</td>
          <td>client 結束時 SIGPIPE 自動結束 server</td>
      </tr>
  </tbody>
</table>
<p><strong>MCP server 的特性</strong>：</p>
<ul>
<li><strong>每個 client spawn 一個 server</strong>：Claude Desktop 開 5 個 MCP server、就有 5 個 Python subprocess。</li>
<li><strong>Index lazy load</strong>：本 demo <code>load_index()</code> 第一次 call 才 read pickle、之後 cached。Cold start 第一次 tool call 稍慢。</li>
<li><strong>Process lifecycle 在 client 端</strong>：client 死了、stdin EOF、server 自然結束。Client 沒清乾淨 spawn 多次就 leak process。</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 看當前所有 MCP server</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">ps aux <span class="p">|</span> grep blog_mcp_server <span class="p">|</span> grep -v grep
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 如果 client crash 留下 zombie：</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">pkill -f <span class="s2">&#34;blog_mcp_server.py&#34;</span></span></span></code></pre></div><p><strong>多 MCP server 並存</strong>（如 Claude Desktop 接 git server + filesystem server + custom server）：</p>
<table>
  <thead>
      <tr>
          <th>Server</th>
          <th>RAM</th>
          <th>主要負載</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>git MCP server</td>
          <td>~30 MB</td>
          <td>shell 呼叫</td>
      </tr>
      <tr>
          <td>filesystem MCP server</td>
          <td>~30 MB</td>
          <td>fs 操作</td>
      </tr>
      <tr>
          <td>blog_mcp_server（本 demo）</td>
          <td>~50 MB（含 index）</td>
          <td>embedding + retrieval</td>
      </tr>
      <tr>
          <td>5 個 server 同時</td>
          <td>~200 MB</td>
          <td>累積</td>
      </tr>
  </tbody>
</table>
<p>200 MB 在 32 GB Mac 上不顯眼、但 16 GB Mac + 多 MCP server + 大 chat model 就可能擠到。</p>
<h2 id="rag--mcp-整合完整應用-stack">RAG + MCP 整合：完整應用 stack</h2>
<p>實際應用會疊起來：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">User 在 Claude Desktop 打字
</span></span><span class="line"><span class="ln">2</span><span class="cl">  ↓
</span></span><span class="line"><span class="ln">3</span><span class="cl">Claude Desktop (~200 MB)
</span></span><span class="line"><span class="ln">4</span><span class="cl">  ↓ MCP stdio
</span></span><span class="line"><span class="ln">5</span><span class="cl">blog_mcp_server.py (~50 MB)
</span></span><span class="line"><span class="ln">6</span><span class="cl">  ↓ HTTP /api/embeddings + /v1/chat/completions
</span></span><span class="line"><span class="ln">7</span><span class="cl">Ollama daemon (~200 MB)
</span></span><span class="line"><span class="ln">8</span><span class="cl">  ↓ load
</span></span><span class="line"><span class="ln">9</span><span class="cl">nomic-embed-text 模型 (~274 MB) + 主 chat model (~6 GB)</span></span></code></pre></div><p>整體 RAM 占用範圍：</p>
<table>
  <thead>
      <tr>
          <th>配置</th>
          <th>估算</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Minimal（gemma3:1b + 小 index）</td>
          <td>~1.7 GB</td>
      </tr>
      <tr>
          <td>Standard（gemma3:4b + 中 index）</td>
          <td>~6.5 GB</td>
      </tr>
      <tr>
          <td>Heavy（gemma4:31b + 大 index + 多 MCP server）</td>
          <td>~22 GB</td>
      </tr>
  </tbody>
</table>
<p>跟 <a href="/blog/llm/01-local-llm-services/hands-on/resource-management/" data-link-title="Hands-on：LLM 運行中 &#43; 結束的資源管理" data-link-desc="RAM / 磁碟 / port 三個 dimension 的觀察跟釋放、Ollama keep_alive 跟 ComfyUI 兩種 lifecycle 對比、實測釋放數字">resource-management 章</a> 比、RAG / MCP 加 ~500 MB-1 GB overhead 在 chat 之上、是合理的 tradeoff（換來 retrieval + tool use 能力）。</p>
<h2 id="各資源類型的關鍵指標">各資源類型的關鍵指標</h2>
<p>整理三 dimension 的關鍵指標跟監控方式：</p>
<h3 id="ram">RAM</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 看 Ollama 載了哪些 model</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">ollama ps
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 看所有 LLM-related process</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">ps aux <span class="p">|</span> grep -E <span class="s2">&#34;ollama|comfyui|mcp&#34;</span> <span class="p">|</span> grep -v grep <span class="p">|</span> awk <span class="s1">&#39;{print $4, $11, $12, $13}&#39;</span> <span class="p">|</span> sort -rn
</span></span><span class="line"><span class="ln">6</span><span class="cl">
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># 系統整體</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">vm_stat <span class="p">|</span> head -3</span></span></code></pre></div><p><strong>告警閾值</strong>：</p>
<ul>
<li>RAM 占用 &gt; 80% 系統總量：開始考慮 unload model 或關掉 ComfyUI</li>
<li>看到 swap 增加（<code>vm_stat | grep &quot;Swapouts&quot;</code>）：已經 swap、要立刻減少 model</li>
</ul>
<h3 id="磁碟">磁碟</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Ollama models 累積</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">du -sh ~/.ollama/models
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># RAG index 累積（多個 corpus）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">du -sh scripts/rag-demo/index*.pkl 2&gt;/dev/null
</span></span><span class="line"><span class="ln">6</span><span class="cl">
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># ComfyUI checkpoints / VAE / LoRA / etc</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">du -sh ~/Projects/ComfyUI/models/*</span></span></code></pre></div><p><strong>累積評估</strong>：</p>
<ul>
<li>Ollama: 每 model 1-20 GB、半年累積容易破 50 GB</li>
<li>RAG index: 每 100K chunks ~800 MB、多 corpus 累積要管</li>
<li>ComfyUI: 每 checkpoint 4-7 GB、加 LoRA / VAE / ControlNet 等可達 50+ GB</li>
</ul>
<h3 id="process--port">Process / Port</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 一鍵 audit 所有 LLM service</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="k">for</span> p in <span class="m">11434</span> <span class="m">1234</span> <span class="m">8080</span> <span class="m">8188</span> 8000<span class="p">;</span> <span class="k">do</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;=== port </span><span class="nv">$p</span><span class="s2"> ===&#34;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">  lsof -i :<span class="nv">$p</span> 2&gt;/dev/null <span class="p">|</span> head -2
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="k">done</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># 找 zombie subprocess（沒 parent 的 mcp server）</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">ps aux <span class="p">|</span> grep <span class="s2">&#34;mcp_server&#34;</span> <span class="p">|</span> grep -v grep</span></span></code></pre></div><p><strong>告警訊號</strong>：</p>
<ul>
<li>同 port 兩個 process listen：明顯有 zombie、要 kill</li>
<li>多個 mcp_server PPID = 1（被 reparent 到 init）：原 client 死了沒清乾淨</li>
</ul>
<h2 id="rag-應用的長期累積管理">RAG 應用的長期累積管理</h2>
<p>跑超過幾週、會累積：</p>
<table>
  <thead>
      <tr>
          <th>累積物</th>
          <th>為什麼累積</th>
          <th>怎麼清</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Multiple <code>index.pkl</code></td>
          <td>跑不同 corpus 各建 index、舊的沒刪</td>
          <td><code>find scripts -name 'index*.pkl' -mtime +30 -delete</code></td>
      </tr>
      <tr>
          <td>Ollama models</td>
          <td>試了不同 model 沒清</td>
          <td>看 <code>ollama list</code> modified 欄、<code>ollama rm</code> 不用的</td>
      </tr>
      <tr>
          <td>Python <code>__pycache__</code></td>
          <td>每次跑 script 累積</td>
          <td><code>.gitignore</code> 已包、本地 <code>find . -name __pycache__ -exec rm -rf {} +</code></td>
      </tr>
      <tr>
          <td>Embedding cache</td>
          <td>如果你寫了 embedding cache 機制</td>
          <td>各自清理策略</td>
      </tr>
  </tbody>
</table>
<p>清理 idiom：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 每月跑一次的 cleanup</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">llm-rag-cleanup<span class="o">()</span> <span class="o">{</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Old indexes (&gt;30 days):&#34;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">  find scripts -name <span class="s1">&#39;index*.pkl&#39;</span> -mtime +30 -ls
</span></span><span class="line"><span class="ln">5</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Ollama models (review):&#34;</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">  ollama list
</span></span><span class="line"><span class="ln">7</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Python caches:&#34;</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">  find ~/Projects -name __pycache__ -type d <span class="p">|</span> head -10
</span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="o">}</span></span></span></code></pre></div><h2 id="跟-production-的差距預告">跟 production 的差距預告</h2>
<p>本篇紀錄的數字、是「single-user、single-machine、no concurrency」的 baseline。Production 場景多了幾個維度：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>本地</th>
          <th>Production</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>並發 user</td>
          <td>1</td>
          <td>10-10000</td>
      </tr>
      <tr>
          <td>Index 大小</td>
          <td>&lt; 100 MB</td>
          <td>TB 級</td>
      </tr>
      <tr>
          <td>Model serving</td>
          <td>Ollama 1 process</td>
          <td>vLLM / TGI / Triton 多 worker</td>
      </tr>
      <tr>
          <td>Vector storage</td>
          <td>pickle</td>
          <td>Pinecone / Weaviate / pgvector</td>
      </tr>
      <tr>
          <td>Latency 要求</td>
          <td>秒級 OK</td>
          <td>p50 &lt; 500ms、p99 &lt; 2s</td>
      </tr>
      <tr>
          <td>Cost model</td>
          <td>一次性硬體</td>
          <td>$/request、$/token</td>
      </tr>
      <tr>
          <td>Observability</td>
          <td>tail log</td>
          <td>metrics / traces / dashboards</td>
      </tr>
      <tr>
          <td>失敗模式</td>
          <td>crash → 自己重啟</td>
          <td>99.9% uptime SLA</td>
      </tr>
  </tbody>
</table>
<p>Production 視角詳細展開見 <a href="/blog/llm/04-applications/production-resource-planning/" data-link-title="4.9 Production 部署的資源評估原理" data-link-desc="從本地單 user 到 production multi-tenant：concurrent users、cost model、observability、SLA、capacity planning 的設計取捨">4.9 Production 部署的資源評估原理</a>。</p>
<h2 id="何時這篇會過時">何時這篇會過時</h2>
<p><strong>不會過時的部分</strong>：</p>
<ul>
<li>三階段 footprint 分類（ingest / query / server）</li>
<li>RAM / 磁碟 / process 三 dimension 的監控指令</li>
<li>多模型並存的 RAM 預估方法</li>
<li>長期累積管理 idiom</li>
</ul>
<p><strong>會變的部分</strong>：</p>
<ul>
<li>具體 RAM / 磁碟數字（隨模型架構、量化方法演化）</li>
<li><code>OLLAMA_KEEP_ALIVE</code> 等具體環境變數名</li>
<li>哪些 vector DB 主流（會持續演化）</li>
</ul>
<p>讀的時候若 RAM 占用跟本篇對不上、可能是新 model 架構效率改變、用同樣方法量自己環境的 baseline 即可。</p>
<p>跟其他 hands-on 章節的關係：完整 hands-on 系列見 <a href="/blog/llm/01-local-llm-services/hands-on/" data-link-title="Hands-on：本地 AI 工具實作筆記" data-link-desc="Ollama / ComfyUI / Whisper / Piper TTS：實際安裝、驗證、跑通的紀錄。隨工具版本演化、跟 1.x 原理章節互補。">Hands-on 章節索引</a>、實作配對見 <a href="/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &#43; cosine retrieval &#43; Ollama chat、validating 4.0 RAG 原理">RAG demo</a> 跟 <a href="/blog/llm/01-local-llm-services/hands-on/mcp-demo/" data-link-title="Hands-on：用 blog content 寫一個最小 MCP server" data-link-desc="stdio JSON-RPC、stdlib-only Python、暴露 blog content 給 LLM 用、validating 4.3 應用層協議">MCP demo</a>、Ollama / ComfyUI 共用的 lifecycle 管理見 <a href="/blog/llm/01-local-llm-services/hands-on/resource-management/" data-link-title="Hands-on：LLM 運行中 &#43; 結束的資源管理" data-link-desc="RAM / 磁碟 / port 三個 dimension 的觀察跟釋放、Ollama keep_alive 跟 ComfyUI 兩種 lifecycle 對比、實測釋放數字">Resource management</a>、Apple Silicon 統一記憶體預算原理見 <a href="/blog/llm/00-foundations/hardware-memory-budget/" data-link-title="0.5 Apple Silicon 記憶體預算" data-link-desc="記憶體決定能跑什麼，Q4 量化下的可運作模型對照與系統保留">0.5 記憶體預算</a>。</p>
<h2 id="跑這篇實測的指令總結">跑這篇實測的指令總結</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. RAG ingest 階段 RAM 量</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">ollama ps  <span class="c1"># 先看 baseline</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">python3 scripts/rag-demo/ingest.py <span class="p">&amp;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="nv">INGEST_PID</span><span class="o">=</span><span class="nv">$!</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">ollama ps  <span class="c1"># 看 embedding model 載入後</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">vm_stat <span class="p">|</span> head -3
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="nb">wait</span> <span class="nv">$INGEST_PID</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 2. RAG query 階段 RAM 量</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">ollama ps  <span class="c1"># 看 idle 後 unload</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">python3 scripts/rag-demo/query.py --show-retrieved <span class="s2">&#34;test query&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">ollama ps  <span class="c1"># 看 chat model 載入</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"># 3. MCP server 階段 process / RAM</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">python3 scripts/mcp-demo/test_client.py <span class="p">&amp;</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="nv">CLIENT_PID</span><span class="o">=</span><span class="nv">$!</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">sleep <span class="m">2</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">ps aux <span class="p">|</span> grep blog_mcp_server <span class="p">|</span> grep -v grep
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="nb">wait</span> <span class="nv">$CLIENT_PID</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="c1"># 4. 完成釋放</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">ollama list <span class="p">|</span> tail -n +2 <span class="p">|</span> awk <span class="s1">&#39;{print $1}&#39;</span> <span class="p">|</span> xargs -I <span class="o">{}</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="se"></span>  curl -s http://localhost:11434/api/generate -d <span class="s2">&#34;{\&#34;model\&#34;:\&#34;{}\&#34;,\&#34;keep_alive\&#34;:0}&#34;</span></span></span></code></pre></div>]]></content:encoded></item><item><title>Hands-on Quickstart：clone repo 後跑通所有 demo</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/quickstart/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/quickstart/</guid><description>&lt;p>本篇是 hands-on 系列的&lt;strong>導讀&lt;/strong>——把分散在 &lt;code>ollama-setup&lt;/code> / &lt;code>rag-demo&lt;/code> / &lt;code>mcp-demo&lt;/code> / &lt;code>permission-boundary&lt;/code> 各章節的 setup 步驟整合成一條最短路徑、讓 clone repo 的人能在 15 分鐘內跑通所有 demo（&lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/rag/" data-link-title="RAG" data-link-desc="Retrieval-Augmented Generation：動態外掛知識給 LLM、繞開模型參數記憶的靜態限制">RAG&lt;/a>、&lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/mcp/" data-link-title="MCP（Model Context Protocol）" data-link-desc="LLM application ↔ 外部 tool server 之間的標準化協議、複用 OpenAI 相容 API 的成功模式">MCP&lt;/a>、權限邊界三個 demo、RAG 是「retrieval 找相關內容 + LLM 回答」、MCP 是「LLM application ↔ tool server 的標準協議」）。&lt;/p>
&lt;p>每篇 hands-on 文章 focus 在「為什麼這樣設計」、本篇 focus 在「按順序跑通」。讀完想懂原理再進對應章節讀。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>環境&lt;/strong>：macOS 14+、Apple Silicon、Ollama 0.23.2、Python 3.11+
&lt;strong>總時間&lt;/strong>：~15 分鐘（含 model 下載）
&lt;strong>磁碟需求&lt;/strong>：Step 1 ~ 4 約 ~5 GB（Ollama 200 MB + nomic-embed-text 274 MB + gemma3:1b 815 MB + room for index）；Step 5 ComfyUI 可選加 ~10 GB（SDXL base 模型）。
&lt;strong>適用平台&lt;/strong>：本快速路徑只在 Apple Silicon Mac 驗證過；Intel Mac / Linux 上 Ollama 仍可裝、但 GPU 加速跟 model tag 行為可能不同、實際以官方 release notes 為準。&lt;/p>&lt;/blockquote>
&lt;h2 id="適合誰讀">適合誰讀&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>你是&lt;/th>
 &lt;th>本篇對你&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>剛 clone 我的 blog repo、想跑 demo 試試看&lt;/td>
 &lt;td>&lt;strong>從本篇開始&lt;/strong>、按步驟做&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>想懂某個 demo 的設計取捨&lt;/td>
 &lt;td>跑通後再進 &lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &amp;#43; cosine retrieval &amp;#43; Ollama chat、validating 4.0 RAG 原理">RAG demo&lt;/a> / &lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/mcp-demo/" data-link-title="Hands-on：用 blog content 寫一個最小 MCP server" data-link-desc="stdio JSON-RPC、stdlib-only Python、暴露 blog content 給 LLM 用、validating 4.3 應用層協議">MCP demo&lt;/a> / &lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/permission-boundary/" data-link-title="Hands-on：Ollama 改檔案 / 寫程式碼的權限邊界在哪" data-link-desc="四組對照實驗：Ollama 自己沒 FS / shell 權限、wrapper 才有；--dry-run / --confirm / --auto 三檔審查粒度的取捨">permission-boundary&lt;/a>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>想懂 Ollama / ComfyUI 安裝細節&lt;/td>
 &lt;td>&lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/ollama-setup/" data-link-title="Hands-on：安裝 Ollama &amp;#43; 拉第一個 Gemma 模型" data-link-desc="brew install ollama、launchd service、ollama pull、curl 驗證 OpenAI 相容 API">Ollama setup&lt;/a> / &lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/comfyui-setup/" data-link-title="Hands-on：安裝 ComfyUI &amp;#43; SDXL base" data-link-desc="git clone、venv、pip install requirements、SDXL safetensors 放哪、--listen 啟動 server、瀏覽器 workflow 驗證">ComfyUI setup&lt;/a>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>想看 production 怎麼想資源評估&lt;/td>
 &lt;td>&lt;a href="https://tarrragon.github.io/blog/llm/04-applications/production-resource-planning/" data-link-title="4.9 Production 部署的資源評估原理" data-link-desc="從本地單 user 到 production multi-tenant：concurrent users、cost model、observability、SLA、capacity planning 的設計取捨">4.9 Production resource planning&lt;/a>&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h2 id="為什麼不是pre-builtclone-就能跑">為什麼不是「pre-built、clone 就能跑」&lt;/h2>
&lt;p>衍生產物（&lt;code>index.pkl&lt;/code>、&lt;code>__pycache__/&lt;/code>、Ollama model weights、即「跑出來的 cache / index / weight」、跟 source code 區別）刻意&lt;strong>不進 git&lt;/strong>、原因見 &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/artifact-management/" data-link-title="4.10 衍生產物管理原理：什麼進 git、什麼不該" data-link-desc="LLM 應用的 source / derived / external 三類產物對應 git / build cache / registry、與 production 部署的 reproducibility / cost / share 取捨">4.10 衍生產物管理原理&lt;/a>。所以 clone repo 後需要：&lt;/p></description><content:encoded><![CDATA[<p>本篇是 hands-on 系列的<strong>導讀</strong>——把分散在 <code>ollama-setup</code> / <code>rag-demo</code> / <code>mcp-demo</code> / <code>permission-boundary</code> 各章節的 setup 步驟整合成一條最短路徑、讓 clone repo 的人能在 15 分鐘內跑通所有 demo（<a href="/blog/llm/knowledge-cards/rag/" data-link-title="RAG" data-link-desc="Retrieval-Augmented Generation：動態外掛知識給 LLM、繞開模型參數記憶的靜態限制">RAG</a>、<a href="/blog/llm/knowledge-cards/mcp/" data-link-title="MCP（Model Context Protocol）" data-link-desc="LLM application ↔ 外部 tool server 之間的標準化協議、複用 OpenAI 相容 API 的成功模式">MCP</a>、權限邊界三個 demo、RAG 是「retrieval 找相關內容 + LLM 回答」、MCP 是「LLM application ↔ tool server 的標準協議」）。</p>
<p>每篇 hands-on 文章 focus 在「為什麼這樣設計」、本篇 focus 在「按順序跑通」。讀完想懂原理再進對應章節讀。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>環境</strong>：macOS 14+、Apple Silicon、Ollama 0.23.2、Python 3.11+
<strong>總時間</strong>：~15 分鐘（含 model 下載）
<strong>磁碟需求</strong>：Step 1 ~ 4 約 ~5 GB（Ollama 200 MB + nomic-embed-text 274 MB + gemma3:1b 815 MB + room for index）；Step 5 ComfyUI 可選加 ~10 GB（SDXL base 模型）。
<strong>適用平台</strong>：本快速路徑只在 Apple Silicon Mac 驗證過；Intel Mac / Linux 上 Ollama 仍可裝、但 GPU 加速跟 model tag 行為可能不同、實際以官方 release notes 為準。</p></blockquote>
<h2 id="適合誰讀">適合誰讀</h2>
<table>
  <thead>
      <tr>
          <th>你是</th>
          <th>本篇對你</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>剛 clone 我的 blog repo、想跑 demo 試試看</td>
          <td><strong>從本篇開始</strong>、按步驟做</td>
      </tr>
      <tr>
          <td>想懂某個 demo 的設計取捨</td>
          <td>跑通後再進 <a href="/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &#43; cosine retrieval &#43; Ollama chat、validating 4.0 RAG 原理">RAG demo</a> / <a href="/blog/llm/01-local-llm-services/hands-on/mcp-demo/" data-link-title="Hands-on：用 blog content 寫一個最小 MCP server" data-link-desc="stdio JSON-RPC、stdlib-only Python、暴露 blog content 給 LLM 用、validating 4.3 應用層協議">MCP demo</a> / <a href="/blog/llm/01-local-llm-services/hands-on/permission-boundary/" data-link-title="Hands-on：Ollama 改檔案 / 寫程式碼的權限邊界在哪" data-link-desc="四組對照實驗：Ollama 自己沒 FS / shell 權限、wrapper 才有；--dry-run / --confirm / --auto 三檔審查粒度的取捨">permission-boundary</a></td>
      </tr>
      <tr>
          <td>想懂 Ollama / ComfyUI 安裝細節</td>
          <td><a href="/blog/llm/01-local-llm-services/hands-on/ollama-setup/" data-link-title="Hands-on：安裝 Ollama &#43; 拉第一個 Gemma 模型" data-link-desc="brew install ollama、launchd service、ollama pull、curl 驗證 OpenAI 相容 API">Ollama setup</a> / <a href="/blog/llm/01-local-llm-services/hands-on/comfyui-setup/" data-link-title="Hands-on：安裝 ComfyUI &#43; SDXL base" data-link-desc="git clone、venv、pip install requirements、SDXL safetensors 放哪、--listen 啟動 server、瀏覽器 workflow 驗證">ComfyUI setup</a></td>
      </tr>
      <tr>
          <td>想看 production 怎麼想資源評估</td>
          <td><a href="/blog/llm/04-applications/production-resource-planning/" data-link-title="4.9 Production 部署的資源評估原理" data-link-desc="從本地單 user 到 production multi-tenant：concurrent users、cost model、observability、SLA、capacity planning 的設計取捨">4.9 Production resource planning</a></td>
      </tr>
  </tbody>
</table>
<h2 id="為什麼不是pre-builtclone-就能跑">為什麼不是「pre-built、clone 就能跑」</h2>
<p>衍生產物（<code>index.pkl</code>、<code>__pycache__/</code>、Ollama model weights、即「跑出來的 cache / index / weight」、跟 source code 區別）刻意<strong>不進 git</strong>、原因見 <a href="/blog/llm/04-applications/artifact-management/" data-link-title="4.10 衍生產物管理原理：什麼進 git、什麼不該" data-link-desc="LLM 應用的 source / derived / external 三類產物對應 git / build cache / registry、與 production 部署的 reproducibility / cost / share 取捨">4.10 衍生產物管理原理</a>。所以 clone repo 後需要：</p>
<ol>
<li>裝 Ollama daemon + 拉 model（一次性）</li>
<li>跑 <code>ingest.py</code> 建 RAG index（corpus 變動時重跑）</li>
<li>之後 demo 就能用</li>
</ol>
<p>本篇是這個流程的 step-by-step。</p>
<h2 id="step-1裝-ollama-daemonbrew-install-ollama--brew-services-start">Step 1：裝 Ollama daemon（<code>brew install ollama</code> + <code>brew services start</code>）</h2>
<blockquote>
<p>daemon = 常駐 background process、開機自動啟動、見 <a href="/blog/llm/knowledge-cards/launchd-service/" data-link-title="launchd Service" data-link-desc="macOS 原生的服務管理機制、把 process 註冊成自動啟動的 daemon 或 agent">launchd service 卡</a>。</p></blockquote>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">brew install ollama
</span></span><span class="line"><span class="ln">2</span><span class="cl">brew services start ollama</span></span></code></pre></div><p>驗證：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">curl -s http://localhost:11434/api/version
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># {&#34;version&#34;:&#34;0.x.x&#34;}</span></span></span></code></pre></div><p>詳細安裝跟 troubleshooting 見 <a href="/blog/llm/01-local-llm-services/hands-on/ollama-setup/" data-link-title="Hands-on：安裝 Ollama &#43; 拉第一個 Gemma 模型" data-link-desc="brew install ollama、launchd service、ollama pull、curl 驗證 OpenAI 相容 API">Ollama setup 章節</a>。</p>
<h2 id="step-2拉-modelembed--chat-兩種角色">Step 2：拉 model（embed + chat 兩種角色）</h2>
<blockquote>
<p>為什麼要拉兩個 model：RAG 需要 embedding model 把文字壓成向量做語意比對、chat model 負責根據 retrieval 結果生成回答、兩者訓練目標不同、不能互通（見 <a href="/blog/llm/03-theoretical-foundations/embedding-spaces/" data-link-title="3.1 Embedding 空間" data-link-desc="token 怎麼變成向量、為什麼相似 token 在向量空間中靠近、embedding 是怎麼學出來的">3.1 embedding 空間</a>）。</p></blockquote>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Embedding model（RAG / MCP 都要、274 MB）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">ollama pull nomic-embed-text
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># Chat model（推薦從 1B 開始驗證、之後可換大）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">ollama pull gemma3:1b</span></span></code></pre></div><p>驗證：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">ollama list
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># NAME                       SIZE      MODIFIED</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># gemma3:1b                  815 MB    ...</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># nomic-embed-text:latest    274 MB    ...</span></span></span></code></pre></div><p>選 chat model 大小的取捨見 <a href="/blog/llm/01-local-llm-services/model-selection-priority/" data-link-title="1.4 寫 code 場景的模型選型優先順序" data-link-desc="Gemma 4 31B MTP → Qwen3-Coder 30B → Qwen3 14B → gpt-oss 20B 的取捨與適用情境">1.4 模型選型優先順序</a>。本 quickstart 用 1B 主要驗證流程跑通；長段 daily use（需要 follow 多段格式指令、複雜推理）建議 4B / 8B 起跳（見 <a href="/blog/llm/01-local-llm-services/hands-on/instruction-following-test/" data-link-title="Hands-on：跨資料夾風格 follow 任務的模型對比" data-link-desc="1B / 4B / 8B / 跨代 4B 在「讀風格參考、follow 既有格式、寫新章節」任務上的 structural metrics 對比、揭示 model size 不是唯一因素">instruction-following-test</a>）、極短句驗證 / 簡單問答 1B 也可。本系列預設用 <a href="/blog/llm/knowledge-cards/instruction-tuned/" data-link-title="Instruction-Tuned Model" data-link-desc="經過指令微調的模型：會跟著 prompt 走、回答使用者問題">instruction-tuned model</a> 變體（tag 含 <code>:Xb</code> 不含 <code>-base</code>）、適合對話 / 寫 code。</p>
<h2 id="step-3建-rag-index跑-ingestpy">Step 3：建 RAG index（跑 <code>ingest.py</code>）</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">cd</span> /path/to/blog
</span></span><span class="line"><span class="ln">2</span><span class="cl">python3 scripts/rag-demo/ingest.py</span></span></code></pre></div><p>預期輸出：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Found 71 markdown files under content/llm
</span></span><span class="line"><span class="ln">2</span><span class="cl">  [10/71] 86 chunks in 4.5s
</span></span><span class="line"><span class="ln">3</span><span class="cl">  ...
</span></span><span class="line"><span class="ln">4</span><span class="cl">Wrote 463 records to scripts/rag-demo/index.pkl (22.3s)</span></span></code></pre></div><p>實際數字看你的 blog content 量。Index file 在 <code>scripts/rag-demo/index.pkl</code>、3-50 MB 不等。</p>
<p>詳細的 chunking 策略、embedding 設計、為什麼 pickle、見 <a href="/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &#43; cosine retrieval &#43; Ollama chat、validating 4.0 RAG 原理">RAG demo 章節</a>。</p>
<h2 id="step-4跑-rag--mcp--permission-demo">Step 4：跑 RAG / MCP / permission demo</h2>
<p>完成 step 1-3 後、四個 demo 都能跑了：</p>
<h3 id="rag-demo語意搜尋--llm-回答">RAG demo（語意搜尋 + LLM 回答）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">python3 scripts/rag-demo/query.py --show-retrieved <span class="s2">&#34;你的問題&#34;</span></span></span></code></pre></div><p>例：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">python3 scripts/rag-demo/query.py --show-retrieved <span class="s2">&#34;什麼是 MCP？&#34;</span></span></span></code></pre></div><p>預期看到 retrieved chunks（含相似度跟來源 path）+ LLM 用這些 context 生的答案。</p>
<h3 id="mcp-demostdio-json-rpc-server">MCP demo（stdio JSON-RPC server）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">python3 scripts/mcp-demo/test_client.py</span></span></code></pre></div><p>預期看到 5 個階段的 JSON-RPC 對話：initialize / tools/list / tools/call (search_blog) / tools/call (read_chunk) / error。</p>
<h3 id="permission-boundary-demollm-mediated-file-edit">Permission boundary demo（LLM-mediated file edit）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 備份要試的檔案</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">cp content/llm/knowledge-cards/token.md /tmp/token-orig.md
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># Dry-run（預設、不寫檔、印 diff）</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">python3 scripts/permission-demo/edit_with_llm.py <span class="se">\
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="se"></span>  content/llm/knowledge-cards/token.md <span class="se">\
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="se"></span>  <span class="s2">&#34;加一句說明&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 還原（如果剛剛沒用 dry-run）</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">cp /tmp/token-orig.md content/llm/knowledge-cards/token.md</span></span></code></pre></div><p>詳細的 <code>--dry-run</code> / <code>--confirm</code> / <code>--auto</code> 三種 mode 取捨見 <a href="/blog/llm/01-local-llm-services/hands-on/permission-boundary/" data-link-title="Hands-on：Ollama 改檔案 / 寫程式碼的權限邊界在哪" data-link-desc="四組對照實驗：Ollama 自己沒 FS / shell 權限、wrapper 才有；--dry-run / --confirm / --auto 三檔審查粒度的取捨">Permission boundary 章節</a>。</p>
<h2 id="step-5可選comfyui-text-to-image-demo">Step 5（可選）：ComfyUI text-to-image demo</h2>
<p>需要額外裝 ComfyUI + 拉 SDXL model（~10 GB 磁碟）、流程獨立：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 跟 step 1 平行的軌道、見 ComfyUI setup 章節</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="nb">cd</span> ~/Projects
</span></span><span class="line"><span class="ln">3</span><span class="cl">git clone --depth <span class="m">1</span> https://github.com/comfyanonymous/ComfyUI.git
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="nb">cd</span> ComfyUI
</span></span><span class="line"><span class="ln">5</span><span class="cl">python3 -m venv venv
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="nb">source</span> venv/bin/activate
</span></span><span class="line"><span class="ln">7</span><span class="cl">pip install -r requirements.txt
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"># 下載 SDXL base：~/Projects/ComfyUI/models/checkpoints/</span>
</span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1"># 見 ComfyUI setup 章節指令</span></span></span></code></pre></div><p>啟動 + 跑 generation：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">cd</span> ~/Projects/ComfyUI <span class="o">&amp;&amp;</span> <span class="nb">source</span> venv/bin/activate <span class="o">&amp;&amp;</span> nohup python main.py &gt; /tmp/comfyui.log 2&gt;<span class="p">&amp;</span><span class="m">1</span> <span class="p">&amp;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 等 server ready</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="k">until</span> curl -s -o /dev/null -w <span class="s2">&#34;%{http_code}&#34;</span> http://127.0.0.1:8188/ <span class="p">|</span> grep -q 200<span class="p">;</span> <span class="k">do</span> sleep 2<span class="p">;</span> <span class="k">done</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 跑 generation（用 repo 內的 script）</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="nb">cd</span> /path/to/blog
</span></span><span class="line"><span class="ln">7</span><span class="cl">python3 scripts/comfyui-test/generate.py --steps <span class="m">15</span></span></span></code></pre></div><p>詳細裝法 + workflow JSON 解讀見 <a href="/blog/llm/01-local-llm-services/hands-on/comfyui-setup/" data-link-title="Hands-on：安裝 ComfyUI &#43; SDXL base" data-link-desc="git clone、venv、pip install requirements、SDXL safetensors 放哪、--listen 啟動 server、瀏覽器 workflow 驗證">ComfyUI setup 章節</a>。</p>
<h2 id="cleanup完事釋放資源">Cleanup（完事釋放資源）</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 停 Ollama daemon</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">brew services stop ollama
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># kill ComfyUI（如果有跑）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">pkill -9 -f <span class="s2">&#34;ComfyUI/main.py&#34;</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># 清 build artifact（可選、可重建）</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">rm -f scripts/rag-demo/index.pkl
</span></span><span class="line"><span class="ln">9</span><span class="cl">find scripts -name __pycache__ -type d -exec rm -rf <span class="o">{}</span> +</span></span></code></pre></div><p>詳細的 resource lifecycle 跟 cleanup idiom 見 <a href="/blog/llm/01-local-llm-services/hands-on/resource-management/" data-link-title="Hands-on：LLM 運行中 &#43; 結束的資源管理" data-link-desc="RAM / 磁碟 / port 三個 dimension 的觀察跟釋放、Ollama keep_alive 跟 ComfyUI 兩種 lifecycle 對比、實測釋放數字">Resource management 章節</a>。</p>
<h2 id="跑通後該往哪讀">跑通後該往哪讀</h2>
<table>
  <thead>
      <tr>
          <th>想懂什麼</th>
          <th>讀哪</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>「RAG 為什麼 retrieval 對 / generation 弱」</td>
          <td><a href="/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &#43; cosine retrieval &#43; Ollama chat、validating 4.0 RAG 原理">RAG demo</a></td>
      </tr>
      <tr>
          <td>「MCP wire protocol 細節」</td>
          <td><a href="/blog/llm/01-local-llm-services/hands-on/mcp-demo/" data-link-title="Hands-on：用 blog content 寫一個最小 MCP server" data-link-desc="stdio JSON-RPC、stdlib-only Python、暴露 blog content 給 LLM 用、validating 4.3 應用層協議">MCP demo</a></td>
      </tr>
      <tr>
          <td>「為什麼 LLM 寫 <code>rm -rf</code> 不會真的執行」</td>
          <td><a href="/blog/llm/01-local-llm-services/hands-on/permission-boundary/" data-link-title="Hands-on：Ollama 改檔案 / 寫程式碼的權限邊界在哪" data-link-desc="四組對照實驗：Ollama 自己沒 FS / shell 權限、wrapper 才有；--dry-run / --confirm / --auto 三檔審查粒度的取捨">Permission boundary</a></td>
      </tr>
      <tr>
          <td>「不同 model 在 instruction following 上的差距」</td>
          <td><a href="/blog/llm/01-local-llm-services/hands-on/instruction-following-test/" data-link-title="Hands-on：跨資料夾風格 follow 任務的模型對比" data-link-desc="1B / 4B / 8B / 跨代 4B 在「讀風格參考、follow 既有格式、寫新章節」任務上的 structural metrics 對比、揭示 model size 不是唯一因素">Instruction following test</a></td>
      </tr>
      <tr>
          <td>「跑 demo 占多少 RAM、怎麼釋放」</td>
          <td><a href="/blog/llm/01-local-llm-services/hands-on/resource-management/" data-link-title="Hands-on：LLM 運行中 &#43; 結束的資源管理" data-link-desc="RAM / 磁碟 / port 三個 dimension 的觀察跟釋放、Ollama keep_alive 跟 ComfyUI 兩種 lifecycle 對比、實測釋放數字">Resource management</a> + <a href="/blog/llm/01-local-llm-services/hands-on/rag-mcp-resources/" data-link-title="Hands-on：RAG / MCP 的資源 footprint" data-link-desc="RAG ingest / query / MCP server 三階段的 RAM / 磁碟 / process 實測、多模型並存的 RAM 衝突、本地 LLM 跑 RAG 跟單純 chat 的差異">RAG/MCP 資源 footprint</a></td>
      </tr>
      <tr>
          <td>「production 部署該怎麼想」</td>
          <td><a href="/blog/llm/04-applications/production-resource-planning/" data-link-title="4.9 Production 部署的資源評估原理" data-link-desc="從本地單 user 到 production multi-tenant：concurrent users、cost model、observability、SLA、capacity planning 的設計取捨">4.9 Production resource planning</a></td>
      </tr>
      <tr>
          <td>「什麼該進 git、什麼不該」</td>
          <td><a href="/blog/llm/04-applications/artifact-management/" data-link-title="4.10 衍生產物管理原理：什麼進 git、什麼不該" data-link-desc="LLM 應用的 source / derived / external 三類產物對應 git / build cache / registry、與 production 部署的 reproducibility / cost / share 取捨">4.10 衍生產物管理原理</a></td>
      </tr>
  </tbody>
</table>
<h2 id="跑不過時">跑不過時</h2>
<table>
  <thead>
      <tr>
          <th>症狀</th>
          <th>對應章節</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>ollama: command not found</code></td>
          <td><a href="/blog/llm/01-local-llm-services/hands-on/ollama-setup/" data-link-title="Hands-on：安裝 Ollama &#43; 拉第一個 Gemma 模型" data-link-desc="brew install ollama、launchd service、ollama pull、curl 驗證 OpenAI 相容 API">Ollama setup § 常見前置設定問題</a></td>
      </tr>
      <tr>
          <td><code>curl http://localhost:11434/api/version</code> 沒回應</td>
          <td>同上</td>
      </tr>
      <tr>
          <td><code>python3 ingest.py</code> 報 HTTP error</td>
          <td>確認 Ollama daemon 跑著、nomic-embed-text 已 pull</td>
      </tr>
      <tr>
          <td>RAG retrieval 結果都不相關</td>
          <td><a href="/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1 RAG § Retrieval 失敗的根本原因</a></td>
      </tr>
      <tr>
          <td>MCP test_client 卡住</td>
          <td><a href="/blog/llm/01-local-llm-services/hands-on/mcp-demo/" data-link-title="Hands-on：用 blog content 寫一個最小 MCP server" data-link-desc="stdio JSON-RPC、stdlib-only Python、暴露 blog content 給 LLM 用、validating 4.3 應用層協議">MCP demo § subprocess 跟 bufsize</a></td>
      </tr>
      <tr>
          <td>一切都不對</td>
          <td><a href="/blog/llm/01-local-llm-services/troubleshooting/" data-link-title="1.7 排錯方法論：用三層架構做故障定位" data-link-desc="故障定位的分層思考、症狀到層級的對應反射、log 在三層的角色差異、最小可重現的縮減策略">1.7 排錯方法論</a></td>
      </tr>
  </tbody>
</table>
<h2 id="何時這篇會過時">何時這篇會過時</h2>
<p><strong>會變的部分</strong>：</p>
<ul>
<li><code>brew install ollama</code> 流程（macOS 跟 brew 演化）</li>
<li><code>ollama pull</code> 的具體 model tag（model 會新陳代謝）</li>
<li>Python 版本相容性（3.11 → 3.14 各有 quirk）</li>
</ul>
<p><strong>不會過時的部分</strong>：</p>
<ul>
<li>4 步驟的順序（裝 daemon → 拉 model → 建 index → 跑 demo）是 RAG / MCP / 任何 LLM 應用的通用 setup pattern</li>
<li>衍生產物（index、cache）不進 git 的設計取捨</li>
<li>Cleanup 步驟跟釋放邏輯</li>
</ul>
<p>跑指令時報錯先看 step 對應章節的 troubleshooting section、再 Google 或開 issue。</p>
]]></content:encoded></item><item><title>Hands-on：安裝 Ollama + 拉第一個 Gemma 模型</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/ollama-setup/</link><pubDate>Mon, 11 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/ollama-setup/</guid><description>&lt;p>本篇紀錄在 Apple Silicon Mac 上裝 Ollama 並拉一個小模型驗證的完整流程。指令在 macOS 14 (Sonoma) / Homebrew 提供的環境下驗證。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-11
&lt;strong>Ollama 版本&lt;/strong>：0.23.2
&lt;strong>示範模型&lt;/strong>：&lt;code>gemma3:1b&lt;/code>（約 815 MB、選最小可運行的 Gemma 變體當驗證對象）&lt;/p>&lt;/blockquote>
&lt;h2 id="前置設定">前置設定&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>項目&lt;/th>
 &lt;th>檢查指令&lt;/th>
 &lt;th>預期&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>macOS 版本&lt;/td>
 &lt;td>&lt;code>sw_vers -productVersion&lt;/code>&lt;/td>
 &lt;td>14.x 或更新&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Apple Silicon&lt;/td>
 &lt;td>&lt;code>uname -m&lt;/code>&lt;/td>
 &lt;td>&lt;code>arm64&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Homebrew&lt;/td>
 &lt;td>&lt;code>brew --version&lt;/code>&lt;/td>
 &lt;td>4.x（任何近期版）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>磁碟空間&lt;/td>
 &lt;td>&lt;code>df -h ~&lt;/code>&lt;/td>
 &lt;td>至少 3 GB 剩餘給 runtime + 1B 模型&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>port 11434&lt;/td>
 &lt;td>&lt;code>lsof -i :11434&lt;/code>&lt;/td>
 &lt;td>無輸出（port 沒被佔）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>表中 &lt;code>brew --version&lt;/code> 這關若還沒過、代表 Homebrew 沒裝。新機從零的安裝順序（Homebrew、PATH、bash）見 &lt;a href="https://tarrragon.github.io/blog/other/macos-%E6%96%B0%E6%A9%9F%E5%9F%BA%E7%A4%8E%E5%BB%BA%E8%A8%AD%E5%A5%97%E4%BB%B6%E7%AE%A1%E7%90%86%E8%88%87%E5%80%8B%E4%BA%BA-bin-%E7%9A%84%E8%A8%AD%E5%AE%9A%E9%A0%86%E5%BA%8F/" data-link-title="macOS 新機基礎建設：套件管理與個人 bin 的設定順序" data-link-desc="重灌或換機後底層基礎建設的依賴順序，免得後面工具裝不起來或路徑互相找不到。">macOS 新機基礎建設&lt;/a>。&lt;/p>
&lt;p>選 1B 模型只是為了驗證流程、能力很弱、實際寫 code 場景請用 14B / 31B 級。模型大小跟記憶體 / 磁碟對應關係見 &lt;a href="https://tarrragon.github.io/blog/llm/00-foundations/hardware-memory-budget/" data-link-title="0.5 Apple Silicon 記憶體預算" data-link-desc="記憶體決定能跑什麼，Q4 量化下的可運作模型對照與系統保留">0.5 Apple Silicon 記憶體預算&lt;/a>。&lt;/p>
&lt;h2 id="安裝-ollama">安裝 Ollama&lt;/h2>
&lt;p>用 Homebrew 安裝、是 macOS 上最直接的路徑：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">brew install ollama&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>執行時間在 broadband 大約 30 秒到 2 分鐘、視 dependency cache 是否已有（Ollama 依賴 mlx-c 等 Apple Silicon 加速函式庫、首次裝較久）。&lt;/p>
&lt;p>裝完看到的 caveat 訊息：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">To start ollama now and restart at login:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> brew services start ollama
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">Or, if you don&amp;#39;t want/need a background service you can just run:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> OLLAMA_FLASH_ATTENTION=&amp;#34;1&amp;#34; OLLAMA_KV_CACHE_TYPE=&amp;#34;q8_0&amp;#34; /opt/homebrew/opt/ollama/bin/ollama serve&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>兩種啟動模式：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>launchd service&lt;/strong>（推薦日常用）：開機自動啟動、跑在背景。&lt;/li>
&lt;li>&lt;strong>前景手動跑&lt;/strong>：terminal 開著、關掉就停。&lt;/li>
&lt;/ul>
&lt;p>驗證 binary 路徑：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">which ollama
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="c1"># 應該回 /opt/homebrew/bin/ollama&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="啟動-ollama-service">啟動 Ollama Service&lt;/h2>
&lt;p>選 launchd service 模式：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">brew services start ollama&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>預期輸出：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">==&amp;gt; Successfully started `ollama` (label: homebrew.mxcl.ollama)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>這個動作做兩件事：&lt;/p>
&lt;ol>
&lt;li>註冊一個 launchd plist（macOS 開機自啟動 / 背景服務的設定檔、見 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/launchd-service/" data-link-title="launchd Service" data-link-desc="macOS 原生的服務管理機制、把 process 註冊成自動啟動的 daemon 或 agent">launchd-service 卡片&lt;/a>）到 &lt;code>~/Library/LaunchAgents/homebrew.mxcl.ollama.plist&lt;/code>。&lt;/li>
&lt;li>立刻啟動 ollama serve、之後重開機自動啟動。&lt;/li>
&lt;/ol>
&lt;p>驗證 server 真的在跑：&lt;/p></description><content:encoded><![CDATA[<p>本篇紀錄在 Apple Silicon Mac 上裝 Ollama 並拉一個小模型驗證的完整流程。指令在 macOS 14 (Sonoma) / Homebrew 提供的環境下驗證。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-11
<strong>Ollama 版本</strong>：0.23.2
<strong>示範模型</strong>：<code>gemma3:1b</code>（約 815 MB、選最小可運行的 Gemma 變體當驗證對象）</p></blockquote>
<h2 id="前置設定">前置設定</h2>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>檢查指令</th>
          <th>預期</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>macOS 版本</td>
          <td><code>sw_vers -productVersion</code></td>
          <td>14.x 或更新</td>
      </tr>
      <tr>
          <td>Apple Silicon</td>
          <td><code>uname -m</code></td>
          <td><code>arm64</code></td>
      </tr>
      <tr>
          <td>Homebrew</td>
          <td><code>brew --version</code></td>
          <td>4.x（任何近期版）</td>
      </tr>
      <tr>
          <td>磁碟空間</td>
          <td><code>df -h ~</code></td>
          <td>至少 3 GB 剩餘給 runtime + 1B 模型</td>
      </tr>
      <tr>
          <td>port 11434</td>
          <td><code>lsof -i :11434</code></td>
          <td>無輸出（port 沒被佔）</td>
      </tr>
  </tbody>
</table>
<p>表中 <code>brew --version</code> 這關若還沒過、代表 Homebrew 沒裝。新機從零的安裝順序（Homebrew、PATH、bash）見 <a href="/blog/other/macos-%E6%96%B0%E6%A9%9F%E5%9F%BA%E7%A4%8E%E5%BB%BA%E8%A8%AD%E5%A5%97%E4%BB%B6%E7%AE%A1%E7%90%86%E8%88%87%E5%80%8B%E4%BA%BA-bin-%E7%9A%84%E8%A8%AD%E5%AE%9A%E9%A0%86%E5%BA%8F/" data-link-title="macOS 新機基礎建設：套件管理與個人 bin 的設定順序" data-link-desc="重灌或換機後底層基礎建設的依賴順序，免得後面工具裝不起來或路徑互相找不到。">macOS 新機基礎建設</a>。</p>
<p>選 1B 模型只是為了驗證流程、能力很弱、實際寫 code 場景請用 14B / 31B 級。模型大小跟記憶體 / 磁碟對應關係見 <a href="/blog/llm/00-foundations/hardware-memory-budget/" data-link-title="0.5 Apple Silicon 記憶體預算" data-link-desc="記憶體決定能跑什麼，Q4 量化下的可運作模型對照與系統保留">0.5 Apple Silicon 記憶體預算</a>。</p>
<h2 id="安裝-ollama">安裝 Ollama</h2>
<p>用 Homebrew 安裝、是 macOS 上最直接的路徑：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">brew install ollama</span></span></code></pre></div><p>執行時間在 broadband 大約 30 秒到 2 分鐘、視 dependency cache 是否已有（Ollama 依賴 mlx-c 等 Apple Silicon 加速函式庫、首次裝較久）。</p>
<p>裝完看到的 caveat 訊息：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">To start ollama now and restart at login:
</span></span><span class="line"><span class="ln">2</span><span class="cl">  brew services start ollama
</span></span><span class="line"><span class="ln">3</span><span class="cl">Or, if you don&#39;t want/need a background service you can just run:
</span></span><span class="line"><span class="ln">4</span><span class="cl">  OLLAMA_FLASH_ATTENTION=&#34;1&#34; OLLAMA_KV_CACHE_TYPE=&#34;q8_0&#34; /opt/homebrew/opt/ollama/bin/ollama serve</span></span></code></pre></div><p>兩種啟動模式：</p>
<ul>
<li><strong>launchd service</strong>（推薦日常用）：開機自動啟動、跑在背景。</li>
<li><strong>前景手動跑</strong>：terminal 開著、關掉就停。</li>
</ul>
<p>驗證 binary 路徑：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">which ollama
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 應該回 /opt/homebrew/bin/ollama</span></span></span></code></pre></div><h2 id="啟動-ollama-service">啟動 Ollama Service</h2>
<p>選 launchd service 模式：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">brew services start ollama</span></span></code></pre></div><p>預期輸出：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">==&gt; Successfully started `ollama` (label: homebrew.mxcl.ollama)</span></span></code></pre></div><p>這個動作做兩件事：</p>
<ol>
<li>註冊一個 launchd plist（macOS 開機自啟動 / 背景服務的設定檔、見 <a href="/blog/llm/knowledge-cards/launchd-service/" data-link-title="launchd Service" data-link-desc="macOS 原生的服務管理機制、把 process 註冊成自動啟動的 daemon 或 agent">launchd-service 卡片</a>）到 <code>~/Library/LaunchAgents/homebrew.mxcl.ollama.plist</code>。</li>
<li>立刻啟動 ollama serve、之後重開機自動啟動。</li>
</ol>
<p>驗證 server 真的在跑：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">curl -s http://localhost:11434/api/version</span></span></code></pre></div><p>預期回：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="p">{</span><span class="nt">&#34;version&#34;</span><span class="p">:</span><span class="s2">&#34;0.23.2&#34;</span><span class="p">}</span></span></span></code></pre></div><p>看到這個 JSON 就證明三件事：Ollama daemon 跑了、port 11434 通了、API 結構正確。</p>
<h2 id="拉第一個模型">拉第一個模型</h2>
<p>Ollama 用 <code>ollama pull</code> 從官方 registry 下載模型：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">ollama pull gemma3:1b</span></span></code></pre></div><p>Gemma 3 1B 約 815 MB、broadband 約 1-2 分鐘下載。下載過程顯示多階段：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">pulling 7cd4618c1faf: 100% ▕██████████████████▏ 815 MB
</span></span><span class="line"><span class="ln">2</span><span class="cl">pulling e0a42594d802: 100% ▕██████████████████▏  358 B
</span></span><span class="line"><span class="ln">3</span><span class="cl">pulling dd084c7d92a3: 100% ▕██████████████████▏  8.4 KB
</span></span><span class="line"><span class="ln">4</span><span class="cl">pulling 3116c5225075: 100% ▕██████████████████▏   77 B
</span></span><span class="line"><span class="ln">5</span><span class="cl">pulling 120007c81bf8: 100% ▕██████████████████▏  492 B
</span></span><span class="line"><span class="ln">6</span><span class="cl">verifying sha256 digest
</span></span><span class="line"><span class="ln">7</span><span class="cl">writing manifest
</span></span><span class="line"><span class="ln">8</span><span class="cl">success</span></span></code></pre></div><p>幾個 hash blob 分別是：模型權重（最大那個）、tokenizer、template、license metadata 等。Ollama 把這些統一管理、放在 <code>~/.ollama/models/</code>。</p>
<p>驗證模型已下載：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">ollama list</span></span></code></pre></div><p>預期：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">NAME         ID              SIZE      MODIFIED
</span></span><span class="line"><span class="ln">2</span><span class="cl">gemma3:1b    8648f39daa8f    815 MB    35 seconds ago</span></span></code></pre></div><h2 id="驗證-openai-相容-api">驗證 OpenAI 相容 API</h2>
<p>OpenAI 相容 API 是下游所有工具（IDE plugin、RAG pipeline、MCP server、<a href="/blog/llm/01-local-llm-services/vscode-continue-integration/" data-link-title="1.3 VS Code &#43; Continue.dev 整合" data-link-desc="安裝 Continue 擴充套件、config.json 設定、Cmd&#43;L 對話 / Cmd&#43;I 行內編輯快捷鍵">Continue.dev</a> 等）依賴的介面 contract、驗證它能正常回應、整個 stack 才走得通：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">curl -s http://localhost:11434/v1/chat/completions <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  -H <span class="s2">&#34;Content-Type: application/json&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  -d <span class="s1">&#39;{
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s1">    &#34;model&#34;: &#34;gemma3:1b&#34;,
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s1">    &#34;messages&#34;: [{&#34;role&#34;:&#34;user&#34;,&#34;content&#34;:&#34;Reply in one short sentence: what is 2+2?&#34;}],
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="s1">    &#34;stream&#34;: false
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="s1">  }&#39;</span></span></span></code></pre></div><p>預期回 JSON、<code>choices[0].message.content</code> 是模型回答（如 <code>&quot;2 + 2 = 4&quot;</code>）。看到合理回答就證明：</p>
<ol>
<li>Ollama 跟模型權重對接好。</li>
<li>OpenAI 相容 API 格式正常（IDE plugin 可以接）。</li>
<li>推論流程整條通。</li>
</ol>
<p>常見的失敗回應跟下一步：</p>
<ul>
<li><strong><code>{&quot;error&quot;:&quot;model 'gemma3:1b' not found, try pulling it first&quot;}</code></strong>：先跑 <code>ollama pull gemma3:1b</code>、確認 <code>ollama list</code> 看到該 tag。</li>
<li><strong><code>curl: (7) Failed to connect to localhost port 11434: Connection refused</code></strong>：server 沒在跑、回 <code>brew services list</code> 看 status、若是 stopped 跑 <code>brew services start ollama</code>。</li>
<li><strong><code>{&quot;error&quot;:&quot;json: cannot unmarshal ...&quot;}</code></strong>：請求格式錯（例如 messages 寫成 string 不是 array）、檢查 JSON body。</li>
<li><strong>連得上但長時間沒回應</strong>：第一次載入大 model 需要 30 ~ 60 秒、看 <code>~/.ollama/logs/server.log</code> 確認是否還在 loading。</li>
</ul>
<p>用內建 CLI 互動模式也行：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">ollama run gemma3:1b</span></span></code></pre></div><p>進入 REPL、可以打字對話。<code>/bye</code> 離開。</p>
<p>第一次跑 <code>ollama run</code> 會把模型載入記憶體（1B 模型大約 1-2 秒）、之後對話延遲低。如果幾分鐘沒用、模型會被 unload 釋放記憶體、下次 run 又要等載入。控制行為的環境變數是 <code>OLLAMA_KEEP_ALIVE</code>（預設 5 分鐘）。</p>
<h2 id="常見前置設定問題">常見前置設定問題</h2>
<h3 id="port-11434-被佔用">Port 11434 被佔用</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">lsof -i :11434</span></span></code></pre></div><p>若已有 process 占用、可能是先前手動跑過 <code>ollama serve</code> 沒關。kill 後再 start service：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">pkill -f <span class="s2">&#34;ollama serve&#34;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">brew services restart ollama</span></span></code></pre></div><h3 id="ollama-command-not-found裝完還是找不到"><code>ollama: command not found</code>（裝完還是找不到）</h3>
<p>Homebrew 在 Apple Silicon 預設裝到 <code>/opt/homebrew/bin</code>、shell PATH 應該已含。若沒含：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">echo</span> <span class="nv">$PATH</span> <span class="p">|</span> tr <span class="s1">&#39;:&#39;</span> <span class="s1">&#39;\n&#39;</span> <span class="p">|</span> grep homebrew
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 若沒看到 /opt/homebrew/bin、要加進 ~/.zshrc：</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="nb">echo</span> <span class="s1">&#39;export PATH=&#34;/opt/homebrew/bin:$PATH&#34;&#39;</span> &gt;&gt; ~/.zshrc
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="nb">source</span> ~/.zshrc</span></span></code></pre></div><h3 id="server-啟動但-curl-失敗">Server 啟動但 curl 失敗</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">brew services list <span class="p">|</span> grep ollama</span></span></code></pre></div><p>若 status 不是 <code>started</code>、看 log：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">tail -50 /opt/homebrew/var/log/ollama.log</span></span></code></pre></div><p>常見原因：port 衝突、權限問題、上次 crash 沒清乾淨。</p>
<p>完整排錯流程見 <a href="/blog/llm/01-local-llm-services/troubleshooting/" data-link-title="1.7 排錯方法論：用三層架構做故障定位" data-link-desc="故障定位的分層思考、症狀到層級的對應反射、log 在三層的角色差異、最小可重現的縮減策略">1.7 排錯方法論</a>。</p>
<h2 id="之後想做的事">之後想做的事</h2>
<ul>
<li><strong>接 VS Code</strong>：見 <a href="/blog/llm/01-local-llm-services/vscode-continue-integration/" data-link-title="1.3 VS Code &#43; Continue.dev 整合" data-link-desc="安裝 Continue 擴充套件、config.json 設定、Cmd&#43;L 對話 / Cmd&#43;I 行內編輯快捷鍵">1.3 VS Code + Continue.dev 整合</a>。設定 <code>apiBase: http://localhost:11434</code> 就能用。</li>
<li><strong>跑更大模型</strong>：32GB+ Mac 推薦 <code>gemma4:31b-coding-mtp-bf16</code>（18 GB）。模型選擇見 <a href="/blog/llm/01-local-llm-services/model-selection-priority/" data-link-title="1.4 寫 code 場景的模型選型優先順序" data-link-desc="Gemma 4 31B MTP → Qwen3-Coder 30B → Qwen3 14B → gpt-oss 20B 的取捨與適用情境">1.4 模型選型優先順序</a>。</li>
<li><strong>加 embedding</strong>：codebase 索引要 embedding 模型：<code>ollama pull nomic-embed-text</code>（274 MB）、見 <a href="/blog/llm/04-applications/rag-principles/" data-link-title="4.1 RAG 原理：retrieval &#43; augmentation 模式" data-link-desc="為什麼模型需要外掛知識、語意相似 vs 字面相似、chunking 的本質取捨、retrieval 失敗的根本原因">4.1 RAG 原理</a>。</li>
</ul>
<h2 id="升級--移除">升級 / 移除</h2>
<p>升級：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">brew upgrade ollama
</span></span><span class="line"><span class="ln">2</span><span class="cl">brew services restart ollama</span></span></code></pre></div><p>完整移除：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">brew services stop ollama
</span></span><span class="line"><span class="ln">2</span><span class="cl">brew uninstall ollama
</span></span><span class="line"><span class="ln">3</span><span class="cl">rm -rf ~/.ollama  <span class="c1"># 清模型 cache（可選）</span></span></span></code></pre></div><h2 id="何時這篇會過時">何時這篇會過時</h2>
<ul>
<li><code>brew install ollama</code> 安裝方式跟 OpenAI 相容 API 形狀短期內不會變（生態都依賴）。</li>
<li><code>gemma3:1b</code> 這個具體 tag 預期會被新模型取代、但「拉一個小模型驗證流程」的方法不變。</li>
<li>launchd service 機制是 macOS 系統 API、不會 deprecate。</li>
</ul>
<p>讀的時候若 <code>brew install</code> 跑失敗、查 Ollama GitHub release notes；其餘驗證步驟結構通用。</p>
]]></content:encoded></item></channel></rss>