<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Resource on Tarragon</title><link>https://tarrragon.github.io/blog/tags/resource/</link><description>Recent content in Resource on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Tue, 12 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/resource/index.xml" rel="self" type="application/rss+xml"/><item><title>Hands-on：LLM 運行中 + 結束的資源管理</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/resource-management/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/resource-management/</guid><description>&lt;p>跑本地 LLM 的核心 invariant 跟雲端不一樣：&lt;strong>Mac 是 shared resource、不是 dedicated GPU&lt;/strong>。雲端 inference server 跑進 dedicated container、結束 instance 自然回收所有資源；本地&lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/inference-server/" data-link-title="Inference Server" data-link-desc="載入模型權重、處理 prompt、產生 token 的常駐 process">推論伺服器&lt;/a>跑在你日常用的 Mac、跟 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/unified-memory/" data-link-title="Unified Memory Architecture" data-link-desc="Apple Silicon 讓 CPU / GPU / NE 共用同一塊記憶體：跑大模型的優勢來源">統一記憶體&lt;/a> 共享同一塊容量，忘記管理會 silently 吃光 RAM、磁碟、port、最後讓系統變慢甚至 swap。&lt;/p>
&lt;p>本篇紀錄三個 dimension（RAM / 磁碟 / port）的觀察工具跟釋放姿勢、對比 Ollama 跟 ComfyUI 兩種典型 lifecycle、加上實測釋放數字。對應 &lt;a href="https://tarrragon.github.io/blog/llm/00-foundations/privacy-data-flow/" data-link-title="0.7 隱私 / 資安的資料流原理" data-link-desc="從「位置」到「資料流」的思考升級：信任邊界、合約模型、零信任原則套用到 LLM 工作流">0.7 隱私資料流原理&lt;/a>「每個 hop 都要 audit」這條思維——資源管理也是 hop 級的 audit、不是「裝完就忘」。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>環境&lt;/strong>：macOS 14、Apple Silicon、Ollama 0.23.2、ComfyUI 0.21.0、SDXL base 1.0&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼這事重要">為什麼這事重要&lt;/h2>
&lt;p>雲端 inference：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">Container start → load model → serve requests → container stop → 所有 RAM / 磁碟 / port 自動回收&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>本地 inference：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">brew services start → load model on demand → serve → ??? → 你忘記 stop
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> → RAM / 磁碟一直被佔
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> → 下次重開機才釋放&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>具體會踩到的問題：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>RAM&lt;/strong>：18 GB SDXL 模型載入後不會自動卸、即使 ComfyUI idle、Python process 仍占 RAM&lt;/li>
&lt;li>&lt;strong>磁碟&lt;/strong>：&lt;code>ollama pull&lt;/code> 累積、&lt;code>~/.ollama/models/blobs&lt;/code> 半年可長到 50 GB+、不主動清不會減&lt;/li>
&lt;li>&lt;strong>Port&lt;/strong>：上次 crash 的 &lt;code>ollama serve&lt;/code> 進程沒乾淨清、port 11434 還占著、下次啟動報「address already in use」&lt;/li>
&lt;li>&lt;strong>GPU / Metal&lt;/strong>：模型載入後 Metal context 佔住、跟其他 GPU-using app（影片剪輯、遊戲）競爭&lt;/li>
&lt;/ul>
&lt;h2 id="三個-dimension--觀察工具">三個 dimension + 觀察工具&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Dimension&lt;/th>
 &lt;th>觀察指令&lt;/th>
 &lt;th>看什麼&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>RAM&lt;/td>
 &lt;td>&lt;code>vm_stat | head -5&lt;/code>&lt;/td>
 &lt;td>Pages free（每 page 16 KB）、空閒越多越好&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>RAM（per process）&lt;/td>
 &lt;td>Activity Monitor 或 &lt;code>ps aux | sort -k6 -rn | head&lt;/code>&lt;/td>
 &lt;td>哪個 process 佔最多記憶體&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>磁碟&lt;/td>
 &lt;td>&lt;code>df -h ~ | tail -1&lt;/code>&lt;/td>
 &lt;td>系統 volume 剩餘&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>磁碟（per dir）&lt;/td>
 &lt;td>&lt;code>du -sh ~/.ollama/models/blobs&lt;/code>&lt;/td>
 &lt;td>LLM models 累積量&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Port&lt;/td>
 &lt;td>&lt;code>lsof -i :11434&lt;/code>&lt;/td>
 &lt;td>誰在 listen 該 port&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Process&lt;/td>
 &lt;td>&lt;code>ps aux | grep -i ollama | grep -v grep&lt;/code>&lt;/td>
 &lt;td>Ollama / ComfyUI / Python 跑哪幾個&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Ollama loaded models&lt;/td>
 &lt;td>&lt;code>ollama ps&lt;/code>&lt;/td>
 &lt;td>哪些 model 在 RAM、size、idle timer&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>實測：剛 kill 完 ComfyUI（SDXL + Python venv）後、&lt;code>vm_stat&lt;/code> 看到 free pages 從 619K 變 1090K（每 page 16 KB）、約 &lt;strong>+7.5 GB RAM 釋放&lt;/strong>——這就是 SDXL + ComfyUI process 一直占的記憶體量。&lt;/p></description><content:encoded><![CDATA[<p>跑本地 LLM 的核心 invariant 跟雲端不一樣：<strong>Mac 是 shared resource、不是 dedicated GPU</strong>。雲端 inference server 跑進 dedicated container、結束 instance 自然回收所有資源；本地<a href="/blog/llm/knowledge-cards/inference-server/" data-link-title="Inference Server" data-link-desc="載入模型權重、處理 prompt、產生 token 的常駐 process">推論伺服器</a>跑在你日常用的 Mac、跟 <a href="/blog/llm/knowledge-cards/unified-memory/" data-link-title="Unified Memory Architecture" data-link-desc="Apple Silicon 讓 CPU / GPU / NE 共用同一塊記憶體：跑大模型的優勢來源">統一記憶體</a> 共享同一塊容量，忘記管理會 silently 吃光 RAM、磁碟、port、最後讓系統變慢甚至 swap。</p>
<p>本篇紀錄三個 dimension（RAM / 磁碟 / port）的觀察工具跟釋放姿勢、對比 Ollama 跟 ComfyUI 兩種典型 lifecycle、加上實測釋放數字。對應 <a href="/blog/llm/00-foundations/privacy-data-flow/" data-link-title="0.7 隱私 / 資安的資料流原理" data-link-desc="從「位置」到「資料流」的思考升級：信任邊界、合約模型、零信任原則套用到 LLM 工作流">0.7 隱私資料流原理</a>「每個 hop 都要 audit」這條思維——資源管理也是 hop 級的 audit、不是「裝完就忘」。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>環境</strong>：macOS 14、Apple Silicon、Ollama 0.23.2、ComfyUI 0.21.0、SDXL base 1.0</p></blockquote>
<h2 id="為什麼這事重要">為什麼這事重要</h2>
<p>雲端 inference：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Container start → load model → serve requests → container stop → 所有 RAM / 磁碟 / port 自動回收</span></span></code></pre></div><p>本地 inference：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">brew services start → load model on demand → serve → ??? → 你忘記 stop
</span></span><span class="line"><span class="ln">2</span><span class="cl">                                              → RAM / 磁碟一直被佔
</span></span><span class="line"><span class="ln">3</span><span class="cl">                                              → 下次重開機才釋放</span></span></code></pre></div><p>具體會踩到的問題：</p>
<ul>
<li><strong>RAM</strong>：18 GB SDXL 模型載入後不會自動卸、即使 ComfyUI idle、Python process 仍占 RAM</li>
<li><strong>磁碟</strong>：<code>ollama pull</code> 累積、<code>~/.ollama/models/blobs</code> 半年可長到 50 GB+、不主動清不會減</li>
<li><strong>Port</strong>：上次 crash 的 <code>ollama serve</code> 進程沒乾淨清、port 11434 還占著、下次啟動報「address already in use」</li>
<li><strong>GPU / Metal</strong>：模型載入後 Metal context 佔住、跟其他 GPU-using app（影片剪輯、遊戲）競爭</li>
</ul>
<h2 id="三個-dimension--觀察工具">三個 dimension + 觀察工具</h2>
<table>
  <thead>
      <tr>
          <th>Dimension</th>
          <th>觀察指令</th>
          <th>看什麼</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RAM</td>
          <td><code>vm_stat | head -5</code></td>
          <td>Pages free（每 page 16 KB）、空閒越多越好</td>
      </tr>
      <tr>
          <td>RAM（per process）</td>
          <td>Activity Monitor 或 <code>ps aux | sort -k6 -rn | head</code></td>
          <td>哪個 process 佔最多記憶體</td>
      </tr>
      <tr>
          <td>磁碟</td>
          <td><code>df -h ~ | tail -1</code></td>
          <td>系統 volume 剩餘</td>
      </tr>
      <tr>
          <td>磁碟（per dir）</td>
          <td><code>du -sh ~/.ollama/models/blobs</code></td>
          <td>LLM models 累積量</td>
      </tr>
      <tr>
          <td>Port</td>
          <td><code>lsof -i :11434</code></td>
          <td>誰在 listen 該 port</td>
      </tr>
      <tr>
          <td>Process</td>
          <td><code>ps aux | grep -i ollama | grep -v grep</code></td>
          <td>Ollama / ComfyUI / Python 跑哪幾個</td>
      </tr>
      <tr>
          <td>Ollama loaded models</td>
          <td><code>ollama ps</code></td>
          <td>哪些 model 在 RAM、size、idle timer</td>
      </tr>
  </tbody>
</table>
<p>實測：剛 kill 完 ComfyUI（SDXL + Python venv）後、<code>vm_stat</code> 看到 free pages 從 619K 變 1090K（每 page 16 KB）、約 <strong>+7.5 GB RAM 釋放</strong>——這就是 SDXL + ComfyUI process 一直占的記憶體量。</p>
<h2 id="ollama-的-lifecycleauto-unload-模式">Ollama 的 lifecycle（auto-unload 模式）</h2>
<p>Ollama 走「按需 load / idle unload」設計：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">brew services start ollama          → daemon 啟動、沒 model 載入、RAM 占用 ~200 MB
</span></span><span class="line"><span class="ln">2</span><span class="cl">                                     port 11434 listening
</span></span><span class="line"><span class="ln">3</span><span class="cl">ollama run gemma3:4b &#34;hello&#34;        → 把 model 載入 RAM (~4-5 GB)
</span></span><span class="line"><span class="ln">4</span><span class="cl">                                     立刻 generate response
</span></span><span class="line"><span class="ln">5</span><span class="cl">                                     model 留在 RAM
</span></span><span class="line"><span class="ln">6</span><span class="cl">(idle 5 分鐘、無新 request)         → Ollama 自動 unload model
</span></span><span class="line"><span class="ln">7</span><span class="cl">                                     RAM 釋放、daemon 仍跑著
</span></span><span class="line"><span class="ln">8</span><span class="cl">ollama run gemma3:4b &#34;next&#34;         → 重新 load model（~5-10 秒）、generate
</span></span><span class="line"><span class="ln">9</span><span class="cl">brew services stop ollama           → daemon 結束、port 釋放</span></span></code></pre></div><p><strong>關鍵參數 <code>OLLAMA_KEEP_ALIVE</code></strong>（環境變數、預設 <code>5m</code>）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 看當前 loaded models</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">ollama ps
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># NAME         ID              SIZE      PROCESSOR    UNTIL</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># gemma3:4b    a2af6cc3eb7f    5.5 GB    100% Metal   4 minutes from now</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># 啟動時調 keep_alive（持續佔 RAM 直到 ollama 重啟）</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="nv">OLLAMA_KEEP_ALIVE</span><span class="o">=</span>-1 brew services restart ollama
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 啟動時讓 model 用完立即 unload</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="nv">OLLAMA_KEEP_ALIVE</span><span class="o">=</span><span class="m">0</span> brew services restart ollama</span></span></code></pre></div><p>選 keep_alive 的 trade-off：</p>
<table>
  <thead>
      <tr>
          <th>設定</th>
          <th>RAM 占用</th>
          <th>首字延遲</th>
          <th>適合場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>0</code></td>
          <td>最低（generate 完立即釋放）</td>
          <td>高（每次都重 load）</td>
          <td>偶爾用、RAM 緊張</td>
      </tr>
      <tr>
          <td><code>5m</code>（預設）</td>
          <td>中（活躍用占住、閒 5 分鐘後釋放）</td>
          <td>低（活躍期不重 load）</td>
          <td>大多場景</td>
      </tr>
      <tr>
          <td><code>-1</code></td>
          <td>高（永久占住）</td>
          <td>最低</td>
          <td>整天頻繁用、RAM 充裕</td>
      </tr>
  </tbody>
</table>
<p><strong>主動 unload 指令</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 把 idle 的 model 立刻從 RAM 卸掉、但 daemon 仍跑</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">curl -s http://localhost:11434/api/generate <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  -d <span class="s1">&#39;{&#34;model&#34;: &#34;gemma3:4b&#34;, &#34;keep_alive&#34;: 0}&#39;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 或關掉整個 daemon</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">brew services stop ollama</span></span></code></pre></div><h2 id="comfyui-的-lifecycle持續占用模式">ComfyUI 的 lifecycle（持續占用模式）</h2>
<p>ComfyUI 走完全不同模式：<strong>model 載入後一直在 RAM、直到 server process 結束</strong>。沒有 auto-unload 機制。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">python main.py                      → ComfyUI server start、port 8188 listening
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">                                     RAM ~3 GB（Python venv + 框架）
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">第一次 Queue Prompt (用 SDXL)        → 載入 sd_xl_base_1.0.safetensors (~6 GB)
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">                                     RAM 跳到 ~9-10 GB
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">                                     generate 完成、model 留在 RAM
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">連續多張生成                          → 維持 ~9-10 GB、沒 unload
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">idle 1 小時                          → 仍 ~9-10 GB（沒 timer）
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">切到 ControlNet workflow             → 多載 ControlNet model (~2 GB)、ComfyUI 自動 swap
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">                                     RAM 暫升、SD 部分可能被 evict 到 disk
</span></span><span class="line"><span class="ln">10</span><span class="cl">Ctrl+C / pkill                       → process 結束、RAM 完全釋放</span></span></code></pre></div><p>要釋放 ComfyUI 占的 RAM、<strong>唯一方法是結束 server</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 找 PID</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">ps aux <span class="p">|</span> grep <span class="s2">&#34;ComfyUI/main.py&#34;</span> <span class="p">|</span> grep -v grep
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># 優雅關（讓它 cleanup）</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">pkill -INT -f <span class="s2">&#34;ComfyUI/main.py&#34;</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># 強制 kill（如果上面沒反應、最多等 5 秒再強制）</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">pkill -KILL -f <span class="s2">&#34;ComfyUI/main.py&#34;</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># 確認 port 釋放</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">lsof -i :8188 <span class="p">|</span> head -3</span></span></code></pre></div><p>實測：M4 Pro 32GB、SDXL base 載入後 ComfyUI process 占 ~8 GB RAM；<code>pkill -9</code> 後 <code>vm_stat</code> 顯示 free pages 增加 ~470K page（<strong>7.5 GB 釋放</strong>）。</p>
<h3 id="為什麼-ollama-跟-comfyui-設計不同">為什麼 Ollama 跟 ComfyUI 設計不同</h3>
<table>
  <thead>
      <tr>
          <th>因素</th>
          <th>Ollama 設計</th>
          <th>ComfyUI 設計</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>主要使用模式</td>
          <td>API 服務、IDE plugin 透過 HTTP 用</td>
          <td>互動 GUI、user 連續調 prompt</td>
      </tr>
      <tr>
          <td>Model 切換頻率</td>
          <td>高（不同任務換不同 model）</td>
          <td>低（一次 session 通常一個 model）</td>
      </tr>
      <tr>
          <td>User 期待的 latency</td>
          <td>低首字延遲（IDE 補完場景）</td>
          <td>高 throughput（連續生圖）</td>
      </tr>
      <tr>
          <td>結論</td>
          <td>Auto-unload 釋 RAM 給其他 model</td>
          <td>持續載入避免重複 load 浪費</td>
      </tr>
  </tbody>
</table>
<p>兩種設計都 valid、適合不同使用模式。理解差異後就知道 ComfyUI 一直占 RAM「不是 bug」、是設計選擇。</p>
<h2 id="跟其他本地-server-對比">跟其他本地 server 對比</h2>
<table>
  <thead>
      <tr>
          <th>Server</th>
          <th>Auto-unload</th>
          <th>主動 unload 指令</th>
          <th>占 RAM 觀察</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ollama</td>
          <td>有（5 分鐘 idle）</td>
          <td><code>keep_alive: 0</code> 或 stop daemon</td>
          <td><code>ollama ps</code></td>
      </tr>
      <tr>
          <td>LM Studio</td>
          <td>無（GUI 主動關閉 model 才釋）</td>
          <td>GUI Eject Model</td>
          <td>Activity Monitor</td>
      </tr>
      <tr>
          <td>llama.cpp <code>llama-server</code></td>
          <td>無</td>
          <td>kill process</td>
          <td><code>lsof -i :8080</code></td>
      </tr>
      <tr>
          <td>ComfyUI</td>
          <td>無</td>
          <td>kill process</td>
          <td><code>ps aux | grep ComfyUI</code></td>
      </tr>
      <tr>
          <td>oMLX</td>
          <td>有（per model 可配）</td>
          <td>API endpoint</td>
          <td>server log</td>
      </tr>
  </tbody>
</table>
<p><strong>結論</strong>：只有 Ollama 跟 oMLX 內建 auto-unload、其他都要手動釋放。GUI server（LM Studio）通常給 user 一個「Eject」按鈕、CLI server 通常要 kill process。</p>
<h2 id="標準釋放程序">標準釋放程序</h2>
<p>寫 code 完一天結束、要釋放所有資源、按下表順序操作：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. 確認當前狀態（記下要還回去多少 RAM）</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">vm_stat <span class="p">|</span> head -3
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">df -h ~ <span class="p">|</span> tail -1
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">ollama ps
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">ps aux <span class="p">|</span> grep -E <span class="s2">&#34;ollama|ComfyUI|llama-server&#34;</span> <span class="p">|</span> grep -v grep
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># 2. 釋放當前載入的 LLM models（Ollama）</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">brew services stop ollama
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 或保留 daemon、只 unload model：</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># curl -s http://localhost:11434/api/generate -d &#39;{&#34;model&#34;: &#34;&lt;your model&gt;&#34;, &#34;keep_alive&#34;: 0}&#39;</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"># 3. 結束 ComfyUI / 其他 GUI server</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">pkill -INT -f <span class="s2">&#34;ComfyUI/main.py&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">14</span><span class="cl">pkill -INT -f <span class="s2">&#34;llama-server&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">15</span><span class="cl">sleep <span class="m">5</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="c1"># 強制（如果上面沒清乾淨）</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">pkill -KILL -f <span class="s2">&#34;ComfyUI/main.py&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">18</span><span class="cl">pkill -KILL -f <span class="s2">&#34;llama-server&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="c1"># 4. 驗證所有 port 釋放</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">lsof -i :11434 -i :1234 -i :8080 -i :8188 -i :8000 2&gt;<span class="p">&amp;</span><span class="m">1</span> <span class="p">|</span> head
</span></span><span class="line"><span class="ln">22</span><span class="cl">
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="c1"># 5. 確認釋放量</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">vm_stat <span class="p">|</span> head -3
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="c1"># free pages 該明顯增加</span></span></span></code></pre></div><h3 id="容易出錯的釋放方式">容易出錯的「釋放方式」</h3>
<ul>
<li><strong><code>killall Python</code></strong>：會 kill 所有 Python process、包括其他 dev tool（如 jupyter、Django）。用 <code>pkill -f &quot;ComfyUI/main.py&quot;</code> 等明確 pattern。</li>
<li><strong><code>rm -rf ~/.ollama</code></strong>：會清掉所有 model registry、下次要重 pull 全部 model。Cleanup 用 <code>ollama rm &lt;model&gt;</code> 才精準。</li>
<li><strong><code>brew uninstall ollama</code></strong>：直接卸載 Ollama 本身、過 reinstall 麻煩。Stop service 就夠。</li>
<li><strong>重開機釋放</strong>：work 但太重、會中斷其他工作。用 process-level 操作即可。</li>
</ul>
<h2 id="磁碟長期累積管理">磁碟長期累積管理</h2>
<p>Models 一旦 <code>pull</code> 進 <code>~/.ollama/models/blobs</code>、不主動 <code>rm</code> 不會減少。半年累積可長到 50 GB+。</p>
<p>Ollama models 只是磁碟大戶之一。整台 Mac 突然被吃光、要從哪裡查起的全機診斷順序（先排除快照浮動、再用實際佔用值逐層找大戶），見 <a href="/blog/other/macos-%E7%A3%81%E7%A2%9F%E7%A9%BA%E9%96%93%E8%A2%AB%E5%90%83%E5%85%89%E7%9A%84%E8%A8%BA%E6%96%B7%E6%B5%81%E7%A8%8B/" data-link-title="macOS 磁碟空間被吃光的診斷流程" data-link-desc="Mac 空間莫名歸零、清 cache 沒救、或空間掉了又回來時的排查順序。避開 sparse 假大小和本地快照浮動的誤判。含 disk-report 腳本。">macOS 磁碟空間診斷流程</a>——那篇的佔用大戶表也會把 ollama 列為其中一項、再連回本篇的專屬清理 idiom。</p>
<h3 id="觀察累積">觀察累積</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Ollama models 總占用</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">du -sh ~/.ollama/models/blobs
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># 4.1G    /Users/tarragon/.ollama/models/blobs</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 逐 model 看大小</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">ollama list
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># NAME                       ID              SIZE      MODIFIED</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"># gemma4:e4b                 c6eb396dbd59    9.6 GB    Less than a second ago</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># nomic-embed-text:latest    0a109f422b47    274 MB    3 hours ago</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"># ComfyUI checkpoints 累積</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">du -sh ~/.ollama ~/Projects/ComfyUI/models 2&gt;/dev/null
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"># 4.2G    /Users/tarragon/.ollama</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"># 7.0G    /Users/tarragon/Projects/ComfyUI/models</span></span></span></code></pre></div><h3 id="清理策略">清理策略</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 刪掉很久沒用的 model</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">ollama rm &lt;model-tag&gt;
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># 一次清掉所有 Ollama models（保留 daemon）</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">ollama list <span class="p">|</span> tail -n +2 <span class="p">|</span> awk <span class="s1">&#39;{print $1}&#39;</span> <span class="p">|</span> xargs -I <span class="o">{}</span> ollama rm <span class="o">{}</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># 看 ComfyUI checkpoints 哪些可清</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">ls -lh ~/Projects/ComfyUI/models/checkpoints/
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># 手動刪不要的 .safetensors（小心、不能 undo）</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">rm ~/Projects/ComfyUI/models/checkpoints/&lt;old-model&gt;.safetensors</span></span></code></pre></div><h3 id="磁碟管理-idiom">磁碟管理 idiom</h3>
<p>定期（每月或磁碟剩 &lt; 20% 時）做：</p>
<ol>
<li><code>du -sh ~/.ollama ~/Projects/ComfyUI/models</code> 看當前累積</li>
<li><code>ollama list</code> 看哪些 model 沒在用（看 <code>MODIFIED</code> 欄、太舊的考慮刪）</li>
<li>刪實驗用的 model、保留 daily-driver</li>
<li>ComfyUI checkpoints 同樣 review</li>
</ol>
<h2 id="port--process-排錯">Port / Process 排錯</h2>
<h3 id="啟動報address-already-in-use">啟動報「address already in use」</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 找誰占</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">lsof -i :11434
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># COMMAND  PID  USER   ...   NAME</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># ollama   xxx  ...    ...   TCP localhost:11434 (LISTEN)</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># 看是不是 zombie process</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">ps aux <span class="p">|</span> grep <span class="k">$(</span>lsof -ti :11434 <span class="p">|</span> head -1<span class="k">)</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 清掉</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="nb">kill</span> -9 <span class="k">$(</span>lsof -ti :11434<span class="k">)</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"># 或重啟 service（會自動清舊 instance）</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">brew services restart ollama</span></span></code></pre></div><h3 id="ollama-daemon-掛了不知道">Ollama daemon 掛了不知道</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 健康檢查</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">curl -s http://localhost:11434/api/version
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># 沒回應、看 service 狀態</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">brew services list <span class="p">|</span> grep ollama
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># 沒在跑、重啟</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">brew services start ollama
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># 看 log</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">tail -50 /opt/homebrew/var/log/ollama.log</span></span></code></pre></div><h3 id="comfyui-看似跑著但-queue-不動">ComfyUI 看似跑著但 Queue 不動</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 看 stdout / stderr log</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">tail -30 /tmp/comfyui.log  <span class="c1"># 如果啟動時 redirect 到 log</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 看是不是 GPU / Metal stuck（極少見、但 SDXL 大量並發可能踩到）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 解法：kill + 重啟</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">pkill -9 -f <span class="s2">&#34;ComfyUI/main.py&#34;</span></span></span></code></pre></div><p>完整排錯流程跟「先確認哪一層壞」見 <a href="/blog/llm/01-local-llm-services/troubleshooting/" data-link-title="1.7 排錯方法論：用三層架構做故障定位" data-link-desc="故障定位的分層思考、症狀到層級的對應反射、log 在三層的角色差異、最小可重現的縮減策略">1.7 排錯方法論</a>。</p>
<h2 id="觀察記憶體佔用實測對照">觀察記憶體佔用：實測對照</h2>
<p>跑這幾步紀錄 baseline → load model → kill 的 RAM 變化：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Baseline</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">vm_stat <span class="p">|</span> grep <span class="s2">&#34;Pages free&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># Pages free:                              1090076.   ← ~17 GB free</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 啟動 Ollama + load 4B model</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">brew services start ollama
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">ollama run gemma3:4b <span class="s2">&#34;hello&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">ollama ps
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># NAME       SIZE     PROCESSOR    UNTIL</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># gemma3:4b  5.5 GB   100% Metal   4 minutes from now</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl">vm_stat <span class="p">|</span> grep <span class="s2">&#34;Pages free&#34;</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"># Pages free:                               750000.   ← 跌 ~5 GB（model 載入）</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1"># 額外啟動 ComfyUI + load SDXL</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">nohup python main.py &gt; /tmp/comfyui.log 2&gt;<span class="p">&amp;</span><span class="m">1</span> <span class="p">&amp;</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1"># 在 GUI 上 Queue Prompt 跑一次 SDXL generation</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">vm_stat <span class="p">|</span> grep <span class="s2">&#34;Pages free&#34;</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="c1"># Pages free:                               280000.   ← 再跌 ~7.5 GB（SDXL 載入 + Python venv）</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="c1"># kill 全部</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">brew services stop ollama
</span></span><span class="line"><span class="ln">23</span><span class="cl">pkill -9 -f <span class="s2">&#34;ComfyUI/main.py&#34;</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">sleep <span class="m">3</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">vm_stat <span class="p">|</span> grep <span class="s2">&#34;Pages free&#34;</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="c1"># Pages free:                              1090000.   ← 回到 baseline</span></span></span></code></pre></div><p>每 page 16 KB、所以 free pages 數字 × 16 KB = 實際 free RAM bytes。</p>
<h2 id="自動化釋放launchd--shell-alias">自動化釋放：launchd / shell alias</h2>
<p>寫個 shell function 一鍵 cleanup：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 加進 ~/.zshrc</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">llm-cleanup<span class="o">()</span> <span class="o">{</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Stopping Ollama...&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  brew services stop ollama 2&gt;/dev/null
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Killing ComfyUI...&#34;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  pkill -INT -f <span class="s2">&#34;ComfyUI/main.py&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  sleep <span class="m">3</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  pkill -KILL -f <span class="s2">&#34;ComfyUI/main.py&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Killing other model servers...&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">  pkill -KILL -f <span class="s2">&#34;llama-server&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">13</span><span class="cl">  pkill -KILL -f <span class="s2">&#34;lm-studio-server&#34;</span> 2&gt;/dev/null
</span></span><span class="line"><span class="ln">14</span><span class="cl">
</span></span><span class="line"><span class="ln">15</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Verifying ports...&#34;</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">  <span class="k">for</span> p in <span class="m">11434</span> <span class="m">1234</span> <span class="m">8080</span> <span class="m">8188</span> 8000<span class="p">;</span> <span class="k">do</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    lsof -i :<span class="nv">$p</span> 2&gt;/dev/null <span class="p">|</span> head -2
</span></span><span class="line"><span class="ln">18</span><span class="cl">  <span class="k">done</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Free RAM:&#34;</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">  vm_stat <span class="p">|</span> grep <span class="s2">&#34;Pages free&#34;</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="o">}</span></span></span></code></pre></div><p>完事打 <code>llm-cleanup</code> 一鍵釋放、不用記每個 process 怎麼 kill。</p>
<h2 id="何時這篇會過時">何時這篇會過時</h2>
<p><strong>不會過時的部分</strong>：</p>
<ul>
<li>RAM / 磁碟 / port 三個 dimension 是長期 invariant、用什麼 LLM server 都成立。</li>
<li>「Mac 是 shared resource、需要主動管理」這個 framing。</li>
<li>Ollama 跟 ComfyUI 兩種典型 lifecycle 對比（auto-unload vs persistent）。</li>
<li>觀察工具（<code>vm_stat</code>、<code>lsof</code>、<code>ps</code>、<code>du</code>、Activity Monitor）是 macOS 系統 API、不會 deprecate。</li>
<li>標準釋放程序、自動化 shell function 模式。</li>
</ul>
<p><strong>會變的部分</strong>：</p>
<ul>
<li>具體 model size / RAM 占用數字（隨模型架構演化）。</li>
<li><code>OLLAMA_KEEP_ALIVE</code> 等具體環境變數名（Ollama API 演化）。</li>
<li>ComfyUI 可能加 auto-unload feature（社群有 issue 在討論）。</li>
</ul>
<p>讀的時候若指令跑不過、先 <code>--help</code> 看當前版本 flag；釋放 RAM 的「kill process」這個機制本身永遠成立。</p>
<h2 id="跟其他-hands-on-章節的關係">跟其他 hands-on 章節的關係</h2>
<ul>
<li><a href="/blog/llm/01-local-llm-services/hands-on/ollama-setup/" data-link-title="Hands-on：安裝 Ollama &#43; 拉第一個 Gemma 模型" data-link-desc="brew install ollama、launchd service、ollama pull、curl 驗證 OpenAI 相容 API">Ollama 安裝</a>：介紹 <code>brew services start/stop</code>、本篇延伸 lifecycle 細節</li>
<li><a href="/blog/llm/01-local-llm-services/hands-on/comfyui-setup/" data-link-title="Hands-on：安裝 ComfyUI &#43; SDXL base" data-link-desc="git clone、venv、pip install requirements、SDXL safetensors 放哪、--listen 啟動 server、瀏覽器 workflow 驗證">ComfyUI 安裝</a>：介紹 ComfyUI 啟動、本篇延伸 RAM 占用 + 釋放</li>
<li><a href="/blog/llm/01-local-llm-services/troubleshooting/" data-link-title="1.7 排錯方法論：用三層架構做故障定位" data-link-desc="故障定位的分層思考、症狀到層級的對應反射、log 在三層的角色差異、最小可重現的縮減策略">1.7 排錯方法論</a>：用三層架構定位故障、本篇是 lifecycle 視角的補完</li>
<li><a href="/blog/llm/00-foundations/privacy-data-flow/" data-link-title="0.7 隱私 / 資安的資料流原理" data-link-desc="從「位置」到「資料流」的思考升級：信任邊界、合約模型、零信任原則套用到 LLM 工作流">0.7 隱私資料流原理</a>：「每個 hop 都要 audit」延伸到資源層</li>
</ul>
<p>整體心法：本地 LLM 工作流跟雲端不一樣、要主動管理 lifecycle、不能裝完就忘。</p>
]]></content:encoded></item><item><title>Hands-on：RAG / MCP 的資源 footprint</title><link>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/rag-mcp-resources/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/rag-mcp-resources/</guid><description>&lt;p>&lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/resource-management/" data-link-title="Hands-on：LLM 運行中 &amp;#43; 結束的資源管理" data-link-desc="RAM / 磁碟 / port 三個 dimension 的觀察跟釋放、Ollama keep_alive 跟 ComfyUI 兩種 lifecycle 對比、實測釋放數字">Resource management 章&lt;/a> 講的是 Ollama / ComfyUI 等&lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/inference-server/" data-link-title="Inference Server" data-link-desc="載入模型權重、處理 prompt、產生 token 的常駐 process">推論伺服器&lt;/a>的 lifecycle。但&lt;strong>跑 &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/rag/" data-link-title="RAG" data-link-desc="Retrieval-Augmented Generation：動態外掛知識給 LLM、繞開模型參數記憶的靜態限制">RAG&lt;/a> / &lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/mcp/" data-link-title="MCP（Model Context Protocol）" data-link-desc="LLM application ↔ 外部 tool server 之間的標準化協議、複用 OpenAI 相容 API 的成功模式">MCP&lt;/a> 應用&lt;/strong>比單純 chat 多吃幾倍資源——&lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/embedding-model/" data-link-title="Embedding Model" data-link-desc="把文字轉成向量的模型：用於 codebase 索引與語意搜尋">embedding model&lt;/a>、chat model、index 檔、subprocess、tool 邏輯——而且不同階段（ingest vs query）的瓶頸不一樣。&lt;/p>
&lt;p>本篇紀錄 &lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &amp;#43; cosine retrieval &amp;#43; Ollama chat、validating 4.0 RAG 原理">RAG demo&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/llm/01-local-llm-services/hands-on/mcp-demo/" data-link-title="Hands-on：用 blog content 寫一個最小 MCP server" data-link-desc="stdio JSON-RPC、stdlib-only Python、暴露 blog content 給 LLM 用、validating 4.3 應用層協議">MCP demo&lt;/a> 跑起來的實測資源 footprint、提供本地多模型並存的 baseline、給寫 production 應用前的 sanity check。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>驗證日期&lt;/strong>：2026-05-12
&lt;strong>環境&lt;/strong>：M4 Pro 32 GB、Ollama 0.23.2、Python 3.14
&lt;strong>Corpus&lt;/strong>：本 blog 的 &lt;code>content/llm/&lt;/code>、71 個 markdown 檔、463 chunks&lt;/p>&lt;/blockquote>
&lt;h2 id="各階段資源-footprint">各階段資源 footprint&lt;/h2>
&lt;p>RAG / MCP 工作流通常分三階段、各自吃不同資源：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>階段&lt;/th>
 &lt;th>主要資源消耗&lt;/th>
 &lt;th>持續時間&lt;/th>
 &lt;th>是否常駐&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>RAG ingest&lt;/strong>&lt;/td>
 &lt;td>embedding model RAM + CPU + 磁碟寫&lt;/td>
 &lt;td>one-shot（corpus 更動時跑）&lt;/td>
 &lt;td>否&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>RAG query&lt;/strong>&lt;/td>
 &lt;td>index 載入 RAM + chat model RAM + GPU&lt;/td>
 &lt;td>per-request&lt;/td>
 &lt;td>retrieval index 常駐&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>MCP server&lt;/strong>&lt;/td>
 &lt;td>subprocess 永久跑、tool 呼叫時動態載資源&lt;/td>
 &lt;td>session 內常駐&lt;/td>
 &lt;td>是&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>不同階段的瓶頸不一樣、優化目標也不同。&lt;/p>
&lt;h2 id="rag-ingest-階段one-shot-但批次密集">RAG Ingest 階段：one-shot 但批次密集&lt;/h2>
&lt;p>跑 &lt;code>python3 scripts/rag-demo/ingest.py&lt;/code> 時：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">Found 71 markdown files under content/llm
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> [10/71] 86 chunks in 4.5s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> [20/71] 181 chunks in 8.6s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> ...
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl"> [70/71] 461 chunks in 22.2s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">Wrote 463 records to scripts/rag-demo/index.pkl (22.3s)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>實測資源消耗：&lt;/p></description><content:encoded><![CDATA[<p><a href="/blog/llm/01-local-llm-services/hands-on/resource-management/" data-link-title="Hands-on：LLM 運行中 &#43; 結束的資源管理" data-link-desc="RAM / 磁碟 / port 三個 dimension 的觀察跟釋放、Ollama keep_alive 跟 ComfyUI 兩種 lifecycle 對比、實測釋放數字">Resource management 章</a> 講的是 Ollama / ComfyUI 等<a href="/blog/llm/knowledge-cards/inference-server/" data-link-title="Inference Server" data-link-desc="載入模型權重、處理 prompt、產生 token 的常駐 process">推論伺服器</a>的 lifecycle。但<strong>跑 <a href="/blog/llm/knowledge-cards/rag/" data-link-title="RAG" data-link-desc="Retrieval-Augmented Generation：動態外掛知識給 LLM、繞開模型參數記憶的靜態限制">RAG</a> / <a href="/blog/llm/knowledge-cards/mcp/" data-link-title="MCP（Model Context Protocol）" data-link-desc="LLM application ↔ 外部 tool server 之間的標準化協議、複用 OpenAI 相容 API 的成功模式">MCP</a> 應用</strong>比單純 chat 多吃幾倍資源——<a href="/blog/llm/knowledge-cards/embedding-model/" data-link-title="Embedding Model" data-link-desc="把文字轉成向量的模型：用於 codebase 索引與語意搜尋">embedding model</a>、chat model、index 檔、subprocess、tool 邏輯——而且不同階段（ingest vs query）的瓶頸不一樣。</p>
<p>本篇紀錄 <a href="/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &#43; cosine retrieval &#43; Ollama chat、validating 4.0 RAG 原理">RAG demo</a> 跟 <a href="/blog/llm/01-local-llm-services/hands-on/mcp-demo/" data-link-title="Hands-on：用 blog content 寫一個最小 MCP server" data-link-desc="stdio JSON-RPC、stdlib-only Python、暴露 blog content 給 LLM 用、validating 4.3 應用層協議">MCP demo</a> 跑起來的實測資源 footprint、提供本地多模型並存的 baseline、給寫 production 應用前的 sanity check。</p>
<blockquote>
<p><strong>驗證日期</strong>：2026-05-12
<strong>環境</strong>：M4 Pro 32 GB、Ollama 0.23.2、Python 3.14
<strong>Corpus</strong>：本 blog 的 <code>content/llm/</code>、71 個 markdown 檔、463 chunks</p></blockquote>
<h2 id="各階段資源-footprint">各階段資源 footprint</h2>
<p>RAG / MCP 工作流通常分三階段、各自吃不同資源：</p>
<table>
  <thead>
      <tr>
          <th>階段</th>
          <th>主要資源消耗</th>
          <th>持續時間</th>
          <th>是否常駐</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>RAG ingest</strong></td>
          <td>embedding model RAM + CPU + 磁碟寫</td>
          <td>one-shot（corpus 更動時跑）</td>
          <td>否</td>
      </tr>
      <tr>
          <td><strong>RAG query</strong></td>
          <td>index 載入 RAM + chat model RAM + GPU</td>
          <td>per-request</td>
          <td>retrieval index 常駐</td>
      </tr>
      <tr>
          <td><strong>MCP server</strong></td>
          <td>subprocess 永久跑、tool 呼叫時動態載資源</td>
          <td>session 內常駐</td>
          <td>是</td>
      </tr>
  </tbody>
</table>
<p>不同階段的瓶頸不一樣、優化目標也不同。</p>
<h2 id="rag-ingest-階段one-shot-但批次密集">RAG Ingest 階段：one-shot 但批次密集</h2>
<p>跑 <code>python3 scripts/rag-demo/ingest.py</code> 時：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Found 71 markdown files under content/llm
</span></span><span class="line"><span class="ln">2</span><span class="cl">  [10/71] 86 chunks in 4.5s
</span></span><span class="line"><span class="ln">3</span><span class="cl">  [20/71] 181 chunks in 8.6s
</span></span><span class="line"><span class="ln">4</span><span class="cl">  ...
</span></span><span class="line"><span class="ln">5</span><span class="cl">  [70/71] 461 chunks in 22.2s
</span></span><span class="line"><span class="ln">6</span><span class="cl">Wrote 463 records to scripts/rag-demo/index.pkl (22.3s)</span></span></code></pre></div><p>實測資源消耗：</p>
<table>
  <thead>
      <tr>
          <th>資源</th>
          <th>數字</th>
          <th>為什麼</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RAM（峰值）</td>
          <td>~600 MB</td>
          <td>nomic-embed-text 模型 (274 MB) + Python runtime + 累積 records (~200 MB)</td>
      </tr>
      <tr>
          <td>磁碟寫</td>
          <td><code>index.pkl</code> ~3.7 MB</td>
          <td>463 records、每筆含 chunk text + 768-dim float embedding</td>
      </tr>
      <tr>
          <td>CPU + GPU</td>
          <td>Ollama 推 embedding、Apple Silicon Metal backend</td>
          <td>22 秒處理 463 個 chunk、平均 ~21 chunk/sec</td>
      </tr>
      <tr>
          <td>網路</td>
          <td>0</td>
          <td>完全本地推論</td>
      </tr>
  </tbody>
</table>
<p><strong>Ingest 階段的特性</strong>：</p>
<ul>
<li><strong>One-shot</strong>：corpus 不變不用重跑、index 寫一次永久用。</li>
<li><strong>吃 CPU 多於 RAM</strong>：產生 embedding 是 forward pass、瓶頸在 GPU 算力、RAM 沒太大壓力。</li>
<li><strong>磁碟寫小</strong>：每 chunk 約 8 KB（text 部分 ~5 KB + embedding 768 floats × 4 bytes = ~3 KB）、463 chunks 總共 ~3.7 MB。</li>
<li><strong>可平行</strong>：sequential <code>embed(chunk)</code> 是最慢實作、用 batching API（如果 Ollama 支援）或多 worker、能快 5-10x。</li>
</ul>
<p><strong>規模 extrapolation</strong>：</p>
<table>
  <thead>
      <tr>
          <th>Corpus 大小</th>
          <th>預估 ingest 時間</th>
          <th>index.pkl 大小</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>71 docs / 463 chunks（本 blog）</td>
          <td>22 秒</td>
          <td>3.7 MB</td>
      </tr>
      <tr>
          <td>1000 docs / ~7000 chunks（中型 codebase）</td>
          <td>~5 分鐘</td>
          <td>~55 MB</td>
      </tr>
      <tr>
          <td>10000 docs / ~70000 chunks（大型 codebase）</td>
          <td>~50 分鐘</td>
          <td>~550 MB</td>
      </tr>
      <tr>
          <td>100K docs / ~700K chunks（公司 wiki）</td>
          <td>~8 小時</td>
          <td>~5.5 GB</td>
      </tr>
  </tbody>
</table>
<p>10K docs 以上就應該考慮：</p>
<ul>
<li><a href="/blog/llm/knowledge-cards/batching/" data-link-title="Batching" data-link-desc="多 request 一起跑、攤平 model load 成本：production LLM inference 的核心優化、決定 throughput vs latency 取捨">Batching</a> embedding（單次 request 送 50 個 chunks）</li>
<li>並行 worker（Python multiprocessing、4-8 worker）</li>
<li>換 <a href="/blog/llm/knowledge-cards/vector-database/" data-link-title="Vector Database" data-link-desc="為高維向量 (embedding) 設計的儲存 &#43; 近似最近鄰 (ANN) 檢索系統：RAG 從 prototype 跨到 production 的關鍵元件">vector database</a>（避免把全部資料用 pickle 塞 RAM）</li>
</ul>
<h2 id="rag-query-階段retrieval-加-generation">RAG Query 階段：retrieval 加 generation</h2>
<p>跑 <code>python3 scripts/rag-demo/query.py --show-retrieved &quot;問題&quot;</code> 時：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Loaded 463 chunks from scripts/rag-demo/index.pkl
</span></span><span class="line"><span class="ln">2</span><span class="cl">=== Retrieved chunks ===
</span></span><span class="line"><span class="ln">3</span><span class="cl">  0.870  llm/knowledge-cards/transformer.md#chunk2
</span></span><span class="line"><span class="ln">4</span><span class="cl">  ...
</span></span><span class="line"><span class="ln">5</span><span class="cl">（LLM 生成 response）</span></span></code></pre></div><p>實測資源消耗（單次 query）：</p>
<table>
  <thead>
      <tr>
          <th>階段</th>
          <th>RAM 增量</th>
          <th>時間</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>載 index.pkl 到 RAM</td>
          <td>3.7 MB（小 corpus）/ MB 級（大 corpus）</td>
          <td>&lt; 1 秒</td>
      </tr>
      <tr>
          <td>embed query</td>
          <td>0（已載入的 nomic-embed-text）</td>
          <td>200 ms</td>
      </tr>
      <tr>
          <td>cosine over 463 chunks</td>
          <td>純 Python 計算、暫時用 ~10 MB</td>
          <td>50 ms</td>
      </tr>
      <tr>
          <td>載 chat model（gemma3:1b）</td>
          <td>~1 GB（首次）/ 0（已 cached）</td>
          <td>5-10 秒（首次）/ 0（cached）</td>
      </tr>
      <tr>
          <td>生成 response</td>
          <td>0 額外</td>
          <td>5-30 秒（看 model + prompt 長度）</td>
      </tr>
  </tbody>
</table>
<p><strong>Query 階段的特性</strong>：</p>
<ul>
<li><strong>第一次 cold start</strong>：要載 chat model 進 RAM、5-10 秒首字延遲。</li>
<li><strong>後續 query 都快</strong>：embedding model + chat model 都在 RAM、retrieval 毫秒級、只剩 generation 時間。</li>
<li><strong>RAM 占用 = embedding model + chat model + index</strong>：
<ul>
<li>463 chunks: 274 MB + chat model + 3.7 MB ≈ chat model + 280 MB</li>
<li>100K chunks: 274 MB + chat model + ~800 MB 進 RAM、加上 mmap pickle 額外開銷</li>
</ul>
</li>
<li><strong>瓶頸是 chat model</strong>：retrieval 部分快、瓶頸完全在 generation。</li>
</ul>
<p><strong>多模型並存</strong>（embedding + chat）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 看當前 RAM 占用</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">ollama ps
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># NAME                       SIZE      UNTIL</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># nomic-embed-text:latest    274 MB    4 minutes from now</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># gemma3:4b                  5.5 GB    4 minutes from now</span></span></span></code></pre></div><p>兩個 model 都載入時、Ollama RAM 占用約 6 GB。Ollama 的 <code>OLLAMA_KEEP_ALIVE</code>（預設 5 分鐘）會 idle 後分別 unload 兩個 model。</p>
<p><strong>規模 sanity check</strong>：</p>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>RAM 需求</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>純 chat（gemma3:1b）</td>
          <td>~1 GB</td>
      </tr>
      <tr>
          <td>RAG with gemma3:1b + nomic-embed-text + 小 index</td>
          <td>~1.5 GB</td>
      </tr>
      <tr>
          <td>RAG with gemma3:4b + nomic-embed-text + 中型 index</td>
          <td>~6 GB</td>
      </tr>
      <tr>
          <td>RAG with gemma4:31b + nomic-embed-text + 大 index</td>
          <td>~20 GB</td>
      </tr>
  </tbody>
</table>
<p>跑 RAG 比 chat 額外要 ~300-1000 MB（embedding model + index）、不會太重。</p>
<h2 id="mcp-server-階段subprocess-常駐">MCP Server 階段：subprocess 常駐</h2>
<p>跑 <code>python3 scripts/mcp-demo/test_client.py</code> 時、client 會 spawn <code>blog_mcp_server.py</code> 當 child process。</p>
<p>實測：</p>
<table>
  <thead>
      <tr>
          <th>資源</th>
          <th>數字</th>
          <th>備註</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Subprocess RAM</td>
          <td>~50 MB</td>
          <td>Python runtime + index.pkl mmap</td>
      </tr>
      <tr>
          <td>stdio pipe 數量</td>
          <td>3（stdin、stdout、stderr）</td>
          <td>每 spawn 一個 server 都要 3 FD</td>
      </tr>
      <tr>
          <td>持續時間</td>
          <td>client 在跑就在跑</td>
          <td>client 結束時 SIGPIPE 自動結束 server</td>
      </tr>
  </tbody>
</table>
<p><strong>MCP server 的特性</strong>：</p>
<ul>
<li><strong>每個 client spawn 一個 server</strong>：Claude Desktop 開 5 個 MCP server、就有 5 個 Python subprocess。</li>
<li><strong>Index lazy load</strong>：本 demo <code>load_index()</code> 第一次 call 才 read pickle、之後 cached。Cold start 第一次 tool call 稍慢。</li>
<li><strong>Process lifecycle 在 client 端</strong>：client 死了、stdin EOF、server 自然結束。Client 沒清乾淨 spawn 多次就 leak process。</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 看當前所有 MCP server</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">ps aux <span class="p">|</span> grep blog_mcp_server <span class="p">|</span> grep -v grep
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 如果 client crash 留下 zombie：</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">pkill -f <span class="s2">&#34;blog_mcp_server.py&#34;</span></span></span></code></pre></div><p><strong>多 MCP server 並存</strong>（如 Claude Desktop 接 git server + filesystem server + custom server）：</p>
<table>
  <thead>
      <tr>
          <th>Server</th>
          <th>RAM</th>
          <th>主要負載</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>git MCP server</td>
          <td>~30 MB</td>
          <td>shell 呼叫</td>
      </tr>
      <tr>
          <td>filesystem MCP server</td>
          <td>~30 MB</td>
          <td>fs 操作</td>
      </tr>
      <tr>
          <td>blog_mcp_server（本 demo）</td>
          <td>~50 MB（含 index）</td>
          <td>embedding + retrieval</td>
      </tr>
      <tr>
          <td>5 個 server 同時</td>
          <td>~200 MB</td>
          <td>累積</td>
      </tr>
  </tbody>
</table>
<p>200 MB 在 32 GB Mac 上不顯眼、但 16 GB Mac + 多 MCP server + 大 chat model 就可能擠到。</p>
<h2 id="rag--mcp-整合完整應用-stack">RAG + MCP 整合：完整應用 stack</h2>
<p>實際應用會疊起來：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">User 在 Claude Desktop 打字
</span></span><span class="line"><span class="ln">2</span><span class="cl">  ↓
</span></span><span class="line"><span class="ln">3</span><span class="cl">Claude Desktop (~200 MB)
</span></span><span class="line"><span class="ln">4</span><span class="cl">  ↓ MCP stdio
</span></span><span class="line"><span class="ln">5</span><span class="cl">blog_mcp_server.py (~50 MB)
</span></span><span class="line"><span class="ln">6</span><span class="cl">  ↓ HTTP /api/embeddings + /v1/chat/completions
</span></span><span class="line"><span class="ln">7</span><span class="cl">Ollama daemon (~200 MB)
</span></span><span class="line"><span class="ln">8</span><span class="cl">  ↓ load
</span></span><span class="line"><span class="ln">9</span><span class="cl">nomic-embed-text 模型 (~274 MB) + 主 chat model (~6 GB)</span></span></code></pre></div><p>整體 RAM 占用範圍：</p>
<table>
  <thead>
      <tr>
          <th>配置</th>
          <th>估算</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Minimal（gemma3:1b + 小 index）</td>
          <td>~1.7 GB</td>
      </tr>
      <tr>
          <td>Standard（gemma3:4b + 中 index）</td>
          <td>~6.5 GB</td>
      </tr>
      <tr>
          <td>Heavy（gemma4:31b + 大 index + 多 MCP server）</td>
          <td>~22 GB</td>
      </tr>
  </tbody>
</table>
<p>跟 <a href="/blog/llm/01-local-llm-services/hands-on/resource-management/" data-link-title="Hands-on：LLM 運行中 &#43; 結束的資源管理" data-link-desc="RAM / 磁碟 / port 三個 dimension 的觀察跟釋放、Ollama keep_alive 跟 ComfyUI 兩種 lifecycle 對比、實測釋放數字">resource-management 章</a> 比、RAG / MCP 加 ~500 MB-1 GB overhead 在 chat 之上、是合理的 tradeoff（換來 retrieval + tool use 能力）。</p>
<h2 id="各資源類型的關鍵指標">各資源類型的關鍵指標</h2>
<p>整理三 dimension 的關鍵指標跟監控方式：</p>
<h3 id="ram">RAM</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 看 Ollama 載了哪些 model</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">ollama ps
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 看所有 LLM-related process</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">ps aux <span class="p">|</span> grep -E <span class="s2">&#34;ollama|comfyui|mcp&#34;</span> <span class="p">|</span> grep -v grep <span class="p">|</span> awk <span class="s1">&#39;{print $4, $11, $12, $13}&#39;</span> <span class="p">|</span> sort -rn
</span></span><span class="line"><span class="ln">6</span><span class="cl">
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># 系統整體</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">vm_stat <span class="p">|</span> head -3</span></span></code></pre></div><p><strong>告警閾值</strong>：</p>
<ul>
<li>RAM 占用 &gt; 80% 系統總量：開始考慮 unload model 或關掉 ComfyUI</li>
<li>看到 swap 增加（<code>vm_stat | grep &quot;Swapouts&quot;</code>）：已經 swap、要立刻減少 model</li>
</ul>
<h3 id="磁碟">磁碟</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Ollama models 累積</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">du -sh ~/.ollama/models
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># RAG index 累積（多個 corpus）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">du -sh scripts/rag-demo/index*.pkl 2&gt;/dev/null
</span></span><span class="line"><span class="ln">6</span><span class="cl">
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># ComfyUI checkpoints / VAE / LoRA / etc</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">du -sh ~/Projects/ComfyUI/models/*</span></span></code></pre></div><p><strong>累積評估</strong>：</p>
<ul>
<li>Ollama: 每 model 1-20 GB、半年累積容易破 50 GB</li>
<li>RAG index: 每 100K chunks ~800 MB、多 corpus 累積要管</li>
<li>ComfyUI: 每 checkpoint 4-7 GB、加 LoRA / VAE / ControlNet 等可達 50+ GB</li>
</ul>
<h3 id="process--port">Process / Port</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 一鍵 audit 所有 LLM service</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="k">for</span> p in <span class="m">11434</span> <span class="m">1234</span> <span class="m">8080</span> <span class="m">8188</span> 8000<span class="p">;</span> <span class="k">do</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;=== port </span><span class="nv">$p</span><span class="s2"> ===&#34;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">  lsof -i :<span class="nv">$p</span> 2&gt;/dev/null <span class="p">|</span> head -2
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="k">done</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># 找 zombie subprocess（沒 parent 的 mcp server）</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">ps aux <span class="p">|</span> grep <span class="s2">&#34;mcp_server&#34;</span> <span class="p">|</span> grep -v grep</span></span></code></pre></div><p><strong>告警訊號</strong>：</p>
<ul>
<li>同 port 兩個 process listen：明顯有 zombie、要 kill</li>
<li>多個 mcp_server PPID = 1（被 reparent 到 init）：原 client 死了沒清乾淨</li>
</ul>
<h2 id="rag-應用的長期累積管理">RAG 應用的長期累積管理</h2>
<p>跑超過幾週、會累積：</p>
<table>
  <thead>
      <tr>
          <th>累積物</th>
          <th>為什麼累積</th>
          <th>怎麼清</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Multiple <code>index.pkl</code></td>
          <td>跑不同 corpus 各建 index、舊的沒刪</td>
          <td><code>find scripts -name 'index*.pkl' -mtime +30 -delete</code></td>
      </tr>
      <tr>
          <td>Ollama models</td>
          <td>試了不同 model 沒清</td>
          <td>看 <code>ollama list</code> modified 欄、<code>ollama rm</code> 不用的</td>
      </tr>
      <tr>
          <td>Python <code>__pycache__</code></td>
          <td>每次跑 script 累積</td>
          <td><code>.gitignore</code> 已包、本地 <code>find . -name __pycache__ -exec rm -rf {} +</code></td>
      </tr>
      <tr>
          <td>Embedding cache</td>
          <td>如果你寫了 embedding cache 機制</td>
          <td>各自清理策略</td>
      </tr>
  </tbody>
</table>
<p>清理 idiom：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 每月跑一次的 cleanup</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">llm-rag-cleanup<span class="o">()</span> <span class="o">{</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Old indexes (&gt;30 days):&#34;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">  find scripts -name <span class="s1">&#39;index*.pkl&#39;</span> -mtime +30 -ls
</span></span><span class="line"><span class="ln">5</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Ollama models (review):&#34;</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">  ollama list
</span></span><span class="line"><span class="ln">7</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;[*] Python caches:&#34;</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">  find ~/Projects -name __pycache__ -type d <span class="p">|</span> head -10
</span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="o">}</span></span></span></code></pre></div><h2 id="跟-production-的差距預告">跟 production 的差距預告</h2>
<p>本篇紀錄的數字、是「single-user、single-machine、no concurrency」的 baseline。Production 場景多了幾個維度：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>本地</th>
          <th>Production</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>並發 user</td>
          <td>1</td>
          <td>10-10000</td>
      </tr>
      <tr>
          <td>Index 大小</td>
          <td>&lt; 100 MB</td>
          <td>TB 級</td>
      </tr>
      <tr>
          <td>Model serving</td>
          <td>Ollama 1 process</td>
          <td>vLLM / TGI / Triton 多 worker</td>
      </tr>
      <tr>
          <td>Vector storage</td>
          <td>pickle</td>
          <td>Pinecone / Weaviate / pgvector</td>
      </tr>
      <tr>
          <td>Latency 要求</td>
          <td>秒級 OK</td>
          <td>p50 &lt; 500ms、p99 &lt; 2s</td>
      </tr>
      <tr>
          <td>Cost model</td>
          <td>一次性硬體</td>
          <td>$/request、$/token</td>
      </tr>
      <tr>
          <td>Observability</td>
          <td>tail log</td>
          <td>metrics / traces / dashboards</td>
      </tr>
      <tr>
          <td>失敗模式</td>
          <td>crash → 自己重啟</td>
          <td>99.9% uptime SLA</td>
      </tr>
  </tbody>
</table>
<p>Production 視角詳細展開見 <a href="/blog/llm/04-applications/production-resource-planning/" data-link-title="4.9 Production 部署的資源評估原理" data-link-desc="從本地單 user 到 production multi-tenant：concurrent users、cost model、observability、SLA、capacity planning 的設計取捨">4.9 Production 部署的資源評估原理</a>。</p>
<h2 id="何時這篇會過時">何時這篇會過時</h2>
<p><strong>不會過時的部分</strong>：</p>
<ul>
<li>三階段 footprint 分類（ingest / query / server）</li>
<li>RAM / 磁碟 / process 三 dimension 的監控指令</li>
<li>多模型並存的 RAM 預估方法</li>
<li>長期累積管理 idiom</li>
</ul>
<p><strong>會變的部分</strong>：</p>
<ul>
<li>具體 RAM / 磁碟數字（隨模型架構、量化方法演化）</li>
<li><code>OLLAMA_KEEP_ALIVE</code> 等具體環境變數名</li>
<li>哪些 vector DB 主流（會持續演化）</li>
</ul>
<p>讀的時候若 RAM 占用跟本篇對不上、可能是新 model 架構效率改變、用同樣方法量自己環境的 baseline 即可。</p>
<p>跟其他 hands-on 章節的關係：完整 hands-on 系列見 <a href="/blog/llm/01-local-llm-services/hands-on/" data-link-title="Hands-on：本地 AI 工具實作筆記" data-link-desc="Ollama / ComfyUI / Whisper / Piper TTS：實際安裝、驗證、跑通的紀錄。隨工具版本演化、跟 1.x 原理章節互補。">Hands-on 章節索引</a>、實作配對見 <a href="/blog/llm/01-local-llm-services/hands-on/rag-demo/" data-link-title="Hands-on：用 blog content 當 corpus 跑 RAG" data-link-desc="200 行 Python：embedding &#43; cosine retrieval &#43; Ollama chat、validating 4.0 RAG 原理">RAG demo</a> 跟 <a href="/blog/llm/01-local-llm-services/hands-on/mcp-demo/" data-link-title="Hands-on：用 blog content 寫一個最小 MCP server" data-link-desc="stdio JSON-RPC、stdlib-only Python、暴露 blog content 給 LLM 用、validating 4.3 應用層協議">MCP demo</a>、Ollama / ComfyUI 共用的 lifecycle 管理見 <a href="/blog/llm/01-local-llm-services/hands-on/resource-management/" data-link-title="Hands-on：LLM 運行中 &#43; 結束的資源管理" data-link-desc="RAM / 磁碟 / port 三個 dimension 的觀察跟釋放、Ollama keep_alive 跟 ComfyUI 兩種 lifecycle 對比、實測釋放數字">Resource management</a>、Apple Silicon 統一記憶體預算原理見 <a href="/blog/llm/00-foundations/hardware-memory-budget/" data-link-title="0.5 Apple Silicon 記憶體預算" data-link-desc="記憶體決定能跑什麼，Q4 量化下的可運作模型對照與系統保留">0.5 記憶體預算</a>。</p>
<h2 id="跑這篇實測的指令總結">跑這篇實測的指令總結</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. RAG ingest 階段 RAM 量</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">ollama ps  <span class="c1"># 先看 baseline</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">python3 scripts/rag-demo/ingest.py <span class="p">&amp;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="nv">INGEST_PID</span><span class="o">=</span><span class="nv">$!</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">ollama ps  <span class="c1"># 看 embedding model 載入後</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">vm_stat <span class="p">|</span> head -3
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="nb">wait</span> <span class="nv">$INGEST_PID</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 2. RAG query 階段 RAM 量</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">ollama ps  <span class="c1"># 看 idle 後 unload</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">python3 scripts/rag-demo/query.py --show-retrieved <span class="s2">&#34;test query&#34;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">ollama ps  <span class="c1"># 看 chat model 載入</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"># 3. MCP server 階段 process / RAM</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">python3 scripts/mcp-demo/test_client.py <span class="p">&amp;</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="nv">CLIENT_PID</span><span class="o">=</span><span class="nv">$!</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">sleep <span class="m">2</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">ps aux <span class="p">|</span> grep blog_mcp_server <span class="p">|</span> grep -v grep
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="nb">wait</span> <span class="nv">$CLIENT_PID</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="c1"># 4. 完成釋放</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">ollama list <span class="p">|</span> tail -n +2 <span class="p">|</span> awk <span class="s1">&#39;{print $1}&#39;</span> <span class="p">|</span> xargs -I <span class="o">{}</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="se"></span>  curl -s http://localhost:11434/api/generate -d <span class="s2">&#34;{\&#34;model\&#34;:\&#34;{}\&#34;,\&#34;keep_alive\&#34;:0}&#34;</span></span></span></code></pre></div>]]></content:encoded></item></channel></rss>