<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Paradigm-Shift on Tarragon</title><link>https://tarrragon.github.io/blog/tags/paradigm-shift/</link><description>Recent content in Paradigm-Shift on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Tue, 16 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/paradigm-shift/index.xml" rel="self" type="application/rss+xml"/><item><title>從 Firestore 遷往自建 relational：撞牆驅動的 Type E 重建模、存取模型反轉與並行期</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/firestore/migrate-to-relational/</link><pubDate>Tue, 16 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/firestore/migrate-to-relational/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/firestore/" data-link-title="Firestore" data-link-desc="Firebase / Google Cloud 的 serverless document database、collection / document 模型、client 直連 &amp;#43; Security Rules、realtime listener 與 offline 同步、BaaS bundle 的資料層面">Firestore&lt;/a> overview 的 migration playbook。寫作參照 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook 寫作方法論&lt;/a>。BaaS 託管平台整場遷出的資產線盤點與並行期總覽見 &lt;a href="https://tarrragon.github.io/blog/backend/10-system-evolution/managed-platform-exit/" data-link-title="10.3 託管形態遷出：資產線盤點與並行期執行" data-link-desc="0.21 升級自建 tripwire 觸發後的執行劇本 — 把遷出拆成資料、身分、流量、整合各自的可攜性與斷點、設計舊平台與新系統的並行期與回切窗口、用部分遷出作為中繼形態">10.3 託管形態遷出&lt;/a>；本文聚焦資料層的跨 paradigm 重建模。&lt;/p>&lt;/blockquote>
&lt;p>「我們把 Firestore 整包匯出，匯進 PostgreSQL 就好。」這句話低估了遷移的真正內容 — Firestore 遷往自建 relational 的難點是&lt;strong>反轉整個存取模型&lt;/strong>，搬資料只是其中最容易的一條線。Firestore 是 client 用 SDK 直連資料庫、授權寫在 Security Rules；自建 relational 是 client 打自己的後端 API、授權在後端中介層。資料可以匯出，但反正規化的 document 形狀、沿查詢限制長出來的資料模型、realtime listener 與 offline 同步能力，都沒有 1:1 的對應物。字面意義的「匯出再匯入」只搬走了最容易的那部分。本文走 paradigm shift 結構：先講為何字面遷移不成立、再講哪些該遷哪些先留、最後才是階段化執行。&lt;/p>
&lt;h2 id="遷移的-driver三面牆不是relational-比較好">遷移的 driver：三面牆，不是「relational 比較好」&lt;/h2>
&lt;p>Firestore 遷往自建很少因為「relational 比較好」這種空泛動機，而是撞到 &lt;a href="https://tarrragon.github.io/blog/backend/00-service-selection/delivery-mode-selection/" data-link-title="0.21 交付形態選型：從全託管到自建的光譜與邊界" data-link-desc="在進入資料庫、快取與部署選型之前、先判斷服務該用託管平台（Wix / Shopify / Google Sites）、辦公生態自動化（Apps Script）、BaaS（Firebase）、半託管 CMS（WordPress）還是自建、並為日後遷往自建保留可遷出路徑">0.21&lt;/a> BaaS 段描述的三面具體的牆。先確認 driver 真的成立、再啟動遷移：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Driver&lt;/th>
 &lt;th>撞牆訊號&lt;/th>
 &lt;th>遷移要解的問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>報表 / 分析查詢&lt;/td>
 &lt;td>跨 collection 報表查不出來、已經在維護資料複製管線&lt;/td>
 &lt;td>把資料放回支援 JOIN / aggregation 的 relational&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>成本曲線轉折&lt;/td>
 &lt;td>read / write 計費隨流量線性成長、超過自建 + cache 的成本&lt;/td>
 &lt;td>用自管資料庫 + 應用層快取壓低單位成本&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>授權控制面失控&lt;/td>
 &lt;td>Security Rules 長到難以測試 / review、授權邏輯沒有版本治理&lt;/td>
 &lt;td>把授權拉回後端 API 中介層、可測試可審查&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;blockquote>
&lt;p>&lt;strong>No-go condition&lt;/strong>：產品仍以多裝置 realtime 同步與 offline-first 為核心賣點、且查詢需求簡單、成本仍在舒適區 → 先不要遷。這些正是 Firestore 的主場，硬遷會把 realtime / offline 這層平台白送的能力變成自己要重建的工程。遷移前先問「撞的是哪面牆」，三面牆都沒撞到就是 &lt;a href="https://tarrragon.github.io/blog/backend/00-service-selection/capability-buy-vs-build/" data-link-title="0.22 能力級買 vs 建：feature-as-a-service 與 BaaS bundle 選型" data-link-desc="在交付形態決定整個系統要不要自建之後、逐能力判斷該外包還是自建：辨識 managed 基礎設施、feature SaaS 與 BaaS bundle 三種外包深度、no-code 到 dev-tool 的服務光譜、買 vs 建判準與權重浮動、整合接縫與遷出代價">0.22&lt;/a> 講的偽自建。&lt;/p>&lt;/blockquote>
&lt;p>逐能力遷出是常態而非整包搬離：&lt;a href="https://tarrragon.github.io/blog/backend/00-service-selection/capability-buy-vs-build/" data-link-title="0.22 能力級買 vs 建：feature-as-a-service 與 BaaS bundle 選型" data-link-desc="在交付形態決定整個系統要不要自建之後、逐能力判斷該外包還是自建：辨識 managed 基礎設施、feature SaaS 與 BaaS bundle 三種外包深度、no-code 到 dev-tool 的服務光譜、買 vs 建判準與權重浮動、整合接縫與遷出代價">0.22 的「成長期 SaaS」例子&lt;/a> 就是只把撞牆的資料層搬到自管 PostgreSQL、認證留在原平台。本文預設的也是這種逐能力遷出 — 遷的是資料層，不一定連認證、儲存一起搬。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/firestore/" data-link-title="Firestore" data-link-desc="Firebase / Google Cloud 的 serverless document database、collection / document 模型、client 直連 &#43; Security Rules、realtime listener 與 offline 同步、BaaS bundle 的資料層面">Firestore</a> overview 的 migration playbook。寫作參照 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook 寫作方法論</a>。BaaS 託管平台整場遷出的資產線盤點與並行期總覽見 <a href="/blog/backend/10-system-evolution/managed-platform-exit/" data-link-title="10.3 託管形態遷出：資產線盤點與並行期執行" data-link-desc="0.21 升級自建 tripwire 觸發後的執行劇本 — 把遷出拆成資料、身分、流量、整合各自的可攜性與斷點、設計舊平台與新系統的並行期與回切窗口、用部分遷出作為中繼形態">10.3 託管形態遷出</a>；本文聚焦資料層的跨 paradigm 重建模。</p></blockquote>
<p>「我們把 Firestore 整包匯出，匯進 PostgreSQL 就好。」這句話低估了遷移的真正內容 — Firestore 遷往自建 relational 的難點是<strong>反轉整個存取模型</strong>，搬資料只是其中最容易的一條線。Firestore 是 client 用 SDK 直連資料庫、授權寫在 Security Rules；自建 relational 是 client 打自己的後端 API、授權在後端中介層。資料可以匯出，但反正規化的 document 形狀、沿查詢限制長出來的資料模型、realtime listener 與 offline 同步能力，都沒有 1:1 的對應物。字面意義的「匯出再匯入」只搬走了最容易的那部分。本文走 paradigm shift 結構：先講為何字面遷移不成立、再講哪些該遷哪些先留、最後才是階段化執行。</p>
<h2 id="遷移的-driver三面牆不是relational-比較好">遷移的 driver：三面牆，不是「relational 比較好」</h2>
<p>Firestore 遷往自建很少因為「relational 比較好」這種空泛動機，而是撞到 <a href="/blog/backend/00-service-selection/delivery-mode-selection/" data-link-title="0.21 交付形態選型：從全託管到自建的光譜與邊界" data-link-desc="在進入資料庫、快取與部署選型之前、先判斷服務該用託管平台（Wix / Shopify / Google Sites）、辦公生態自動化（Apps Script）、BaaS（Firebase）、半託管 CMS（WordPress）還是自建、並為日後遷往自建保留可遷出路徑">0.21</a> BaaS 段描述的三面具體的牆。先確認 driver 真的成立、再啟動遷移：</p>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>撞牆訊號</th>
          <th>遷移要解的問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>報表 / 分析查詢</td>
          <td>跨 collection 報表查不出來、已經在維護資料複製管線</td>
          <td>把資料放回支援 JOIN / aggregation 的 relational</td>
      </tr>
      <tr>
          <td>成本曲線轉折</td>
          <td>read / write 計費隨流量線性成長、超過自建 + cache 的成本</td>
          <td>用自管資料庫 + 應用層快取壓低單位成本</td>
      </tr>
      <tr>
          <td>授權控制面失控</td>
          <td>Security Rules 長到難以測試 / review、授權邏輯沒有版本治理</td>
          <td>把授權拉回後端 API 中介層、可測試可審查</td>
      </tr>
  </tbody>
</table>
<blockquote>
<p><strong>No-go condition</strong>：產品仍以多裝置 realtime 同步與 offline-first 為核心賣點、且查詢需求簡單、成本仍在舒適區 → 先不要遷。這些正是 Firestore 的主場，硬遷會把 realtime / offline 這層平台白送的能力變成自己要重建的工程。遷移前先問「撞的是哪面牆」，三面牆都沒撞到就是 <a href="/blog/backend/00-service-selection/capability-buy-vs-build/" data-link-title="0.22 能力級買 vs 建：feature-as-a-service 與 BaaS bundle 選型" data-link-desc="在交付形態決定整個系統要不要自建之後、逐能力判斷該外包還是自建：辨識 managed 基礎設施、feature SaaS 與 BaaS bundle 三種外包深度、no-code 到 dev-tool 的服務光譜、買 vs 建判準與權重浮動、整合接縫與遷出代價">0.22</a> 講的偽自建。</p></blockquote>
<p>逐能力遷出是常態而非整包搬離：<a href="/blog/backend/00-service-selection/capability-buy-vs-build/" data-link-title="0.22 能力級買 vs 建：feature-as-a-service 與 BaaS bundle 選型" data-link-desc="在交付形態決定整個系統要不要自建之後、逐能力判斷該外包還是自建：辨識 managed 基礎設施、feature SaaS 與 BaaS bundle 三種外包深度、no-code 到 dev-tool 的服務光譜、買 vs 建判準與權重浮動、整合接縫與遷出代價">0.22 的「成長期 SaaS」例子</a> 就是只把撞牆的資料層搬到自管 PostgreSQL、認證留在原平台。本文預設的也是這種逐能力遷出 — 遷的是資料層，不一定連認證、儲存一起搬。</p>
<h2 id="6-維-diff-audit主導維度是-paradigm--application-change">6 維 diff audit：主導維度是 paradigm + application change</h2>
<p>遷移前先盤點 source 跟 target 的差異落在哪幾維、決定 playbook 結構：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Firestore → 自建 relational</th>
          <th>程度</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>document / collection → 正規 table、SDK query → 後端 API + SQL</td>
          <td>High</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>serverless 全託管 → 自管 / managed 資料庫、自己擔 backup / failover</td>
          <td>High</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>client 直連 + 規則授權 → API 中介 + 後端授權</td>
          <td>High</td>
      </tr>
      <tr>
          <td>Components 數量</td>
          <td>單一平台 → 新增一層自建後端服務 + 資料庫</td>
          <td>High</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>前端拔 SDK 改打 API、realtime / offline 要重建</td>
          <td>High</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>平台複製 → 自己設計 replica / 多 region / DR</td>
          <td>Medium</td>
      </tr>
  </tbody>
</table>
<p>主導維度是 <strong>paradigm 與 application change</strong>：六維裡五維落在 High。這定義了結構 — <strong>Type E paradigm shift</strong>（排除 schema 翻譯 Type A 和 drop-in Type B）：存取模型反轉、部分能力重建、可能長期混合（資料層自建、認證仍留平台）。</p>
<h2 id="為什麼字面遷移不成立存取模型反轉">為什麼字面遷移不成立：存取模型反轉</h2>
<p>Firestore 的存取模型是 <em>前端即客戶端、資料庫直接面向公網、授權在規則層</em>；自建 relational 是 <em>前端打後端、後端面向資料庫、授權在服務層</em>。這個反轉是遷移的核心難點，不在資料搬運。</p>
<p><strong>反正規化 document → 正規 schema</strong>：</p>
<ul>
<li>Firestore 為了繞開查詢限制，常把關聯資料冗餘寫進同一 document（一份資料複製多處）</li>
<li>遷往 relational 要把冗餘拆回正規化 table、重建外鍵關係，這是逆向工程：要先讀懂當初為什麼這樣存</li>
<li>反過來說，有些 document 的巢狀結構在 relational 用 JSONB 保留更省事（見 <a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">PostgreSQL jsonb</a>）— 不是所有 document 都要拆成 table</li>
</ul>
<p><strong>Security Rules 授權 → 後端授權</strong>：</p>
<ul>
<li>Firestore 的授權邏輯散在 Security Rules DSL 裡，遷移要把每一條規則翻譯成後端 API 的權限檢查</li>
<li>這層翻譯是安全敏感的：漏一條規則等於開一個越權查詢的洞，對應 <a href="/blog/backend/01-database/red-team-data-layer/" data-link-title="1.5 攻擊者視角（紅隊）：資料層弱點判讀" data-link-desc="從資料存取邊界、外洩路徑與修復代價、盤點 database 的主要弱點">1.5 資料層紅隊</a></li>
</ul>
<p><strong>SDK 直連 → API 中介</strong>：</p>
<ul>
<li>前端原本用 Firestore SDK 直接讀寫，遷移後要拔掉 SDK、改打自建 API</li>
<li>這是 application 層的大改，不是資料庫換連線字串</li>
</ul>
<p><strong>realtime listener / offline persistence → 自己重建</strong>：</p>
<ul>
<li>snapshot listener 的即時推送、offline 讀寫快取，是平台白送的能力</li>
<li>自建要用 WebSocket / SSE 重建即時層（見 <a href="/blog/backend/03-message-queue/" data-link-title="模組三：訊息佇列與事件傳遞" data-link-desc="整理 durable queue、broker、retry、outbox 與 idempotency 的後端實務">03 訊息佇列</a> 與 presence 設計）、用前端本地儲存重建 offline — 這是遷移最容易被漏估的工作量</li>
</ul>
<p>所以遷移的第一步不是匯資料，是<strong>盤點 application 對 Firestore 的所有依賴面</strong>：查詢路徑、授權規則、realtime 訂閱、offline 行為。這份清單決定哪些能直接遷、哪些要重建、哪些先留在平台。</p>
<h2 id="哪些該遷哪些先留逐能力混合">哪些該遷、哪些先留（逐能力混合）</h2>
<p>Type E 的本質是不收斂 — 不必把所有 Firebase 能力一次搬完。判讀標準：</p>
<table>
  <thead>
      <tr>
          <th>Workload / 能力特徵</th>
          <th>去向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>需要報表 / JOIN / aggregation 的資料</td>
          <td>遷自建 relational</td>
      </tr>
      <tr>
          <td>讀取量大、成本敏感、access pattern 穩定的資料</td>
          <td>遷自建 + <a href="/blog/backend/02-cache-redis/" data-link-title="模組二：快取與 Redis" data-link-desc="整理快取策略、Redis 資料型別與分散式狀態輔助能力">應用層快取</a></td>
      </tr>
      <tr>
          <td>仍以 realtime 同步為核心、查詢簡單的資料</td>
          <td>先留 Firestore / 或最後再遷</td>
      </tr>
      <tr>
          <td>認證（Firebase Auth）</td>
          <td>可留平台、逐能力決定（見 0.22）</td>
      </tr>
      <tr>
          <td>檔案儲存（Firebase Storage）</td>
          <td>可留平台、與資料層解耦後再評估</td>
      </tr>
  </tbody>
</table>
<p><a href="/blog/backend/00-service-selection/capability-buy-vs-build/" data-link-title="0.22 能力級買 vs 建：feature-as-a-service 與 BaaS bundle 選型" data-link-desc="在交付形態決定整個系統要不要自建之後、逐能力判斷該外包還是自建：辨識 managed 基礎設施、feature SaaS 與 BaaS bundle 三種外包深度、no-code 到 dev-tool 的服務光譜、買 vs 建判準與權重浮動、整合接縫與遷出代價">0.22 的成長期 SaaS</a> 是這個判讀的 case anchor：撞牆的是資料層的 query 複雜度與成本，遷的就是資料層，認證留在原地。混合不是過渡失敗，是逐能力選型的穩態。</p>
<h2 id="phase-plan存取模型反轉的階段化">Phase plan：存取模型反轉的階段化</h2>
<p>paradigm shift 的階段化把不可逆動作放到最後、每階段有獨立驗證門檻：</p>
<h4 id="phase-1依賴面盤點">Phase 1：依賴面盤點</h4>
<p>列出 application 對 Firestore 的所有讀寫路徑、Security Rules 授權條件、realtime 訂閱點、offline 行為。標每項的頻率、安全敏感度、是否可重建。這份清單不完整不進下一階段。</p>
<h4 id="phase-2relational-重建模">Phase 2：relational 重建模</h4>
<p>把反正規化 document 設計回正規 schema、決定哪些巢狀結構用 JSONB 保留。同步設計後端 API 的端點與授權檢查、把 Security Rules 逐條翻譯成服務層權限。對應 <a href="/blog/backend/01-database/schema-design/" data-link-title="1.2 Schema Design 與資料建模" data-link-desc="整理 table、index、key、partition、denormalization 與命名規則">1.2 schema design</a> 與 <a href="/blog/backend/01-database/red-team-data-layer/" data-link-title="1.5 攻擊者視角（紅隊）：資料層弱點判讀" data-link-desc="從資料存取邊界、外洩路徑與修復代價、盤點 database 的主要弱點">1.5 資料層紅隊</a>。</p>
<h4 id="phase-3自建後端--dual-write">Phase 3：自建後端 + dual-write</h4>
<p>立起自建後端 API 與資料庫，前端關鍵寫入路徑同時寫 Firestore 與新後端。Firestore 仍是 source of truth、新庫累積資料。dual-write 要處理一邊失敗的補償（對應 <a href="/blog/backend/01-database/reconciliation-data-repair/" data-link-title="1.9 Reconciliation 與 Data Repair" data-link-desc="資料不一致的分類、偵測模式、修復策略、audit trail、跟 backup / PITR 整合">1.9 Reconciliation</a>）。</p>
<h4 id="phase-4backfill-歷史資料">Phase 4：backfill 歷史資料</h4>
<p>把 Firestore 既有 document 按新 schema 轉換寫入新庫。backfill 與 dual-write 並行時要處理覆蓋順序，backfill 不能蓋掉 dual-write 的新值。轉換過程記 checksum / row count 對照。</p>
<h4 id="phase-5shadow-read-驗證">Phase 5：shadow read 驗證</h4>
<p>讀路徑同時打 Firestore 與新後端、比對結果、記錄差異但仍以 Firestore 回應用戶。差異率降到可接受才進 cutover。對應 <a href="/blog/backend/01-database/schema-migration-rollout-evidence/" data-link-title="1.7 Schema Migration Rollout 證據（Schema Migration Rollout Evidence）實作示範" data-link-desc="以訂單付款狀態欄位演進示範 schema migration 如何產出 evidence、release gate 與 incident decision log。">1.7 Schema Migration Rollout 證據</a> 的 evidence 方法。</p>
<h4 id="phase-6漸進-cutover--重建即時層">Phase 6：漸進 cutover + 重建即時層</h4>
<p>前端逐步把讀寫從 Firestore SDK 切到自建 API（按比例 / 按功能模組），保留切回能力。若產品需要 realtime，這階段要把 snapshot listener 換成自建即時層（WebSocket / SSE）並驗證延遲與斷線重連。cutover 完成後資料層的 source of truth 轉到自建；未遷的能力（認證、儲存）仍在平台 — 混合架構成立。</p>
<h2 id="evidence每階段的前進依據">Evidence：每階段的前進依據</h2>
<p>每個階段用資料證明可前進、不靠感覺：</p>
<table>
  <thead>
      <tr>
          <th>階段</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>dual-write</td>
          <td>雙寫成功率、寫入失敗補償紀錄、兩邊 document / row 數差異</td>
      </tr>
      <tr>
          <td>backfill</td>
          <td>已轉換比例、轉換錯誤數、checksum 對照、反正規化還原正確性抽查</td>
      </tr>
      <tr>
          <td>shadow read</td>
          <td>新舊結果差異率、差異分類（建模差異 vs 真錯誤）、授權翻譯漏洞掃描</td>
      </tr>
      <tr>
          <td>cutover</td>
          <td>切流比例、新 API latency p99、error rate、realtime 推送延遲、rollback 是否觸發</td>
      </tr>
  </tbody>
</table>
<p>這些 evidence 對齊 <a href="/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">4.20 Observability Evidence Package</a>（Source / Time range / Query link / Owner / Data quality）與 <a href="/blog/backend/06-reliability/release-gate/" data-link-title="6.8 Release Gate 與變更節奏" data-link-desc="把驗證、migration、相容性納入放行判準">6.8 release gate</a>。授權翻譯這項要特別當成 gate 條件 — 它是安全邊界、不只是功能正確性。</p>
<h2 id="cutover-與-rollback-決策">Cutover 與 rollback 決策</h2>
<p>資料庫切流失敗代價高、加上這裡牽涉授權正確性，決策權責要寫清楚：</p>
<ul>
<li><strong>cutover window</strong>：選低流量時段、明確切流比例階梯（如 1% → 10% → 50% → 100%），按功能模組切比按全站切安全</li>
<li><strong>rollback condition</strong>：新 API error rate / latency 超閾值、shadow read 差異率異常、或發現授權翻譯漏洞 → 切回 Firestore</li>
<li><strong>decision owner</strong>：誰有權喊停、依據什麼 evidence、記錄在 <a href="/blog/backend/08-incident-response/incident-decision-log/" data-link-title="8.19 Incident Decision Log" data-link-desc="把事中假設、決策、證據、回退條件與責任人留下可復盤紀錄">8.19 incident decision log</a></li>
<li><strong>realtime 連續性</strong>：若即時層同步切換，要驗證切換期間訂閱不中斷、或明確告知短暫降級</li>
</ul>
<p>對應 <a href="/blog/backend/knowledge-cards/rollback-window/" data-link-title="Rollback Window" data-link-desc="說明變更進入 production 後還能用哪種方式回退或改路線的時間與條件">rollback window</a>、<a href="/blog/backend/knowledge-cards/rollback-condition/" data-link-title="Rollback Condition" data-link-desc="說明決策執行後出現哪些訊號時要撤回、回退或改路線">rollback condition</a>。</p>
<h2 id="cleanup-與長期混合">Cleanup 與長期混合</h2>
<p>Type E 的 cleanup 通常不是「關掉整個 Firebase」— 多數情況認證、儲存仍留平台：</p>
<ul>
<li>已遷資料路徑的 Firestore collection、Security Rules、dual-write code path 退役</li>
<li>shadow read 比對 code 移除</li>
<li>前端殘留的 Firestore SDK 依賴清掉（資料層已不走它）</li>
<li>但 Firebase Auth / Storage 若仍在用，保留；明確標示哪條資料路徑的 source of truth 是自建庫、哪條仍在平台</li>
<li>Firestore 的資料匯出備份保留到確認新庫穩定，對應 <a href="/blog/backend/10-system-evolution/managed-platform-exit/" data-link-title="10.3 託管形態遷出：資產線盤點與並行期執行" data-link-desc="0.21 升級自建 tripwire 觸發後的執行劇本 — 把遷出拆成資料、身分、流量、整合各自的可攜性與斷點、設計舊平台與新系統的並行期與回切窗口、用部分遷出作為中繼形態">10.3</a> 的並行期退役判準</li>
</ul>
<p>混合架構不是遷移失敗、是逐能力選型的穩態 — 撞牆的資料層自建、沒撞牆的認證 / 儲存留在平台。</p>
<h2 id="失敗模式">失敗模式</h2>
<p>production 常見的 5 個踩雷：</p>
<h4 id="case-1只匯資料漏了存取模型反轉">Case 1：只匯資料、漏了存取模型反轉</h4>
<p>把 Firestore 匯出匯進 PostgreSQL 就以為遷完、忘了前端還在打 SDK、授權還在 Security Rules。修法：依賴面盤點是 Phase 1、資料搬運只是其中一條線，存取模型反轉才是主體。</p>
<h4 id="case-2security-rules-翻譯漏洞">Case 2：Security Rules 翻譯漏洞</h4>
<p>把規則翻成後端授權時漏一條、開了越權查詢的洞、上線後資料外洩。修法：授權翻譯要逐條對照 + 紅隊驗證（<a href="/blog/backend/01-database/red-team-data-layer/" data-link-title="1.5 攻擊者視角（紅隊）：資料層弱點判讀" data-link-desc="從資料存取邊界、外洩路徑與修復代價、盤點 database 的主要弱點">1.5</a>）、當成 cutover gate 條件、不是功能 bug。</p>
<h4 id="case-3反正規化還原錯誤">Case 3：反正規化還原錯誤</h4>
<p>document 的冗餘副本拆回 table 時還原錯關係、新庫資料關聯接錯。修法：Phase 2 先讀懂當初為何反正規化、backfill 後抽查還原正確性、shadow read 比對抓出建模差異。</p>
<h4 id="case-4低估-realtime--offline-重建工作量">Case 4：低估 realtime / offline 重建工作量</h4>
<p>以為遷資料庫就好、上線才發現 snapshot listener 與 offline 同步整層要自己重建、進度爆炸。修法：依賴面盤點就把 realtime 訂閱點與 offline 行為標出來、列入工作量、必要時這層最後遷或先保留。</p>
<h4 id="case-5dual-write-一邊失敗沒補償">Case 5：dual-write 一邊失敗沒補償</h4>
<p>dual-write 時新庫寫成功 Firestore 失敗（或反之）、兩邊分歧、cutover 後資料不完整。修法：dual-write 要有失敗補償（記錄、重試、標記人工對帳），對應 <a href="/blog/backend/01-database/reconciliation-data-repair/" data-link-title="1.9 Reconciliation 與 Data Repair" data-link-desc="資料不一致的分類、偵測模式、修復策略、audit trail、跟 backup / PITR 整合">1.9 Reconciliation</a>。</p>
<p><strong>Anti-recommendation</strong>：產品仍重度依賴 realtime / offline、或團隊還沒有自建後端與資料庫的營運能力（backup、failover、授權設計）→ 先不要遷。可先把一塊撞牆最明顯、realtime 需求最低的資料（例如報表來源資料）試點、累積自建營運經驗再擴大。</p>
<h2 id="容量與成本crossover-判讀">容量與成本：crossover 判讀</h2>
<p>遷移的成本判讀關鍵是 <em>遷移後的總帳</em>、不是只看 Firestore 帳單：</p>
<ul>
<li><strong>遷移當下</strong>：高 read 流量下，自管資料庫 + 應用層快取的單位成本常低於 Firestore 的 per-read 計費</li>
<li><strong>但要加回自建的隱性成本</strong>：後端服務的開發與維運、資料庫的 backup / failover / 擴容、realtime 層的重建與維護、團隊人力</li>
<li><strong>判讀分層</strong>：撞到成本牆且已有後端團隊 → 自建總帳通常划算；仍是小團隊、realtime 是核心、流量不大 → Firestore 的「平台白送能力」可能仍比自建總帳便宜</li>
</ul>
<blockquote>
<p><strong>Scope warning</strong>：crossover 隨流量形狀、region pricing、團隊成本結構變動、無通用閾值。遷移省下的 Firestore 帳單要扣掉自建後端 + 資料庫 + 即時層的維運成本後再比，不是直接拿兩邊資料庫帳單對照。</p></blockquote>
<p>接回 <a href="/blog/backend/00-service-selection/cost-risk-tradeoffs/" data-link-title="0.6 成本、風險與選型取捨" data-link-desc="用人力成本、雲端成本、操作成本與失敗代價判斷後端能力投入順序">0.6 成本、風險與選型取捨</a>、<a href="/blog/backend/01-database/kv-document-capacity-planning/" data-link-title="1.10 KV / Document DB 容量規劃" data-link-desc="DynamoDB / Cosmos DB / Bigtable / MongoDB 等 KV / Document DB 的容量設計、partition key 取捨、capacity mode 選擇">1.10 KV / Document DB 容量規劃</a>。</p>
<h2 id="邊界與整合">邊界與整合</h2>
<h3 id="跟其他遷移路徑的關係">跟其他遷移路徑的關係</h3>
<ul>
<li><strong>保留 document model</strong>：若只是要逃離 Firestore 的查詢限制、但 document 形狀仍適合，遷 <a href="/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB</a> 比遷 relational 的 paradigm 跨度小、不必反正規化還原</li>
<li><strong>整包託管遷出</strong>：若連認證、儲存一起搬離 Firebase，整場資產線盤點與並行期走 <a href="/blog/backend/10-system-evolution/managed-platform-exit/" data-link-title="10.3 託管形態遷出：資產線盤點與並行期執行" data-link-desc="0.21 升級自建 tripwire 觸發後的執行劇本 — 把遷出拆成資料、身分、流量、整合各自的可攜性與斷點、設計舊平台與新系統的並行期與回切窗口、用部分遷出作為中繼形態">10.3 託管形態遷出</a>、本文是其中資料層那一條</li>
<li><strong>反向視角</strong>：哪些資料當初就不該進 Firestore（報表來源、強一致交易），見 <a href="/blog/backend/01-database/vendors/firestore/#%e4%b8%8d%e9%81%a9%e7%94%a8%e5%a0%b4%e6%99%af" data-link-title="Firestore" data-link-desc="Firebase / Google Cloud 的 serverless document database、collection / document 模型、client 直連 &#43; Security Rules、realtime listener 與 offline 同步、BaaS bundle 的資料層面">Firestore overview 的不適用場景</a></li>
</ul>
<h3 id="sibling-與-cross-link">Sibling 與 cross-link</h3>
<ul>
<li><a href="/blog/backend/01-database/vendors/firestore/" data-link-title="Firestore" data-link-desc="Firebase / Google Cloud 的 serverless document database、collection / document 模型、client 直連 &#43; Security Rules、realtime listener 與 offline 同步、BaaS bundle 的資料層面">Firestore overview</a> — 服務定位與查詢邊界</li>
<li><a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">1.6 資料庫轉換實作</a> — 通用 dual-write / shadow read / cutover 框架</li>
<li><a href="/blog/backend/01-database/red-team-data-layer/" data-link-title="1.5 攻擊者視角（紅隊）：資料層弱點判讀" data-link-desc="從資料存取邊界、外洩路徑與修復代價、盤點 database 的主要弱點">1.5 資料層紅隊</a> — Security Rules 授權翻譯的安全驗證</li>
<li><a href="/blog/backend/01-database/reconciliation-data-repair/" data-link-title="1.9 Reconciliation 與 Data Repair" data-link-desc="資料不一致的分類、偵測模式、修復策略、audit trail、跟 backup / PITR 整合">1.9 Reconciliation 與 Data Repair</a> — dual-write 失敗補償與資料對帳</li>
<li><a href="/blog/backend/01-database/vendors/dynamodb/migrate-rds-mongodb-to-dynamodb/" data-link-title="從 RDS / MongoDB 遷移到 DynamoDB：access-pattern-first 重建模、混合架構與 cost crossover" data-link-desc="RDS / MongoDB → DynamoDB 不是搬 schema 而是換 paradigm；本文走 Type E paradigm shift 結構，展開為何字面遷移不成立、access pattern 重建模、哪些 workload 該遷哪些該留的混合架構、dual-write &#43; shadow read 階段化，以及 Zomato cost crossover 的長期成本判讀">從 RDS / MongoDB 遷往 DynamoDB</a> — 同為 Type E paradigm shift 的對照（方向相反：遷入 NoSQL vs 遷出 BaaS）</li>
<li><a href="/blog/backend/00-service-selection/delivery-mode-selection/" data-link-title="0.21 交付形態選型：從全託管到自建的光譜與邊界" data-link-desc="在進入資料庫、快取與部署選型之前、先判斷服務該用託管平台（Wix / Shopify / Google Sites）、辦公生態自動化（Apps Script）、BaaS（Firebase）、半託管 CMS（WordPress）還是自建、並為日後遷往自建保留可遷出路徑">0.21 交付形態選型</a> / <a href="/blog/backend/00-service-selection/capability-buy-vs-build/" data-link-title="0.22 能力級買 vs 建：feature-as-a-service 與 BaaS bundle 選型" data-link-desc="在交付形態決定整個系統要不要自建之後、逐能力判斷該外包還是自建：辨識 managed 基礎設施、feature SaaS 與 BaaS bundle 三種外包深度、no-code 到 dev-tool 的服務光譜、買 vs 建判準與權重浮動、整合接縫與遷出代價">0.22 能力級買 vs 建</a> — 遷移 driver 的選型層背景</li>
</ul>
]]></content:encoded></item><item><title>Docker Swarm → Kubernetes：5 個 Swarm production cluster 撞牆數據</title><link>https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/kubernetes/migrate-from-docker-swarm/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/kubernetes/migrate-from-docker-swarm/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link Docker Swarm 跟 &lt;a href="https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/kubernetes/" data-link-title="Kubernetes" data-link-desc="Container orchestration 主流、GKE / EKS / AKS / 自管">Kubernetes&lt;/a>。跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit&lt;/a> 後對映 &lt;em>Paradigm = High（Swarm 簡單 container orchestration → K8s declarative resource model）→ Type E paradigm shift&lt;/em>。&lt;/p>&lt;/blockquote>
&lt;h2 id="5-個-swarm-production-cluster-撞牆數據">5 個 Swarm production cluster 撞牆數據&lt;/h2>
&lt;p>從 2020-2024 觀察 5 個中型 organization 的 Swarm production cluster lifecycle、典型撞牆點：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Cluster&lt;/th>
 &lt;th>規模 (peak)&lt;/th>
 &lt;th>撞牆點&lt;/th>
 &lt;th>觸發遷移時間&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>A (SaaS startup)&lt;/td>
 &lt;td>80 service / 12 node&lt;/td>
 &lt;td>service discovery latency 升、無 sidecar mesh&lt;/td>
 &lt;td>2022&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>B (E-commerce)&lt;/td>
 &lt;td>150 service / 25 node&lt;/td>
 &lt;td>rolling update + canary 邏輯自寫複雜&lt;/td>
 &lt;td>2023&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>C (Fintech)&lt;/td>
 &lt;td>60 service / 15 node&lt;/td>
 &lt;td>secret rotation + RBAC 自管、合規難&lt;/td>
 &lt;td>2023&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>D (Media)&lt;/td>
 &lt;td>200 service / 40 node&lt;/td>
 &lt;td>autoscaling 自寫、預測流量失敗&lt;/td>
 &lt;td>2024&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>E (Logistics)&lt;/td>
 &lt;td>100 service / 20 node&lt;/td>
 &lt;td>multi-region 不支援&lt;/td>
 &lt;td>2024&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>5 個共同 pattern：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Swarm 簡單但 ceiling 100-200 service / 20-40 node&lt;/strong>&lt;/li>
&lt;li>&lt;strong>跨 service 治理（mesh / RBAC / secret / autoscale）需要 &lt;em>外掛&lt;/em> 工具、複雜度反超 K8s&lt;/strong>&lt;/li>
&lt;li>&lt;strong>無 multi-region native&lt;/strong>、災備受限&lt;/li>
&lt;li>&lt;strong>生態縮、社群活躍度低、新 feature 緩&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>撞牆點不是「Swarm 跑不動」、是「Swarm 不會幫你解 &lt;em>跨 service 治理&lt;/em> 問題、要自寫」。Kubernetes 不是 simpler、是 &lt;em>把治理問題納入框架&lt;/em>。&lt;/p>
&lt;h2 id="為什麼遷ceiling--ecosystem--multi-region-三條-driver">為什麼遷：ceiling / ecosystem / multi-region 三條 driver&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Driver&lt;/th>
 &lt;th>觸發&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Ceiling&lt;/td>
 &lt;td>Swarm 跑 100-200 service 後 service discovery latency / scheduling 跟不上&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Ecosystem&lt;/td>
 &lt;td>K8s ecosystem (Helm / Operator / mesh / GitOps) 成熟、Swarm 對等工具缺&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Multi-region&lt;/td>
 &lt;td>Swarm 不支援、K8s 多 cluster federation 成熟&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>反向 driver（K8s → Swarm）：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link Docker Swarm 跟 <a href="/blog/backend/05-deployment-platform/vendors/kubernetes/" data-link-title="Kubernetes" data-link-desc="Container orchestration 主流、GKE / EKS / AKS / 自管">Kubernetes</a>。跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit</a> 後對映 <em>Paradigm = High（Swarm 簡單 container orchestration → K8s declarative resource model）→ Type E paradigm shift</em>。</p></blockquote>
<h2 id="5-個-swarm-production-cluster-撞牆數據">5 個 Swarm production cluster 撞牆數據</h2>
<p>從 2020-2024 觀察 5 個中型 organization 的 Swarm production cluster lifecycle、典型撞牆點：</p>
<table>
  <thead>
      <tr>
          <th>Cluster</th>
          <th>規模 (peak)</th>
          <th>撞牆點</th>
          <th>觸發遷移時間</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>A (SaaS startup)</td>
          <td>80 service / 12 node</td>
          <td>service discovery latency 升、無 sidecar mesh</td>
          <td>2022</td>
      </tr>
      <tr>
          <td>B (E-commerce)</td>
          <td>150 service / 25 node</td>
          <td>rolling update + canary 邏輯自寫複雜</td>
          <td>2023</td>
      </tr>
      <tr>
          <td>C (Fintech)</td>
          <td>60 service / 15 node</td>
          <td>secret rotation + RBAC 自管、合規難</td>
          <td>2023</td>
      </tr>
      <tr>
          <td>D (Media)</td>
          <td>200 service / 40 node</td>
          <td>autoscaling 自寫、預測流量失敗</td>
          <td>2024</td>
      </tr>
      <tr>
          <td>E (Logistics)</td>
          <td>100 service / 20 node</td>
          <td>multi-region 不支援</td>
          <td>2024</td>
      </tr>
  </tbody>
</table>
<p>5 個共同 pattern：</p>
<ul>
<li><strong>Swarm 簡單但 ceiling 100-200 service / 20-40 node</strong></li>
<li><strong>跨 service 治理（mesh / RBAC / secret / autoscale）需要 <em>外掛</em> 工具、複雜度反超 K8s</strong></li>
<li><strong>無 multi-region native</strong>、災備受限</li>
<li><strong>生態縮、社群活躍度低、新 feature 緩</strong></li>
</ul>
<p>撞牆點不是「Swarm 跑不動」、是「Swarm 不會幫你解 <em>跨 service 治理</em> 問題、要自寫」。Kubernetes 不是 simpler、是 <em>把治理問題納入框架</em>。</p>
<h2 id="為什麼遷ceiling--ecosystem--multi-region-三條-driver">為什麼遷：ceiling / ecosystem / multi-region 三條 driver</h2>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>觸發</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ceiling</td>
          <td>Swarm 跑 100-200 service 後 service discovery latency / scheduling 跟不上</td>
      </tr>
      <tr>
          <td>Ecosystem</td>
          <td>K8s ecosystem (Helm / Operator / mesh / GitOps) 成熟、Swarm 對等工具缺</td>
      </tr>
      <tr>
          <td>Multi-region</td>
          <td>Swarm 不支援、K8s 多 cluster federation 成熟</td>
      </tr>
  </tbody>
</table>
<p>反向 driver（K8s → Swarm）：</p>
<ul>
<li>純 internal tool / 小規模（&lt; 30 service）、K8s 過度複雜</li>
<li>Edge / IoT scenario、Swarm footprint 小</li>
</ul>
<h2 id="6-維-audit">6 維 audit</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td><strong>High</strong>（docker-compose stack.yml → K8s YAML、syntax 完全不同）</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>Medium（Swarm 自管 → K8s self-host or managed）</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td><strong>High</strong>（簡單 container orchestration → declarative resource model）</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Low（同 1 個 orchestration 系統）</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Low（container image 不變）</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>Low</td>
      </tr>
  </tbody>
</table>
<p>Schema + Paradigm 雙 High → <strong>Type E paradigm shift</strong> 為主、Schema 高維獨立段。</p>
<h2 id="paradigm-對位">Paradigm 對位</h2>
<table>
  <thead>
      <tr>
          <th>概念</th>
          <th>Swarm</th>
          <th>K8s</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Workload unit</td>
          <td>Service</td>
          <td>Deployment + Pod + Service</td>
      </tr>
      <tr>
          <td>Stack 定義</td>
          <td>stack.yml (docker-compose 格式)</td>
          <td>YAML manifest (multiple resources)</td>
      </tr>
      <tr>
          <td>Networking</td>
          <td>Overlay network (built-in)</td>
          <td>CNI plugin (Calico / Cilium / etc)</td>
      </tr>
      <tr>
          <td>Service discovery</td>
          <td>DNS-based built-in</td>
          <td>DNS-based (CoreDNS) + Service object</td>
      </tr>
      <tr>
          <td>Load balancing</td>
          <td>Built-in routing mesh</td>
          <td>Service + Ingress + LoadBalancer</td>
      </tr>
      <tr>
          <td>Secret management</td>
          <td>Docker secrets</td>
          <td>K8s Secret + 外部 Vault / Secrets Manager</td>
      </tr>
      <tr>
          <td>Rolling update</td>
          <td><code>docker service update --image ...</code></td>
          <td>Deployment + rolling update + readiness probe</td>
      </tr>
      <tr>
          <td>Autoscaling</td>
          <td>手動 scale</td>
          <td>HPA (Horizontal Pod Autoscaler)</td>
      </tr>
      <tr>
          <td>RBAC</td>
          <td>Limited (Swarm enterprise)</td>
          <td>First-class (Role / RoleBinding / ServiceAccount)</td>
      </tr>
      <tr>
          <td>Persistent storage</td>
          <td>Volume + driver plugin</td>
          <td>PV / PVC + CSI driver</td>
      </tr>
      <tr>
          <td>Service mesh</td>
          <td>無 (要外掛 Traefik)</td>
          <td>Istio / Linkerd / Cilium</td>
      </tr>
      <tr>
          <td>GitOps</td>
          <td>無 native</td>
          <td>Argo CD / Flux (first-class)</td>
      </tr>
  </tbody>
</table>
<h2 id="schema-gapdocker-compose-vs-k8s-yaml">Schema gap：docker-compose vs K8s YAML</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># Docker Swarm stack.yml</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">version</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;3.8&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">services</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="nt">webapp</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l">myapp:1.0</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="nt">deploy</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">      </span><span class="nt">replicas</span><span class="p">:</span><span class="w"> </span><span class="m">3</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">      </span><span class="nt">update_config</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">        </span><span class="nt">parallelism</span><span class="p">:</span><span class="w"> </span><span class="m">1</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">      </span><span class="nt">restart_policy</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">        </span><span class="nt">condition</span><span class="p">:</span><span class="w"> </span><span class="kc">on</span>-<span class="l">failure</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="nt">networks</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">      </span>- <span class="l">frontend</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">    </span><span class="nt">ports</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">      </span>- <span class="s2">&#34;8080:8080&#34;</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># K8s equivalent (Deployment + Service + Ingress)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">apps/v1</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Deployment</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">webapp</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">  </span><span class="nt">replicas</span><span class="p">:</span><span class="w"> </span><span class="m">3</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">  </span><span class="nt">strategy</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">RollingUpdate</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="nt">rollingUpdate</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">      </span><span class="nt">maxSurge</span><span class="p">:</span><span class="w"> </span><span class="m">1</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">      </span><span class="nt">maxUnavailable</span><span class="p">:</span><span class="w"> </span><span class="m">0</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">  </span><span class="nt">selector</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">    </span><span class="nt">matchLabels</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">app</span><span class="p">:</span><span class="w"> </span><span class="l">webapp }</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">  </span><span class="nt">template</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">    </span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">      </span><span class="nt">labels</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">app</span><span class="p">:</span><span class="w"> </span><span class="l">webapp }</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w">    </span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">      </span><span class="nt">containers</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">        </span>- <span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">webapp</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w">          </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l">myapp:1.0</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w">          </span><span class="nt">ports</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">            </span>- <span class="nt">containerPort</span><span class="p">:</span><span class="w"> </span><span class="m">8080</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w">          </span><span class="nt">readinessProbe</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w">            </span><span class="nt">httpGet</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w">              </span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="l">/healthz</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w">              </span><span class="nt">port</span><span class="p">:</span><span class="w"> </span><span class="m">8080</span><span class="w">
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="w">          </span><span class="nt">resources</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="w">            </span><span class="nt">requests</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="w">              </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l">100m</span><span class="w">
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="w">              </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">128Mi</span><span class="w">
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="w">            </span><span class="nt">limits</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="w">              </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l">500m</span><span class="w">
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="w">              </span><span class="nt">memory</span><span class="p">:</span><span class="w"> </span><span class="l">512Mi</span><span class="w">
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="w"></span><span class="nn">---</span><span class="w">
</span></span></span><span class="line"><span class="ln">36</span><span class="cl"><span class="w"></span><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">v1</span><span class="w">
</span></span></span><span class="line"><span class="ln">37</span><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Service</span><span class="w">
</span></span></span><span class="line"><span class="ln">38</span><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">39</span><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">webapp</span><span class="w">
</span></span></span><span class="line"><span class="ln">40</span><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">41</span><span class="cl"><span class="w">  </span><span class="nt">selector</span><span class="p">:</span><span class="w"> </span>{<span class="w"> </span><span class="nt">app</span><span class="p">:</span><span class="w"> </span><span class="l">webapp }</span><span class="w">
</span></span></span><span class="line"><span class="ln">42</span><span class="cl"><span class="w">  </span><span class="nt">ports</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">43</span><span class="cl"><span class="w">    </span>- <span class="nt">port</span><span class="p">:</span><span class="w"> </span><span class="m">8080</span><span class="w">
</span></span></span><span class="line"><span class="ln">44</span><span class="cl"><span class="w">      </span><span class="nt">targetPort</span><span class="p">:</span><span class="w"> </span><span class="m">8080</span></span></span></code></pre></div><p>1 Swarm service → 2-3 K8s resource（Deployment + Service + 可能 Ingress / HPA）；application 不改但 <em>deployment 端工作量 5-10x</em>。</p>
<h2 id="migration-流程">Migration 流程</h2>
<h3 id="partial-migration--混合架構">Partial migration + 混合架構</h3>
<p>跟 <a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a> / <a href="/blog/backend/05-deployment-platform/vendors/consul/migrate-from-etcd/" data-link-title="etcd → Consul：KV &#43; N 個 extras feature matrix" data-link-desc="etcd → Consul 是 Type E paradigm shift expansion — 從 pure KV store 升到 service mesh / discovery / health check / multi-DC；本文用對照表 &#43; paradigm expansion 路線、5 個 production 踩雷（API 對位 / lock semantics / watch event model / multi-DC topology / ACL system）">etcd → Consul</a> 同 Type E pattern：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">1. Audit application：列所有 Swarm stack + service
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">2. 分類處理 plan:
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">   - 簡單 stateless: 先切 K8s (低風險)
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">   - Stateful (DB / queue): 評估 K8s operator 或保留 Swarm
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">   - Critical service: 雙跑期確認 K8s 行為對等
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">3. K8s cluster 建置:
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">   - Managed (EKS / GKE / AKS) vs self-host (kubeadm)
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">   - 配 ingress controller / cert-manager / monitoring
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">4. Application 遷移 (per stack)
</span></span><span class="line"><span class="ln">10</span><span class="cl">   - 寫 K8s YAML / Helm chart
</span></span><span class="line"><span class="ln">11</span><span class="cl">   - 配 readiness/liveness probe / resource request
</span></span><span class="line"><span class="ln">12</span><span class="cl">   - Networking + secret 對位
</span></span><span class="line"><span class="ln">13</span><span class="cl">5. Cutover + Swarm decommission
</span></span><span class="line"><span class="ln">14</span><span class="cl">   - 部分 stack 切完、評估 Swarm 是否保留 (legacy / edge)
</span></span><span class="line"><span class="ln">15</span><span class="cl">   - 多數 organization 完全 decommission Swarm</span></span></code></pre></div><p>整體 3-6 個月、依 stack 數量跟 application 複雜度。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1networking-model-差cross-service-connectivity-失效">Case 1：Networking model 差、cross-service connectivity 失效</h3>
<p><strong>徵兆</strong>：cutover 後 service A 連 service B 失敗、Swarm 端 <code>tasks.service_b</code> DNS 對位 K8s 端 <code>service-b.namespace.svc.cluster.local</code> 不通。</p>
<p><strong>根因</strong>：Swarm overlay network 內 service-to-service 用 short name (<code>service_b</code>)、K8s 用 FQDN；application 端 service URL 寫死。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Application 端用 short name + cluster DNS search domain</li>
<li>K8s 端設 <code>dnsPolicy: ClusterFirst</code> 預設、確認 <code>kubectl get svc -A</code> 對應</li>
<li>NetworkPolicy 預設 deny-all、明示 allow rule</li>
</ol>
<h3 id="case-2secret-rotation-從-swarm-secrets-換-vault--secrets-manager">Case 2：Secret rotation 從 Swarm secrets 換 Vault / Secrets Manager</h3>
<p><strong>徵兆</strong>：原本 Swarm 用 <code>docker secret</code> 旋轉 secret、切 K8s 後 K8s Secret 是 <em>static value</em>、rotation 不自動。</p>
<p><strong>根因</strong>：K8s Secret 是 K8s-native 但 <em>not auto-rotated</em>、需要外部 Vault / Secrets Manager + agent (vault-agent-injector / external-secrets-operator)。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>K8s 端 deploy external-secrets-operator + AWS Secrets Manager / Vault integration</li>
<li>Application 端 mount file or env variable、不在 code 寫死</li>
<li>Rotation 走 vendor-side、K8s 端 sidecar 自動 reload</li>
</ol>
<h3 id="case-3readiness-probe-沒設rolling-update-期間-traffic-loss">Case 3：Readiness probe 沒設、rolling update 期間 traffic loss</h3>
<p><strong>徵兆</strong>：cutover 後 deploy 期間 application 5-10% request 失敗；發現 pod startup 完成前就接 traffic。</p>
<p><strong>根因</strong>：Swarm 簡單 restart_policy 沒對等 probe 概念；K8s 預設 deploy 後 immediate ready、若沒 readiness probe、startup 時間長的 application 會在未 ready 時接流量。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>必加 readiness probe</strong>：HTTP / TCP / exec check</li>
<li><strong>配 initial delay</strong>：JVM application 預留 30-60s</li>
<li><strong>配 <code>minReadySeconds</code></strong>：deployment 端設 30s 確保 stable</li>
</ol>
<h3 id="case-4hpa-預設不啟autoscaling-失效">Case 4：HPA 預設不啟、autoscaling 失效</h3>
<p><strong>徵兆</strong>：Swarm 端寫了 cron-based autoscale script、切 K8s 後 script 失效、流量高峰沒 scale up。</p>
<p><strong>根因</strong>：K8s HPA 不是預設啟動、需要 <em>明示配置</em> + metrics-server install。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">autoscaling/v2</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">HorizontalPodAutoscaler</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">metadata</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">webapp-hpa</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="nt">spec</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">  </span><span class="nt">scaleTargetRef</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l">apps/v1</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">    </span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l">Deployment</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">webapp</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">  </span><span class="nt">minReplicas</span><span class="p">:</span><span class="w"> </span><span class="m">3</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">  </span><span class="nt">maxReplicas</span><span class="p">:</span><span class="w"> </span><span class="m">20</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">  </span><span class="nt">metrics</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">    </span>- <span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">Resource</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">      </span><span class="nt">resource</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">        </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">cpu</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">        </span><span class="nt">target</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">          </span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l">Utilization</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w">          </span><span class="nt">averageUtilization</span><span class="p">:</span><span class="w"> </span><span class="m">70</span></span></span></code></pre></div><p>裝 metrics-server / Keda（event-driven autoscaling）+ 配 HPA per Deployment。</p>
<h3 id="case-5yaml-維護地獄helm--kustomize-配置遲">Case 5：YAML 維護地獄、Helm / Kustomize 配置遲</h3>
<p><strong>徵兆</strong>：cutover 後 K8s YAML 從 5 個檔（Swarm stack）變 50+ 個 K8s manifest；每個 application 端要改一個 config 都要動 N 個 file。</p>
<p><strong>根因</strong>：K8s YAML 是 <em>very verbose</em>、不像 docker-compose 簡潔；缺 templating 跟 environment 抽象。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Helm chart</strong>：對 application 包成 chart、用 <code>values.yaml</code> 抽象環境差異</li>
<li><strong>Kustomize</strong>：base + overlay pattern、不靠 templating</li>
<li><strong>GitOps with Argo CD / Flux</strong>：宣告式部署、降 manual kubectl 操作</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Docker Swarm</th>
          <th>Kubernetes (managed)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster cost (mid-tier)</td>
          <td>$300-800 / mo</td>
          <td>$500-1500 / mo（EKS/GKE/AKS control plane + nodes）</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.3-0.8</td>
          <td>0.5-1.5（除非 managed、降到 0.3-0.7）</td>
      </tr>
      <tr>
          <td>Ecosystem maturity</td>
          <td>低、衰退</td>
          <td>高、active growth</td>
      </tr>
      <tr>
          <td>Multi-region</td>
          <td>不支援</td>
          <td>多 cluster federation 成熟</td>
      </tr>
      <tr>
          <td>Migration cost</td>
          <td>-</td>
          <td>2-4 FTE × 3-6 個月</td>
      </tr>
      <tr>
          <td>Long-term ROI</td>
          <td>Negative（社群縮）</td>
          <td>Positive（feature growth）</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：&lt; 30 service 小 organization 可不切；50+ service 開始撞 Swarm ceiling、值得評估；100+ service / multi-region 必切。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-service-mesh-整合">跟 Service mesh 整合</h3>
<p>Cutover 後 <em>順便</em> 評估 Istio / Linkerd / Cilium service mesh、cover mTLS / observability / traffic policy；不要在 Swarm migration 後立刻上 mesh、分階段。</p>
<h3 id="跟-gitops-整合">跟 GitOps 整合</h3>
<p>K8s + Argo CD / Flux 是 <em>natural pair</em>；migration 時直接走 GitOps、避免 manual kubectl 操作累積。</p>
<h3 id="跟-vault--aws-secrets-manager-對齊">跟 <a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/migrate-to-aws-secrets-manager/" data-link-title="Vault → AWS Secrets Manager：「secret」不是「secret」、identity model 才是核心差異" data-link-desc="Vault → AWS Secrets Manager migration 表面是 secret store 替換、實際核心是 identity model 對位（Vault token &#43; policy vs AWS IAM &#43; resource policy）；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 提出的 identity axis 候選 — identity 是否獨立 audit 軸；5 個 production 踩雷（IAM principal 對位 / dynamic credential 對等失敗 / lease lifecycle 模型不同 / audit log 結構差 / 計費模型反轉）">Vault → AWS Secrets Manager</a> 對齊</h3>
<p>Swarm secrets → K8s Secret → external secrets management 是 <em>3-step 演進</em>、不是 1-step；migration 期間先用 K8s Secret、之後切 Vault / Secrets Manager。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Target vendor：<a href="/blog/backend/05-deployment-platform/vendors/kubernetes/" data-link-title="Kubernetes" data-link-desc="Container orchestration 主流、GKE / EKS / AKS / 自管">Kubernetes</a></li>
<li>平行 migration playbook (Type E)：<a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a> / <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/" data-link-title="Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm" data-link-desc="Redis → Memcached 是 Type E paradigm reduction migration — 從 multi-paradigm（KV &#43; 資料結構 &#43; pub/sub &#43; Lua &#43; streams）退到 pure cache；不是「remove Redis features」、是「重新分配 Redis-specific feature 到對應 specialized 服務」；5 個 production 踩雷 &#43; paradigm reduction 路線">Redis → Memcached</a> / <a href="/blog/backend/05-deployment-platform/vendors/consul/migrate-from-etcd/" data-link-title="etcd → Consul：KV &#43; N 個 extras feature matrix" data-link-desc="etcd → Consul 是 Type E paradigm shift expansion — 從 pure KV store 升到 service mesh / discovery / health check / multi-DC；本文用對照表 &#43; paradigm expansion 路線、5 個 production 踩雷（API 對位 / lock semantics / watch event model / multi-DC topology / ACL system）">etcd → Consul</a> / <a href="/blog/backend/04-observability/vendors/honeycomb/migrate-from-sentry/" data-link-title="Sentry → Honeycomb：trace 不是 error、是不同 observability paradigm" data-link-desc="Sentry → Honeycomb 是 paradigm shift — Sentry 主軸是 error tracking &#43; transaction trace、Honeycomb 主軸是 high-cardinality wide-event observability；本文釐清 paradigm 邊界、5 個 production 踩雷（event schema 對位 / sampling 行為 / error grouping 失效 / cost 模型差 / alert paradigm shift）">Sentry → Honeycomb</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a></li>
</ul>
]]></content:encoded></item><item><title>Sentry → Honeycomb：trace 不是 error、是不同 observability paradigm</title><link>https://tarrragon.github.io/blog/backend/04-observability/vendors/honeycomb/migrate-from-sentry/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/04-observability/vendors/honeycomb/migrate-from-sentry/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/sentry/" data-link-title="Sentry" data-link-desc="Error tracking 主流、APM / Profiling / Session Replay 擴展">Sentry&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/honeycomb/" data-link-title="Honeycomb" data-link-desc="High-cardinality observability 平台、events-based 模型">Honeycomb&lt;/a>。跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit&lt;/a> 後對映 &lt;em>Paradigm = High（error tracking ↔ wide-event observability）→ Type E paradigm shift&lt;/em>。&lt;/p>&lt;/blockquote>
&lt;h2 id="trace-不是-error是不同-paradigm">Trace 不是 error、是不同 paradigm&lt;/h2>
&lt;p>把 Sentry → Honeycomb 當「trace tool 替換」是最常見的誤判 — Sentry trace 是 &lt;em>error 上下文&lt;/em>、Honeycomb trace 是 &lt;em>observability 第一性&lt;/em>：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>概念&lt;/th>
 &lt;th>Sentry&lt;/th>
 &lt;th>Honeycomb&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>核心 paradigm&lt;/td>
 &lt;td>Error tracking + transaction trace&lt;/td>
 &lt;td>High-cardinality wide-event observability&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>第一性 unit&lt;/td>
 &lt;td>Error event&lt;/td>
 &lt;td>Wide event (span with N fields)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Trace 角色&lt;/td>
 &lt;td>Error 的「附帶 context」&lt;/td>
 &lt;td>Observability 主軸、每 event 是 trace span&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Sampling&lt;/td>
 &lt;td>Error 全收 + transaction sample&lt;/td>
 &lt;td>Adaptive sampling、保留 &lt;em>anomaly&lt;/em>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Query model&lt;/td>
 &lt;td>Filter + group by + aggregation&lt;/td>
 &lt;td>High-cardinality 多維 query (BubbleUp / heatmap)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>User base&lt;/td>
 &lt;td>Developer (debug error)&lt;/td>
 &lt;td>SRE + Platform (debug system behavior)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cost model&lt;/td>
 &lt;td>Per-error event + transaction&lt;/td>
 &lt;td>Per-event (wide event volume)&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>核心差異不在「Honeycomb 是 better Sentry」、在「兩者是不同 observability paradigm」&lt;/strong>：&lt;/p>
&lt;ul>
&lt;li>Sentry 適合 &lt;em>application-level error debug&lt;/em> — 拿到 error stack trace + minimal context、快速 fix&lt;/li>
&lt;li>Honeycomb 適合 &lt;em>system-level behavior debug&lt;/em> — 看流量分佈 / 多維 correlation / 異常 outlier、找 &lt;em>為什麼這個 user 在這個時段在這個 endpoint 慢&lt;/em>&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Migration scope 包含 &lt;em>paradigm reset&lt;/em> — 不是 SDK 換、是 SRE / Dev team 對 observability 的心智模型重設&lt;/strong>。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link <a href="/blog/backend/04-observability/vendors/sentry/" data-link-title="Sentry" data-link-desc="Error tracking 主流、APM / Profiling / Session Replay 擴展">Sentry</a> 跟 <a href="/blog/backend/04-observability/vendors/honeycomb/" data-link-title="Honeycomb" data-link-desc="High-cardinality observability 平台、events-based 模型">Honeycomb</a>。跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit</a> 後對映 <em>Paradigm = High（error tracking ↔ wide-event observability）→ Type E paradigm shift</em>。</p></blockquote>
<h2 id="trace-不是-error是不同-paradigm">Trace 不是 error、是不同 paradigm</h2>
<p>把 Sentry → Honeycomb 當「trace tool 替換」是最常見的誤判 — Sentry trace 是 <em>error 上下文</em>、Honeycomb trace 是 <em>observability 第一性</em>：</p>
<table>
  <thead>
      <tr>
          <th>概念</th>
          <th>Sentry</th>
          <th>Honeycomb</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>核心 paradigm</td>
          <td>Error tracking + transaction trace</td>
          <td>High-cardinality wide-event observability</td>
      </tr>
      <tr>
          <td>第一性 unit</td>
          <td>Error event</td>
          <td>Wide event (span with N fields)</td>
      </tr>
      <tr>
          <td>Trace 角色</td>
          <td>Error 的「附帶 context」</td>
          <td>Observability 主軸、每 event 是 trace span</td>
      </tr>
      <tr>
          <td>Sampling</td>
          <td>Error 全收 + transaction sample</td>
          <td>Adaptive sampling、保留 <em>anomaly</em></td>
      </tr>
      <tr>
          <td>Query model</td>
          <td>Filter + group by + aggregation</td>
          <td>High-cardinality 多維 query (BubbleUp / heatmap)</td>
      </tr>
      <tr>
          <td>User base</td>
          <td>Developer (debug error)</td>
          <td>SRE + Platform (debug system behavior)</td>
      </tr>
      <tr>
          <td>Cost model</td>
          <td>Per-error event + transaction</td>
          <td>Per-event (wide event volume)</td>
      </tr>
  </tbody>
</table>
<p><strong>核心差異不在「Honeycomb 是 better Sentry」、在「兩者是不同 observability paradigm」</strong>：</p>
<ul>
<li>Sentry 適合 <em>application-level error debug</em> — 拿到 error stack trace + minimal context、快速 fix</li>
<li>Honeycomb 適合 <em>system-level behavior debug</em> — 看流量分佈 / 多維 correlation / 異常 outlier、找 <em>為什麼這個 user 在這個時段在這個 endpoint 慢</em></li>
</ul>
<p><strong>Migration scope 包含 <em>paradigm reset</em> — 不是 SDK 換、是 SRE / Dev team 對 observability 的心智模型重設</strong>。</p>
<h2 id="為什麼遷observability-成熟度--cardinality--cost-三條-driver">為什麼遷：observability 成熟度 / cardinality / cost 三條 driver</h2>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>觸發</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Observability 成熟度</td>
          <td>Application 規模到 <em>跨多 service / multi-tenant</em>、Sentry error tracking 不夠細、SRE 要看 <em>high-cardinality</em> 多維 query</td>
      </tr>
      <tr>
          <td>High-cardinality</td>
          <td>Sentry tag system 限制 cardinality（~1000 unique value）、Honeycomb native 支援 millions cardinality</td>
      </tr>
      <tr>
          <td>Cost</td>
          <td>Per-error pricing 對 high-error volume 場景爆、Honeycomb per-event 在 <em>wide event</em> 場景更可預測</td>
      </tr>
  </tbody>
</table>
<p>反向 driver（Honeycomb → Sentry）：</p>
<ul>
<li>Pure error tracking 場景、Honeycomb wide-event 過度設計</li>
<li>Frontend / mobile 客戶端 error tracking、Sentry 對 web/mobile/desktop SDK 成熟度高</li>
</ul>
<h2 id="6-維-audit">6 維 audit</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>Medium（event schema 概念不同、SDK 完全換）</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>Low（兩者都 SaaS、operational 對等）</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td><strong>High</strong>（error tracking ↔ wide-event observability）</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Low（同 1 個 observability vendor）</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td><strong>High</strong>（SDK 換 + instrumentation 重設計）</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>Low</td>
      </tr>
  </tbody>
</table>
<p>Paradigm = High（其他 Low-Medium）→ Type E paradigm shift；application change 雖 High 但是 paradigm 的 downstream。</p>
<h2 id="結構partial-migration--混合架構是-long-term-default">結構：partial migration + 混合架構是 long-term default</h2>
<p>跟 <a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a> / <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/" data-link-title="Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm" data-link-desc="Redis → Memcached 是 Type E paradigm reduction migration — 從 multi-paradigm（KV &#43; 資料結構 &#43; pub/sub &#43; Lua &#43; streams）退到 pure cache；不是「remove Redis features」、是「重新分配 Redis-specific feature 到對應 specialized 服務」；5 個 production 踩雷 &#43; paradigm reduction 路線">Redis → Memcached</a> 同 Type E pattern：</p>
<ul>
<li><strong>不存在 complete migration</strong>：Sentry 對 <em>frontend error tracking</em> 強項、Honeycomb 對 <em>backend system observability</em> 強項</li>
<li><strong>長期混合架構</strong>：frontend / mobile 保留 Sentry、backend / SRE 走 Honeycomb</li>
<li><strong>Application 重設計</strong>：instrumentation 用 OpenTelemetry、避免 vendor SDK lock-in</li>
</ul>
<h2 id="application-重設計範例">Application 重設計範例</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Before: Sentry SDK</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">sentry_sdk</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">sentry_sdk</span><span class="o">.</span><span class="n">init</span><span class="p">(</span><span class="n">dsn</span><span class="o">=</span><span class="s1">&#39;https://x@sentry.io/y&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="n">process_order</span><span class="p">(</span><span class="n">order_id</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="n">sentry_sdk</span><span class="o">.</span><span class="n">capture_exception</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="k">raise</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"># After: OpenTelemetry + Honeycomb</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="kn">from</span> <span class="nn">opentelemetry</span> <span class="kn">import</span> <span class="n">trace</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="kn">from</span> <span class="nn">opentelemetry.sdk.trace</span> <span class="kn">import</span> <span class="n">TracerProvider</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="kn">from</span> <span class="nn">opentelemetry.sdk.trace.export</span> <span class="kn">import</span> <span class="n">BatchSpanProcessor</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="kn">from</span> <span class="nn">opentelemetry.exporter.otlp.proto.grpc.trace_exporter</span> <span class="kn">import</span> <span class="n">OTLPSpanExporter</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="n">trace</span><span class="o">.</span><span class="n">set_tracer_provider</span><span class="p">(</span><span class="n">TracerProvider</span><span class="p">())</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="n">trace</span><span class="o">.</span><span class="n">get_tracer_provider</span><span class="p">()</span><span class="o">.</span><span class="n">add_span_processor</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">    <span class="n">BatchSpanProcessor</span><span class="p">(</span><span class="n">OTLPSpanExporter</span><span class="p">(</span><span class="n">endpoint</span><span class="o">=</span><span class="s1">&#39;https://api.honeycomb.io&#39;</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;x-honeycomb-team&#39;</span><span class="p">:</span> <span class="s1">&#39;YOUR_API_KEY&#39;</span><span class="p">}))</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="n">tracer</span> <span class="o">=</span> <span class="n">trace</span><span class="o">.</span><span class="n">get_tracer</span><span class="p">(</span><span class="vm">__name__</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="k">with</span> <span class="n">tracer</span><span class="o">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s1">&#39;process_order&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">    <span class="n">span</span><span class="o">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s1">&#39;order.id&#39;</span><span class="p">,</span> <span class="n">order_id</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">    <span class="n">span</span><span class="o">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s1">&#39;user.id&#39;</span><span class="p">,</span> <span class="n">user_id</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl">    <span class="n">span</span><span class="o">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s1">&#39;order.amount&#39;</span><span class="p">,</span> <span class="n">order</span><span class="o">.</span><span class="n">amount</span><span class="p">)</span>  <span class="c1"># high-cardinality 自然</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl">    <span class="n">span</span><span class="o">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s1">&#39;order.region&#39;</span><span class="p">,</span> <span class="n">region</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">28</span><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl">        <span class="n">process_order</span><span class="p">(</span><span class="n">order_id</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl">        <span class="n">span</span><span class="o">.</span><span class="n">set_status</span><span class="p">(</span><span class="n">trace</span><span class="o">.</span><span class="n">Status</span><span class="p">(</span><span class="n">trace</span><span class="o">.</span><span class="n">StatusCode</span><span class="o">.</span><span class="n">OK</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl">    <span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">32</span><span class="cl">        <span class="n">span</span><span class="o">.</span><span class="n">set_status</span><span class="p">(</span><span class="n">trace</span><span class="o">.</span><span class="n">Status</span><span class="p">(</span><span class="n">trace</span><span class="o">.</span><span class="n">StatusCode</span><span class="o">.</span><span class="n">ERROR</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)))</span>
</span></span><span class="line"><span class="ln">33</span><span class="cl">        <span class="n">span</span><span class="o">.</span><span class="n">record_exception</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">34</span><span class="cl">        <span class="k">raise</span></span></span></code></pre></div><p>差異：</p>
<ul>
<li>Sentry 只 capture exception + 簡 context</li>
<li>Honeycomb 對每 operation 寫 <em>wide event</em> 含 high-cardinality field（user.id / order.amount / order.region）</li>
<li>SRE 端能跑 <code>WHERE order.region = &quot;us-west-2&quot; AND duration &gt; 5000</code> 的 multi-dim query</li>
</ul>
<h2 id="migration-流程">Migration 流程</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">1. Audit application：列所有 Sentry SDK 使用 + capture pattern
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">2. 分類處理 plan:
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">   - Pure error tracking (frontend): 保留 Sentry
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">   - Backend system trace: 切 Honeycomb / OTel
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">   - Error + context (混合): 雙寫期 evaluate
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">3. OpenTelemetry instrumentation 化:
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">   - 用 OTel SDK 取代 vendor SDK
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">   - Honeycomb 是 OTLP target、跟 vendor lock 解耦
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">4. Backend application 切 Honeycomb (3-6 個月)
</span></span><span class="line"><span class="ln">10</span><span class="cl">5. Frontend / mobile 保留 Sentry
</span></span><span class="line"><span class="ln">11</span><span class="cl">6. SRE training: Honeycomb BubbleUp / heatmap / multi-dim query</span></span></code></pre></div><h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1event-schema-對位失敗sre-不會用-bubbleup">Case 1：Event schema 對位失敗、SRE 不會用 BubbleUp</h3>
<p><strong>徵兆</strong>：切 Honeycomb 後 SRE 用 Sentry 思維 — 找 error → fix；Honeycomb BubbleUp / heatmap 沒人會用、observability 退化到 <em>只看 error count</em>。</p>
<p><strong>根因</strong>：Sentry → Honeycomb migration 不只是 tool 換、是 <em>observability mindset 換</em>；SRE 沒培訓 wide-event query / BubbleUp anomaly detection。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>SRE training</strong>：1-2 週 hands-on Honeycomb BubbleUp + heatmap + multi-dim query</li>
<li><strong>Migration scope 含 sample query playbook</strong>：每個 incident type 對應 Honeycomb query 寫成 runbook</li>
<li><strong>保留 Sentry frontend / mobile</strong>：不要逼 SRE 全切、保留 <em>paradigm fit</em> 的部分</li>
</ol>
<h3 id="case-2sampling-行為差production-cost-飛">Case 2：Sampling 行為差、production cost 飛</h3>
<p><strong>徵兆</strong>：切 Honeycomb 後第 1 個月 event volume 比 Sentry 高 100x；帳單暴漲。</p>
<p><strong>根因</strong>：Sentry 對 transaction 端 sample（10% 預設）、error 全收；Honeycomb 端 <em>每 span 都 wide event</em>、application 端沒設 sampling 全送、event volume 爆。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Honeycomb Refinery (sampling proxy)</strong>：deploy refinery 在 application 端跟 Honeycomb 之間、tail-based sampling</li>
<li><strong>Sample rule</strong>：保留 <em>anomaly</em> (error / slow / outlier)、drop <em>boring success</em> 90%+</li>
<li><strong>Cost monitoring 第一週密集</strong>：cardinality + event volume + cost dashboard、catch 預期外 spike</li>
</ol>
<h3 id="case-3error-grouping-失效">Case 3：Error grouping 失效</h3>
<p><strong>徵兆</strong>：切 Honeycomb 後 <em>相似 error</em> 沒被 group 成「同類 issue」、SRE 看每 event 獨立、failure 模式淹沒在 noise。</p>
<p><strong>根因</strong>：Sentry 自動 error grouping (by stack trace fingerprint)、Honeycomb 沒對等 — wide event 是 first-class、event grouping 需要 application 端 explicit 設 <code>error.type</code> field。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Application 端設 error type field</strong>：<code>span.set_attribute('error.type', exception_class)</code></li>
<li><strong>Honeycomb derived column</strong>：用 derived column 算 error fingerprint</li>
<li><strong>保留 Sentry error tracking</strong>：純 error grouping 場景 Sentry 強項、別硬切</li>
</ol>
<h3 id="case-4cost-模型差預估錯">Case 4：Cost 模型差、預估錯</h3>
<p><strong>徵兆</strong>：切 Honeycomb 後預估 50% cost saving、實際只省 10-15%。</p>
<p><strong>根因</strong>：Sentry per-error pricing 對 error-heavy application 貴；Honeycomb per-event pricing 對 <em>wide event volume</em> application 貴；如果 application 是 <em>event volume 高 但 error 少</em>、Honeycomb 反而貴。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration 估</strong>：用 OTel pilot 跑 1-2 週、估真實 event volume</li>
<li><strong>Sample rule 設計</strong>：retention 7 天 hot + 30 天 cold + 1 年 archive、降 cost</li>
<li><strong>混合架構保留</strong>：frontend / mobile 走 Sentry、backend 走 Honeycomb、避免一邊 cost 爆</li>
</ol>
<h3 id="case-5alert-paradigm-不對等">Case 5：Alert paradigm 不對等</h3>
<p><strong>徵兆</strong>：Sentry alert 簡單（error rate / latency p99 threshold）、Honeycomb trigger 配置複雜（SLO + burn rate + BubbleUp）；SOC 學習曲線 1-2 個月。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Migration 含 alert rebuild scope</strong>：Honeycomb trigger 不直接對位 Sentry alert、要重寫</li>
<li><strong>SLO-driven alert</strong>：用 Honeycomb SLO 取代 Sentry threshold alert、降 alert fatigue</li>
<li><strong>PagerDuty integration</strong>：兩家都支援、routing rule 跟 dedup 要 review</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Sentry</th>
          <th>Honeycomb</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pricing model</td>
          <td>Per-error + transaction</td>
          <td>Per-event (wide event)</td>
      </tr>
      <tr>
          <td>Cost (mid-tier)</td>
          <td>$500-2000 / mo</td>
          <td>$400-3000 / mo (依 event volume)</td>
      </tr>
      <tr>
          <td>Sampling</td>
          <td>Built-in transaction sampling</td>
          <td>Refinery (additional component)</td>
      </tr>
      <tr>
          <td>Cardinality</td>
          <td>~1000 unique value / tag</td>
          <td>Millions / field</td>
      </tr>
      <tr>
          <td>Application complexity</td>
          <td>Low (SDK + capture exception)</td>
          <td>Medium (OTel + wide event instrument)</td>
      </tr>
      <tr>
          <td>Migration cost</td>
          <td>-</td>
          <td>2-4 FTE × 2-3 個月</td>
      </tr>
  </tbody>
</table>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-opentelemetry-整合">跟 OpenTelemetry 整合</h3>
<p>OTel 是 vendor-neutral instrumentation、Honeycomb 是 OTLP backend；application 端 OTel 化後可以同時 ship 到多個 backend（dev 端 Jaeger / production 端 Honeycomb / fallback 端 Tempo）。</p>
<h3 id="跟-datadog--grafana-stack-對位">跟 <a href="/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/" data-link-title="Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability" data-link-desc="Datadog 五層計費（host APM / metric / log ingest / log retention / RUM）拆解、對位 Grafana Stack（Mimir / Loki / Tempo / Grafana / Alloy）的 5 層責任；OTel-based agent migration、5 個 production 踩雷（cardinality 爆 / log volume cost / dashboard 不直接轉 / alert routing 換邏輯 / SLO definition 差異）、cost reality check">Datadog → Grafana Stack</a> 對位</h3>
<p>兩條 observability 路線：</p>
<ul>
<li>Grafana Stack (Mimir / Loki / Tempo)：self-host or Grafana Cloud、open source baseline</li>
<li>Honeycomb：SaaS-only、focus wide-event observability</li>
</ul>
<p>選擇取決於 <em>observability paradigm</em>：trace-heavy 走 Tempo / Honeycomb、metric-heavy 走 Mimir / Datadog。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source vendor：<a href="/blog/backend/04-observability/vendors/sentry/" data-link-title="Sentry" data-link-desc="Error tracking 主流、APM / Profiling / Session Replay 擴展">Sentry</a></li>
<li>Target vendor：<a href="/blog/backend/04-observability/vendors/honeycomb/" data-link-title="Honeycomb" data-link-desc="High-cardinality observability 平台、events-based 模型">Honeycomb</a></li>
<li>平行 migration playbook (Type E)：<a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a> / <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/" data-link-title="Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm" data-link-desc="Redis → Memcached 是 Type E paradigm reduction migration — 從 multi-paradigm（KV &#43; 資料結構 &#43; pub/sub &#43; Lua &#43; streams）退到 pure cache；不是「remove Redis features」、是「重新分配 Redis-specific feature 到對應 specialized 服務」；5 個 production 踩雷 &#43; paradigm reduction 路線">Redis → Memcached</a> / <a href="/blog/backend/05-deployment-platform/vendors/consul/migrate-from-etcd/" data-link-title="etcd → Consul：KV &#43; N 個 extras feature matrix" data-link-desc="etcd → Consul 是 Type E paradigm shift expansion — 從 pure KV store 升到 service mesh / discovery / health check / multi-DC；本文用對照表 &#43; paradigm expansion 路線、5 個 production 踩雷（API 對位 / lock semantics / watch event model / multi-DC topology / ACL system）">etcd → Consul</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a></li>
</ul>
]]></content:encoded></item><item><title>etcd → Consul：KV + N 個 extras feature matrix</title><link>https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/consul/migrate-from-etcd/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/consul/migrate-from-etcd/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link &lt;a href="https://etcd.io/">etcd&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/backend/05-deployment-platform/vendors/consul/" data-link-title="Consul" data-link-desc="Service registry / mesh / KV / DNS">Consul&lt;/a>。跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit&lt;/a> 後對映 &lt;em>Paradigm = High（pure KV → service mesh paradigm）→ Type E paradigm shift&lt;/em>；跟 &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/" data-link-title="Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm" data-link-desc="Redis → Memcached 是 Type E paradigm reduction migration — 從 multi-paradigm（KV &amp;#43; 資料結構 &amp;#43; pub/sub &amp;#43; Lua &amp;#43; streams）退到 pure cache；不是「remove Redis features」、是「重新分配 Redis-specific feature 到對應 specialized 服務」；5 個 production 踩雷 &amp;#43; paradigm reduction 路線">Redis → Memcached&lt;/a>（paradigm reduction）對偶、本文是 &lt;em>paradigm expansion&lt;/em>（upgrade）方向。&lt;/p>&lt;/blockquote>
&lt;h2 id="kv--n-個-extrasfeature-matrix">KV + N 個 extras：feature matrix&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>概念&lt;/th>
 &lt;th>etcd&lt;/th>
 &lt;th>Consul&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>核心 paradigm&lt;/td>
 &lt;td>Pure KV with Raft consensus&lt;/td>
 &lt;td>Service mesh（KV + 6 個其他）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Data store&lt;/td>
 &lt;td>KV with versioned values + watch&lt;/td>
 &lt;td>KV + service catalog + health checks + sessions&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>API style&lt;/td>
 &lt;td>gRPC + HTTP/REST&lt;/td>
 &lt;td>HTTP/REST + gRPC（Connect）+ DNS&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Service discovery&lt;/td>
 &lt;td>無（application 自管）&lt;/td>
 &lt;td>Built-in（DNS / HTTP API）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Health check&lt;/td>
 &lt;td>無&lt;/td>
 &lt;td>Built-in（HTTP / TCP / script / TTL）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Service mesh&lt;/td>
 &lt;td>無&lt;/td>
 &lt;td>Connect（mTLS + intentions + service-to-service）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Multi-DC&lt;/td>
 &lt;td>不支援（per-cluster only）&lt;/td>
 &lt;td>Built-in WAN federation&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>ACL system&lt;/td>
 &lt;td>RBAC (etcd 3.5+)&lt;/td>
 &lt;td>Token-based ACL + namespaces (Enterprise)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Lock primitive&lt;/td>
 &lt;td>Lease + transaction&lt;/td>
 &lt;td>Session + KV check-and-set&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Watch event model&lt;/td>
 &lt;td>Event stream（gRPC stream）&lt;/td>
 &lt;td>Long-polling blocking query (X-Consul-Index)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Distributed config&lt;/td>
 &lt;td>KV + watch&lt;/td>
 &lt;td>KV + watch + template rendering (consul-template)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Use case 對映&lt;/td>
 &lt;td>K8s control plane / 純 distributed KV&lt;/td>
 &lt;td>Service mesh + service discovery + config + KV&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>核心差異不在「Consul 多功能」、在「Consul 是 service mesh paradigm」&lt;/strong>：service discovery / health check / Connect mTLS 是 first-class、KV 只是其中一個 sub-feature。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link <a href="https://etcd.io/">etcd</a> 跟 <a href="/blog/backend/05-deployment-platform/vendors/consul/" data-link-title="Consul" data-link-desc="Service registry / mesh / KV / DNS">Consul</a>。跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit</a> 後對映 <em>Paradigm = High（pure KV → service mesh paradigm）→ Type E paradigm shift</em>；跟 <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/" data-link-title="Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm" data-link-desc="Redis → Memcached 是 Type E paradigm reduction migration — 從 multi-paradigm（KV &#43; 資料結構 &#43; pub/sub &#43; Lua &#43; streams）退到 pure cache；不是「remove Redis features」、是「重新分配 Redis-specific feature 到對應 specialized 服務」；5 個 production 踩雷 &#43; paradigm reduction 路線">Redis → Memcached</a>（paradigm reduction）對偶、本文是 <em>paradigm expansion</em>（upgrade）方向。</p></blockquote>
<h2 id="kv--n-個-extrasfeature-matrix">KV + N 個 extras：feature matrix</h2>
<table>
  <thead>
      <tr>
          <th>概念</th>
          <th>etcd</th>
          <th>Consul</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>核心 paradigm</td>
          <td>Pure KV with Raft consensus</td>
          <td>Service mesh（KV + 6 個其他）</td>
      </tr>
      <tr>
          <td>Data store</td>
          <td>KV with versioned values + watch</td>
          <td>KV + service catalog + health checks + sessions</td>
      </tr>
      <tr>
          <td>API style</td>
          <td>gRPC + HTTP/REST</td>
          <td>HTTP/REST + gRPC（Connect）+ DNS</td>
      </tr>
      <tr>
          <td>Service discovery</td>
          <td>無（application 自管）</td>
          <td>Built-in（DNS / HTTP API）</td>
      </tr>
      <tr>
          <td>Health check</td>
          <td>無</td>
          <td>Built-in（HTTP / TCP / script / TTL）</td>
      </tr>
      <tr>
          <td>Service mesh</td>
          <td>無</td>
          <td>Connect（mTLS + intentions + service-to-service）</td>
      </tr>
      <tr>
          <td>Multi-DC</td>
          <td>不支援（per-cluster only）</td>
          <td>Built-in WAN federation</td>
      </tr>
      <tr>
          <td>ACL system</td>
          <td>RBAC (etcd 3.5+)</td>
          <td>Token-based ACL + namespaces (Enterprise)</td>
      </tr>
      <tr>
          <td>Lock primitive</td>
          <td>Lease + transaction</td>
          <td>Session + KV check-and-set</td>
      </tr>
      <tr>
          <td>Watch event model</td>
          <td>Event stream（gRPC stream）</td>
          <td>Long-polling blocking query (X-Consul-Index)</td>
      </tr>
      <tr>
          <td>Distributed config</td>
          <td>KV + watch</td>
          <td>KV + watch + template rendering (consul-template)</td>
      </tr>
      <tr>
          <td>Use case 對映</td>
          <td>K8s control plane / 純 distributed KV</td>
          <td>Service mesh + service discovery + config + KV</td>
      </tr>
  </tbody>
</table>
<p><strong>核心差異不在「Consul 多功能」、在「Consul 是 service mesh paradigm」</strong>：service discovery / health check / Connect mTLS 是 first-class、KV 只是其中一個 sub-feature。</p>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>KV API 對位 + 多 N 個 extra API</td>
          <td>Medium</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>兩者 Raft-based、ops similar</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>Pure KV → service mesh</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Components</td>
          <td>同 1 cluster</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>KV API 改 + 新增 service registration / health</td>
          <td>Medium</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>單 DC → multi-DC（如果用 federation）</td>
          <td>Low-Medium</td>
      </tr>
  </tbody>
</table>
<p>Paradigm = High（其他 Low-Medium）→ <strong>Type E paradigm shift</strong>；KV 是 sub-feature、不是 migration scope 全部。</p>
<h2 id="為什麼遷3-條-expansion-driver">為什麼遷：3 條 expansion driver</h2>
<ul>
<li><strong>Service mesh adoption</strong>：本來用 etcd 跑 K8s control plane、現在 application 端要 service mesh（mTLS / intentions / 流量切換）、Consul 一站式 cover</li>
<li><strong>Multi-DC strategy</strong>：etcd 不支援跨 DC、要 active-passive failover；Consul WAN federation 支援 active-active 多 DC</li>
<li><strong>Configuration management</strong>：consul-template + envconsul 比 etcd watch + 自寫 reloader 簡單</li>
</ul>
<p>反向 driver（Consul → etcd）：</p>
<ul>
<li>純 K8s control plane scenario、不需要 service discovery / health check / mesh、etcd 簡單足夠</li>
<li>Resource constraint：Consul agent 比 etcd 更吃資源、low-end VM 上不夠</li>
</ul>
<h2 id="paradigm-expansion-路線">Paradigm expansion 路線</h2>
<p>跟 <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/" data-link-title="Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm" data-link-desc="Redis → Memcached 是 Type E paradigm reduction migration — 從 multi-paradigm（KV &#43; 資料結構 &#43; pub/sub &#43; Lua &#43; streams）退到 pure cache；不是「remove Redis features」、是「重新分配 Redis-specific feature 到對應 specialized 服務」；5 個 production 踩雷 &#43; paradigm reduction 路線">Redis → Memcached paradigm reduction</a>（移除 features）對偶、Consul 是 <em>補進 features</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">etcd KV pattern         → Consul KV API (1:1 對位)
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">etcd watch              → Consul blocking query / consul-template
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">etcd lease + lock       → Consul session + KV CAS
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">(額外加進)
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">無                      → Consul service registration (services.json / API)
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">無                      → Consul health check (HTTP / TCP / TTL)
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">無                      → Consul service discovery (DNS / HTTP)
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">無                      → Consul Connect (mTLS + intentions)
</span></span><span class="line"><span class="ln">10</span><span class="cl">無                      → Consul WAN federation (multi-DC)
</span></span><span class="line"><span class="ln">11</span><span class="cl">無                      → Consul ACL token + policy</span></span></code></pre></div><p>Migration 不只是 KV API 對位、是 <em>application 增能</em>。</p>
<h2 id="api-對位">API 對位</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># etcd basic KV</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">etcdctl put /myapp/config/db_url <span class="s1">&#39;postgres://...&#39;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">etcdctl get /myapp/config/db_url
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># Consul KV (對位)</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">consul kv put myapp/config/db_url <span class="s1">&#39;postgres://...&#39;</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">consul kv get myapp/config/db_url</span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># etcd watch</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">etcdctl watch --prefix /myapp/config/
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># Consul blocking query (long polling)</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">curl <span class="s1">&#39;http://consul:8500/v1/kv/myapp/config?recurse&amp;index=5&amp;wait=10s&#39;</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># X-Consul-Index header 為 watch cursor</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># etcd transaction (multi-key atomic)</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">etcdctl txn <span class="s">&lt;&lt;EOF
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="s">compares:
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="s">mod(&#34;/myapp/lock&#34;) = &#34;0&#34;
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="s">success requests:
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s">put /myapp/lock &#34;owner1&#34;
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># Consul session + KV CAS (對位)</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="nv">SESSION_ID</span><span class="o">=</span><span class="k">$(</span>curl -X PUT <span class="s1">&#39;http://consul:8500/v1/session/create&#39;</span> <span class="p">|</span> jq -r .ID<span class="k">)</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">curl -X PUT <span class="s1">&#39;http://consul:8500/v1/kv/myapp/lock?acquire=&#39;</span><span class="nv">$SESSION_ID</span> -d <span class="s1">&#39;owner1&#39;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"># 若失敗 lock 已被別人持有</span></span></span></code></pre></div><h2 id="application-重設計">Application 重設計</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Before: etcd</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">import</span> <span class="nn">etcd3</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">etcd</span> <span class="o">=</span> <span class="n">etcd3</span><span class="o">.</span><span class="n">client</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="s1">&#39;etcd&#39;</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">2379</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">etcd</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="s1">&#39;/myapp/config/db_url&#39;</span><span class="p">,</span> <span class="s1">&#39;postgres://...&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="n">db_url</span> <span class="o">=</span> <span class="n">etcd</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;/myapp/config/db_url&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># After: Consul (KV-only)</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="kn">import</span> <span class="nn">consul</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">c</span> <span class="o">=</span> <span class="n">consul</span><span class="o">.</span><span class="n">Consul</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="s1">&#39;consul&#39;</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">8500</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">c</span><span class="o">.</span><span class="n">kv</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="s1">&#39;myapp/config/db_url&#39;</span><span class="p">,</span> <span class="s1">&#39;postgres://...&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="n">_</span><span class="p">,</span> <span class="n">kv</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">kv</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;myapp/config/db_url&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">db_url</span> <span class="o">=</span> <span class="n">kv</span><span class="p">[</span><span class="s1">&#39;Value&#39;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"># (額外加進) After: Consul service discovery</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">c</span><span class="o">.</span><span class="n">agent</span><span class="o">.</span><span class="n">service</span><span class="o">.</span><span class="n">register</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">    <span class="n">name</span><span class="o">=</span><span class="s1">&#39;myapp&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    <span class="n">service_id</span><span class="o">=</span><span class="s1">&#39;myapp-1&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">    <span class="n">address</span><span class="o">=</span><span class="s1">&#39;10.0.0.10&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">    <span class="n">port</span><span class="o">=</span><span class="mi">8080</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">    <span class="n">check</span><span class="o">=</span><span class="n">consul</span><span class="o">.</span><span class="n">Check</span><span class="o">.</span><span class="n">http</span><span class="p">(</span><span class="s1">&#39;http://10.0.0.10:8080/health&#39;</span><span class="p">,</span> <span class="s1">&#39;10s&#39;</span><span class="p">,</span> <span class="s1">&#39;5s&#39;</span><span class="p">,</span> <span class="s1">&#39;30s&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="c1"># DNS-based discovery (其他 service 找 myapp)</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="c1"># dig +short myapp.service.consul SRV</span></span></span></code></pre></div><h2 id="migration-流程">Migration 流程</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">1. Pre-migration audit
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">   - 列 etcd 使用的所有 application
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">   - 評估每個 application 是否 *需要* Consul extras（service discovery / health / mesh）
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">   - 純 KV use case 標 *low-effort migration*、用得到 extras 標 *value-add migration*
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">2. Consul cluster build
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">   - 跨 DC 設計（WAN federation 規劃）
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">   - ACL system 配置（不要 default open）
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">   - 性能 sizing（Consul agent 比 etcd 重）
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl">3. Application migration（per-app）
</span></span><span class="line"><span class="ln">12</span><span class="cl">   - 純 KV: SDK 換、API 對位、cutover
</span></span><span class="line"><span class="ln">13</span><span class="cl">   - Service discovery: 加 registration + health check + DNS lookup
</span></span><span class="line"><span class="ln">14</span><span class="cl">   - Service mesh: 加 Connect proxy + intentions
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl">4. Dual-run period
</span></span><span class="line"><span class="ln">17</span><span class="cl">   - etcd 仍跑、application 漸進切到 Consul
</span></span><span class="line"><span class="ln">18</span><span class="cl">   - 每 application cutover 後驗證
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl">5. etcd decommission
</span></span><span class="line"><span class="ln">21</span><span class="cl">   - 確認所有 application 已切
</span></span><span class="line"><span class="ln">22</span><span class="cl">   - K8s control plane（如果是 etcd 唯一 user）保留不切</span></span></code></pre></div><p>整體 2-4 個月、依 application 數量跟 extras 採用程度。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1kv-api-對位看似-11watch-event-model-不同">Case 1：KV API 對位看似 1:1、watch event model 不同</h3>
<p><strong>徵兆</strong>：application 端從 etcd watch 切 Consul blocking query 後、event 處理 latency 從 50ms 漲到 1-5s；應用以為 event push 即時、實際變 polling。</p>
<p><strong>根因</strong>：etcd watch 是 gRPC stream、event 即時 push；Consul blocking query 是 long-polling、有 <code>wait</code> timeout、event 在 timeout 內到才即時收到。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>降 <code>wait</code> timeout</strong> 跟業務需求對齊（default 5min、可設 10s）</li>
<li><strong>多 instance 並發 polling</strong>：N 個 application instance 各自 polling、降單點 event 延遲</li>
<li><strong>架構</strong>：critical event 用 Consul event API（<code>PUT /v1/event/fire/&lt;name&gt;</code>）+ blocking query event endpoint、跟 KV change 分開</li>
<li><strong>保留 etcd for critical watch</strong>：mission-critical watch 用 etcd 不切</li>
</ol>
<h3 id="case-2session-based-lock-跟-etcd-lease-差">Case 2：Session-based lock 跟 etcd lease 差</h3>
<p><strong>徵兆</strong>：原本 etcd lease 5s TTL、lease holder application 失聯時 5s 內 lock 自動釋放；切 Consul session 後、session TTL 仍生效、但 health check 整合複雜、偶發 lock not released。</p>
<p><strong>根因</strong>：Consul session 有兩種模式 — <code>delete</code>（session expire 時 release lock）vs <code>release</code>（release lock 但 KV 保留）；TTL 配 health check 時行為複雜。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 明示 session behavior</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">session_id</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">session</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="n">name</span><span class="o">=</span><span class="s1">&#39;myapp-lock&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">    <span class="n">ttl</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span>           <span class="c1"># 15s TTL</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">    <span class="n">behavior</span><span class="o">=</span><span class="s1">&#39;delete&#39;</span> <span class="c1"># session 過期時 lock 自動 release</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="n">c</span><span class="o">.</span><span class="n">kv</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="s1">&#39;myapp/lock&#39;</span><span class="p">,</span> <span class="s1">&#39;owner1&#39;</span><span class="p">,</span> <span class="n">acquire</span><span class="o">=</span><span class="n">session_id</span><span class="p">)</span></span></span></code></pre></div><p>session TTL 範圍 10s-86400s、不能 &lt; 10s（etcd 可以 1s）；critical low-latency lock 不適用 Consul。</p>
<h3 id="case-3multi-dc-failoverkv-寫到-wrong-dc">Case 3：Multi-DC failover、KV 寫到 wrong DC</h3>
<p><strong>徵兆</strong>：跨 DC 部署後、某 application 寫 KV、但 read 不到；發現 application 端 hardcode 一個 DC 端點、write 到 us-east 但 read 來自 us-west。</p>
<p><strong>根因</strong>：Consul WAN federation 跨 DC 不自動同步 KV；KV 是 <em>per-DC</em>、跨 DC sync 需要 <em>Consul Enterprise license</em> 或自管 <em>consul-replicate</em>。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>每 application instance 連 local DC Consul</strong>：write/read 同 DC</li>
<li><strong>KV replication 跨 DC</strong>：用 consul-replicate 自管、或升 Enterprise</li>
<li><strong>Architecture</strong>：跨 DC 共享 config 改用 <em>DB-backed config</em>（持久 + 跨 DC）+ Consul KV 只存 DC-local config</li>
</ol>
<h3 id="case-4acl-system-預設-opencutover-後曝險">Case 4：ACL system 預設 open、cutover 後曝險</h3>
<p><strong>徵兆</strong>：Consul cluster 上線 1 個月後 SOC 跑 audit、發現任何 application 都能 read 任何 KV；ACL 沒設、所有 token 都全權限。</p>
<p><strong>根因</strong>：Consul ACL 預設 disabled、需要 <em>bootstrap</em>；很多 setup tutorial 簡化跳過 ACL、cutover 後沒補。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Bootstrap ACL system</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">consul acl bootstrap
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># 生成 management token、保留為 root credential</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 建 policy</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">consul acl policy create -name <span class="s1">&#39;myapp-readonly&#39;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="se"></span>  -rules <span class="s1">&#39;key_prefix &#34;myapp/&#34; { policy = &#34;read&#34; }&#39;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 建 token 給 application</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">consul acl token create -policy-name <span class="s1">&#39;myapp-readonly&#39;</span></span></span></code></pre></div><p>Production setup 第一步就 bootstrap ACL、不可以延後。</p>
<h3 id="case-5health-check-failure-連鎖service-discovery-失效">Case 5：Health check failure 連鎖、service discovery 失效</h3>
<p><strong>徵兆</strong>：某 application instance 因 GC pause 5 秒未 respond health check、被 Consul 標 failed；DNS query 不返回該 instance；流量切走；GC 結束後 instance 仍 healthy 但 Consul 端 still failed、需要 minutes recover。</p>
<p><strong>根因</strong>：Consul health check 失敗後進入 critical state、需要 <em>連續 N 次成功</em> 才回 passing；default 1-2 次成功即可、但實際時間視 check interval 而定。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong><code>success_before_passing</code> 設低</strong>（1）讓快速恢復</li>
<li><strong><code>failures_before_critical</code> 設高</strong>（3-5）容忍 transient failure</li>
<li><strong>Multi-check strategy</strong>：HTTP + TCP + script check 三軸、不靠單 check</li>
<li><strong>Application-side hint</strong>：JVM application 配 <code>MaxGCPauseMillis</code> 限制 GC pause &lt; health check interval</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>etcd</th>
          <th>Consul</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster baseline</td>
          <td>3-5 node Raft cluster</td>
          <td>3-5 server + N agent (per host)</td>
      </tr>
      <tr>
          <td>Memory per node</td>
          <td>2-8GB</td>
          <td>4-16GB（含 agent）</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.2-0.5</td>
          <td>0.5-1.0（多 features 多運維）</td>
      </tr>
      <tr>
          <td>Feature surface</td>
          <td>Pure KV</td>
          <td>KV + service mesh + multi-DC + ACL</td>
      </tr>
      <tr>
          <td>Setup complexity</td>
          <td>Low</td>
          <td>Medium-High</td>
      </tr>
      <tr>
          <td>Multi-DC support</td>
          <td>不支援</td>
          <td>Built-in WAN federation</td>
      </tr>
      <tr>
          <td>License</td>
          <td>Apache 2.0 (open)</td>
          <td>MPL 2.0 (community) / commercial (enterprise)</td>
      </tr>
      <tr>
          <td>Migration cost</td>
          <td>-</td>
          <td>1-3 FTE × 2-4 個月</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：純 KV use case 走 etcd；service mesh / multi-DC / discovery 需求大走 Consul；混合 deployment 是 long-term default（K8s control plane 仍跑 etcd、service mesh 跑 Consul）。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-kubernetes-對位">跟 Kubernetes 對位</h3>
<p>K8s control plane <em>永遠</em> 用 etcd、不切 Consul；Consul 是 K8s <em>外</em> 的 service mesh + 跨 cluster discovery。兩者並存、不互斥。</p>
<h3 id="跟-vault-整合">跟 <a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/" data-link-title="HashiCorp Vault" data-link-desc="Self-hosted secret management 與 dynamic credential / encryption-as-a-service / PKI engine、跨雲跨環境的 secret 控制面">Vault</a> 整合</h3>
<p>Consul + Vault 是 HashiCorp 同生態、Consul 跑 service discovery / mesh、Vault 跑 secrets；Consul ACL token 可從 Vault dynamic engine 取得。</p>
<h3 id="跟-istio--linkerd-對位">跟 <a href="https://istio.io/">Istio / Linkerd</a> 對位</h3>
<p>Consul Connect 是 service mesh paradigm、跟 Istio / Linkerd 並列；多數 K8s-native organization 用 Istio / Linkerd、Consul 強項在 <em>跨 K8s + VM + multi-DC</em> mesh。</p>
<h3 id="反向-migrationconsul--etcd">反向 migration（Consul → etcd）</h3>
<p>少數 organization 簡化 stack 時做、流程鏡像對稱、但 <em>退掉 service mesh / multi-DC 是有意識降級</em>、不能假裝功能等價。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Consul Connect production rollout</strong>：mesh adoption 是 incremental、per-service intentions 漸進</li>
<li><strong>Multi-DC topology 設計</strong>：active-active vs active-passive、依 RPO/RTO 跟 cost trade-off</li>
<li><strong>跟 Kubernetes Gateway API 整合</strong>：service mesh paradigm 在 K8s 內 vs 外整合策略</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Target vendor：<a href="/blog/backend/05-deployment-platform/vendors/consul/" data-link-title="Consul" data-link-desc="Service registry / mesh / KV / DNS">Consul</a></li>
<li>平行 migration playbook (Type E)：<a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/" data-link-title="Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm" data-link-desc="Redis → Memcached 是 Type E paradigm reduction migration — 從 multi-paradigm（KV &#43; 資料結構 &#43; pub/sub &#43; Lua &#43; streams）退到 pure cache；不是「remove Redis features」、是「重新分配 Redis-specific feature 到對應 specialized 服務」；5 個 production 踩雷 &#43; paradigm reduction 路線">Redis → Memcached</a>（paradigm reduction 對偶）/ <a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a></li>
<li>平行整合：<a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/" data-link-title="HashiCorp Vault" data-link-desc="Self-hosted secret management 與 dynamic credential / encryption-as-a-service / PKI engine、跨雲跨環境的 secret 控制面">HashiCorp Vault</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a></li>
</ul>
]]></content:encoded></item><item><title>Redis → Memcached：Memcached 不是 simpler Redis、是 cache paradigm</title><link>https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/migrate-to-memcached/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/" data-link-title="Redis" data-link-desc="OSS in-memory data structure store、cache 主流">Redis&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/memcached/" data-link-title="Memcached" data-link-desc="純記憶體 key-value cache、無持久化">Memcached&lt;/a>。跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit&lt;/a> 後對映 &lt;em>Paradigm = High（multi-paradigm → pure cache）→ Type E paradigm shift&lt;/em>；本文是 &lt;em>paradigm reduction&lt;/em>（downgrade 方向）的 dogfood。&lt;/p>&lt;/blockquote>
&lt;h2 id="memcached-不是-simpler-redis是-cache-paradigm">Memcached 不是 simpler Redis、是 cache paradigm&lt;/h2>
&lt;p>把 Redis → Memcached 當「移除 Redis 功能」是最常見的誤判：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>概念&lt;/th>
 &lt;th>Redis&lt;/th>
 &lt;th>Memcached&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>核心 paradigm&lt;/td>
 &lt;td>Multi-paradigm（KV + 資料結構 + pub/sub + script）&lt;/td>
 &lt;td>Pure cache（KV + TTL）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Value 類型&lt;/td>
 &lt;td>String / Hash / List / Set / Sorted Set / Stream / Bitmap / HyperLogLog&lt;/td>
 &lt;td>byte string only&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Atomic operations&lt;/td>
 &lt;td>100+（INCR / LPUSH / ZADD / &amp;hellip;）&lt;/td>
 &lt;td>INCR / DECR / APPEND / CAS&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Server-side scripting&lt;/td>
 &lt;td>Lua scripts (&lt;code>EVAL&lt;/code>)&lt;/td>
 &lt;td>無&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Pub/Sub&lt;/td>
 &lt;td>Native&lt;/td>
 &lt;td>無&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Persistence&lt;/td>
 &lt;td>RDB / AOF&lt;/td>
 &lt;td>無（restart 全失）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Replication&lt;/td>
 &lt;td>Async / sync replication&lt;/td>
 &lt;td>無&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cluster&lt;/td>
 &lt;td>Redis Cluster + Sentinel HA&lt;/td>
 &lt;td>Memcached cluster（client-side sharding）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Eviction policy&lt;/td>
 &lt;td>8 種（LRU / LFU / random / &amp;hellip;）&lt;/td>
 &lt;td>LRU only&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Expiration accuracy&lt;/td>
 &lt;td>TTL 精確到 ms&lt;/td>
 &lt;td>TTL 精確到 second、lazy expiration&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>核心差異不在「Memcached 少了 Redis 功能」、在「Memcached 是不同的 cache paradigm」。&lt;/strong> Redis 的 features（hash / sorted set / pub/sub）多數 &lt;em>不該移除&lt;/em>、是 &lt;em>重新分配到對應 specialized service&lt;/em>：&lt;/p>
&lt;ul>
&lt;li>Hash / sorted set → application 端用 JSON + 自管 index&lt;/li>
&lt;li>Pub/Sub → message queue（NATS / Redis Streams / Kafka）&lt;/li>
&lt;li>Lua scripts → application code&lt;/li>
&lt;li>Persistence → 真正需要的 data 該存 DB、不是 cache&lt;/li>
&lt;li>Replication / cluster → Memcached 自己 cluster strategy&lt;/li>
&lt;/ul>
&lt;h2 id="為什麼遷simplification--cost--ops-三條-driver">為什麼遷：simplification / cost / ops 三條 driver&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Operational simplification&lt;/strong>：Memcached 沒 persistence / replication / cluster mode、ops surface 縮小、團隊不用懂 Redis 25+ command family&lt;/li>
&lt;li>&lt;strong>Cost&lt;/strong>：對 &lt;em>純 cache use case&lt;/em> 而言、Memcached 每 GB 比 Redis 便宜（memory efficiency 略勝 + 無 persistence overhead）&lt;/li>
&lt;li>&lt;strong>Strict cache discipline&lt;/strong>：Memcached &lt;em>逼&lt;/em> application code 把「真正的 cache」跟「半 persistent state」分開、避免 Redis 變 &lt;em>poor man&amp;rsquo;s database&lt;/em>&lt;/li>
&lt;/ul>
&lt;p>反向 driver（Memcached → Redis）：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link <a href="/blog/backend/02-cache-redis/vendors/redis/" data-link-title="Redis" data-link-desc="OSS in-memory data structure store、cache 主流">Redis</a> 跟 <a href="/blog/backend/02-cache-redis/vendors/memcached/" data-link-title="Memcached" data-link-desc="純記憶體 key-value cache、無持久化">Memcached</a>。跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit</a> 後對映 <em>Paradigm = High（multi-paradigm → pure cache）→ Type E paradigm shift</em>；本文是 <em>paradigm reduction</em>（downgrade 方向）的 dogfood。</p></blockquote>
<h2 id="memcached-不是-simpler-redis是-cache-paradigm">Memcached 不是 simpler Redis、是 cache paradigm</h2>
<p>把 Redis → Memcached 當「移除 Redis 功能」是最常見的誤判：</p>
<table>
  <thead>
      <tr>
          <th>概念</th>
          <th>Redis</th>
          <th>Memcached</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>核心 paradigm</td>
          <td>Multi-paradigm（KV + 資料結構 + pub/sub + script）</td>
          <td>Pure cache（KV + TTL）</td>
      </tr>
      <tr>
          <td>Value 類型</td>
          <td>String / Hash / List / Set / Sorted Set / Stream / Bitmap / HyperLogLog</td>
          <td>byte string only</td>
      </tr>
      <tr>
          <td>Atomic operations</td>
          <td>100+（INCR / LPUSH / ZADD / &hellip;）</td>
          <td>INCR / DECR / APPEND / CAS</td>
      </tr>
      <tr>
          <td>Server-side scripting</td>
          <td>Lua scripts (<code>EVAL</code>)</td>
          <td>無</td>
      </tr>
      <tr>
          <td>Pub/Sub</td>
          <td>Native</td>
          <td>無</td>
      </tr>
      <tr>
          <td>Persistence</td>
          <td>RDB / AOF</td>
          <td>無（restart 全失）</td>
      </tr>
      <tr>
          <td>Replication</td>
          <td>Async / sync replication</td>
          <td>無</td>
      </tr>
      <tr>
          <td>Cluster</td>
          <td>Redis Cluster + Sentinel HA</td>
          <td>Memcached cluster（client-side sharding）</td>
      </tr>
      <tr>
          <td>Eviction policy</td>
          <td>8 種（LRU / LFU / random / &hellip;）</td>
          <td>LRU only</td>
      </tr>
      <tr>
          <td>Expiration accuracy</td>
          <td>TTL 精確到 ms</td>
          <td>TTL 精確到 second、lazy expiration</td>
      </tr>
  </tbody>
</table>
<p><strong>核心差異不在「Memcached 少了 Redis 功能」、在「Memcached 是不同的 cache paradigm」。</strong> Redis 的 features（hash / sorted set / pub/sub）多數 <em>不該移除</em>、是 <em>重新分配到對應 specialized service</em>：</p>
<ul>
<li>Hash / sorted set → application 端用 JSON + 自管 index</li>
<li>Pub/Sub → message queue（NATS / Redis Streams / Kafka）</li>
<li>Lua scripts → application code</li>
<li>Persistence → 真正需要的 data 該存 DB、不是 cache</li>
<li>Replication / cluster → Memcached 自己 cluster strategy</li>
</ul>
<h2 id="為什麼遷simplification--cost--ops-三條-driver">為什麼遷：simplification / cost / ops 三條 driver</h2>
<ul>
<li><strong>Operational simplification</strong>：Memcached 沒 persistence / replication / cluster mode、ops surface 縮小、團隊不用懂 Redis 25+ command family</li>
<li><strong>Cost</strong>：對 <em>純 cache use case</em> 而言、Memcached 每 GB 比 Redis 便宜（memory efficiency 略勝 + 無 persistence overhead）</li>
<li><strong>Strict cache discipline</strong>：Memcached <em>逼</em> application code 把「真正的 cache」跟「半 persistent state」分開、避免 Redis 變 <em>poor man&rsquo;s database</em></li>
</ul>
<p>反向 driver（Memcached → Redis）：</p>
<ul>
<li>Application 寫到 Memcached 後發現需要 <em>atomic counter / leaderboard / queue / lock</em>、應該升 Redis（不是繼續 wrap Memcached）</li>
</ul>
<h2 id="跑-6-維-audit">跑 6 維 audit</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>Redis 命令集 → Memcached 命令集、相容度 &lt; 20%</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>兩者都簡單、Memcached 略簡單</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>Multi-paradigm → pure cache</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Components</td>
          <td>同 1 個 cache service</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>必改（任何 hash / list / sorted set / pubsub 用法）</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>同 single instance / cluster</td>
          <td>Low</td>
      </tr>
  </tbody>
</table>
<p>3 維 High（Schema / Paradigm / Application change）多軸高、主導維度 = Paradigm → <strong>Type E paradigm shift</strong>；Schema + Application change 抽獨立段補充。</p>
<h2 id="結構類-type-e--paradigm-reduction-分配路線">結構：類 Type E + paradigm reduction 分配路線</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">1. Memcached 不是 simpler Redis（concept reverse 開頭）
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">2. 為什麼遷
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">3. 6 維 audit
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">4. Paradigm reduction 路線（Redis features 對應的 specialized service）
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">5. Schema 差段（Redis vs Memcached command set）
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">6. Application 重設計（per-call-site refactor）
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">7. Migration 流程（漸進、部分 use case 切）
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">8. Production 故障演練
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">9. Capacity / cost
</span></span><span class="line"><span class="ln">10</span><span class="cl">10. 整合 / 下一步</span></span></code></pre></div><p>10 章節、220-260 行。比 Type E（<a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a>）多 <em>paradigm reduction 路線</em> 段。</p>
<h2 id="paradigm-reduction-路線">Paradigm reduction 路線</h2>
<p>Redis features 對應的 specialized service：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Redis Hash           → Application 端 JSON.stringify + Memcached SET
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">                       (or 直接存 DB + Memcached cache layer)
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">Redis List (queue)   → NATS / Kafka / RabbitMQ / SQS
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">Redis List (stack)   → Application 端用 array + 自管 LIFO
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">Redis Set            → Application 端用 array + dedup OR 用 DB unique index
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl">Redis Sorted Set     → Application 端用 ordered list + comparator
</span></span><span class="line"><span class="ln">11</span><span class="cl">                       OR PostgreSQL + index
</span></span><span class="line"><span class="ln">12</span><span class="cl">
</span></span><span class="line"><span class="ln">13</span><span class="cl">Redis Stream         → Kafka / Redis Streams (保留) / NATS JetStream
</span></span><span class="line"><span class="ln">14</span><span class="cl">
</span></span><span class="line"><span class="ln">15</span><span class="cl">Redis Pub/Sub        → NATS Core / Redis Streams / Kafka
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl">Redis Lua script     → Application code（避免 atomic 假設）
</span></span><span class="line"><span class="ln">18</span><span class="cl">
</span></span><span class="line"><span class="ln">19</span><span class="cl">Redis distributed lock → Consul / etcd / DB advisory lock / Redis (保留)
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl">Redis Bitmap         → DB bit column / 應用端 bitset
</span></span><span class="line"><span class="ln">22</span><span class="cl">
</span></span><span class="line"><span class="ln">23</span><span class="cl">Redis HyperLogLog    → DB approx_count_distinct / 應用端 cardinality estimator</span></span></code></pre></div><p>Migration scope 包含 <em>每個 Redis-specific feature use case 對應的 service 評估</em>；不是「移除」、是「重新分配」。</p>
<h2 id="application-重設計">Application 重設計</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Before: Redis hash</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">redis</span><span class="o">.</span><span class="n">hset</span><span class="p">(</span><span class="s1">&#39;user:123&#39;</span><span class="p">,</span> <span class="s1">&#39;email&#39;</span><span class="p">,</span> <span class="s1">&#39;a@b.com&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">redis</span><span class="o">.</span><span class="n">hset</span><span class="p">(</span><span class="s1">&#39;user:123&#39;</span><span class="p">,</span> <span class="s1">&#39;name&#39;</span><span class="p">,</span> <span class="s1">&#39;Alice&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">user</span> <span class="o">=</span> <span class="n">redis</span><span class="o">.</span><span class="n">hgetall</span><span class="p">(</span><span class="s1">&#39;user:123&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># After: Memcached + JSON</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="kn">import</span> <span class="nn">json</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">user_data</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;email&#39;</span><span class="p">:</span> <span class="s1">&#39;a@b.com&#39;</span><span class="p">,</span> <span class="s1">&#39;name&#39;</span><span class="p">:</span> <span class="s1">&#39;Alice&#39;</span><span class="p">}</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">mc</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="s1">&#39;user:123&#39;</span><span class="p">,</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">user_data</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="n">user</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">mc</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;user:123&#39;</span><span class="p">)</span> <span class="ow">or</span> <span class="s1">&#39;</span><span class="si">{}</span><span class="s1">&#39;</span><span class="p">)</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Before: Redis sorted set (leaderboard)</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">redis</span><span class="o">.</span><span class="n">zadd</span><span class="p">(</span><span class="s1">&#39;leaderboard&#39;</span><span class="p">,</span> <span class="p">{</span><span class="s1">&#39;alice&#39;</span><span class="p">:</span> <span class="mi">100</span><span class="p">,</span> <span class="s1">&#39;bob&#39;</span><span class="p">:</span> <span class="mi">95</span><span class="p">})</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">top_10</span> <span class="o">=</span> <span class="n">redis</span><span class="o">.</span><span class="n">zrevrange</span><span class="p">(</span><span class="s1">&#39;leaderboard&#39;</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="n">withscores</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># After: PostgreSQL + index + Memcached cache</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># Persistent: write to DB</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># Cache: pre-compute top 10 in DB query, cache in Memcached</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="n">mc</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="s1">&#39;leaderboard:top10&#39;</span><span class="p">,</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">query</span><span class="p">(</span><span class="s1">&#39;SELECT user, score FROM scores ORDER BY score DESC LIMIT 10&#39;</span><span class="p">)))</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Before: Redis distributed lock</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="k">with</span> <span class="n">redis</span><span class="o">.</span><span class="n">lock</span><span class="p">(</span><span class="s1">&#39;resource:1&#39;</span><span class="p">,</span> <span class="n">timeout</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="n">process_resource</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># After: PostgreSQL advisory lock OR Consul session</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="k">with</span> <span class="n">db</span><span class="o">.</span><span class="n">advisory_lock</span><span class="p">(</span><span class="n">resource_id</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">    <span class="n">process_resource</span><span class="p">()</span></span></span></code></pre></div><p>每個 Redis-specific pattern 都要 per-call-site refactor、不是 SDK 換。</p>
<h2 id="migration-流程">Migration 流程</h2>
<p>跟 <a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a> 同 <em>partial migration</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">1. Audit application code、列所有 Redis call site + feature 使用
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">2. 按 feature 分類處理 plan:
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">   - Pure KV (GET/SET/DEL/TTL): 切 Memcached 直接
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">   - Hash → JSON + Memcached: per-call-site refactor
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">   - List/Sorted Set: 評估是 queue / leaderboard / 其他用途、對應 service
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">   - Pub/Sub: 移到 message queue
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">   - Lock: 移到 DB 或保留 Redis
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">3. 部分 application 先切（純 KV use case）
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">4. 複雜 use case 逐步 refactor 到對應 service
</span></span><span class="line"><span class="ln">10</span><span class="cl">5. Memcached 跑 production 後、Redis 可降為 *narrow scope*（只跑剩餘 Redis-specific feature）
</span></span><span class="line"><span class="ln">11</span><span class="cl">   或完全退役（如果 application 已 refactor 乾淨）
</span></span><span class="line"><span class="ln">12</span><span class="cl">6. 長期混合架構：Memcached cache layer + DB persistent state + 可選的 Redis（locks / specialty）</span></span></code></pre></div><p>整體 3-12 個月、依 Redis-specific feature 使用深度。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1hash--json-後-getset-round-trip-變-n1">Case 1：Hash → JSON 後 GET/SET round-trip 變 N+1</h3>
<p><strong>徵兆</strong>：cutover 後 application latency p99 從 5ms 漲到 50ms；profiling 顯示「為了改 user.email、要先 GET user object → modify → SET」、原本 Redis <code>HSET</code> 1 個 round-trip 現在 2 個。</p>
<p><strong>根因</strong>：JSON-encoded value 不能 partial update、每次改一欄都要 read-modify-write。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Application 端 cache JSON object in memory</strong>：read-modify-write 仍 1 個 SET、但 read 是 memory</li>
<li><strong>Compare-and-swap (CAS)</strong>：Memcached CAS 防止 concurrent update lost</li>
<li><strong>Field-level cache key</strong>：把 hash 拆成 N 個 Memcached key（<code>user:123:email</code> / <code>user:123:name</code>）、避開 JSON</li>
</ol>
<h3 id="case-2sorted-set-leaderboard-退化recomputation-cost-爆">Case 2：Sorted set leaderboard 退化、recomputation cost 爆</h3>
<p><strong>徵兆</strong>：原本 Redis leaderboard <code>ZADD</code> + <code>ZREVRANGE</code> &lt; 1ms；切 DB-backed leaderboard 後 <code>SELECT ... ORDER BY ... LIMIT 10</code> 在 1M+ row 跑 100-500ms。</p>
<p><strong>根因</strong>：Memcached 不支援 sorted set、leaderboard 必須在 DB 算、N 大時 sort 慢。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Cache pre-computed top N</strong>：DB scheduled job 每分鐘算 top 100、寫 Memcached、application 讀 cache 不直查 DB</li>
<li><strong>Materialized view + index</strong>：DB 端用 materialized view + index、毫秒級 query</li>
<li><strong>保留 Redis sorted set</strong>：leaderboard 是 Redis 強項、不該退到 Memcached、走混合架構</li>
</ol>
<h3 id="case-3pubsub-移除缺-fan-out-機制">Case 3：Pub/Sub 移除、缺 fan-out 機制</h3>
<p><strong>徵兆</strong>：原本 Redis Pub/Sub 跑 cache invalidation broadcast、N 個 application instance 都收 invalidation msg；切 Memcached 後失去 broadcast、cache stale。</p>
<p><strong>根因</strong>：Memcached 沒 Pub/Sub；application 需要外部 fan-out 機制。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>NATS / Redis Streams + consumer group</strong>：each application instance 是 consumer、收 invalidation</li>
<li><strong>Database trigger + LISTEN/NOTIFY</strong>：PostgreSQL <code>LISTEN/NOTIFY</code> 對中型 fan-out 足夠</li>
<li><strong>Architecture rethink</strong>：是否真需要 broadcast invalidation？通常用 <em>TTL-based cache</em> + <em>cache key versioning</em> 就能 cover 多數 invalidation use case</li>
</ol>
<h3 id="case-4atomic-incr-沒對等race-condition">Case 4：Atomic INCR 沒對等、race condition</h3>
<p><strong>徵兆</strong>：rate limiter / counter pattern 切 Memcached、<code>mc.incr(key)</code> 在 key 不存在時 return None（不 auto-init 為 0）；application 端 <code>if None: mc.set(key, 1)</code> race condition、低機率 counter reset。</p>
<p><strong>根因</strong>：Memcached INCR 對 missing key 不像 Redis 自動 init；application 端 init logic 容易 race。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 用 ADD（atomic put-if-absent）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="n">mc</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>  <span class="c1"># only sets if missing</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="n">mc</span><span class="o">.</span><span class="n">incr</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>    <span class="c1"># always works after add</span></span></span></code></pre></div><p><code>ADD</code> + <code>INCR</code> 兩個 atomic operation 合起來 race-free。</p>
<h3 id="case-5eviction-policy-差異production-cache-hit-rate-降">Case 5：Eviction policy 差異、production cache hit rate 降</h3>
<p><strong>徵兆</strong>：cutover 後 cache hit rate 從 95% 降到 80%；profiling 發現「重要 key 沒在 cache」、新 key 一直擠走熱 key。</p>
<p><strong>根因</strong>：Redis 預設 <code>allkeys-lfu</code> (least frequently used)、長期熱 key 不被擠；Memcached 只有 LRU、單純按 access time、burst access 的 cold key 擠走 long-tail hot key。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Memory headroom</strong>：Memcached memory 限制拉高 30-50%、避免 eviction pressure</li>
<li><strong>Application-side cache priority</strong>：critical key 用 <em>no-expiration set</em> + 主動 refresh</li>
<li><strong>保留 Redis for LFU workload</strong>：long-tail hot key 場景 Redis LFU 更合適、不該退 Memcached</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Redis</th>
          <th>Memcached</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Memory efficiency</td>
          <td>baseline</td>
          <td>+10-20%（無 metadata overhead）</td>
      </tr>
      <tr>
          <td>Throughput</td>
          <td>~100K ops/s single-thread</td>
          <td>~500K-1M ops/s multi-threaded</td>
      </tr>
      <tr>
          <td>Latency p99</td>
          <td>1-3ms</td>
          <td>0.5-1ms</td>
      </tr>
      <tr>
          <td>Persistence overhead</td>
          <td>5-15% CPU</td>
          <td>0</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.3-0.8</td>
          <td>0.1-0.3</td>
      </tr>
      <tr>
          <td>Application complexity</td>
          <td>Low（feature 豐富）</td>
          <td>Higher（feature 移到 application）</td>
      </tr>
      <tr>
          <td>Cost per GB memory</td>
          <td>baseline</td>
          <td>略低（無 persistence I/O / replication overhead）</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：純 cache use case 走 Memcached 省 ops + 略省 cost；application 已用 Redis-specific feature 不該切；混合架構是 long-term default。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-redis--dragonflydb-對比">跟 <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a> 對比</h3>
<p>兩條路：</p>
<ul>
<li>DragonflyDB：保留 Redis paradigm、優化 throughput + memory；application 不用改</li>
<li>Memcached：退到 pure cache paradigm、application 必須改、但 ops 簡化</li>
</ul>
<p>選擇取決於 <em>是否真的需要 Redis multi-paradigm features</em>：用得到就 DragonflyDB / Redis、用不到就 Memcached。</p>
<h3 id="跟-nats-整合">跟 <a href="/blog/backend/03-message-queue/vendors/nats/" data-link-title="NATS" data-link-desc="Lightweight messaging、JetStream 加持久化與 streams">NATS</a> 整合</h3>
<p>Redis Pub/Sub 移除後、應用端 fan-out / messaging 需求轉到 NATS / Redis Streams / Kafka；本文 cross-link migration playbook <a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a> 有 paradigm shift 流程參考。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Memcached Cluster strategy</strong>：client-side consistent hashing vs server-side cluster mode、ops 簡化 vs scalability 取捨</li>
<li><strong>Long-term mixed architecture</strong>：80% Memcached + 20% Redis 是常見 stable state、不一定要完全消除 Redis</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source vendor：<a href="/blog/backend/02-cache-redis/vendors/redis/" data-link-title="Redis" data-link-desc="OSS in-memory data structure store、cache 主流">Redis</a></li>
<li>Target vendor：<a href="/blog/backend/02-cache-redis/vendors/memcached/" data-link-title="Memcached" data-link-desc="純記憶體 key-value cache、無持久化">Memcached</a></li>
<li>平行 migration playbook (Type E)：<a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a> / <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">PostgreSQL → CockroachDB</a></li>
<li>平行 Type B 對照：<a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a>（保留 paradigm）</li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL 5.7 → 8.0 Major Version Upgrade：character set / authentication / atomic DDL 三條 paradigm 同時換軌</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/major-version-upgrade/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/major-version-upgrade/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> 內 version upgrade migration playbook、走 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> Type E paradigm shift 結構。&lt;/p>&lt;/blockquote>
&lt;p>5.7 → 8.0 看起來是 &lt;em>minor bump&lt;/em>（從 5.7.40 升到 8.0.36）、但不是。Oracle 把這個 release boundary 當成 &lt;em>清庫存的機會&lt;/em> — 同時推出 3 個 &lt;em>behavioral paradigm shift&lt;/em>：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Paradigm&lt;/th>
 &lt;th>5.7 default&lt;/th>
 &lt;th>8.0 default&lt;/th>
 &lt;th>影響&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Character set&lt;/td>
 &lt;td>latin1 / utf8（=utf8mb3）&lt;/td>
 &lt;td>utf8mb4&lt;/td>
 &lt;td>string column 儲存 + emoji / 4-byte UTF-8&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Authentication plugin&lt;/td>
 &lt;td>mysql_native_password&lt;/td>
 &lt;td>caching_sha2_password&lt;/td>
 &lt;td>client / library 需要支援新 plugin&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>DDL atomicity&lt;/td>
 &lt;td>Non-atomic（crash 留 orphan）&lt;/td>
 &lt;td>Atomic（crash recovery 乾淨）&lt;/td>
 &lt;td>開發信心、crash recovery 行為&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>對應 &lt;em>任意一個&lt;/em> paradigm 升級失誤、production 都會 down。三條同時換、必須 &lt;em>三條都規劃&lt;/em>。&lt;/p>
&lt;p>這條 upgrade 比 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/major-version-upgrade/" data-link-title="PostgreSQL major version upgrade (14 → 17)：為什麼這篇不套 5 type migration" data-link-desc="PostgreSQL major version upgrade 是 *5 type 漏類* 的實證 — source/target 同 vendor、5 維度都 Low 但 *upgrade-specific audit* 是核心；本文結構接近 deep article methodology 的 6-section &amp;#43; 額外 upgrade audit 段；涵蓋 pg_upgrade / logical replication / blue-green 三方法、extension 相容性、5 production 踩雷">PostgreSQL major-version-upgrade&lt;/a> 工作量大 — PG major upgrade 主要是 &lt;em>pg_upgrade&lt;/em> 工具流程、MySQL 是 &lt;em>behavioral compatibility audit + ecosystem 全 review&lt;/em>。&lt;/p>
&lt;h2 id="為什麼是-type-e不是-minor-upgrade">為什麼是 Type E（不是 minor upgrade）&lt;/h2>
&lt;p>跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/#%e5%af%ab%e5%89%8d%e7%9a%84-diff-dimension-audit" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit&lt;/a>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> 內 version upgrade migration playbook、走 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> Type E paradigm shift 結構。</p></blockquote>
<p>5.7 → 8.0 看起來是 <em>minor bump</em>（從 5.7.40 升到 8.0.36）、但不是。Oracle 把這個 release boundary 當成 <em>清庫存的機會</em> — 同時推出 3 個 <em>behavioral paradigm shift</em>：</p>
<table>
  <thead>
      <tr>
          <th>Paradigm</th>
          <th>5.7 default</th>
          <th>8.0 default</th>
          <th>影響</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Character set</td>
          <td>latin1 / utf8（=utf8mb3）</td>
          <td>utf8mb4</td>
          <td>string column 儲存 + emoji / 4-byte UTF-8</td>
      </tr>
      <tr>
          <td>Authentication plugin</td>
          <td>mysql_native_password</td>
          <td>caching_sha2_password</td>
          <td>client / library 需要支援新 plugin</td>
      </tr>
      <tr>
          <td>DDL atomicity</td>
          <td>Non-atomic（crash 留 orphan）</td>
          <td>Atomic（crash recovery 乾淨）</td>
          <td>開發信心、crash recovery 行為</td>
      </tr>
  </tbody>
</table>
<p>對應 <em>任意一個</em> paradigm 升級失誤、production 都會 down。三條同時換、必須 <em>三條都規劃</em>。</p>
<p>這條 upgrade 比 <a href="/blog/backend/01-database/vendors/postgresql/major-version-upgrade/" data-link-title="PostgreSQL major version upgrade (14 → 17)：為什麼這篇不套 5 type migration" data-link-desc="PostgreSQL major version upgrade 是 *5 type 漏類* 的實證 — source/target 同 vendor、5 維度都 Low 但 *upgrade-specific audit* 是核心；本文結構接近 deep article methodology 的 6-section &#43; 額外 upgrade audit 段；涵蓋 pg_upgrade / logical replication / blue-green 三方法、extension 相容性、5 production 踩雷">PostgreSQL major-version-upgrade</a> 工作量大 — PG major upgrade 主要是 <em>pg_upgrade</em> 工具流程、MySQL 是 <em>behavioral compatibility audit + ecosystem 全 review</em>。</p>
<h2 id="為什麼是-type-e不是-minor-upgrade">為什麼是 Type E（不是 minor upgrade）</h2>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/#%e5%af%ab%e5%89%8d%e7%9a%84-diff-dimension-audit" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評</th>
          <th>說明</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema</td>
          <td>Medium</td>
          <td>SQL 一致、reserved keyword 新增、collation 預設變</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>Medium-High</td>
          <td>binary upgrade flow 簡單、但 ecosystem 工具兼容性 audit 工作量大</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>High</td>
          <td>3 條 default paradigm shift（charset / auth / atomic DDL）</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Low</td>
          <td>同 MySQL 引擎、不引新 component</td>
      </tr>
      <tr>
          <td>App change</td>
          <td>Medium-High</td>
          <td>client library / driver / connection string 都可能要改</td>
      </tr>
      <tr>
          <td>Topology</td>
          <td>Low</td>
          <td>部署 topology 不變</td>
      </tr>
  </tbody>
</table>
<p>Paradigm = High + App change = Medium-High → <strong>Type E paradigm shift</strong>。</p>
<p>雖然是 <em>同一個 vendor 的 major version</em>、實際的 <em>application 行為差異</em> 跨越多個 paradigm、6 type 框架仍適用、結構走 partial migration 收斂。</p>
<h2 id="4-phase-upgrade">4-phase upgrade</h2>
<h3 id="phase-1pre-check-audit">Phase 1：Pre-check audit</h3>
<p>8.0 升級前用 <em>MySQL Shell upgrade checker</em> + 手動 audit：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">mysqlsh root@5.7-primary.example.com -- util check-for-server-upgrade</span></span></code></pre></div><p>Upgrade checker 報告：</p>
<ul>
<li><em>Reserved keyword</em> 衝突（5.7 不是 keyword 但 8.0 是、例如 <code>WINDOW</code> / <code>RANK</code> / <code>LATERAL</code>）</li>
<li>舊 character set / collation 使用點（latin1 / utf8mb3）</li>
<li>Deprecated feature 使用（GROUP BY 隱含 ORDER BY 等）</li>
<li>Datatype 變動（DATETIME 行為微差）</li>
</ul>
<p>手動 audit：</p>
<ul>
<li>Application driver / library 版本是否支援 caching_sha2_password</li>
<li>Connection string 內 <code>default-authentication-plugin</code> 設定</li>
<li>ORM / framework 是否假設 utf8 而非 utf8mb4</li>
</ul>
<p>完成標準：寫出 <em>blocker list</em>（必須在升級前修） + <em>warning list</em>（可在升級後處理）。</p>
<h3 id="phase-2shadow-upgrade--replica-升-80">Phase 2：Shadow upgrade — Replica 升 8.0</h3>
<p>從 <em>non-critical replica</em> 升起。先升一個 replica、跑 production traffic（read-only）2-4 週：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. Stop replica</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">systemctl stop mysql
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># 2. Backup（XtraBackup）</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">xtrabackup --backup --target-dir<span class="o">=</span>/backup/pre-upgrade
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># 3. Install MySQL 8.0 binary（apt / yum 升級）</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">apt-get install mysql-server-8.0
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># 4. 啟動 8.0、自動 upgrade data dictionary</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">systemctl start mysql
</span></span><span class="line"><span class="ln">12</span><span class="cl">
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"># 5. 8.0 自動跑 server-upgrade（8.0.16+ 內建、mysql_upgrade utility 已 deprecated）</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"># 若 5.7 升 8.0.16 之前 server、才需要手動跑 mysql_upgrade -u root -p</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="c1"># 6. 重新 attach 為 5.7 primary 的 replica（8.0 replica 可 attach 5.7 primary）</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">CHANGE MASTER TO <span class="nv">MASTER_AUTO_POSITION</span><span class="o">=</span>1<span class="p">;</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">START SLAVE<span class="p">;</span></span></span></code></pre></div><p>跑 production read traffic 觀察：</p>
<ul>
<li>Query result 是否跟 5.7 一致（特別 character set 相關）</li>
<li>Replication lag 是否在 baseline 範圍</li>
<li>8.0-specific feature 是否需要（hash join / window function 等）</li>
</ul>
<h3 id="phase-3promote-80-為-primary">Phase 3：Promote 8.0 為 primary</h3>
<p>確認 shadow replica 穩定後：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. 升其他 replica 到 8.0</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"># （per-replica 跑 Phase 2 流程）</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># 2. Application application 改用 8.0-compatible driver</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 把 connection string 加 default-authentication-plugin=caching_sha2_password</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># 或仍用 mysql_native_password（user 端設定）</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"># 3. Failover：promote 8.0 replica 為 primary</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 用 Orchestrator / 自管 failover 流程</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"># 4. 5.7 primary 變成 8.0 replica、升 5.7 → 8.0</span></span></span></code></pre></div><p>完成標準：所有 server 都是 8.0、application 連 8.0 endpoint 無 error。</p>
<h3 id="phase-4decommission-57--套用-80-paradigm">Phase 4：Decommission 5.7 + 套用 8.0 paradigm</h3>
<p>完成 binary upgrade 不是真正完成 — 還要逐步遷移 paradigm：</p>
<ul>
<li>
<p><strong>Character set 升級</strong>：歷史 latin1 / utf8 table 改 utf8mb4</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">CONVERT</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="nb">CHARACTER</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">utf8mb4</span><span class="w"> </span><span class="k">COLLATE</span><span class="w"> </span><span class="n">utf8mb4_0900_ai_ci</span><span class="p">;</span></span></span></code></pre></div><p>每張 table 走 gh-ost / pt-osc（避免 production 阻塞）</p>
</li>
<li>
<p><strong>Authentication 升級</strong>：逐步把 user 從 <code>mysql_native_password</code> 改 <code>caching_sha2_password</code></p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">USER</span><span class="w"> </span><span class="s1">&#39;app&#39;</span><span class="o">@</span><span class="s1">&#39;%&#39;</span><span class="w"> </span><span class="n">IDENTIFIED</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="n">caching_sha2_password</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="s1">&#39;new_password&#39;</span><span class="p">;</span></span></span></code></pre></div><p>需確認 application driver 已支援新 plugin（多數 modern driver OK、legacy 可能要升級）</p>
</li>
<li>
<p><strong>Reserved keyword 處理</strong>：column / table 名稱跟新 reserved word 衝突的、改名</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">RENAME</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">window</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">event_window</span><span class="p">;</span></span></span></code></pre></div></li>
</ul>
<p>多數 org 在 Phase 3 停留更久 — paradigm 升級不是一次 big bang、是漸進。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-authentication-plugin--application-突然連不上">1. Authentication plugin — Application 突然連不上</h3>
<p>升 8.0 後 <em>new user</em> 預設用 caching_sha2_password、舊 application driver（&lt; 5 年版本）不支援、connect error: <code>Authentication plugin 'caching_sha2_password' cannot be loaded</code>。</p>
<p>修法：</p>
<ul>
<li><em>先升 driver</em>：每個 application 升級 mysql-connector-* 到支援 caching_sha2 的版本（多數 modern release 已支援）</li>
<li>短期 workaround：用 <code>mysql_native_password</code>（new user 顯式 create with <code>IDENTIFIED WITH mysql_native_password</code>）</li>
<li>設 <code>default_authentication_plugin=mysql_native_password</code>、強制保留舊 default</li>
</ul>
<h3 id="2-character-set-4-byte-utf-8--emoji-進不去">2. Character set 4-byte UTF-8 — Emoji 進不去</h3>
<p>5.7 latin1 / utf8（=utf8mb3）column 升 8.0 後 <em>仍是 utf8mb3</em>、不會自動升 utf8mb4。Application 寫入 emoji（4-byte UTF-8）會被 <em>truncate / 拒絕</em>。</p>
<p>修法：</p>
<ul>
<li><em>逐 table CONVERT</em>：gh-ost / pt-osc 跑 <code>ALTER TABLE ... CONVERT TO CHARACTER SET utf8mb4</code></li>
<li>新建 table 預設用 utf8mb4（<code>character_set_server=utf8mb4</code> 設定）</li>
<li>Application 連線 charset 設定一致（<code>character_set_client / connection / results</code>）</li>
</ul>
<h3 id="3-reserved-keyword--application-query-突然-syntax-error">3. Reserved keyword — Application query 突然 syntax error</h3>
<p>5.7 跑得好的 query：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">window</span><span class="p">,</span><span class="w"> </span><span class="n">rank</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="p">;</span></span></span></code></pre></div><p>8.0 報錯：<code>window</code> 跟 <code>rank</code> 都是 reserved keyword、必須 backtick：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">`</span><span class="n">window</span><span class="o">`</span><span class="p">,</span><span class="w"> </span><span class="o">`</span><span class="n">rank</span><span class="o">`</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="p">;</span></span></span></code></pre></div><p>修法：</p>
<ul>
<li>Phase 1 upgrade checker 已抓出來、Application code review 改 SQL</li>
<li>推薦 <em>predefer table / column 名 backtick</em> policy（一律加 backtick、避免未來 reserved word 衝突）</li>
<li>ORM 多數會自動 backtick、raw SQL 容易踩</li>
</ul>
<h3 id="4-group-replication--新-feature-開了就不能-rollback">4. Group Replication / 新 feature 開了就不能 rollback</h3>
<p>8.0 升級後 <em>誘惑使用 8.0-only feature</em>：</p>
<ul>
<li>Group Replication（5.7 也有但 8.0 更穩）</li>
<li>Resource Group（5.7 沒有）</li>
<li>Histograms（5.7 沒有）</li>
<li>CTE / window function（5.7 沒有）</li>
</ul>
<p>一旦 application 用了這些 feature、不能 rollback 5.7（feature 不存在、query 失敗）。</p>
<p>修法：</p>
<ul>
<li><em>Phase 1-3 期間禁用 8.0-only feature</em>、保留 rollback option</li>
<li><em>Phase 4 完成</em> 且穩定運作 30+ 天後、才開始 evaluate 8.0-only feature</li>
<li>加 8.0-only feature 時 <em>明確記錄不可 rollback</em></li>
</ul>
<h3 id="5-collation-default-變動--sort-order-跟-unique-行為改變">5. Collation default 變動 — Sort order 跟 unique 行為改變</h3>
<p>5.7 utf8mb4 預設 collation = <code>utf8mb4_general_ci</code>、8.0 預設 = <code>utf8mb4_0900_ai_ci</code>。兩者排序行為不一致：</p>
<ul>
<li><code>utf8mb4_general_ci</code>：簡化 collation、不嚴格遵循 Unicode</li>
<li><code>utf8mb4_0900_ai_ci</code>：Unicode 9.0 compliance、accent-insensitive</li>
</ul>
<p>對 <em>已存在的 table</em>、collation 不會被 8.0 升級改變（保留 5.7 設定）。但 <em>新建 table</em> 預設用 0900_ai_ci、UNION / JOIN 跨不同 collation 的 column 可能 error: <code>Illegal mix of collations</code>。</p>
<p>修法：</p>
<ul>
<li>統一 collation：要麼 <em>所有 table 改 0900_ai_ci</em>、要麼 <em>所有 table 保留 general_ci</em></li>
<li>Schema migration 走 OSC 工具</li>
<li>Application 內 sort-dependent logic（leaderboard / search ranking）要驗證新 collation 結果</li>
</ul>
<h2 id="capability-gap57-有但-80-沒有">Capability gap：5.7 有但 8.0 沒有</h2>
<p>少數 8.0 <em>拿走</em> 的能力：</p>
<ul>
<li><strong>Query Cache</strong>：5.7 內建（但已 deprecated）、8.0 <em>完全移除</em>。Query cache 在高並發場景 actually slowing down、移除是好事</li>
<li><strong>InnoDB MEMORY engine</strong>：5.7 部分支援、8.0 限制更多</li>
<li><strong>Some MyISAM optimizations</strong>：8.0 強制 InnoDB-first、MyISAM-specific 工作流 broken</li>
</ul>
<p>對 Query Cache user：升 8.0 前評估是否依賴、考慮改 application-side cache（Redis）。</p>
<h2 id="容量與成本對照">容量與成本對照</h2>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>5.7</th>
          <th>8.0</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cost</td>
          <td>Free (CE) / Enterprise</td>
          <td>Free (CE) / Enterprise</td>
      </tr>
      <tr>
          <td>升級 hosts × 時間</td>
          <td>-</td>
          <td>per-instance ~30 分鐘 binary upgrade</td>
      </tr>
      <tr>
          <td>Application 改動</td>
          <td>-</td>
          <td>driver upgrade + SQL review</td>
      </tr>
      <tr>
          <td>Character set conversion</td>
          <td>-</td>
          <td>per-table OSC、大表小時級</td>
      </tr>
      <tr>
          <td>Ops headcount</td>
          <td>-</td>
          <td>1-2 個 DBA × 2-4 週</td>
      </tr>
      <tr>
          <td>對 production 影響</td>
          <td>-</td>
          <td>Phase 2-3 漸進升級、無大 downtime</td>
      </tr>
  </tbody>
</table>
<p>5.7 → 8.0 upgrade 整體成本是 <em>1-2 個 FTE 月</em> 規模。對中型 deployment（100+ DB）可能更多。</p>
<h2 id="何時不升">何時不升</h2>
<ul>
<li><strong>App 用 Query Cache 重度</strong>：8.0 沒了、要 application 改造</li>
<li><strong>Old driver 不能升</strong>：legacy enterprise application 用 10 年前 driver、driver vendor 已倒、無法升 8.0-compatible</li>
<li><strong>Compliance freeze</strong>：某些金融 / 醫療場景 freeze technology 多年、升級需要重 audit + recertification</li>
<li><strong>5.7 已 EOL（2023-10）後仍堅持不升</strong>：security risk 高、應該 <em>優先升</em></li>
</ul>
<h2 id="跟-postgresql-major-version-upgrade-對比">跟 PostgreSQL Major Version Upgrade 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>MySQL 5.7 → 8.0</th>
          <th>PostgreSQL N → N+1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tool</td>
          <td>binary upgrade + 自動 server-upgrade（8.0.16+；舊版用 mysql_upgrade）</td>
          <td>pg_upgrade（in-place）</td>
      </tr>
      <tr>
          <td>Downtime</td>
          <td>&lt; 5 分鐘 per instance（binary + DD upgrade）</td>
          <td>&lt; 1 分鐘 per instance（pg_upgrade）</td>
      </tr>
      <tr>
          <td>Paradigm shift</td>
          <td>3 條（charset / auth / atomic DDL）</td>
          <td>一般 0-1 條（PG major 多保 compat）</td>
      </tr>
      <tr>
          <td>App 必須改</td>
          <td>多（driver + query）</td>
          <td>少（多數 query 兼容）</td>
      </tr>
      <tr>
          <td>Risk</td>
          <td>高（paradigm 多）</td>
          <td>中-低</td>
      </tr>
      <tr>
          <td>Rollback</td>
          <td>不可（一旦 atomic DDL data 寫入、5.7 不認）</td>
          <td>不可（pg_upgrade 不可逆）</td>
      </tr>
  </tbody>
</table>
<p>PG major upgrade 比 MySQL 簡單。MySQL 5.7 → 8.0 是 <em>特例</em> — Oracle 把多年 deprecated 一次清。8.0 → 8.4 / 9.x 預期更平順。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p>8.0 replica 可 attach 5.7 primary（向下兼容）、但 5.7 replica <em>不能 attach 8.0 primary</em>（向上不兼容）。Upgrade 順序必須 <em>replica 先升、primary 後升</em>。詳見 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a>。</p>
<h3 id="跟-innodb-tuning">跟 InnoDB Tuning</h3>
<p>8.0 InnoDB 改寫了 redo log（atomic、可動態調整）、<code>innodb_log_file_size</code> 升級後可以 <em>online 改</em>、不必停機。詳見 <a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">InnoDB Tuning</a>。</p>
<h3 id="跟-modern-sql-features">跟 Modern SQL Features</h3>
<p>8.0 補 CTE / window / JSON_TABLE / hash join — 是 <em>為什麼要升 8.0</em> 的 driver。詳見 <a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">Modern SQL Features</a>。</p>
<h3 id="跟-group-replication">跟 Group Replication</h3>
<p>GR 在 5.7 有、但 8.0 才成熟。Group Replication 的 <em>MySQL Shell + Router</em> 整套 stack 主要在 8.0 才完整。詳見 <a href="/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &#43; certification* 整個機制不同。本文走 GR 機制（GCE &#43; certification &#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">Group Replication</a>。</p>
<h3 id="跟-aurora--planetscale-等-managed">跟 Aurora / PlanetScale 等 managed</h3>
<p>從 5.7 升 8.0 是個好時機 <em>同時評估</em> 是否要遷 Aurora / PlanetScale — 既然要做 paradigm shift、不如一次到位。詳見 <a href="/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">migrate-to-aurora</a> / <a href="/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/" data-link-title="MySQL → PlanetScale：managed Vitess &#43; branch-based schema workflow 的 hybrid shift" data-link-desc="自管 MySQL → PlanetScale 加上 Vitess sharding 跟 branch-based schema workflow。本文走 6 維 audit（Paradigm &#43; Operational &#43; Schema 多軸）、4-phase migration、5 production 踩雷、何時不要遷。">migrate-to-planetscale</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>（升級順序 replica-first）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features</a>（升 8.0 的主要 driver）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &#43; certification* 整個機制不同。本文走 GR 機制（GCE &#43; certification &#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">MySQL Group Replication</a>（8.0 成熟）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">MySQL InnoDB Tuning</a>（8.0 redo log 改寫）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">migrate-to-aurora</a> / <a href="/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/" data-link-title="MySQL → PlanetScale：managed Vitess &#43; branch-based schema workflow 的 hybrid shift" data-link-desc="自管 MySQL → PlanetScale 加上 Vitess sharding 跟 branch-based schema workflow。本文走 6 維 audit（Paradigm &#43; Operational &#43; Schema 多軸）、4-phase migration、5 production 踩雷、何時不要遷。">migrate-to-planetscale</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/major-version-upgrade/" data-link-title="PostgreSQL major version upgrade (14 → 17)：為什麼這篇不套 5 type migration" data-link-desc="PostgreSQL major version upgrade 是 *5 type 漏類* 的實證 — source/target 同 vendor、5 維度都 Low 但 *upgrade-specific audit* 是核心；本文結構接近 deep article methodology 的 6-section &#43; 額外 upgrade audit 段；涵蓋 pg_upgrade / logical replication / blue-green 三方法、extension 相容性、5 production 踩雷">PostgreSQL Major Version Upgrade</a>（PG sibling）</li>
<li>方法論：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook Methodology</a>（Type E paradigm shift）</li>
<li>官方：<a href="https://dev.mysql.com/doc/refman/8.0/en/upgrading.html">MySQL 8.0 Upgrade Guide</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL → PlanetScale：managed Vitess + branch-based schema workflow 的 hybrid shift</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> 跟 PlanetScale。走 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> Type E paradigm shift 結構。&lt;/p>&lt;/blockquote>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>自管 MySQL&lt;/th>
 &lt;th>PlanetScale&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Sharding&lt;/td>
 &lt;td>自己配 Vitess 或不 shard&lt;/td>
 &lt;td>Vitess 透明（即使單 keyspace 也走 Vitess）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Schema migration&lt;/td>
 &lt;td>gh-ost / pt-osc 跑 ALTER&lt;/td>
 &lt;td>&lt;strong>Branch + Deploy Request&lt;/strong> workflow&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Failover&lt;/td>
 &lt;td>Orchestrator 自管&lt;/td>
 &lt;td>PlanetScale 自動&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Branching&lt;/td>
 &lt;td>不存在概念&lt;/td>
 &lt;td>&lt;strong>DB branch（git-like）+ revert&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Connection limit&lt;/td>
 &lt;td>max_connections 自己設&lt;/td>
 &lt;td>PlanetScale connection pool / per-plan limit&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Foreign key&lt;/td>
 &lt;td>支援&lt;/td>
 &lt;td>有限支援（Vitess 18+ / 2023 起、需明確啟用）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>SUPER&lt;/code> privilege&lt;/td>
 &lt;td>自己有&lt;/td>
 &lt;td>&lt;strong>無&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Multi-region&lt;/td>
 &lt;td>自己配 binlog ship&lt;/td>
 &lt;td>PlanetScale 內建（Boost feature）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Per-month cost&lt;/td>
 &lt;td>EC2 + EBS + ops&lt;/td>
 &lt;td>per-row-read + per-row-written + storage&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>從 &lt;em>application 連線&lt;/em> 視角：跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">Aurora MySQL migration&lt;/a> 一樣低、connection string 換就完事。從 &lt;em>schema management&lt;/em> 視角：PlanetScale 強推 &lt;em>branch-based workflow&lt;/em> — 改 schema 不再是「跑 gh-ost」、是「開 branch → Deploy Request → review → merge」。整個 schema change 工作流跟 git 同型、跟 application code review 同 workflow。&lt;/p>
&lt;p>這是 &lt;em>workflow + schema-tooling shift&lt;/em> — Aurora 是「同 workflow + managed」、PlanetScale 是「同 protocol + 不同 schema workflow + branch tooling」。Database paradigm（OLTP relational）跟 application change 都 Low、主要 shift 在 DBA / dev 操作介面。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> 跟 PlanetScale。走 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> Type E paradigm shift 結構。</p></blockquote>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>自管 MySQL</th>
          <th>PlanetScale</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sharding</td>
          <td>自己配 Vitess 或不 shard</td>
          <td>Vitess 透明（即使單 keyspace 也走 Vitess）</td>
      </tr>
      <tr>
          <td>Schema migration</td>
          <td>gh-ost / pt-osc 跑 ALTER</td>
          <td><strong>Branch + Deploy Request</strong> workflow</td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>Orchestrator 自管</td>
          <td>PlanetScale 自動</td>
      </tr>
      <tr>
          <td>Branching</td>
          <td>不存在概念</td>
          <td><strong>DB branch（git-like）+ revert</strong></td>
      </tr>
      <tr>
          <td>Connection limit</td>
          <td>max_connections 自己設</td>
          <td>PlanetScale connection pool / per-plan limit</td>
      </tr>
      <tr>
          <td>Foreign key</td>
          <td>支援</td>
          <td>有限支援（Vitess 18+ / 2023 起、需明確啟用）</td>
      </tr>
      <tr>
          <td><code>SUPER</code> privilege</td>
          <td>自己有</td>
          <td><strong>無</strong></td>
      </tr>
      <tr>
          <td>Multi-region</td>
          <td>自己配 binlog ship</td>
          <td>PlanetScale 內建（Boost feature）</td>
      </tr>
      <tr>
          <td>Per-month cost</td>
          <td>EC2 + EBS + ops</td>
          <td>per-row-read + per-row-written + storage</td>
      </tr>
  </tbody>
</table>
<p>從 <em>application 連線</em> 視角：跟 <a href="/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">Aurora MySQL migration</a> 一樣低、connection string 換就完事。從 <em>schema management</em> 視角：PlanetScale 強推 <em>branch-based workflow</em> — 改 schema 不再是「跑 gh-ost」、是「開 branch → Deploy Request → review → merge」。整個 schema change 工作流跟 git 同型、跟 application code review 同 workflow。</p>
<p>這是 <em>workflow + schema-tooling shift</em> — Aurora 是「同 workflow + managed」、PlanetScale 是「同 protocol + 不同 schema workflow + branch tooling」。Database paradigm（OLTP relational）跟 application change 都 Low、主要 shift 在 DBA / dev 操作介面。</p>
<h2 id="為什麼是-type-eparadigm--operational--schema-多軸">為什麼是 Type E（Paradigm + Operational + Schema 多軸）</h2>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/#%e5%af%ab%e5%89%8d%e7%9a%84-diff-dimension-audit" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評</th>
          <th>說明</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema</td>
          <td>Medium-High</td>
          <td>MySQL wire protocol 一致、FK 有限支援（Vitess 18+）、部分 INSTANT DDL 行為差</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>High</td>
          <td>branch lifecycle、Deploy Request workflow、connection pooler 不同</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>High</td>
          <td>branch-based schema management、跟自管 gh-ost / pt-osc 思維完全不同</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Medium</td>
          <td>PlanetScale CLI / Console / API / connection pooler 都進團隊工具</td>
      </tr>
      <tr>
          <td>App change</td>
          <td>Low</td>
          <td>connection string + 移除 FK 約束</td>
      </tr>
      <tr>
          <td>Topology</td>
          <td>Low-Medium</td>
          <td>Vitess 透明 sharding 即使單 keyspace</td>
      </tr>
  </tbody>
</table>
<p>Paradigm + Operational + Schema 三軸 High。按優先序 Schema &gt; Paradigm &gt; Operational、預設選 Type A。但 <em>讀者最關心</em> 的是 schema workflow paradigm 轉變、不是 schema field translation — Type E 結構更貼合「不收斂、部分 adopt」的真實 migration 流程。</p>
<p>→ <strong>Type E paradigm shift</strong>、4-phase partial migration（多數 org 停 Phase 2-3 hybrid）。</p>
<h2 id="driverbranch-based-workflow--vitess-transparent-sharding--zero-dba">Driver：Branch-based workflow + Vitess transparent sharding + zero DBA</h2>
<p>從自管 MySQL 遷 PlanetScale 的核心 driver 有三條：</p>
<p><strong>Branch-based schema workflow</strong>：</p>
<ul>
<li>改 schema 開 branch（<code>pscale branch create</code>）、在 branch 上跑 ALTER、跑 application code 改、merge 進 main 前 Deploy Request review</li>
<li>Deploy Request 顯示 schema diff、跟 GitHub PR 同概念</li>
<li>Merge 後 PlanetScale 自動跑 <em>no-downtime schema migration</em>（內部 VReplication）</li>
<li>出問題可 <em>revert</em>（48 小時內、用 Vitess VReplication 反向 ship 資料）</li>
</ul>
<p>這條 workflow 對 <em>developer ergonomic</em> 拉力大 — schema change 不再是「DBA 工作」、是「dev 自己處理、跟 code review 同流程」。</p>
<p><strong>Vitess transparent sharding</strong>：</p>
<ul>
<li>PlanetScale 強制每個 cluster 走 Vitess（即使單 keyspace 看似 unsharded）</li>
<li>寫吞吐成長到需要 shard 時、加 shard 是 PlanetScale internal 操作、application 看不到</li>
<li>不用養 Vitess SRE 團隊</li>
</ul>
<p><strong>Zero DBA</strong>：</p>
<ul>
<li>PlanetScale 接管所有 ops（failover / backup / parameter / scaling）</li>
<li>跟 Aurora 同等級「managed」、加上 branch workflow</li>
</ul>
<p>FK 處理：早期 Vitess（&lt; 18）不支援 FK、PlanetScale 對應期間建議全 drop FK + 改 application enforcement。Vitess 18（2023 末）後加 FK 支援、PlanetScale 在合適 plan 內可啟用、但 <em>cross-shard FK</em> 仍受限。Phase 1 audit 重點不再是「全 drop FK」、而是「驗證 FK 行為（特別 cascade / cross-shard）跟自管 MySQL 預期一致」。</p>
<h2 id="4-phase-partial-migration不收斂">4-phase partial migration（不收斂）</h2>
<h3 id="phase-1fk-行為驗證--schema-auditplanetscale-shadow-cluster-起來">Phase 1：FK 行為驗證 + schema audit、PlanetScale shadow cluster 起來</h3>
<p>第一步是 <em>FK 行為驗證</em> + schema layout audit。Vitess 18+ / PlanetScale 已支援 FK、但行為跟自管 MySQL 有差異：</p>
<ul>
<li>列所有 FK：<code>SELECT * FROM information_schema.KEY_COLUMN_USAGE WHERE REFERENCED_TABLE_NAME IS NOT NULL</code></li>
<li>對每個 FK 評估：
<ul>
<li><em>Cross-shard FK</em>：PlanetScale 不允許 FK 跨 shard、parent 跟 child 必須同 shard（透過 Vindex 設計）</li>
<li><em>Cascade 行為</em>：cross-shard DELETE cascade 在 PlanetScale 不執行、改 application 層處理</li>
<li><em>Native FK 啟用 vs application enforcement</em>：依 Vitess 18+ 行為決定保留 FK 或改 app-level</li>
</ul>
</li>
<li><em>PlanetScale shadow cluster</em> 起來、跑 application schema、用 Vitess Connector 從自管 binlog ship 資料</li>
</ul>
<p>工作主要塊：</p>
<ul>
<li>FK 行為 audit + 改 cross-shard cascade（依 FK 數量、weeks 工作量）</li>
<li>Schema dump → PlanetScale import（用 <code>pscale shell</code>）</li>
<li>Vitess Connector 設定 binlog stream</li>
</ul>
<p>完成標準：PlanetScale shadow cluster 有完整 production schema、cross-shard FK 已處理、binlog stream lag &lt; 1 秒。</p>
<h3 id="phase-2read-traffic-切-planetscale">Phase 2：Read traffic 切 PlanetScale</h3>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">Aurora migration</a> Phase 2 同概念：read query 切 PlanetScale connection string、寫入仍自管 MySQL。</p>
<p>差異：</p>
<ul>
<li>PlanetScale connection 有 <em>per-plan rate limit</em>（Scaler Plan: 10K connections、Enterprise: 100K）</li>
<li>必須走 <em>PlanetScale connection pool</em>（不是直接連、有 SSL handshake overhead）</li>
<li>監控 <code>pscale_io_read_query_throttled_total</code> 確認沒撞 plan limit</li>
</ul>
<p>跑 2-4 週、確認：</p>
<ul>
<li>PlanetScale read latency 跟自管 replica latency 接近（PlanetScale Boost cache 可能比自管快）</li>
<li>Vitess Connector stream 穩定</li>
<li>Application 對 PlanetScale row read 量符合 cost forecast</li>
</ul>
<h3 id="phase-3schema-workflow-切-planetscale--write-cutover">Phase 3：Schema workflow 切 PlanetScale + write cutover</h3>
<p>關鍵 paradigm shift：<em>停 gh-ost / pt-osc</em>、改用 PlanetScale branch workflow。</p>
<p>訓練步驟：</p>
<ol>
<li><em>第一個 small schema change</em> 用 PlanetScale branch + Deploy Request 跑</li>
<li>開發團隊熟悉 <code>pscale branch create</code> / <code>pscale deploy-request create</code> CLI</li>
<li>CI integration：把 PlanetScale CLI 加進 deploy pipeline</li>
<li>退役 gh-ost / pt-osc CI integration</li>
</ol>
<p>完成 schema workflow 訓練後 write cutover：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 1. PlanetScale 把 shadow cluster promote 為 primary（用 PlanetScale console / API）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 透過 PlanetScale Console 啟用 production write 或用 `pscale` CLI 對應 promotion 命令</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># （CLI 命令名稱隨 pscale 版本變動、以 pscale --help 為準）</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 2. Application connection string 切 PlanetScale writer</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># 自管 → mysql://primary.example.com:3306/production</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># PlanetScale → mysql://...@xxx.connect.psdb.cloud/production?sslaccept=strict</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">
</span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1"># 3. Vitess Connector 反向（PlanetScale → 自管）作為 rollback insurance</span></span></span></code></pre></div><p>完成標準：寫入流量 100% 進 PlanetScale、自管 MySQL 接 PlanetScale binlog（rollback buffer）。</p>
<h3 id="phase-4自管-mysql-退役--保留作-rollback-buffer">Phase 4：自管 MySQL 退役 / 保留作 rollback buffer</h3>
<p>跟 Aurora migration Phase 4 同模式：</p>
<ul>
<li>自管保留 30-90 天作 cold buffer</li>
<li>確認 PlanetScale cost forecast 跟 actual 一致（per-row read / write 計費可能超預期）</li>
<li>確認 branch workflow 在 production team 內 adopt（不是「PlanetScale 在用、但團隊還是用 gh-ost on staging」這種 stuck 狀態）</li>
</ul>
<p>多數 org 在 <em>Phase 3</em> 停留更久（半年-一年）— Vitess Connector 反向 binlog ship 是穩定 rollback path、Phase 4 不急。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-cross-shard-fk--planetscale-跟-native-mysql-行為不同">1. Cross-shard FK — PlanetScale 跟 native MySQL 行為不同</h3>
<p>Vitess 18+ / PlanetScale 已支援 FK、但 <em>cross-shard cascade</em> 不執行。同 shard 內 FK 跟 native MySQL 一致；parent 跟 child 跨 shard 時、<code>ON DELETE CASCADE</code> 在 PlanetScale 不會跨 shard 觸發 child delete、結果 application 看到 <em>orphan row</em>。</p>
<p>修法：</p>
<ul>
<li>Phase 1 audit 出哪些 FK 跨 shard（Vindex 設計決定 parent / child 是否同 shard）</li>
<li>同 shard FK：直接保留、行為跟自管 MySQL 一致</li>
<li>Cross-shard cascade：改 application 層 transaction 內 explicit DELETE child、或 <em>background reconciliation job</em>（定期掃 orphan）</li>
<li>把 <em>parent / child 強制同 shard</em>（用相同 Vindex column）是預防 cross-shard FK 議題的根本解</li>
</ul>
<h3 id="2-deploy-request-思維轉換不到位--團隊仍用跑-alter心智模型">2. Deploy Request 思維轉換不到位 — 團隊仍用「跑 ALTER」心智模型</h3>
<p>DBA / SRE 習慣 <em>直接連 PlanetScale 跑 ALTER</em> —但 PlanetScale 在 production branch 上 <em>禁止 DDL</em>（必須走 Deploy Request）。失敗訊息 <em>not actionable</em>（ERROR: not authorized）、DBA 找不到原因、production maintenance 卡住。</p>
<p>修法：</p>
<ul>
<li>Phase 3 <em>訓練步驟</em> 不能跳：找一個 small schema change 在 staging 走完整 branch workflow、團隊每個 DBA / SRE 都 hands-on 過</li>
<li>在 ops runbook 寫明 <em>production schema change must go through Deploy Request</em>、列 CLI 命令模板</li>
<li>緊急 schema change（事故中）也走 branch + Deploy Request、PlanetScale 可加速 Deploy（不能 bypass workflow）</li>
</ul>
<h3 id="3-schema-diff-邊界--planetscale-看不到-application-level-insert-changes">3. Schema diff 邊界 — PlanetScale 看不到 application-level INSERT changes</h3>
<p>Deploy Request 顯示 <em>schema-level diff</em>（CREATE / ALTER / DROP）、不顯示 <em>data diff</em>。如果 branch 上有 INSERT 進去（測試資料 / seed data）、merge 進 main 時 <em>資料不會搬</em>（只搬 schema）、application 預期有資料但 production 沒。</p>
<p>修法：</p>
<ul>
<li>把 <em>seed data INSERT</em> 放 application migration / fixture、不在 PlanetScale branch 內</li>
<li>用 PlanetScale CLI <em>export branch data</em> 跟 <em>import to main</em>（手動操作）作為 escape hatch</li>
<li>教育團隊：PlanetScale branch = <em>schema branch</em>、不是 git-like <em>data branch</em></li>
</ul>
<h3 id="4-branch-lifecycle-ops-cost--100-個-stale-branch">4. Branch lifecycle ops cost — 100 個 stale branch</h3>
<p>每個 PR 都開一個 PlanetScale branch、PR merge 後忘記刪、累積 100 個 stale branch。每個 branch 佔 storage cost、PlanetScale plan limit 也限制 branch 數量。</p>
<p>修法：</p>
<ul>
<li>CI integration：PR close 自動 <code>pscale branch delete &lt;branch-name&gt;</code></li>
<li>設 <em>branch retention policy</em>（30 天無活動自動刪）</li>
<li>監控 <code>pscale branch list | wc -l</code> 數量、超 threshold alert</li>
<li>把 branch lifecycle 寫進 <em>team playbook</em>（不是 PlanetScale 教、是團隊內部規範）</li>
</ul>
<h3 id="5-無-super-privilege--部分操作不可行">5. 無 <code>SUPER</code> privilege — 部分操作不可行</h3>
<p>PlanetScale connection 拿到的 MySQL user 沒有 <code>SUPER</code> privilege。需要 <code>SUPER</code> 的操作直接失敗：</p>
<ul>
<li><code>SET GLOBAL</code>（不能改 runtime variable）</li>
<li><code>KILL</code> 別人的 query（PlanetScale console 提供 alt 介面）</li>
<li><code>SHOW MASTER STATUS</code> / <code>SHOW SLAVE STATUS</code>（PlanetScale 抽象掉、不暴露）</li>
<li><code>INSTALL PLUGIN</code>（managed、不允許）</li>
<li><code>STOP SLAVE</code> / <code>START SLAVE</code>（Vitess 內部）</li>
</ul>
<p>修法：</p>
<ul>
<li>評估 application 跟 ops tool 是否依賴 <code>SUPER</code> privilege</li>
<li>改用 PlanetScale console / API 等價操作</li>
<li>部分監控 query（<code>SHOW SLAVE STATUS</code>）用 <em>PlanetScale 內建 dashboard</em> 代替</li>
</ul>
<h2 id="schema-translation-主要工作量塊">Schema translation 主要工作量塊</h2>
<p>雖然 Type E 結構不以 schema translation 為主、但 schema diff 在 Phase 1 仍佔多數時間：</p>
<table>
  <thead>
      <tr>
          <th>自管 MySQL</th>
          <th>PlanetScale (Vitess)</th>
          <th>翻譯難度</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FOREIGN KEY constraint</td>
          <td>（無）+ application enforcement</td>
          <td>高</td>
      </tr>
      <tr>
          <td>INSTANT DDL</td>
          <td>部分支援、其他走 Vitess online DDL</td>
          <td>低-中</td>
      </tr>
      <tr>
          <td>Stored procedure</td>
          <td>支援</td>
          <td>低</td>
      </tr>
      <tr>
          <td>Trigger</td>
          <td>支援</td>
          <td>低</td>
      </tr>
      <tr>
          <td>User-defined function</td>
          <td>受限</td>
          <td>中</td>
      </tr>
      <tr>
          <td>INSERT 跨表（CTE）</td>
          <td>支援</td>
          <td>低</td>
      </tr>
      <tr>
          <td>Cross-shard JOIN</td>
          <td>必須用 Vindex（user_id 等 shard key 同表）</td>
          <td>中-高</td>
      </tr>
      <tr>
          <td><code>SUPER</code> 行為</td>
          <td>不支援</td>
          <td>中（ops tool 改）</td>
      </tr>
      <tr>
          <td><code>RELOAD</code> privilege</td>
          <td>不支援</td>
          <td>中</td>
      </tr>
  </tbody>
</table>
<h2 id="容量與成本對照">容量與成本對照</h2>
<p>PlanetScale 計費 <em>很不同</em>：</p>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>自管 MySQL（EC2）</th>
          <th>PlanetScale Scaler Pro</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Per-row read</td>
          <td>不計費</td>
          <td>按量計費、$1 per 1B row read</td>
      </tr>
      <tr>
          <td>Per-row written</td>
          <td>不計費</td>
          <td>按量計費、$1.50 per 1M row write</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>EBS、$0.10/GB-month</td>
          <td>$1.50/GB-month + replication overhead</td>
      </tr>
      <tr>
          <td>Connection limit</td>
          <td>max_connections 自己設</td>
          <td>per-plan limit、可加 Connection pooler</td>
      </tr>
      <tr>
          <td>Branch</td>
          <td>不適用</td>
          <td>每 branch 含 storage cost</td>
      </tr>
      <tr>
          <td>Boost cache</td>
          <td>不適用</td>
          <td>additional cost</td>
      </tr>
      <tr>
          <td>Ops headcount</td>
          <td>1-2 FTE</td>
          <td>&lt; 0.2 FTE</td>
      </tr>
  </tbody>
</table>
<p>PlanetScale 適合 <em>小-中規模 + high developer productivity priority</em>：</p>
<ul>
<li>流量 &lt; 10K WPS：cost 接近自管、developer productivity 顯著提升</li>
<li>流量 10-50K WPS：cost 開始貴、但 ops saving 仍大於 cost increase</li>
<li>流量 &gt; 100K WPS：PlanetScale Enterprise 議價、要 commit pricing</li>
</ul>
<p>對 high-traffic 場景 cost forecast 必須跑 <em>真實 workload trace</em> — PlanetScale 提供 <code>pscale analytics</code> 預估 read / write 量、用 production binlog replay 在 staging 跑、估算 row read / write 計費。</p>
<h2 id="何時不要遷">何時不要遷</h2>
<ul>
<li><strong>FK 是 application core constraint</strong>：cascade DELETE / SET NULL 廣泛使用、application 改不動</li>
<li><strong>大量 <code>SUPER</code>-required ops 自動化</strong>：DBA tools / monitoring 寫死 <code>SUPER</code>、改不動</li>
<li><strong>OS-level customization 需求</strong>：跟 Aurora 一樣、PlanetScale 完全 managed</li>
<li><strong>流量極大 + 預算敏感</strong>：&gt; 100K WPS row read 計費可能比 EC2 貴 5x、需要 Enterprise commit pricing</li>
<li><strong>跨雲 portability 是 requirement</strong>：PlanetScale 跑在自家 cloud（背後 AWS / GCP）、不像自管 Vitess 可跨雲</li>
</ul>
<h2 id="跟-aurora-mysql-對比同-batch-的選擇">跟 Aurora MySQL 對比（同 batch 的選擇）</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Aurora MySQL</th>
          <th>PlanetScale</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Type</td>
          <td>C operational hybrid</td>
          <td>E paradigm shift</td>
      </tr>
      <tr>
          <td>工作量主軸</td>
          <td>parameter group + IAM + endpoint</td>
          <td>FK audit + branch workflow</td>
      </tr>
      <tr>
          <td>Sharding</td>
          <td>不 shard、single-region scaling</td>
          <td>Vitess 透明 sharding</td>
      </tr>
      <tr>
          <td>Schema workflow</td>
          <td>仍用 gh-ost / pt-osc</td>
          <td>Branch + Deploy Request</td>
      </tr>
      <tr>
          <td>FK</td>
          <td>支援</td>
          <td>不支援</td>
      </tr>
      <tr>
          <td>Cost model</td>
          <td>per-hour instance + per-GB storage</td>
          <td>per-row read / write + per-GB storage</td>
      </tr>
      <tr>
          <td>適合規模</td>
          <td>100 GB - 50 TB</td>
          <td>100 GB - 1 PB</td>
      </tr>
      <tr>
          <td>跨雲</td>
          <td>AWS-only</td>
          <td>PlanetScale 背後 AWS / GCP</td>
      </tr>
  </tbody>
</table>
<p>選擇邏輯：</p>
<ul>
<li><em>AWS-heavy ecosystem + 不想 schema workflow paradigm shift</em> → Aurora</li>
<li><em>Developer-first culture + 想 branch-based schema workflow + 接受 FK 限制</em> → PlanetScale</li>
</ul>
<p>兩者不互斥、有 org 用 Aurora 給 OLTP core、PlanetScale 給 newer microservices（branch workflow 帶價值）。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>平行 batch：→ Aurora MySQL migration playbook（同 batch、不同 paradigm）</li>
<li>上游：<a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a> / <a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">Vitess sharding 設計</a></li>
<li>跨章節：<a href="/blog/backend/06-reliability/performance-regression-gate/" data-link-title="6.13 Performance Regression Gate" data-link-desc="把效能 baseline 從一次性壓測變成持續對齊的 release gate，涵蓋 baseline 設定、判讀方法、variance 控制與退化定位">6.13 Performance Regression Gate</a> — Deploy Request workflow 對 release gate 的影響</li>
<li>既有 vendor 對照：<a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a> / <a href="https://planetscale.com/">PlanetScale 官方</a></li>
<li>方法論：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook Methodology</a>（Type E paradigm shift 結構說明）</li>
<li>官方：<a href="https://planetscale.com/docs/imports/migrate-from-mysql">PlanetScale Migration Guide</a></li>
</ul>
]]></content:encoded></item><item><title>從 RDS / MongoDB 遷移到 DynamoDB：access-pattern-first 重建模、混合架構與 cost crossover</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/dynamodb/migrate-rds-mongodb-to-dynamodb/</link><pubDate>Tue, 02 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/dynamodb/migrate-rds-mongodb-to-dynamodb/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/dynamodb/" data-link-title="DynamoDB" data-link-desc="AWS managed key-value、partition-based scaling、9000 萬 RPS sustained 實戰證據">DynamoDB&lt;/a> overview 的 migration playbook。寫作參照 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook 寫作方法論&lt;/a>。&lt;/p>&lt;/blockquote>
&lt;p>「我們要把 RDS 整個搬到 DynamoDB。」這句話本身就藏著最大的誤解 — DynamoDB 遷移不是把 table schema 1:1 搬過去。RDS 的 normalized schema、JOIN、ad-hoc query 在 DynamoDB 沒有對應物；MongoDB 的彈性 document、二級索引、aggregation pipeline 也不能直接映射。字面意義的「遷移」不成立 — 遷移的動作是 &lt;em>從 access pattern 重新設計資料模型&lt;/em>、搬資料只是最後一步。能不能遷、該遷多少，取決於 workload 的查詢形狀是否固定、一致性需求是否能放寬。本文走 paradigm shift 結構：先講為何字面遷移不成立、再講哪些該遷哪些該留、最後才是階段化執行。&lt;/p>
&lt;h2 id="6-維-diff-audit主導維度是-paradigm">6 維 diff audit：主導維度是 paradigm&lt;/h2>
&lt;p>遷移前先盤點 source 跟 target 的差異落在哪幾維、決定 playbook 結構：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>RDS / MongoDB → DynamoDB&lt;/th>
 &lt;th>程度&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Schema / API&lt;/td>
 &lt;td>SQL / document query → KV &lt;code>GetItem&lt;/code> / &lt;code>Query&lt;/code>、無 JOIN&lt;/td>
 &lt;td>High&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operational model&lt;/td>
 &lt;td>self-managed / RDS-managed → fully managed serverless&lt;/td>
 &lt;td>Medium&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Paradigm&lt;/td>
 &lt;td>relational / document model → access-pattern-first KV&lt;/td>
 &lt;td>High&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Components 數量&lt;/td>
 &lt;td>單 DB → 單 DB（不拆分）&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application change&lt;/td>
 &lt;td>ORM / query layer 全改、access pattern 先行&lt;/td>
 &lt;td>High&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Data topology&lt;/td>
 &lt;td>partition key 設計、無跨 region transaction&lt;/td>
 &lt;td>Medium&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>主導維度是 &lt;strong>paradigm&lt;/strong>（其次 schema / application change）。這定義了結構 — &lt;strong>Type E paradigm shift&lt;/strong>（排除 schema 翻譯 Type A 和 drop-in Type B）：部分遷移、長期混合架構、不收斂到「全部搬完」。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>No-go condition&lt;/strong>：workload 需要 ad-hoc 分析查詢、跨實體 JOIN、頻繁 schema 變動下的彈性查詢、或複雜多表交易 → 不該遷 DynamoDB。這些是 relational / document 的主場、硬遷會把複雜度推給 application 層（自己做 JOIN、自己維護冗餘）。&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼字面遷移不成立paradigm-gap">為什麼字面遷移不成立：paradigm gap&lt;/h2>
&lt;p>RDS / MongoDB 是 &lt;em>先有資料模型、再支援任意查詢&lt;/em>；DynamoDB 是 &lt;em>先有查詢、才設計資料模型&lt;/em>。這個順序顛倒是遷移的核心難點。&lt;/p>
&lt;p>&lt;strong>relational → DynamoDB 的斷層&lt;/strong>：&lt;/p>
&lt;ul>
&lt;li>JOIN 消失：relational 用 JOIN 組合多表、DynamoDB 要嘛預先反正規化（把關聯資料寫在同一 item / 同一 partition）、要嘛 application 多次查詢自己組&lt;/li>
&lt;li>ad-hoc query 消失：RDS 可以對任意欄位下 &lt;code>WHERE&lt;/code>、DynamoDB 只能用 PK/SK 或預建 GSI 查（對應 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/dynamodb/gsi-lsi-design/" data-link-title="DynamoDB GSI 與 LSI 設計：access pattern 補位、projection、consistency 跟 DAX 補位" data-link-desc="GSI / LSI 是 single-table 沒覆蓋的 access pattern 補位、不是萬靈丹；本文涵蓋 projection 三型選擇、sparse index、GSI 自己會 hot partition、DAX 讀峰值補位的觸發條件（含 Capcom 是 derive vs Lemino 是 case fact 的分層）">gsi-lsi-design&lt;/a>）&lt;/li>
&lt;li>強一致交易縮窄：relational 任意多表交易 → DynamoDB 有限的 TransactWriteItems（對應 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/dynamodb/transactions-conditional-writes/" data-link-title="DynamoDB Transaction 與 Conditional Write：跨 item 原子性、optimistic locking 與 idempotency" data-link-desc="DynamoDB 的寫原子性不是免費 ACID；本文展開 TransactWriteItems 跨 item 原子性、ConditionExpression 條件寫、version-based optimistic locking、ClientRequestToken idempotency，以及 transaction 2x 成本邊界與何時用單 item conditional write 取代 transaction">transactions-conditional-writes&lt;/a>）&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>document（MongoDB）→ DynamoDB 的斷層&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/dynamodb/" data-link-title="DynamoDB" data-link-desc="AWS managed key-value、partition-based scaling、9000 萬 RPS sustained 實戰證據">DynamoDB</a> overview 的 migration playbook。寫作參照 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook 寫作方法論</a>。</p></blockquote>
<p>「我們要把 RDS 整個搬到 DynamoDB。」這句話本身就藏著最大的誤解 — DynamoDB 遷移不是把 table schema 1:1 搬過去。RDS 的 normalized schema、JOIN、ad-hoc query 在 DynamoDB 沒有對應物；MongoDB 的彈性 document、二級索引、aggregation pipeline 也不能直接映射。字面意義的「遷移」不成立 — 遷移的動作是 <em>從 access pattern 重新設計資料模型</em>、搬資料只是最後一步。能不能遷、該遷多少，取決於 workload 的查詢形狀是否固定、一致性需求是否能放寬。本文走 paradigm shift 結構：先講為何字面遷移不成立、再講哪些該遷哪些該留、最後才是階段化執行。</p>
<h2 id="6-維-diff-audit主導維度是-paradigm">6 維 diff audit：主導維度是 paradigm</h2>
<p>遷移前先盤點 source 跟 target 的差異落在哪幾維、決定 playbook 結構：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>RDS / MongoDB → DynamoDB</th>
          <th>程度</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>SQL / document query → KV <code>GetItem</code> / <code>Query</code>、無 JOIN</td>
          <td>High</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>self-managed / RDS-managed → fully managed serverless</td>
          <td>Medium</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>relational / document model → access-pattern-first KV</td>
          <td>High</td>
      </tr>
      <tr>
          <td>Components 數量</td>
          <td>單 DB → 單 DB（不拆分）</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>ORM / query layer 全改、access pattern 先行</td>
          <td>High</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>partition key 設計、無跨 region transaction</td>
          <td>Medium</td>
      </tr>
  </tbody>
</table>
<p>主導維度是 <strong>paradigm</strong>（其次 schema / application change）。這定義了結構 — <strong>Type E paradigm shift</strong>（排除 schema 翻譯 Type A 和 drop-in Type B）：部分遷移、長期混合架構、不收斂到「全部搬完」。</p>
<blockquote>
<p><strong>No-go condition</strong>：workload 需要 ad-hoc 分析查詢、跨實體 JOIN、頻繁 schema 變動下的彈性查詢、或複雜多表交易 → 不該遷 DynamoDB。這些是 relational / document 的主場、硬遷會把複雜度推給 application 層（自己做 JOIN、自己維護冗餘）。</p></blockquote>
<h2 id="為什麼字面遷移不成立paradigm-gap">為什麼字面遷移不成立：paradigm gap</h2>
<p>RDS / MongoDB 是 <em>先有資料模型、再支援任意查詢</em>；DynamoDB 是 <em>先有查詢、才設計資料模型</em>。這個順序顛倒是遷移的核心難點。</p>
<p><strong>relational → DynamoDB 的斷層</strong>：</p>
<ul>
<li>JOIN 消失：relational 用 JOIN 組合多表、DynamoDB 要嘛預先反正規化（把關聯資料寫在同一 item / 同一 partition）、要嘛 application 多次查詢自己組</li>
<li>ad-hoc query 消失：RDS 可以對任意欄位下 <code>WHERE</code>、DynamoDB 只能用 PK/SK 或預建 GSI 查（對應 <a href="/blog/backend/01-database/vendors/dynamodb/gsi-lsi-design/" data-link-title="DynamoDB GSI 與 LSI 設計：access pattern 補位、projection、consistency 跟 DAX 補位" data-link-desc="GSI / LSI 是 single-table 沒覆蓋的 access pattern 補位、不是萬靈丹；本文涵蓋 projection 三型選擇、sparse index、GSI 自己會 hot partition、DAX 讀峰值補位的觸發條件（含 Capcom 是 derive vs Lemino 是 case fact 的分層）">gsi-lsi-design</a>）</li>
<li>強一致交易縮窄：relational 任意多表交易 → DynamoDB 有限的 TransactWriteItems（對應 <a href="/blog/backend/01-database/vendors/dynamodb/transactions-conditional-writes/" data-link-title="DynamoDB Transaction 與 Conditional Write：跨 item 原子性、optimistic locking 與 idempotency" data-link-desc="DynamoDB 的寫原子性不是免費 ACID；本文展開 TransactWriteItems 跨 item 原子性、ConditionExpression 條件寫、version-based optimistic locking、ClientRequestToken idempotency，以及 transaction 2x 成本邊界與何時用單 item conditional write 取代 transaction">transactions-conditional-writes</a>）</li>
</ul>
<p><strong>document（MongoDB）→ DynamoDB 的斷層</strong>：</p>
<ul>
<li>看似接近（都是 NoSQL / document-ish）、實際 MongoDB 的二級索引彈性、aggregation pipeline、彈性 query 在 DynamoDB 都沒有對應</li>
<li>MongoDB 可以「先存進去、之後再想怎麼查」；DynamoDB 不行、access pattern 沒想清楚就建表、後面要重做</li>
</ul>
<p>所以遷移的第一步不是匯資料、是 <strong>窮舉 access pattern</strong>：列出 application 對這份資料的所有讀寫路徑、每條路徑對應 DynamoDB 的 PK/SK/GSI 設計。access pattern 列不完整、就還不能開始遷。</p>
<h2 id="哪些-workload-該遷哪些該留混合架構">哪些 workload 該遷、哪些該留（混合架構）</h2>
<p>Type E 的本質是 <em>不收斂</em> — 不是所有資料都該進 DynamoDB、混合架構會長期存在。判讀標準：</p>
<table>
  <thead>
      <tr>
          <th>Workload 特徵</th>
          <th>去向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>access pattern 固定、key-based 查詢、高吞吐</td>
          <td>遷 DynamoDB</td>
      </tr>
      <tr>
          <td>可接受 eventually consistent</td>
          <td>遷 DynamoDB</td>
      </tr>
      <tr>
          <td>需要 ad-hoc 分析 / 報表 / JOIN</td>
          <td>留 RDS / 或進 analytics 系統</td>
      </tr>
      <tr>
          <td>需要強一致複雜交易</td>
          <td>留 RDS</td>
      </tr>
      <tr>
          <td>schema 頻繁演進、查詢需求不穩</td>
          <td>留 MongoDB / RDS</td>
      </tr>
  </tbody>
</table>
<p><code>9.C20 Zomato</code> 是這個判讀的 case anchor：Zomato 遷的是 <em>billing platform</em>（帳單事件、access pattern 固定、可接受 eventually consistent）、不是把整家公司的資料庫都搬。帳單系統從 TiDB 遷到 DynamoDB 後吞吐 2,000 → 8,000 RPM（4x）、延遲降 90%、成本降 50%；動機是 TiDB 必須為突發流量峰值預先 over-provision、DynamoDB on-demand「pay only for what we use」避免常態浪費。</p>
<blockquote>
<p><strong>Scope warning</strong>：Zomato 的「成本降 50%」是 <em>當下流量</em> 下的對照、不是永久結論；「延遲降 90%」可能主要是 p50、p99/p999 改善幅度通常較小。這兩點 case 原文已標明、引用時不可升級成「DynamoDB 永遠更便宜更快」。crossover 判讀見下方容量段。</p></blockquote>
<h2 id="phase-planaccess-pattern-first-階段化">Phase plan：access-pattern-first 階段化</h2>
<p>paradigm shift 的階段化把不可逆動作放到最後、每階段有獨立驗證門檻：</p>
<h4 id="phase-1access-pattern-窮舉">Phase 1：access pattern 窮舉</h4>
<p>列出 application 對目標資料的所有讀寫路徑、標每條的頻率、一致性需求、是否可放寬。這份清單是後續所有設計的輸入、不完整不進下一階段。</p>
<h4 id="phase-2dynamodb-資料建模">Phase 2：DynamoDB 資料建模</h4>
<p>依 access pattern 設計 PK/SK、single-table 結構、需要的 GSI、capacity mode。對應 <a href="/blog/backend/01-database/vendors/dynamodb/single-table-design-pattern/" data-link-title="DynamoDB Single-Table Design：從適用度前置判讀到 access pattern 反推 PK/SK" data-link-desc="DynamoDB single-table 設計不是「資料表越少越好」，而是 access pattern 反推 PK/SK 跟 GSI；本文先做 DynamoDB 適用度 4 軸前置判讀（PK 天然均勻 / control plane vs data plane / consistency / access pattern 穩定），再展開設計流程、failure modes 與 durable queue 正向用例">single-table-design-pattern</a>、<a href="/blog/backend/01-database/vendors/dynamodb/partition-key-antipatterns/" data-link-title="DynamoDB Partition Key 反模式與 Write Sharding：composite key 修復跟 mode × partition 交叉判讀" data-link-desc="DynamoDB partition 上限 1000 WCU 是 hot partition 的根因；composite key（event_id &#43; shard suffix）跟 calculated shard（hash % N）兩種修法、mode × partition 在 provisioned / on-demand 不同表現，以及 9.C15 Tixcraft 6750x 擴展的工程細節">partition-key-antipatterns</a>。</p>
<h4 id="phase-3dual-write">Phase 3：dual-write</h4>
<p>application 同時寫舊（RDS / MongoDB）跟新（DynamoDB）。舊系統仍是 source of truth、DynamoDB 累積資料。dual-write 要處理寫入失敗一致性（其中一邊失敗如何補償）。</p>
<h4 id="phase-4backfill-歷史資料">Phase 4：backfill 歷史資料</h4>
<p>把舊系統既有資料按新模型轉換寫入 DynamoDB。backfill 跟 dual-write 並行時要處理覆蓋順序（backfill 不能覆蓋掉 dual-write 的新值）。</p>
<h4 id="phase-5shadow-read-驗證">Phase 5：shadow read 驗證</h4>
<p>讀路徑同時打舊跟新、比對結果、記錄差異但仍以舊系統回應用戶。shadow read 是 cutover 前的信心來源 — 差異率降到可接受才進 cutover。對應 <a href="/blog/backend/01-database/schema-migration-rollout-evidence/" data-link-title="1.7 Schema Migration Rollout 證據（Schema Migration Rollout Evidence）實作示範" data-link-desc="以訂單付款狀態欄位演進示範 schema migration 如何產出 evidence、release gate 與 incident decision log。">1.7 Schema Migration Rollout 證據</a> 的 evidence 方法。</p>
<h4 id="phase-6漸進-cutover">Phase 6：漸進 cutover</h4>
<p>讀流量逐步從舊切到新（按比例 / 按 user segment）、保留隨時切回的能力。cutover 完成後 DynamoDB 成為該 workload 的 source of truth；但其他未遷 workload 仍在 RDS / MongoDB — 混合架構成立。</p>
<h2 id="evidence每階段的前進依據">Evidence：每階段的前進依據</h2>
<p>每個階段用資料證明可前進、不靠感覺：</p>
<table>
  <thead>
      <tr>
          <th>階段</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>dual-write</td>
          <td>雙寫成功率、寫入失敗補償紀錄、兩邊 row count 差異</td>
      </tr>
      <tr>
          <td>backfill</td>
          <td>已 backfill 比例、轉換錯誤數、checksum 對照</td>
      </tr>
      <tr>
          <td>shadow read</td>
          <td>新舊結果差異率、差異分類（可接受的 eventual vs 真錯誤）</td>
      </tr>
      <tr>
          <td>cutover</td>
          <td>切流比例、新系統 latency p99、error rate、rollback 是否觸發</td>
      </tr>
  </tbody>
</table>
<p>這些 evidence 對齊 <a href="/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">4.20 Observability Evidence Package</a>（Source / Time range / Query link / Owner / Data quality）與 <a href="/blog/backend/06-reliability/release-gate/" data-link-title="6.8 Release Gate 與變更節奏" data-link-desc="把驗證、migration、相容性納入放行判準">6.8 release gate</a> 的 gate 決策。</p>
<h2 id="cutover-與-rollback-決策">Cutover 與 rollback 決策</h2>
<p>資料庫切流失敗代價高、決策權責要寫清楚：</p>
<ul>
<li><strong>cutover window</strong>：選低流量時段、明確切流比例階梯（如 1% → 10% → 50% → 100%）</li>
<li><strong>rollback condition</strong>：新系統 error rate / latency 超過閾值、或 shadow read 差異率異常 → 切回舊系統</li>
<li><strong>decision owner</strong>：誰有權喊停、依據什麼 evidence、記錄在 <a href="/blog/backend/08-incident-response/incident-decision-log/" data-link-title="8.19 Incident Decision Log" data-link-desc="把事中假設、決策、證據、回退條件與責任人留下可復盤紀錄">8.19 incident decision log</a>（Timestamp / Decision / Context / Evidence / Owner / Rollback condition）</li>
<li><strong>資料凍結策略</strong>：cutover 期間若需要凍結寫入、明確凍結範圍與時長</li>
</ul>
<p>對應 <a href="/blog/backend/knowledge-cards/rollback-window/" data-link-title="Rollback Window" data-link-desc="說明變更進入 production 後還能用哪種方式回退或改路線的時間與條件">rollback window</a>、<a href="/blog/backend/knowledge-cards/rollback-condition/" data-link-title="Rollback Condition" data-link-desc="說明決策執行後出現哪些訊號時要撤回、回退或改路線">rollback condition</a>。</p>
<h2 id="cleanup-與長期混合">Cleanup 與長期混合</h2>
<p>Type E 的 cleanup 不一定是「退役舊系統」— 多數情況舊系統仍服務未遷 workload：</p>
<ul>
<li>已遷 workload 的舊 schema / 舊 writer / dual-write code path 退役</li>
<li>shadow read 比對 code 移除</li>
<li>但 RDS / MongoDB 本身保留（服務 analytics / 強一致 / 彈性查詢 workload）</li>
<li>明確標示哪條資料路徑的 source of truth 是 DynamoDB、哪條仍是 RDS / MongoDB、避免「到底哪個是真的」混亂</li>
</ul>
<p>混合架構不是過渡失敗、是 paradigm shift 的穩態 — 每個 workload 待在最適合它的儲存層。</p>
<h2 id="失敗模式">失敗模式</h2>
<p>production 常見的 5 個踩雷：</p>
<h4 id="case-1先匯資料才想-access-pattern">Case 1：先匯資料才想 access pattern</h4>
<p>把 RDS table 結構直接搬成 DynamoDB item、上線後發現查不出要的資料、要重建表。修法：access pattern 窮舉是 Phase 1、資料建模是 Phase 2；順序不能顛倒。</p>
<h4 id="case-2把-join-邏輯推給-application-卻沒評估成本">Case 2：把 JOIN 邏輯推給 application 卻沒評估成本</h4>
<p>遷了關聯資料、application 每次查詢做 N 次 DynamoDB 呼叫自己組 JOIN、latency 跟成本爆炸。修法：關聯資料在建模階段反正規化（同 partition / 同 item）；無法反正規化的關聯查詢、該 workload 可能不適合遷。</p>
<h4 id="case-3dual-write-一邊失敗沒補償">Case 3：dual-write 一邊失敗沒補償</h4>
<p>dual-write 時 DynamoDB 寫成功 RDS 失敗（或反之）、兩邊資料分歧、cutover 後發現新系統資料不完整。修法：dual-write 要有失敗補償（記錄失敗、重試、或標記該筆需人工對帳）；對應 <a href="/blog/backend/01-database/reconciliation-data-repair/" data-link-title="1.9 Reconciliation 與 Data Repair" data-link-desc="資料不一致的分類、偵測模式、修復策略、audit trail、跟 backup / PITR 整合">1.9 Reconciliation 與 Data Repair</a>。</p>
<h4 id="case-4跳過-shadow-read-直接-cutover">Case 4：跳過 shadow read 直接 cutover</h4>
<p>對自己的建模有信心、省掉 shadow read、cutover 後才發現 access pattern 漏了某個查詢路徑、生產出錯。修法：shadow read 是 cutover 前唯一能在真實流量下驗證新模型的階段、不能省。</p>
<h4 id="case-5只看當下成本忽略-crossover">Case 5：只看當下成本忽略 crossover</h4>
<p>遷移時算出成本降 50% 就下決策、未來流量成長後 DynamoDB cost-per-request 累積超過自管 cluster、反而更貴。修法：算 12-24 個月在預期流量下的成本曲線、不是當下 snapshot（見容量段）。</p>
<p><strong>Anti-recommendation</strong>：workload 查詢需求還在快速變化、或團隊對 access-pattern-first 建模沒經驗 → 先不要遷；用一個低風險、access pattern 已穩定的 workload 試點（如 Zomato 的 billing platform）、累積經驗再擴大。</p>
<h2 id="容量與成本crossover-判讀">容量與成本：crossover 判讀</h2>
<p>DynamoDB 成本判讀的關鍵是 <em>未來流量曲線</em>、不是遷移當下的 snapshot：</p>
<ul>
<li><strong>遷移當下</strong>：相對 over-provisioned 的自管 cluster、DynamoDB on-demand 常更便宜（Zomato -50%）</li>
<li><strong>流量成長後</strong>：DynamoDB cost-per-request 隨用量線性成長、自管 cluster 在高且可預測流量下有 crossover 點、可能反超便宜</li>
<li><strong>判讀分層</strong>：小/中流量或流量不可預測 → DynamoDB 划算；大且可預測流量 + 已有 DBA 團隊 → 算自管 crossover</li>
</ul>
<p>這條 vendor-level 成本軸主寫於 <a href="/blog/backend/01-database/vendors/dynamodb/on-demand-vs-provisioned/#%e8%bb%b8-6dynamodb-vs-%e8%87%aa%e7%ae%a1-cluster-cost-crossover" data-link-title="DynamoDB On-Demand vs Provisioned：6 軸決策、auto-scaling 邊界與 cost crossover" data-link-desc="capacity mode 選擇不是單軸 peak/avg ratio；本文展開 6 軸決策（peak/avg / 讀寫比 trend / surge 暫時 vs 永久 baseline / predictable-peak vs flash-sale / DBA 工時釋放 / vendor vs 自管 cost crossover），含 Zomato 50% 成本下降、Zoom 30x permanent surge、Amazon Ads sustained workload 等 case 分軸 anchor">on-demand-vs-provisioned 軸 6</a>；本篇從遷移決策角度引用、不重複展開 6 軸。</p>
<blockquote>
<p><strong>Scope warning</strong>：crossover 點隨 region pricing、workload shape、團隊成本結構變動、無通用閾值；Zomato 的具體百分比是單一 case 當下對照、不可外推。</p></blockquote>
<p>接回 <a href="/blog/backend/09-performance-capacity/" data-link-title="模組九：效能工程與容量規劃" data-link-desc="把『目前配置能撐多少、要加多少機器』變成可量化、可驗證、可改進的工程流程">9.7 成本邊界與 efficiency</a>、<a href="/blog/backend/01-database/kv-document-capacity-planning/" data-link-title="1.10 KV / Document DB 容量規劃" data-link-desc="DynamoDB / Cosmos DB / Bigtable / MongoDB 等 KV / Document DB 的容量設計、partition key 取捨、capacity mode 選擇">1.10 KV / Document DB 容量規劃</a>。</p>
<h2 id="邊界與整合">邊界與整合</h2>
<h3 id="跟其他遷移路徑的關係">跟其他遷移路徑的關係</h3>
<ul>
<li><strong>DynamoDB → SQL / search / analytics split</strong>（遷出方向）：當 DynamoDB workload 長出 ad-hoc 查詢需求、把分析部分拆到 OpenSearch / 數倉、是反向路徑、屬另一篇 playbook scope</li>
<li><strong>MongoDB → Atlas</strong>：若只是要 managed MongoDB 而非換 paradigm、走 <a href="/blog/backend/01-database/vendors/mongodb/migrate-to-atlas/" data-link-title="MongoDB → Atlas：Atlas 不是 MongoDB &#43; managed、是另一個 product" data-link-desc="Atlas 號稱「MongoDB managed」但 operational model 完全不同（auto-scaling / VPC peering / IAM-driven access / 內建 backup / billing 模型）；本文採用 Type C operational redesign hybrid 結構、4-phase operational migration &#43; drop-in cutover、5 個 production 踩雷（連線數限制 / IP whitelist / backup retention / IAM token 過期 / billing 暴漲）">MongoDB → Atlas</a>、不必遷 DynamoDB（保留 document paradigm）</li>
<li><strong>跨平台等效</strong>：RDS → Aurora（保留 relational）、MongoDB → Cosmos DB（保留 document）、都比遷 DynamoDB 的 paradigm 跨度小；先確認真的需要換 paradigm</li>
</ul>
<h3 id="sibling-與-cross-link">Sibling 與 cross-link</h3>
<ul>
<li><a href="/blog/backend/01-database/vendors/dynamodb/single-table-design-pattern/" data-link-title="DynamoDB Single-Table Design：從適用度前置判讀到 access pattern 反推 PK/SK" data-link-desc="DynamoDB single-table 設計不是「資料表越少越好」，而是 access pattern 反推 PK/SK 跟 GSI；本文先做 DynamoDB 適用度 4 軸前置判讀（PK 天然均勻 / control plane vs data plane / consistency / access pattern 穩定），再展開設計流程、failure modes 與 durable queue 正向用例">single-table-design-pattern</a> — 遷移 Phase 2 資料建模的核心</li>
<li><a href="/blog/backend/01-database/vendors/dynamodb/partition-key-antipatterns/" data-link-title="DynamoDB Partition Key 反模式與 Write Sharding：composite key 修復跟 mode × partition 交叉判讀" data-link-desc="DynamoDB partition 上限 1000 WCU 是 hot partition 的根因；composite key（event_id &#43; shard suffix）跟 calculated shard（hash % N）兩種修法、mode × partition 在 provisioned / on-demand 不同表現，以及 9.C15 Tixcraft 6750x 擴展的工程細節">partition-key-antipatterns</a> — 建模時 PK 均勻度判讀</li>
<li><a href="/blog/backend/01-database/vendors/dynamodb/transactions-conditional-writes/" data-link-title="DynamoDB Transaction 與 Conditional Write：跨 item 原子性、optimistic locking 與 idempotency" data-link-desc="DynamoDB 的寫原子性不是免費 ACID；本文展開 TransactWriteItems 跨 item 原子性、ConditionExpression 條件寫、version-based optimistic locking、ClientRequestToken idempotency，以及 transaction 2x 成本邊界與何時用單 item conditional write 取代 transaction">transactions-conditional-writes</a> — 遷移後寫一致性如何在 DynamoDB 重建</li>
<li><a href="/blog/backend/01-database/vendors/dynamodb/on-demand-vs-provisioned/" data-link-title="DynamoDB On-Demand vs Provisioned：6 軸決策、auto-scaling 邊界與 cost crossover" data-link-desc="capacity mode 選擇不是單軸 peak/avg ratio；本文展開 6 軸決策（peak/avg / 讀寫比 trend / surge 暫時 vs 永久 baseline / predictable-peak vs flash-sale / DBA 工時釋放 / vendor vs 自管 cost crossover），含 Zomato 50% 成本下降、Zoom 30x permanent surge、Amazon Ads sustained workload 等 case 分軸 anchor">on-demand-vs-provisioned</a> — cost crossover 軸 6 SSoT</li>
<li><a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">1.6 資料庫轉換實作</a> — 通用 dual-write / shadow read / cutover 框架</li>
<li>跟 <a href="/blog/backend/09-performance-capacity/cases/zomato-tidb-to-dynamodb-migration/" data-link-title="9.C20 Zomato：從 TiDB 遷移到 DynamoDB、吞吐 4 倍、延遲降 90%、成本減 50%" data-link-desc="Zomato 帳單系統從 TiDB 遷移到 DynamoDB、吞吐 2K→8K RPM、延遲降 90%、成本減 50%">Zomato 9.C20</a> 互引：billing platform 遷移的可量化對照與 cost crossover 警示</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration&lt;/a> playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB&lt;/a>。本文是 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 多重歸類跟 tie-breaking&lt;/a> 規則的實證 — 三維皆 High 配對的處理方式不是「選 type A 或 type C 或 type E」、是 &lt;em>主導維度走 Type E、其他高維度獨立加段&lt;/em>。每階段切換用 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate&lt;/a> 把關。&lt;/p>&lt;/blockquote>
&lt;h2 id="三維皆-high決策矩陣">三維皆 High：決策矩陣&lt;/h2>
&lt;p>跑 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit&lt;/a> 對 PostgreSQL → CockroachDB：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>評估&lt;/th>
 &lt;th>等級&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Schema / API&lt;/td>
 &lt;td>PostgreSQL wire protocol 兼容、但 SQL feature set 部分缺（CTE recursive 部分 / window function 部分 / extension 完全缺）&lt;/td>
 &lt;td>&lt;strong>High&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operational model&lt;/td>
 &lt;td>Single-node + Patroni → distributed Raft + 自動 rebalance；HA / backup / topology 全換&lt;/td>
 &lt;td>&lt;strong>High&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Abstraction / paradigm&lt;/td>
 &lt;td>Single-node MVCC + transaction → distributed Serializable Snapshot Isolation (SSI)&lt;/td>
 &lt;td>&lt;strong>High&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Number of components&lt;/td>
 &lt;td>同 1 個 DB cluster&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application change&lt;/td>
 &lt;td>Transaction retry pattern 必須改、ORM 可能需 patch&lt;/td>
 &lt;td>Medium&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>3 維 High + 1 維 Medium。按 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">methodology audit Step 5&lt;/a> 的多重歸類處理規則：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor <a href="/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration</a> playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> 跟 <a href="/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB</a>。本文是 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 多重歸類跟 tie-breaking</a> 規則的實證 — 三維皆 High 配對的處理方式不是「選 type A 或 type C 或 type E」、是 <em>主導維度走 Type E、其他高維度獨立加段</em>。每階段切換用 <a href="/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate</a> 把關。</p></blockquote>
<h2 id="三維皆-high決策矩陣">三維皆 High：決策矩陣</h2>
<p>跑 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a> 對 PostgreSQL → CockroachDB：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>PostgreSQL wire protocol 兼容、但 SQL feature set 部分缺（CTE recursive 部分 / window function 部分 / extension 完全缺）</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>Single-node + Patroni → distributed Raft + 自動 rebalance；HA / backup / topology 全換</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Abstraction / paradigm</td>
          <td>Single-node MVCC + transaction → distributed Serializable Snapshot Isolation (SSI)</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Number of components</td>
          <td>同 1 個 DB cluster</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Transaction retry pattern 必須改、ORM 可能需 patch</td>
          <td>Medium</td>
      </tr>
  </tbody>
</table>
<p>3 維 High + 1 維 Medium。按 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">methodology audit Step 5</a> 的多重歸類處理規則：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">主導維度判讀 (優先序): Schema &gt; Paradigm &gt; Operational &gt; Components
</span></span><span class="line"><span class="ln">2</span><span class="cl">
</span></span><span class="line"><span class="ln">3</span><span class="cl">實際應用: Schema High + Paradigm High + Operational High
</span></span><span class="line"><span class="ln">4</span><span class="cl">- Schema 是 High、但 CRDB 提供 PostgreSQL wire protocol 兼容
</span></span><span class="line"><span class="ln">5</span><span class="cl">- Paradigm 是 High、是 *單機 → 分散式* 的根本轉變、讀者最關心
</span></span><span class="line"><span class="ln">6</span><span class="cl">- Operational 是 High、但很大程度是 Paradigm 的 downstream
</span></span><span class="line"><span class="ln">7</span><span class="cl">
</span></span><span class="line"><span class="ln">8</span><span class="cl">→ 主結構選 Paradigm（Type E）、Schema + Operational 抽獨立段補充</span></span></code></pre></div><p>不強迫單一 type 標籤 — 本文是 <em>Type E 為主 + Type A / C 高維度增補</em> 的 multi-axis 形態。</p>
<h2 id="結構-differentiatortype-e-主結構--多軸增補段">結構 differentiator：Type E 主結構 + 多軸增補段</h2>
<p>跟前批 5 個 migration playbook 對照：</p>
<table>
  <thead>
      <tr>
          <th>結構元素</th>
          <th>Type A Splunk → Elastic</th>
          <th>Type B Redis → DragonflyDB</th>
          <th>Type C PostgreSQL → Aurora</th>
          <th>Type D Datadog → Grafana</th>
          <th>Type E Kafka ↔ NATS</th>
          <th><strong>本文（三維 High）</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Phased translation</td>
          <td>yes</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>partial</td>
      </tr>
      <tr>
          <td>Compatibility audit</td>
          <td>-</td>
          <td>yes</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
      </tr>
      <tr>
          <td>Operational redesign 對位</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>-</td>
          <td>-</td>
          <td><strong>yes（獨立段）</strong></td>
      </tr>
      <tr>
          <td>Schema gap 對位</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td><strong>yes（獨立段）</strong></td>
      </tr>
      <tr>
          <td>Parallel streams</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Paradigm contrast</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>yes</td>
      </tr>
      <tr>
          <td>Application 重設計</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>yes</td>
      </tr>
      <tr>
          <td>混合架構 long-term</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>partial（部分 workload）</td>
      </tr>
  </tbody>
</table>
<p>本文是「Type E 為主 + Type A schema gap 段 + Type C operational redesign 段」混合形態、9-10 章節、260-300 行。</p>
<h2 id="維度-1paradigm-shift主導">維度 1：Paradigm shift（主導）</h2>
<p>CRDB 是 <em>distributed SQL DB</em>、不是「PostgreSQL 多節點版」。核心差異：</p>
<table>
  <thead>
      <tr>
          <th>概念</th>
          <th>PostgreSQL</th>
          <th>CockroachDB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Transaction isolation</td>
          <td>MVCC、Read Committed default</td>
          <td>Serializable Snapshot Isolation (SSI)、強一致</td>
      </tr>
      <tr>
          <td>Transaction conflict</td>
          <td>First writer wins</td>
          <td>Retry-on-conflict、application 必須處理 <code>40001</code> retry code</td>
      </tr>
      <tr>
          <td>Replication</td>
          <td>Streaming replication + standby</td>
          <td>Raft consensus、每筆寫 quorum + 自動 rebalance</td>
      </tr>
      <tr>
          <td>Partition</td>
          <td>Declarative partitioning（手動）</td>
          <td>Automatic range-based + locality-aware</td>
      </tr>
      <tr>
          <td>Latency p99</td>
          <td>1-10ms（單 region）</td>
          <td>5-50ms（cross-AZ Raft quorum）</td>
      </tr>
      <tr>
          <td>Throughput limit</td>
          <td>單 primary 上限 ~10-50K TPS</td>
          <td>Linear scale by adding node、~5K TPS / node</td>
      </tr>
  </tbody>
</table>
<p>關鍵 paradigm 改變：<em>transaction 是 retry-able 操作、不是 atomic guaranteed</em>。所有 transaction code 需要包 retry loop（CRDB 提供 <code>cockroach_restart</code> savepoint）。</p>
<h2 id="維度-2schema-gappostgresql-features-crdb-不支援">維度 2：Schema gap（PostgreSQL features CRDB 不支援）</h2>
<p>CRDB 號稱 PostgreSQL-compatible、但 <em>covergence rate 80-90%</em>；常見 gap：</p>
<table>
  <thead>
      <tr>
          <th>PostgreSQL feature</th>
          <th>CRDB 狀態</th>
          <th>影響</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Stored procedure / function (PL/pgSQL)</td>
          <td>Limited（CRDB 22.2+ 部分支援）</td>
          <td>Migration scope 內必須 audit + 改寫</td>
      </tr>
      <tr>
          <td>Common Table Expression (CTE) recursive</td>
          <td>Limited (depth + structure)</td>
          <td>複雜 CTE 可能跑不通、必須 query refactor</td>
      </tr>
      <tr>
          <td>Window function 全集</td>
          <td>Partial</td>
          <td>報表 query 需逐 case 驗證</td>
      </tr>
      <tr>
          <td>Extensions (pg_repack / pgaudit / TimescaleDB)</td>
          <td><strong>不支援</strong></td>
          <td>用 CRDB 自家 alternative 或自管 application 層</td>
      </tr>
      <tr>
          <td>Triggers</td>
          <td>Limited</td>
          <td>Audit / data integrity 邏輯遷到 application 層</td>
      </tr>
      <tr>
          <td>Custom types / domain</td>
          <td>Partial</td>
          <td>用 CHECK constraint 替代</td>
      </tr>
      <tr>
          <td>Geographic types (PostGIS)</td>
          <td>CRDB native geo support（語法不同）</td>
          <td>Spatial query 改寫</td>
      </tr>
      <tr>
          <td><code>SELECT FOR UPDATE</code> semantics</td>
          <td>對等但底層機制不同（distributed lock）</td>
          <td>注意 deadlock pattern 差異</td>
      </tr>
      <tr>
          <td>Advisory locks</td>
          <td><strong>不支援</strong></td>
          <td>Application 端用其他 distributed lock（Redis / Consul）</td>
      </tr>
  </tbody>
</table>
<p>Migration 必須 <em>先 audit 完整 SQL feature 使用</em>、列出 gap、評估解法或退役。</p>
<h2 id="維度-3operational-redesign">維度 3：Operational redesign</h2>
<p>CRDB operational model 完全不同：</p>
<table>
  <thead>
      <tr>
          <th>Operational concept</th>
          <th>PostgreSQL self-managed</th>
          <th>CRDB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster bootstrap</td>
          <td>Patroni / Stolon + manual</td>
          <td><code>cockroach init</code> + 自動 Raft formation</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>Patroni + DCS + watchdog</td>
          <td>內建 Raft、無 single primary</td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>Patroni-managed、15-60s</td>
          <td>透明 Raft re-election、&lt; 5s</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>pgBackRest + WAL archive</td>
          <td><code>BACKUP TO</code> (incremental + full)</td>
      </tr>
      <tr>
          <td>Restore</td>
          <td><code>pgBackRest restore</code> + PITR</td>
          <td><code>RESTORE FROM</code></td>
      </tr>
      <tr>
          <td>Replication</td>
          <td>Streaming + logical</td>
          <td>Built-in、無 logical replication 對等概念</td>
      </tr>
      <tr>
          <td>Schema migration</td>
          <td><code>pg_dump</code> / Flyway / Liquibase</td>
          <td><code>cockroach sql</code> + online schema change（無 lock）</td>
      </tr>
      <tr>
          <td>Monitoring</td>
          <td>pg_stat_* views + Prometheus exporter</td>
          <td>CRDB admin UI + Prometheus（schema 不同）</td>
      </tr>
      <tr>
          <td>Sizing</td>
          <td>Vertical scale（單 node big spec）</td>
          <td>Horizontal scale（多 node 小 spec）</td>
      </tr>
  </tbody>
</table>
<p>SRE 心智模型完全重訓：<em>無 primary 概念 / 無 streaming lag 概念 / 無 standby promote 概念</em>。</p>
<h2 id="migration-流程混合形態">Migration 流程（混合形態）</h2>
<p>不是線性 phased、是 <em>phased + parallel + partial</em> 混合：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Phase 0: scope 判讀
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  - 列 application、區分「適合 CRDB」vs「保留 PostgreSQL」
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  - SQL feature audit
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  - Application transaction pattern audit
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">Phase 1: schema port + application 改寫
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  - DDL 轉成 CRDB syntax
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  - 不支援 extension 找 alternative
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  - Application transaction code 加 retry loop
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl">Phase 2: 雙寫期（部分 application 開始走 CRDB）
</span></span><span class="line"><span class="ln">12</span><span class="cl">  - 新 application 走 CRDB
</span></span><span class="line"><span class="ln">13</span><span class="cl">  - 舊 application 持續 PostgreSQL
</span></span><span class="line"><span class="ln">14</span><span class="cl">  - CDC bridge（Debezium → Kafka → CRDB consumer）
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl">Phase 3: cutover 適合的 application
</span></span><span class="line"><span class="ln">17</span><span class="cl">  - 每個 application 獨立 cutover
</span></span><span class="line"><span class="ln">18</span><span class="cl">  - 不是「全 DB 一次切」
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl">Phase 4: 長期混合架構
</span></span><span class="line"><span class="ln">21</span><span class="cl">  - 某些 workload 永遠保留 PostgreSQL（不適合分散式）
</span></span><span class="line"><span class="ln">22</span><span class="cl">  - CRDB 跑 distributed 適配 workload</span></span></code></pre></div><p>整體 3-6 個月、不收斂到全 CRDB。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1transaction-retry-沒處理application-大量-40001-error">Case 1：Transaction retry 沒處理、application 大量 <code>40001</code> error</h3>
<p><strong>徵兆</strong>：cutover 後 application 5-10% transaction 報 <code>restart transaction: TransactionRetryWithProtoRefreshError</code>、業務 fail。</p>
<p><strong>根因</strong>：PostgreSQL Read Committed 不要求 application 處理 conflict、CRDB Serializable Isolation 必須 <em>retry-on-conflict</em>；application code 沒 retry loop。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">// CRDB transaction with retry</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="k">for</span> <span class="nx">retries</span> <span class="o">:=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">retries</span> <span class="p">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="nx">retries</span><span class="o">++</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="nx">tx</span><span class="p">,</span> <span class="nx">_</span> <span class="o">:=</span> <span class="nx">db</span><span class="p">.</span><span class="nf">Begin</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="c1">// ... transaction logic ...</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="nx">err</span> <span class="o">:=</span> <span class="nx">tx</span><span class="p">.</span><span class="nf">Commit</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="k">if</span> <span class="nx">err</span> <span class="o">!=</span> <span class="kc">nil</span> <span class="o">&amp;&amp;</span> <span class="nx">strings</span><span class="p">.</span><span class="nf">Contains</span><span class="p">(</span><span class="nx">err</span><span class="p">.</span><span class="nf">Error</span><span class="p">(),</span> <span class="s">&#34;40001&#34;</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">        <span class="nx">time</span><span class="p">.</span><span class="nf">Sleep</span><span class="p">(</span><span class="nf">backoff</span><span class="p">(</span><span class="nx">retries</span><span class="p">))</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">        <span class="k">continue</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="k">break</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p>framework-level：用 CRDB-provided client lib（go-cockroachdb / crdb-jdbc）有 retry helper。</p>
<h3 id="case-2extension-缺位application-feature-整段掉">Case 2：Extension 缺位、application feature 整段掉</h3>
<p><strong>徵兆</strong>：cutover 後 application 某個地理計算功能直接報錯、PostGIS 函數不存在；migrate 計畫漏看。</p>
<p><strong>根因</strong>：CRDB native geo 不同 syntax / API、PostGIS extension 不能直接搬。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration 必跑 extension audit</strong>：列所有 <code>pg_extension</code>、找對應 CRDB feature 或退役</li>
<li><strong>PostGIS 替代</strong>：CRDB native ST_* functions、部分 syntax 對齊但 spatial index 不同</li>
<li><strong>退役不能換的 feature</strong>：評估保留 PostgreSQL（混合架構）</li>
</ol>
<h3 id="case-3sequential-pk-撞-raft-quorum-瓶頸">Case 3：Sequential PK 撞 Raft quorum 瓶頸</h3>
<p><strong>徵兆</strong>：cutover 後寫入吞吐量 / latency 不如預期、CRDB cluster CPU &lt; 30% 但 write latency p99 high。</p>
<p><strong>根因</strong>：application 用 <code>AUTO_INCREMENT</code> / <code>SERIAL</code> 連續 PK；CRDB 把連續 key 放 <em>同一 range</em> / 同一 Raft group、寫入串行化、無法平行 scale。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>改 UUID v7 / <code>unique_rowid()</code></strong>：時序排序但散佈跨 range、自動 partition by hash</li>
<li><strong><code>PRIMARY KEY (region, id)</code></strong>：multi-region 場景 multi-tenancy 自然拆分</li>
<li><strong>不適合的 workload 留 PostgreSQL</strong>：不是所有 schema 都適合 distributed</li>
</ol>
<h3 id="case-4long-transaction-對-raft-衝擊">Case 4：Long transaction 對 Raft 衝擊</h3>
<p><strong>徵兆</strong>：跨 1 分鐘+ 的 transaction（batch processing / 大 ETL）大量 retry、最後失敗；同期間其他短 transaction 也 retry rate 上升。</p>
<p><strong>根因</strong>：CRDB long transaction holds intent on touched ranges、阻塞其他 transaction；SSI conflict 機率隨 transaction 時間平方增長。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Long transaction 拆短</strong>：batch 用多個 short transaction、checkpoint 在 application 層</li>
<li><strong>Heavy ETL 不跑 CRDB</strong>：用 CRDB CDC export 到 OLAP（Snowflake / BigQuery）跑 batch</li>
<li><strong>Read-only long transaction 用 follower read</strong>：<code>AS OF SYSTEM TIME</code> 不 hold intent、適合 reporting</li>
</ol>
<h3 id="case-5backup--restore-行為跟-postgresql-不同sre-runbook-失效">Case 5：Backup / restore 行為跟 PostgreSQL 不同、SRE runbook 失效</h3>
<p><strong>徵兆</strong>：DBA 嘗試 <code>pg_restore</code> 失敗、CRDB 端 backup format 完全不同；incident response 卡關 1-2 小時。</p>
<p><strong>根因</strong>：CRDB backup 是 <em>cluster-internal format</em>、不能用 PostgreSQL tooling；SRE runbook 仍是 PostgreSQL world、應急時心智模型錯位。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Runbook 重寫</strong>：CRDB-specific backup / restore 流程、SRE training</li>
<li><strong>DR drill</strong>：cutover 前跑完整 DR drill、用 CRDB tooling 完成、不依賴 PostgreSQL 經驗</li>
<li><strong>Multi-region backup</strong>：CRDB 跨 region backup 配置、避免單 region 故障</li>
</ol>
<h2 id="capacity-規劃">Capacity 規劃</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PostgreSQL self-managed</th>
          <th>CockroachDB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Single-node 上限</td>
          <td>~10-50K TPS（vertical scale 到 32-128 vCPU）</td>
          <td>~5K TPS / node（horizontal scale by adding node）</td>
      </tr>
      <tr>
          <td>跨 region</td>
          <td>高 latency 跨區 streaming</td>
          <td>設計 native、Locality-aware queries</td>
      </tr>
      <tr>
          <td>Sharding</td>
          <td>手動 partition / pg_partman</td>
          <td>自動 range-based</td>
      </tr>
      <tr>
          <td>Storage / TPS ratio</td>
          <td>不變</td>
          <td>Storage 跨 node 3x（Raft quorum 3-replica default）</td>
      </tr>
      <tr>
          <td>Total cost (10TB)</td>
          <td>$2-4K USD / month（self-managed）</td>
          <td>$5-10K USD / month（CRDB Cloud + 3x storage）</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：CRDB cost 顯著高、選 CRDB 必須是 <em>paradigm 需求</em>（distributed transaction / multi-region / linear scale）；單純成本 / availability 改善走 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">Aurora</a> 更划算。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-postgresql--aurora-migration-對比">跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora migration</a> 對比</h3>
<p>兩條 PostgreSQL 出路：</p>
<ul>
<li><strong>Aurora</strong>：operational simplification、protocol drop-in、cost 中等漲；適合 <em>不需 distributed transaction</em> 的 production</li>
<li><strong>CRDB</strong>：distributed paradigm shift、application 必須改、cost 顯著漲；適合 <em>真的需要 distributed</em> 的 workload</li>
</ul>
<p>多數 application 不需要 distributed transaction、Aurora 更合理；真正需要 cross-region 強一致 / linear scale by adding node 才走 CRDB。</p>
<h3 id="跟-application-transaction-pattern-重設計">跟 application transaction pattern 重設計</h3>
<p>CRDB 強制 application 改 transaction code、retry loop 必加。團隊心智模型轉換是 migration 主要 effort、技術部分相對少。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>CRDB → PostgreSQL reverse migration</strong>：當業務 simplify 後 distributed 不必要、reverse migration cost 高、實務上 CRDB 是 <em>single-direction lock-in</em></li>
<li><strong>CRDB Serverless</strong>：cost 起點低、burst workload 適合；steady workload 仍是 dedicated cluster</li>
<li><strong>Multi-region active-active</strong>：CRDB 真正強項、但網路成本爆、僅金融 / 政府客戶 ROI 合理</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source / target vendor：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> / <a href="/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB</a></li>
<li>對位 migration：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a>（另一條 PostgreSQL 出路）</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> / <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Process content 結構由最大差異維度決定</a>（本文驗證 <em>多重歸類 multi-axis 處理</em>）</li>
</ul>
]]></content:encoded></item><item><title>JMeter → k6：k6 不是 JMeter 的「script 版本」、是 VU model 取代 thread model</title><link>https://tarrragon.github.io/blog/backend/09-performance-capacity/vendors/k6/migrate-from-jmeter/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/09-performance-capacity/vendors/k6/migrate-from-jmeter/</guid><description>&lt;p>k6 不是 JMeter 的 &lt;em>「script 版本」&lt;/em>。&lt;/p>
&lt;p>這個誤解是 JMeter → k6 migration 第一週最常見的事故來源。Migration 啟動會議常聽到「JMeter 的 thread group 翻成 k6 的 VU 就好了吧」、然後團隊把 &lt;code>.jmx&lt;/code> 內 100 thread → k6 &lt;code>vus: 100&lt;/code>、跑下去發現 RPS 差三倍、p95 延遲表完全不同形狀、以為 k6 壞了。&lt;/p>
&lt;p>實際上 k6 的 &lt;em>Virtual User (VU)&lt;/em> 跟 JMeter 的 &lt;em>Thread&lt;/em> 是 &lt;em>兩種不同的使用者行為建模方式&lt;/em>：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>JMeter Thread&lt;/strong>：一個 OS thread = 一個 user、&lt;code>numThreads=100&lt;/code> 就 &lt;em>固定 100 個 concurrent 使用者一直跑&lt;/em>、ramp-up period 控制怎麼啟動、無 explicit arrival rate 概念&lt;/li>
&lt;li>&lt;strong>k6 VU&lt;/strong>：一個 goroutine-like execution context、預設 &lt;code>vus&lt;/code> 是 &lt;em>concurrent VU pool&lt;/em>、但 k6 更推薦用 &lt;code>arrival-rate executor&lt;/code> — 直接表達 &lt;em>每秒進來幾個 request&lt;/em>、VU 是 &lt;em>為了達到 arrival rate 動態起的 worker&lt;/em>&lt;/li>
&lt;/ul>
&lt;p>差別在 &lt;em>測量視角&lt;/em>：JMeter 預設視角是 &lt;em>「我有 100 個使用者在用系統」&lt;/em>、k6 預設視角是 &lt;em>「我每秒有 N 個請求進來」&lt;/em>。兩種視角下 &lt;em>同一個系統的瓶頸結果完全不同&lt;/em>：100 concurrent user 模型在 server 慢時 throughput 會自動降（user 等回應）、100 RPS arrival rate 模型在 server 慢時 queue 會累積、暴露 &lt;em>真實 production behavior&lt;/em>（user 不會體諒、會繼續送請求）。&lt;/p>
&lt;p>這篇 migration playbook 不是 schema translation 文（&lt;code>.jmx&lt;/code> 翻成 &lt;code>.js&lt;/code> 只是表面）、是 &lt;em>paradigm shift&lt;/em> — 從 closed-system model（thread）到 open-system model（arrival rate）的視角轉換。&lt;/p>
&lt;h2 id="為什麼是-type-eschema--paradigm-同-high">為什麼是 Type E（schema + paradigm 同 High）&lt;/h2>
&lt;p>跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/#6-%e7%b6%ad-diff-dimension-audit" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit&lt;/a>：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>評&lt;/th>
 &lt;th>說明&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Schema&lt;/td>
 &lt;td>High&lt;/td>
 &lt;td>&lt;code>.jmx&lt;/code> XML vs JavaScript scenario、test plan 完全不同 file format / DSL&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operational&lt;/td>
 &lt;td>Medium&lt;/td>
 &lt;td>CLI / distributed run 接近、CI integration 差別大、distributed runner 模型不同&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Paradigm&lt;/td>
 &lt;td>High&lt;/td>
 &lt;td>thread group closed model → arrival rate open model、測試思維不同&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Components&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;td>都是 load test runner、no multi-tool decomposition&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>App change&lt;/td>
 &lt;td>N/A&lt;/td>
 &lt;td>是 test code、不是 production code&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Topology&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;td>都是 CLI / runner 跑、無 sharding&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Schema High + Paradigm High 兩軸 High。按優先序 Schema &amp;gt; Paradigm、預設選 Type A。但對 JMeter → k6 的讀者來說、&lt;em>paradigm shift 才是難關&lt;/em> — schema translation 是工作量、但搞錯 paradigm 會讓 migration 後的測試結果 &lt;em>跟 production 不對應&lt;/em>。所以選 &lt;strong>Type E paradigm shift&lt;/strong> 結構、schema translation 抽出 Phase 1-2 補充。&lt;/p></description><content:encoded><![CDATA[<p>k6 不是 JMeter 的 <em>「script 版本」</em>。</p>
<p>這個誤解是 JMeter → k6 migration 第一週最常見的事故來源。Migration 啟動會議常聽到「JMeter 的 thread group 翻成 k6 的 VU 就好了吧」、然後團隊把 <code>.jmx</code> 內 100 thread → k6 <code>vus: 100</code>、跑下去發現 RPS 差三倍、p95 延遲表完全不同形狀、以為 k6 壞了。</p>
<p>實際上 k6 的 <em>Virtual User (VU)</em> 跟 JMeter 的 <em>Thread</em> 是 <em>兩種不同的使用者行為建模方式</em>：</p>
<ul>
<li><strong>JMeter Thread</strong>：一個 OS thread = 一個 user、<code>numThreads=100</code> 就 <em>固定 100 個 concurrent 使用者一直跑</em>、ramp-up period 控制怎麼啟動、無 explicit arrival rate 概念</li>
<li><strong>k6 VU</strong>：一個 goroutine-like execution context、預設 <code>vus</code> 是 <em>concurrent VU pool</em>、但 k6 更推薦用 <code>arrival-rate executor</code> — 直接表達 <em>每秒進來幾個 request</em>、VU 是 <em>為了達到 arrival rate 動態起的 worker</em></li>
</ul>
<p>差別在 <em>測量視角</em>：JMeter 預設視角是 <em>「我有 100 個使用者在用系統」</em>、k6 預設視角是 <em>「我每秒有 N 個請求進來」</em>。兩種視角下 <em>同一個系統的瓶頸結果完全不同</em>：100 concurrent user 模型在 server 慢時 throughput 會自動降（user 等回應）、100 RPS arrival rate 模型在 server 慢時 queue 會累積、暴露 <em>真實 production behavior</em>（user 不會體諒、會繼續送請求）。</p>
<p>這篇 migration playbook 不是 schema translation 文（<code>.jmx</code> 翻成 <code>.js</code> 只是表面）、是 <em>paradigm shift</em> — 從 closed-system model（thread）到 open-system model（arrival rate）的視角轉換。</p>
<h2 id="為什麼是-type-eschema--paradigm-同-high">為什麼是 Type E（schema + paradigm 同 High）</h2>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/#6-%e7%b6%ad-diff-dimension-audit" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評</th>
          <th>說明</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema</td>
          <td>High</td>
          <td><code>.jmx</code> XML vs JavaScript scenario、test plan 完全不同 file format / DSL</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>Medium</td>
          <td>CLI / distributed run 接近、CI integration 差別大、distributed runner 模型不同</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>High</td>
          <td>thread group closed model → arrival rate open model、測試思維不同</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Low</td>
          <td>都是 load test runner、no multi-tool decomposition</td>
      </tr>
      <tr>
          <td>App change</td>
          <td>N/A</td>
          <td>是 test code、不是 production code</td>
      </tr>
      <tr>
          <td>Topology</td>
          <td>Low</td>
          <td>都是 CLI / runner 跑、無 sharding</td>
      </tr>
  </tbody>
</table>
<p>Schema High + Paradigm High 兩軸 High。按優先序 Schema &gt; Paradigm、預設選 Type A。但對 JMeter → k6 的讀者來說、<em>paradigm shift 才是難關</em> — schema translation 是工作量、但搞錯 paradigm 會讓 migration 後的測試結果 <em>跟 production 不對應</em>。所以選 <strong>Type E paradigm shift</strong> 結構、schema translation 抽出 Phase 1-2 補充。</p>
<h2 id="driverdeveloper-ergonomic--ci-gate-friendly">Driver：developer ergonomic + CI gate friendly</h2>
<p>從 JMeter 遷出 k6 的核心拉力是 <em>developer ergonomic + CI 友善</em>：</p>
<ul>
<li><strong><code>.jmx</code> XML 在 git 內 diff 不可讀</strong>：兩個 <code>.jmx</code> PR 的 diff 是 XML attribute reorder noise、reviewer 看不出來實際邏輯改了什麼；JavaScript 是純文字 + AST、PR diff 直接可讀</li>
<li><strong>GUI 學習曲線</strong>：JMeter GUI 不是現代 IDE、不熟的工程師寫一個 scenario 要花半天找對的 sampler 跟 listener；JavaScript 用既有 IDE（VS Code / IntelliJ）、autocomplete + lint + format 全有</li>
<li><strong>CI integration 步驟差</strong>：JMeter 在 CI 跑要 packaging plugin + non-GUI mode + result XML parser；k6 直接 <code>k6 run script.js</code>、result 是 JSON / Prometheus metrics、threshold pass/fail 直接 exit code</li>
<li><strong>單機 VU 容量</strong>：JMeter 單機通常 ~500-1000 thread（受 JVM 跟 OS thread limit）、k6 單機可跑 30K-50K VU（Go runtime + goroutine）、distributed runner 需求降低</li>
<li><strong>Workload model expressiveness</strong>：k6 <code>arrival-rate executor</code> + <code>ramping-vus</code> + <code>constant-vus</code> 三種 executor 直接對應 <em>open system / ramping / closed system</em> 三種測量視角、不像 JMeter 需要組合 Constant Throughput Timer + Synchronizing Timer + thread group 才達到</li>
</ul>
<p>這條 driver 在 <em>QA 團隊 GUI 維護 .jmx asset</em> 的 org 沒拉力（GUI 反而是優勢）、但對 <em>dev / SRE 寫 performance test 進 CI</em> 的 org 是強拉力。Audience 不同、migration value 完全不同。</p>
<h2 id="4-phase-partial-migration不收斂">4-phase partial migration（不收斂）</h2>
<p>Type E 的特徵是 <em>不收斂</em> — 多數 org 不會把 <code>.jmx</code> 全退役、會停在某個 phase 變成 hybrid：</p>
<h3 id="phase-1學會-k6-paradigm不寫實際-test">Phase 1：學會 k6 paradigm（不寫實際 test）</h3>
<p>寫一個 throwaway script 跑當前 production-like API、不為了 migrate、為了搞清楚 k6 paradigm：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kr">import</span> <span class="nx">http</span> <span class="nx">from</span> <span class="s1">&#39;k6/http&#39;</span><span class="p">;</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kr">import</span> <span class="p">{</span> <span class="nx">check</span> <span class="p">}</span> <span class="nx">from</span> <span class="s1">&#39;k6&#39;</span><span class="p">;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="kr">export</span> <span class="kr">const</span> <span class="nx">options</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">  <span class="c1">// 不要用 vus: 100、用 arrival rate
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span>  <span class="nx">scenarios</span><span class="o">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">    <span class="nx">open_model</span><span class="o">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">      <span class="nx">executor</span><span class="o">:</span> <span class="s1">&#39;constant-arrival-rate&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">      <span class="nx">rate</span><span class="o">:</span> <span class="mi">100</span><span class="p">,</span>           <span class="c1">// 每秒 100 request
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span>      <span class="nx">timeUnit</span><span class="o">:</span> <span class="s1">&#39;1s&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">      <span class="nx">duration</span><span class="o">:</span> <span class="s1">&#39;5m&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">      <span class="nx">preAllocatedVUs</span><span class="o">:</span> <span class="mi">200</span><span class="p">,</span> <span class="c1">// 預先準備 VU 數
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span>      <span class="nx">maxVUs</span><span class="o">:</span> <span class="mi">500</span><span class="p">,</span>          <span class="c1">// 上限
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span>    <span class="p">},</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">  <span class="p">},</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">  <span class="nx">thresholds</span><span class="o">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    <span class="nx">http_req_duration</span><span class="o">:</span> <span class="p">[</span><span class="s1">&#39;p(95)&lt;500&#39;</span><span class="p">],</span> <span class="c1">// p95 &lt; 500ms
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1"></span>    <span class="nx">http_req_failed</span><span class="o">:</span> <span class="p">[</span><span class="s1">&#39;rate&lt;0.01&#39;</span><span class="p">],</span>   <span class="c1">// 失敗率 &lt; 1%
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="c1"></span>  <span class="p">},</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="p">};</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="kr">export</span> <span class="k">default</span> <span class="kd">function</span> <span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">  <span class="kr">const</span> <span class="nx">res</span> <span class="o">=</span> <span class="nx">http</span><span class="p">.</span><span class="nx">get</span><span class="p">(</span><span class="s1">&#39;https://api.example.com/orders&#39;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">  <span class="nx">check</span><span class="p">(</span><span class="nx">res</span><span class="p">,</span> <span class="p">{</span> <span class="s1">&#39;status 200&#39;</span><span class="o">:</span> <span class="p">(</span><span class="nx">r</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="nx">r</span><span class="p">.</span><span class="nx">status</span> <span class="o">===</span> <span class="mi">200</span> <span class="p">});</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p>對比同一個 test 用 <code>.jmx</code> 寫的形狀、思考 <em>為什麼 arrival rate 跟 thread group 測出來不一樣</em>。這 phase 的目標是 <em>paradigm internalization</em>、不是產出 migration artifact。團隊每個寫 performance test 的人都要過這一關、不能跳。</p>
<p>完成標準：寫的人能講清楚「arrival rate 100 / 5 分鐘」跟「100 thread / 5 分鐘 ramp-up」的 production behavior 差異。</p>
<h3 id="phase-2高價值-critical-path-改-k6gui-留-jmeter">Phase 2：高價值 critical path 改 k6（GUI 留 JMeter）</h3>
<p>選 <em>最常跑 + 最重要</em> 的 1-3 條 scenario 改寫 k6、不全部一次轉。典型候選：</p>
<ul>
<li>Pre-release smoke test（核心 API 的 baseline check）</li>
<li>Nightly regression（per-commit performance gate）</li>
<li>Peak readiness rehearsal scenario（活動前 T-7 跑的 stress test）</li>
</ul>
<p>GUI / QA 團隊維護的 <code>.jmx</code> <em>不動</em> — 那些通常是 multi-protocol（JDBC / JMS / FTP）、不在 k6 適合 scope。</p>
<p>工作主要塊：</p>
<ul>
<li><code>.jmx</code> thread group → k6 scenario executor 的 <em>paradigm-correct</em> 翻譯（不是欄位翻譯）</li>
<li>HTTP request 跟 assertion 翻譯（payload / header / cookies）</li>
<li>CSV data source（JMeter CSV Data Set Config）→ k6 <code>SharedArray</code> from JSON</li>
<li>結果輸出 schema 改變（XML / JTL → JSON / Prometheus / k6 Cloud）</li>
<li>CI integration 重做（GitHub Actions / GitLab CI 直接 <code>k6 run</code>、不需要 packaging）</li>
</ul>
<p>完成標準：critical path 的 k6 baseline 跟 <code>.jmx</code> baseline 數據對比一致（p50 / p95 / throughput 在 10% 誤差內、行為不一致時知道是 paradigm 差還是 bug）。</p>
<h3 id="phase-3qa-團隊雙工具技能hybrid-穩定形態">Phase 3：QA 團隊雙工具技能（hybrid 穩定形態）</h3>
<p>很多 org 停在這個 phase：QA 團隊用 GUI 維護 multi-protocol .jmx（covering JDBC / JMS / LDAP / SOAP / FTP）、dev / SRE 用 k6 維護 HTTP / gRPC / WebSocket performance test in CI。Two-tool stack 不是 broken state、是 <em>not-converged-by-design</em>。</p>
<p>這個 phase 的工作主要塊：</p>
<ul>
<li>文件化：哪類 test 用 k6、哪類用 JMeter、決策樹寫在 team handbook</li>
<li>結果整合：兩個工具的 metrics 都進同一個 Grafana dashboard（k6 → Prometheus 直接、JMeter → InfluxDB / Prometheus exporter）</li>
<li>Release gate 用 k6 為主（CI 整合直接）、JMeter 用於 manual QA campaign / multi-protocol 場景</li>
</ul>
<p>多數 org 不進 Phase 4。</p>
<h3 id="phase-4jmeter-退役少見">Phase 4：JMeter 退役（少見）</h3>
<p>只有當 <em>所有 protocol 都換到 k6 extension</em> 或 <em>捨棄了 multi-protocol coverage</em> 時、才 fully 退役 JMeter。常見路徑：</p>
<ul>
<li>用 k6 xk6 extensions 補 protocol（xk6-sql for JDBC、xk6-kafka for Kafka、xk6-amqp for RabbitMQ、xk6-mqtt for MQTT）</li>
<li>評估每個 extension 的 maturity / community support — xk6 ecosystem 比 JMeter plugin 小很多</li>
<li>接受 part of legacy <code>.jmx</code> test 直接 deprecate（covered by integration test 而非 load test）</li>
</ul>
<p>完成標準：所有 protocol 都在 k6 + xk6 內可表達、<code>.jmx</code> 全部 archive。</p>
<h2 id="5-個-production-踩雷">5 個 production 踩雷</h2>
<h3 id="1-thread-group--vu-直接翻譯最常見phase-2-必踩">1. Thread group → VU 直接翻譯（最常見、Phase 2 必踩）</h3>
<p>把 <code>numThreads=100</code> 翻成 <code>vus: 100</code> 就完事 — 結果 RPS 跟 JMeter 不一致、p95 完全不同形狀。原因：JMeter 100 thread 是 <em>closed model</em>（thread 等回應才送下一個）、k6 <code>vus: 100</code> 預設也是 closed model、但 <em>iteration 結束就立刻送下一個</em>（無 think time）— 兩者的 <em>throughput 行為</em> 差異來自 think time / response time。</p>
<p>修法：</p>
<ul>
<li>不用 <code>vus: N</code>、用 <code>constant-arrival-rate</code> 或 <code>ramping-arrival-rate</code>、直接表達 <em>每秒幾個請求</em></li>
<li>如果一定要 closed model（pre-existing JMeter scenario 對比）、在 default function 內加 <code>sleep(thinkTime)</code> 模擬 JMeter Think Time</li>
</ul>
<h3 id="2-arrival-rate-vs-concurrent-vu-混淆">2. Arrival rate vs concurrent VU 混淆</h3>
<p><code>arrival-rate</code> executor 的 <code>rate: 100</code> 意思是 <em>每秒進來 100 request</em>、<code>preAllocatedVUs: 200</code> 是 <em>預先準備 200 個 VU worker pool</em>。如果 service 變慢（p95 從 100ms 飄到 500ms）、需要的 VU 數會從 100/sec * 0.1s = 10 暴增到 100/sec * 0.5s = 50、<code>preAllocatedVUs</code> 不夠就會 warning「ran out of VUs」、實際 arrival rate 達不到 spec。</p>
<p>修法：</p>
<ul>
<li><code>preAllocatedVUs</code> 設為 <code>maxVUs / 2</code></li>
<li><code>maxVUs</code> 設為 <code>rate * worst_case_response_time_seconds * 5</code>（5x safety margin）</li>
<li>Monitor <code>dropped_iterations</code> metric — 不該 &gt; 0、&gt; 0 表示 worker pool 不夠</li>
</ul>
<h3 id="3-protocol-gapk6-沒原生對應-jmeter-的部分">3. Protocol gap（k6 沒原生對應 JMeter 的部分）</h3>
<p>k6 原生支援 HTTP/1.1 / HTTP/2 / gRPC / WebSocket / SSE。<strong>沒有</strong>原生支援：</p>
<ul>
<li>JDBC（要 xk6-sql extension）</li>
<li>JMS（要 xk6-amqp / xk6-kafka extension）</li>
<li>LDAP（無 extension、要外接 LDAP client）</li>
<li>FTP（無 extension）</li>
<li>SMTP / IMAP / POP3（無 extension）</li>
<li>SOAP（HTTP module 內手寫 XML body、無 helper）</li>
</ul>
<p>如果 <code>.jmx</code> 用了這些 protocol、評估 xk6 extension 成熟度（GitHub stars、recent commit、issue volume）、不成熟就把這些 test 留在 JMeter。</p>
<h3 id="4-結果輸出-schema-改變result-post-processing-全部要重寫">4. 結果輸出 schema 改變（result post-processing 全部要重寫）</h3>
<p>JMeter 預設輸出 JTL XML（per-sample 一行）、有 listener 後處理。k6 預設輸出 stdout summary + optional JSON / CSV / Prometheus / k6 Cloud。如果有既有 <em>result analysis pipeline</em>（從 JTL 拉 data 進 BI tool、產 trend chart）、Phase 2 必須重寫。</p>
<p>修法：</p>
<ul>
<li>評估直接接 Prometheus + Grafana（k6 native）取代既有 BI dashboard</li>
<li>或寫 k6 JSON output → 自家 BI 的 transformation script</li>
</ul>
<h3 id="5-ci-integration-重做distributed-runner-模型不同">5. CI integration 重做（distributed runner 模型不同）</h3>
<p>JMeter 在 CI 跑要：JVM provision、plugin install、<code>.jmx</code> upload、non-GUI mode 跑、JTL 結果 parse、exit code 對應 threshold。k6 在 CI 跑：<code>k6 run script.js</code>、threshold pass / fail 直接 exit code、result 進 Prometheus / k6 Cloud。</p>
<p>看起來 k6 簡單、但有踩雷：</p>
<ul>
<li>Distributed run model 不同：JMeter 用 master-slave、k6 OSS 不內建 distributed、要 Grafana Cloud k6 或自建 k6-operator on Kubernetes</li>
<li>大規模負載（&gt; 50K VU）必須 distributed、Phase 2 評估時要先確認 distributed setup 不是 blocker</li>
<li>CI runner 資源：k6 是 native binary、CPU / memory 用量比 JMeter（JVM）低、但 runner spec 要按 max VU 估</li>
</ul>
<h2 id="protocol-gap-詳表">Protocol gap 詳表</h2>
<table>
  <thead>
      <tr>
          <th>Protocol</th>
          <th>JMeter sampler</th>
          <th>k6 對應</th>
          <th>成熟度 / 替代方案</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HTTP/1.1</td>
          <td>HTTP Request</td>
          <td><code>k6/http</code></td>
          <td>原生、成熟</td>
      </tr>
      <tr>
          <td>HTTP/2</td>
          <td>HTTP/2 sampler</td>
          <td><code>k6/http</code>（auto）</td>
          <td>原生、成熟</td>
      </tr>
      <tr>
          <td>gRPC</td>
          <td>（無原生、要 plugin）</td>
          <td><code>k6/net/grpc</code></td>
          <td>原生、成熟</td>
      </tr>
      <tr>
          <td>WebSocket</td>
          <td>WebSocket sampler（plugin）</td>
          <td><code>k6/ws</code></td>
          <td>原生、成熟</td>
      </tr>
      <tr>
          <td>SSE</td>
          <td>（無原生）</td>
          <td>xk6-sse</td>
          <td>extension、中等</td>
      </tr>
      <tr>
          <td>JDBC</td>
          <td>JDBC Request</td>
          <td>xk6-sql</td>
          <td>extension、不成熟、留 JMeter</td>
      </tr>
      <tr>
          <td>JMS</td>
          <td>JMS sampler</td>
          <td>xk6-amqp / xk6-kafka</td>
          <td>extension、protocol-specific</td>
      </tr>
      <tr>
          <td>LDAP</td>
          <td>LDAP Request</td>
          <td>（無）</td>
          <td>外接 / 留 JMeter</td>
      </tr>
      <tr>
          <td>FTP</td>
          <td>FTP Request</td>
          <td>（無）</td>
          <td>留 JMeter</td>
      </tr>
      <tr>
          <td>SMTP / IMAP</td>
          <td>Mail sampler</td>
          <td>（無）</td>
          <td>留 JMeter</td>
      </tr>
      <tr>
          <td>SOAP / XML-RPC</td>
          <td>SOAP / XML-RPC Request</td>
          <td><code>k6/http</code> 手寫 XML body</td>
          <td>工作量大、留 JMeter</td>
      </tr>
      <tr>
          <td>TCP socket</td>
          <td>TCP sampler</td>
          <td><code>k6/net/tcp</code></td>
          <td>原生但簡單、複雜 protocol 留 JMeter</td>
      </tr>
  </tbody>
</table>
<h2 id="容量與成本對照">容量與成本對照</h2>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>JMeter</th>
          <th>k6 OSS</th>
          <th>Grafana Cloud k6</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cost</td>
          <td>Free (Apache)</td>
          <td>Free (Apache 2.0)</td>
          <td>$49+ / mo (Pro)</td>
      </tr>
      <tr>
          <td>單機 VU 容量</td>
          <td>~500-1000 thread</td>
          <td>30K-50K VU</td>
          <td>unlimited（cloud runner）</td>
      </tr>
      <tr>
          <td>Distributed</td>
          <td>master-slave 內建</td>
          <td>不內建、需 k6-operator</td>
          <td>cloud-native</td>
      </tr>
      <tr>
          <td>Result store</td>
          <td>JTL XML（local）</td>
          <td>stdout / JSON / Prom</td>
          <td>cloud retained</td>
      </tr>
      <tr>
          <td>CI integration</td>
          <td>需 packaging</td>
          <td>native CLI</td>
          <td>native + cloud</td>
      </tr>
      <tr>
          <td>Multi-protocol coverage</td>
          <td>廣</td>
          <td>窄（HTTP/gRPC/WS）+ xk6</td>
          <td>同 OSS</td>
      </tr>
  </tbody>
</table>
<p>對 dev-driven CI gate use case：k6 OSS 已經夠用、Grafana Cloud k6 在 <em>跨 region runner + result retention + dashboard 整合</em> 時才有 ROI。對既有 multi-protocol .jmx asset：考慮 Phase 3 hybrid stable state、不要強推 Phase 4。</p>
<h2 id="何時不要切">何時不要切</h2>
<ul>
<li><strong>multi-protocol coverage 是核心需求</strong>：JDBC + JMS + LDAP + FTP 必要、xk6 extension 不夠成熟、留 JMeter</li>
<li><strong>QA 團隊維護 GUI .jmx</strong>：QA 不寫 code、<code>.jmx</code> GUI 是團隊資產、貿然轉 k6 等於 throwaway QA team</li>
<li><strong>既有 multi-year .jmx asset 大量</strong>：500+ scenario 全部翻譯成本 &gt; k6 ergonomic 收益、考慮 Phase 3 stable hybrid</li>
<li><strong>Distributed run 需求極大（&gt; 100K VU）但 ops budget 緊</strong>：k6-operator on Kubernetes 不便宜、Grafana Cloud k6 對應 tier 也不便宜、JMeter master-slave 仍是 cost-effective 選項</li>
</ul>
<h2 id="下一步路由">下一步路由</h2>
<ul>
<li>平行 batch：<a href="/blog/backend/09-performance-capacity/vendors/datadog-continuous-profiler/migrate-from-pyroscope/" data-link-title="Pyroscope → Datadog Continuous Profiler：profiling deployment lifecycle 各階段 operational ownership 轉手" data-link-desc="Pyroscope → Datadog Continuous Profiler 是 Type C operational hybrid migration — pprof data model 接近、profile lifecycle 五階段（install / instrument / ingest / query / cost）的 ops ownership 從 self-host 轉到 SaaS。本文走 6 維 audit（Operational High 其他 Low）、4-phase migration（operational audit &#43; agent parallel &#43; tag reconcile &#43; cutover）、5 production 踩雷（agent 重複 overhead / tag schema 不一致 / trace_id correlation 斷 / cost 突增 / retention 政策變動）、何時保留 Pyroscope（資料主權 / 內網 / OSS-first / cost sensitive）">Pyroscope → Datadog Profiler</a>（Type C operational hybrid）</li>
<li>同 batch Type E：<a href="/blog/backend/08-incident-response/vendors/pagerduty/migrate-to-incident-io/" data-link-title="PagerDuty → incident.io：「On-call」是個 retconned word、同名不同 contract" data-link-desc="PagerDuty → incident.io 不是 schema translation — 兩家的「on-call」字面相同、contract 不同（alert routing vs IR coordination &#43; Slack-native &#43; retrospective）。本文走 Type E paradigm shift、6 維 audit 顯示 paradigm / schema / operational 三軸 High、用 4-phase partial migration（不收斂、Phase 1-2 多數 org 停留）、5 個 production 踩雷（雙系統 state drift / severity 翻譯失真 / schedule layer 漏 / Slack channel 過載 / retrospective 斷層）、跟 PagerDuty Process Automation / AIOps 沒對應的 capability gap">PagerDuty → incident.io</a>（IR paradigm shift）</li>
<li>上游：<a href="/blog/backend/09-performance-capacity/load-test-tooling/" data-link-title="9.3 壓測工具選型" data-link-desc="k6 / JMeter / Gatling / Locust / Vegeta / Production Replay 的工程選型">9.3 壓測工具選型</a> / <a href="/blog/backend/09-performance-capacity/workload-modeling/" data-link-title="9.2 Workload Modeling" data-link-desc="把 production traffic shape 翻成可重播的壓測模型">9.2 Workload Modeling</a></li>
<li>下游：<a href="/blog/backend/06-reliability/performance-regression-gate/" data-link-title="6.13 Performance Regression Gate" data-link-desc="把效能 baseline 從一次性壓測變成持續對齊的 release gate，涵蓋 baseline 設定、判讀方法、variance 控制與退化定位">6.13 Performance Regression Gate</a>（CI gate integration）</li>
<li>vendor 對照：<a href="/blog/backend/09-performance-capacity/vendors/jmeter/" data-link-title="Apache JMeter" data-link-desc="用 GUI、plugin 與多 protocol sampler 承接企業壓測資產的效能工程工具">JMeter</a> / <a href="/blog/backend/09-performance-capacity/vendors/k6/" data-link-title="k6" data-link-desc="用 scriptable scenario 建立 API、protocol 與 CI 友善壓測的效能工程工具">k6</a> / <a href="/blog/backend/09-performance-capacity/vendors/gatling/" data-link-title="Gatling" data-link-desc="用 JVM DSL、simulation 與 injection profile 表達複雜 scenario 的效能工程工具">Gatling</a> / <a href="/blog/backend/09-performance-capacity/vendors/locust/" data-link-title="Locust" data-link-desc="用 Python user behavior 與 distributed worker 表達高自訂負載模型的效能工程工具">Locust</a></li>
<li>方法論：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook Methodology</a>（Type E paradigm shift 結構說明）</li>
</ul>
]]></content:encoded></item><item><title>PagerDuty → incident.io：「On-call」是個 retconned word、同名不同 contract</title><link>https://tarrragon.github.io/blog/backend/08-incident-response/vendors/pagerduty/migrate-to-incident-io/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/08-incident-response/vendors/pagerduty/migrate-to-incident-io/</guid><description>&lt;p>「On-call」是個被 retconned 的詞。PagerDuty 用了十年定義它為 &lt;em>alert routing + schedule + escalation&lt;/em> — 重點是「誰會被叫醒」。incident.io 2024 年推出 On-call 模組時保留了同一個詞、但 contract 變了：On-call 在 incident.io 是 &lt;em>IR coordination + Slack-native workflow + retrospective integration&lt;/em> 的 paging 入口 — 重點是「被叫醒之後做什麼」。&lt;/p>
&lt;p>這個語意 retroactive 是這篇 migration playbook 必須先講清楚的事。讀者打開比較表會看到「PagerDuty 有 schedule、incident.io 有 schedule、PagerDuty 有 escalation policy、incident.io 有 escalation policy」、以為這是一場 schema translation 文。實際上 schema 翻譯只是其中一個工作塊、更難的是 &lt;em>org 的事故行為從「等 PagerDuty 叫」變成「在 Slack channel 內跑 lifecycle」&lt;/em>。&lt;/p>
&lt;h2 id="為什麼是-type-e不是-type-a">為什麼是 Type E（不是 Type A）&lt;/h2>
&lt;p>跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/#6-%e7%b6%ad-diff-dimension-audit" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit&lt;/a>：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>評&lt;/th>
 &lt;th>說明&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Schema&lt;/td>
 &lt;td>High&lt;/td>
 &lt;td>service / escalation policy / schedule / integration 跟 incident / role / action / catalog 沒 1:1 對應&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operational&lt;/td>
 &lt;td>High&lt;/td>
 &lt;td>alert routing → Slack-native IR coordination + retrospective workflow&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Paradigm&lt;/td>
 &lt;td>High&lt;/td>
 &lt;td>「alert someone」 → 「coordinate full incident lifecycle from declare to retro」&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Components&lt;/td>
 &lt;td>Medium&lt;/td>
 &lt;td>incident.io 整合 Slack / Linear / Jira / Confluence 變 multi-component&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>App change&lt;/td>
 &lt;td>Medium&lt;/td>
 &lt;td>webhook / integration key / IaC 都要改&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Topology&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;td>都是 cloud SaaS、無 sharding / region 議題&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>三軸 High（schema / operational / paradigm）。按優先序 schema &amp;gt; paradigm &amp;gt; operational、預設會選 Type A。但這條優先序是 &lt;em>audience-dependent heuristic&lt;/em> — 對「我要把 PagerDuty config 翻譯成 incident.io」的讀者選 Type A、對「我要把事故管理 paradigm 從 paging-first 變成 Slack-first」的讀者選 Type E。&lt;/p></description><content:encoded><![CDATA[<p>「On-call」是個被 retconned 的詞。PagerDuty 用了十年定義它為 <em>alert routing + schedule + escalation</em> — 重點是「誰會被叫醒」。incident.io 2024 年推出 On-call 模組時保留了同一個詞、但 contract 變了：On-call 在 incident.io 是 <em>IR coordination + Slack-native workflow + retrospective integration</em> 的 paging 入口 — 重點是「被叫醒之後做什麼」。</p>
<p>這個語意 retroactive 是這篇 migration playbook 必須先講清楚的事。讀者打開比較表會看到「PagerDuty 有 schedule、incident.io 有 schedule、PagerDuty 有 escalation policy、incident.io 有 escalation policy」、以為這是一場 schema translation 文。實際上 schema 翻譯只是其中一個工作塊、更難的是 <em>org 的事故行為從「等 PagerDuty 叫」變成「在 Slack channel 內跑 lifecycle」</em>。</p>
<h2 id="為什麼是-type-e不是-type-a">為什麼是 Type E（不是 Type A）</h2>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/#6-%e7%b6%ad-diff-dimension-audit" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評</th>
          <th>說明</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema</td>
          <td>High</td>
          <td>service / escalation policy / schedule / integration 跟 incident / role / action / catalog 沒 1:1 對應</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>High</td>
          <td>alert routing → Slack-native IR coordination + retrospective workflow</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>High</td>
          <td>「alert someone」 → 「coordinate full incident lifecycle from declare to retro」</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Medium</td>
          <td>incident.io 整合 Slack / Linear / Jira / Confluence 變 multi-component</td>
      </tr>
      <tr>
          <td>App change</td>
          <td>Medium</td>
          <td>webhook / integration key / IaC 都要改</td>
      </tr>
      <tr>
          <td>Topology</td>
          <td>Low</td>
          <td>都是 cloud SaaS、無 sharding / region 議題</td>
      </tr>
  </tbody>
</table>
<p>三軸 High（schema / operational / paradigm）。按優先序 schema &gt; paradigm &gt; operational、預設會選 Type A。但這條優先序是 <em>audience-dependent heuristic</em> — 對「我要把 PagerDuty config 翻譯成 incident.io」的讀者選 Type A、對「我要把事故管理 paradigm 從 paging-first 變成 Slack-first」的讀者選 Type E。</p>
<p>決定因素是 <em>讀者最關心什麼</em>。從 PagerDuty 出發評估 incident.io 的 org 通常 <em>已經有 Slack channel 跑 IR</em> 的痛感（雙系統 state drift / context switching cost / Slack bot 補 PagerDuty 的能力斷裂）、進來找的是 paradigm 統一、不是欄位翻譯。schema translation 是工作量、但不是讀者來找答案的問題。所以選 <strong>Type E paradigm shift</strong> 結構、schema translation 抽出獨立段補充。</p>
<h2 id="為什麼遷im-native-coordination-的拉力">為什麼遷：IM-native coordination 的拉力</h2>
<p>事故反應在已經 Slack 中心的 org 是 <em>從 Slack 自然發生</em> 的 — 觀測 alert 進 Slack、SRE 開 thread、PM 跳進來問影響、customer-facing team 在 incident channel 看通報、所有上下文都在 IM 內。PagerDuty 在這個 reality 下變成 <em>第二個 system of record</em>：incident 開在 PagerDuty 也開在 Slack、PagerDuty timeline 跟 Slack scroll 是兩條時間線、status update 要 mirror 兩次、責任分派在 Slack 講但要在 PagerDuty 點。</p>
<p>PagerDuty 注意到這個問題、後加了 Status Updates / Slack integration / Postmortem 模組想把 Slack 拉回 PagerDuty。但結構性還是 <em>PagerDuty 是主、Slack 是 mirror</em> — incident object 的 source of truth 在 PagerDuty、Slack 的訊息只是 attachment。對 <em>Slack-first</em> 的 org 來說這個 ownership 反了：Slack channel 才是事故進行中的 ground truth、PagerDuty incident 應該是 paging 入口的 artifact。</p>
<p>incident.io 設計上把這個關係翻過來：Slack channel 是 IR ground truth、incident object 是 channel 的 metadata 投影。declare incident 在 Slack、role 指派在 Slack bot prompt、status update 在 channel reply、retrospective 從 channel 訊息自動 stitch — incident.io dashboard 是 <em>管理視圖</em>、不是事故 <em>進行視圖</em>。On-call 模組加進來後、連 paging 入口也跟 IR coordination 收斂到同一個 system of record。</p>
<p>這個 pull 是這條 migration 的 <em>driver</em>。schema 翻譯只是把這條 pull 落地的工作。</p>
<h2 id="4-phase-partial-migration不收斂">4-phase partial migration（不收斂）</h2>
<p>Type E paradigm shift 的特徵是 <em>不收斂</em> — 多數 org 不會把 PagerDuty 全退役、會停在某個 phase 變成穩定的 hybrid。下面 4 phase 是 <em>常見演進路徑</em>、不是 <em>必要完成步驟</em>：</p>
<h3 id="phase-1slack-first-responsepaging-留-pagerduty">Phase 1：Slack-first response（paging 留 PagerDuty）</h3>
<p>incident.io 接 PagerDuty incident webhook、PagerDuty 開 incident → incident.io 自動開 Slack channel、跑 response lifecycle（declare / role / status / close / retro）。PagerDuty 仍管 paging schedule + escalation、incident.io 管 response coordination。</p>
<p>這個 phase 的工作主要塊是：</p>
<ul>
<li>incident.io 跟 PagerDuty 雙向 webhook 接（PD incident.trigger → IO open channel、IO incident.resolved → PD ack）</li>
<li>Slack workspace 整合（permissions、channel naming、stakeholder broadcast channel）</li>
<li>Severity 對應表（PagerDuty P1-P5 對 incident.io SEV1-SEV4、語意 reconcile）</li>
<li>跑 2-4 週 dual ops、訓練 SRE 在 Slack 內跑 lifecycle、不要回 PagerDuty 點 timeline</li>
</ul>
<p>完成標準：incident commander 不再需要進 PagerDuty UI、status update / role 指派 / action item 都在 Slack。</p>
<h3 id="phase-2catalog--service-ownership-migrate">Phase 2：Catalog + service ownership migrate</h3>
<p>把 PagerDuty 的 service registry（service / team / escalation policy 關聯）抽出進 incident.io 的 Catalog。Catalog 是 incident.io 的 <em>service metadata source of truth</em>、把 service 跟 team / Slack channel / Linear project / runbook URL 綁在一起、incident 發生時自動推薦 role 跟通知 stakeholder。</p>
<p>工作主要塊：</p>
<ul>
<li>從 PagerDuty API export service / team / escalation policy（REST endpoint <code>/services</code>、<code>/teams</code>、<code>/escalation_policies</code>）</li>
<li>Schema mapping：PagerDuty service → incident.io catalog entry、escalation policy → 暫時不動（留在 PagerDuty）</li>
<li>補 PagerDuty 沒有的欄位：Slack channel、Linear project、runbook URL、tier（catalog 比 PagerDuty service 多 metadata 維度）</li>
<li>Service ownership reconcile（PagerDuty 的 team grant 通常跟 GitHub team / IAM group 不一致、Catalog 是重新對齊機會）</li>
</ul>
<p>完成標準：incident 發生時自動知道 owner team 跟對應 Slack channel、不需要人查。</p>
<h3 id="phase-3schedule--escalation-移到-incidentio-on-call">Phase 3：Schedule + escalation 移到 incident.io On-call</h3>
<p>PagerDuty 的 schedule + escalation policy 改進 incident.io On-call。這是 <em>paging 入口的 ownership 轉移</em> — Phase 1 是 PD 觸發 IO response、Phase 3 是 IO 直接收 alert source 觸發 paging。</p>
<p>工作主要塊：</p>
<ul>
<li>Alert source 改線：Splunk / Datadog / Cloudflare WAF / cloud control plane 的 webhook 從 PagerDuty Event API 改成 incident.io webhook endpoint、deduplication key / severity mapping 重做</li>
<li>Schedule 重建：PagerDuty schedule layer model（多 layer 疊加 + restriction + override）跟 incident.io schedule rule（單純 weekly rotation + override）不是 1:1、複雜 schedule 要重新設計</li>
<li>Escalation policy 重建：PagerDuty 的 multi-step escalation + level-based timeout 對應 incident.io 的 escalation path、policy 比 PagerDuty 簡單但要重新測 failover 行為</li>
<li>Mobile app 切換：on-call 人員裝 incident.io app、PagerDuty app 保留作為 backup paging（Phase 4 才完全捨棄）</li>
</ul>
<p>完成標準：日常 paging 全走 incident.io、PagerDuty 留作 fallback 或退役。</p>
<h3 id="phase-4retrospective--完全退役-pagerduty">Phase 4：Retrospective + 完全退役 PagerDuty</h3>
<p>把 retrospective workflow 切到 incident.io 內建的 post-incident flow、捨棄 PagerDuty Postmortems / Jeli 整合。incident.io 的 retro template 從 Slack channel 訊息自動 stitch timeline、action item 推 Linear / Jira、learning review 結構化。</p>
<p>工作主要塊：</p>
<ul>
<li>既有 Jeli / PagerDuty Postmortems 歷史 export（PagerDuty REST 不直接給 postmortem export、要從 Jeli web app 手動 export）</li>
<li>Retrospective template 對應到 org 既有的 post-incident review 結構</li>
<li>Action item lifecycle 整合（incident.io 推 Linear / Jira → close → retrospective 自動標 done）</li>
</ul>
<p>多數 org 停在 Phase 2 或 Phase 3。完整 Phase 4 退役 PagerDuty 不是必要、且常見的選擇是 <em>PagerDuty 留作 backup paging route</em> 或 <em>特定 integration 持續用</em>（見下一段 capability gap）。</p>
<h2 id="5-個-production-踩雷">5 個 production 踩雷</h2>
<p>實際遷過程踩過的 5 個典型問題：</p>
<h3 id="1-雙系統-state-driftphase-1-最常見">1. 雙系統 state drift（Phase 1 最常見）</h3>
<p>PagerDuty incident.trigger → incident.io 開 channel、但 PagerDuty 上 incident 被自動 resolve（例如 monitoring tool 認為 issue cleared）後、incident.io 沒收到對應 webhook、Slack channel 還 active 顯示 in-progress。修法是雙向 webhook 都要接（PD resolved → IO 自動 close channel），但 webhook 失序的場景仍要有 nightly reconcile job 對比兩邊狀態。</p>
<h3 id="2-severity-翻譯失真">2. Severity 翻譯失真</h3>
<p>PagerDuty 的 P1-P5 跟 incident.io 的 SEV1-SEV4 不是 5:4 對應、是兩個獨立 schema。同一個事故在 PagerDuty 是 P2（高優先但非全面 outage）、進 incident.io 可能變 SEV2（部分服務影響）或 SEV1（依 incident.io custom severity 定義）。Phase 1 雙系統並行時 SRE 在 Slack 看到 SEV1 跑進 war room mode、PagerDuty 同 incident 是 P2 沒拉 stakeholder bridge — 同事故兩邊嚴重度不同步、回應節奏錯亂。修法是事先寫死 mapping table（PD P1 → IO SEV1、PD P2 → IO SEV2、不 case-by-case 判斷），並在 Phase 3 後讓 incident.io severity 變唯一 source of truth。</p>
<h3 id="3-schedule-layer-漏-holiday-override--restriction-layer">3. Schedule layer 漏 holiday override / restriction layer</h3>
<p>PagerDuty schedule 是 layer model — primary rotation（layer 1） + secondary rotation（layer 2） + holiday override（layer 3） + restriction（每層 time-of-day 限制）可以疊加。Export 出來只看 layer 1 通常會漏 holiday override 跟 restriction layer、incident.io schedule rule 是單一 rotation + override list、不 cover 多 layer 疊加。修法是 export 時用 PagerDuty API <code>/schedules/{id}</code> 的完整 layer + final_schedule 一起拉、用 incident.io schedule 的 override list 模擬 layer 疊加、複雜 schedule（例如 follow-the-sun + 4 region + holiday override）可能要拆成多個 incident.io schedule 用 escalation chain 串。</p>
<h3 id="4-slack-channel-過載">4. Slack channel 過載</h3>
<p>incident.io 預設每個 incident 開一個 channel。Phase 1 啟用後 SRE 一週收 50+ channel notification、即使 P3 / P4 也開 channel、Slack sidebar 被淹沒。修法是 incident type 設計時把低 severity（SEV3 / SEV4）改成 <em>don&rsquo;t auto-create channel</em> 或 <em>use shared low-severity channel</em>、只 SEV1 / SEV2 開獨立 channel。incident.io 有這個 configuration、但預設不開、要主動設定。</p>
<h3 id="5-retrospective-切換時歷史-learning-斷層">5. Retrospective 切換時歷史 learning 斷層</h3>
<p>從 Jeli / PagerDuty Postmortems 切到 incident.io retro 後、過去 2 年 postmortem 留在原系統、search 跨不到、新 retro template 跟舊的結構不同、learning review 的 trend analysis 斷層。修法是 Phase 4 前先 export 既有 postmortem 為 markdown 進 GitHub Wiki / Confluence 集中保存、incident.io retro 自動 export 到同位置、retro search 不依賴 vendor lock-in。</p>
<h2 id="schema-translation-主要工作量塊">Schema translation 主要工作量塊</h2>
<p>雖然 Type E 結構不以 schema translation 為主、但 translation 工作量塊在 Phase 2-3 仍佔多數時間：</p>
<table>
  <thead>
      <tr>
          <th>來源（PagerDuty）</th>
          <th>目標（incident.io）</th>
          <th>註</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Service</td>
          <td>Catalog entry</td>
          <td>增加 Slack channel / Linear project metadata</td>
      </tr>
      <tr>
          <td>Team</td>
          <td>Catalog team</td>
          <td>多對應 GitHub team / IAM group</td>
      </tr>
      <tr>
          <td>Escalation policy</td>
          <td>Escalation path</td>
          <td>比 PD 簡單、複雜 escalation 要拆</td>
      </tr>
      <tr>
          <td>Schedule（multi-layer）</td>
          <td>Schedule + override list</td>
          <td>不是 1:1、複雜 schedule 要拆多個</td>
      </tr>
      <tr>
          <td>Integration（webhook）</td>
          <td>Webhook endpoint</td>
          <td>全部 alert source 要重 wire</td>
      </tr>
      <tr>
          <td>Incident workflow</td>
          <td>Incident type + role</td>
          <td>重新設計、不直接翻譯</td>
      </tr>
      <tr>
          <td>Event Orchestration rule</td>
          <td>Workflows</td>
          <td>incident.io workflows 比 EO 簡單、複雜 routing 要外接</td>
      </tr>
      <tr>
          <td>AIOps / Process Automation</td>
          <td>（無對應）</td>
          <td>見 capability gap 段</td>
      </tr>
      <tr>
          <td>Postmortem / Jeli</td>
          <td>Post-incident flow</td>
          <td>template 重寫、歷史保存獨立</td>
      </tr>
  </tbody>
</table>
<h2 id="capability-gappagerduty-有但-incidentio-沒有">Capability gap：PagerDuty 有但 incident.io 沒有</h2>
<p>不是所有功能 incident.io 都有對應。Phase 3-4 推進前要先確認這些能力是否在用、是否願意捨棄或外接：</p>
<ul>
<li><strong>AIOps（intelligent grouping / noise reduction）</strong>：PagerDuty Enterprise tier 用 ML 自動 group alert、incident.io 沒對應、grouping 靠 alert source 端 deduplication key</li>
<li><strong>Process Automation（runbook automation）</strong>：PagerDuty 收購 Rundeck、提供 automated remediation step、incident.io 沒對應、要外接 Tines / n8n / 自製 Lambda</li>
<li><strong>Status Page 整合（PagerDuty 內建）</strong>：PagerDuty 提供 Status Page 模組、incident.io status page 是 separate product、定價跟 feature 不同</li>
<li><strong>Multi-region / 強合規（FedRAMP / IL5）</strong>：PagerDuty 在金融 / 政府 / 高合規 deploy 成熟度高、incident.io SOC 2 + ISO 27001 但 FedRAMP 還在追</li>
</ul>
<p>如果在用 AIOps + Process Automation 而且重要、不要做這個 migration、或保留 PagerDuty 作為 AIOps + Automation 後端、incident.io 處理 response coordination — Phase 1 永久 hybrid。</p>
<h2 id="容量與成本對照">容量與成本對照</h2>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>PagerDuty</th>
          <th>incident.io</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>計費模式</td>
          <td>Per-user / month、tier-based（Pro / Business / Enterprise）</td>
          <td>Per-user / month、On-call 模組另計</td>
      </tr>
      <tr>
          <td>隱性容量上限</td>
          <td>API rate limit（10K / minute）</td>
          <td>Slack workspace seat 上限（IR participant ≤ workspace user）</td>
      </tr>
      <tr>
          <td>AIOps 加價</td>
          <td>Enterprise tier + AIOps add-on</td>
          <td>不適用</td>
      </tr>
      <tr>
          <td>Status page</td>
          <td>內建（Business tier+）</td>
          <td>獨立 product</td>
      </tr>
      <tr>
          <td>Process Auto</td>
          <td>Rundeck-based、separate pricing</td>
          <td>不適用</td>
      </tr>
  </tbody>
</table>
<p>實際成本對比需要 RFP — 50 人 SRE org 大致 PD Business + AIOps ~$30-40 / user / mo、incident.io Pro + On-call ~$25-35 / user / mo、cost 差距通常不是 migration 主因（是 paradigm fit + Slack-native）。</p>
<h2 id="何時不要做這個-migration">何時不要做這個 migration</h2>
<ul>
<li><strong>Slack 不是 IR ground truth</strong>：Discord / Teams primary 或 ticket system 為主的 org、incident.io Slack-first 設計無法落地</li>
<li><strong>AIOps + Process Automation 是核心能力</strong>：用了 PD AIOps 自動 group alert 跟 Rundeck 自動 remediation、且這條 chain 重要 — incident.io 沒對應</li>
<li><strong>規模 &lt; 20 SRE / 50 eng</strong>：incident.io 的 catalog + opinionated workflow 設計給中大型 org、小團隊 PagerDuty Lite 或 Grafana OnCall 已經夠用</li>
<li><strong>強合規場景（FedRAMP / IL5 / 金融 SOC 1 type II）</strong>：PagerDuty 合規成熟度高、incident.io 在追、合規團隊不會 sign-off</li>
<li><strong>不打算改變事故行為</strong>：如果 org 只是想換廠商但不想改變 <em>事故在 Slack 跑 lifecycle</em> 的工作模式、這條 migration 的價值丟一半、不如走 <a href="/blog/backend/08-incident-response/vendors/opsgenie/migrate-from-pagerduty/" data-link-title="PagerDuty → Opsgenie：Atlassian 全家桶整合 vs Opsgenie 2027 EOL 的 vendor consolidation 取捨" data-link-desc="PagerDuty → Opsgenie 是 Type A phased schema translation、但 Atlassian 已宣布 Opsgenie 2027-04 EOL — 這條 migration 只在 Atlassian-heavy org &#43; 明確 JSM unification roadmap 下成立、本質是 PD → Opsgenie → JSM Cloud 的雙 hop migration。本文走 6 維 audit（Schema Medium-High 其他 Low）、PagerDuty ↔ Opsgenie ↔ JSM field mapping 對照、5 production 踩雷（escalation step / Heartbeat 缺對應 / integration key dedup 重設 / schedule 時區 / Atlassian Identity SSO 整合）、何時直接走 PD → JSM 跳過 Opsgenie">PagerDuty → Opsgenie</a>（Type A schema translation、同 paradigm）</li>
</ul>
<h2 id="下一步路由">下一步路由</h2>
<ul>
<li>平行 batch：<a href="/blog/backend/08-incident-response/vendors/opsgenie/migrate-from-pagerduty/" data-link-title="PagerDuty → Opsgenie：Atlassian 全家桶整合 vs Opsgenie 2027 EOL 的 vendor consolidation 取捨" data-link-desc="PagerDuty → Opsgenie 是 Type A phased schema translation、但 Atlassian 已宣布 Opsgenie 2027-04 EOL — 這條 migration 只在 Atlassian-heavy org &#43; 明確 JSM unification roadmap 下成立、本質是 PD → Opsgenie → JSM Cloud 的雙 hop migration。本文走 6 維 audit（Schema Medium-High 其他 Low）、PagerDuty ↔ Opsgenie ↔ JSM field mapping 對照、5 production 踩雷（escalation step / Heartbeat 缺對應 / integration key dedup 重設 / schedule 時區 / Atlassian Identity SSO 整合）、何時直接走 PD → JSM 跳過 Opsgenie">PagerDuty → Opsgenie</a>（Type A、同 paradigm 換廠商）/ <a href="/blog/backend/08-incident-response/vendors/atlassian-statuspage/migrate-to-instatus/" data-link-title="Atlassian Statuspage → Instatus：status page 成本下降、但 compatibility audit 不能跳" data-link-desc="Atlassian Statuspage → Instatus 是 Type B drop-in migration、6 維 audit 全 Low；典型情境是從 Statuspage Business / Enterprise 降到 Instatus Pro / Business、但 savings 取決於 subscriber、SSO、audit 與 SLA report 需求。本文走 compatibility audit prefix（subscriber channel 完整度 / SAML SSO / audit log / metrics integration / SLA report / API parity）、4 階段 cutover（DNS TTL &#43; parallel run）、5 個 production 踩雷（SSO tier 選錯、metrics 來源整合斷、subscriber import format / SLA report 缺、custom CSS 不完全相容）、何時不要切（enterprise compliance / 強 Atlassian 整合）">Atlassian Statuspage → Instatus</a>（Type B drop-in）</li>
<li>同 batch Type E：<a href="/blog/backend/09-performance-capacity/vendors/k6/migrate-from-jmeter/" data-link-title="JMeter → k6：k6 不是 JMeter 的「script 版本」、是 VU model 取代 thread model" data-link-desc="JMeter → k6 是 Type E paradigm shift、不是把 .jmx XML 翻成 JavaScript — VU (virtual user) model 跟 thread group model 是兩種對「使用者行為」不同的建模方式。本文走 6 維 audit（Schema High / Paradigm High / Operational Medium）、釐清反向定義、4-phase partial migration（多數 org 停 Phase 2-3 hybrid）、5 production 踩雷（thread group 翻譯失真 / arrival rate vs concurrent VU 混淆 / protocol gap / 結果 schema 改 / CI integration 重做）、protocol gap（JDBC / JMS / LDAP 在 k6 沒原生對應）、何時不要切">JMeter → k6</a>（scripting paradigm shift）</li>
<li>上游：<a href="/blog/backend/08-incident-response/incident-workflow-automation-boundary/" data-link-title="8.21 Incident Workflow Automation Boundary" data-link-desc="定義哪些事故流程適合自動化，哪些決策需要保留人工確認">8.10 Incident Workflow Automation Boundary</a>（automation handoff）</li>
<li>下游：<a href="/blog/backend/08-incident-response/post-incident-review/" data-link-title="8.5 復盤與改進追蹤" data-link-desc="把 RCA 與 action items 轉成可驗證閉環">8.18 Post-Incident Review</a>（incident.io retrospective workflow）</li>
<li>vendor 對照：<a href="/blog/backend/08-incident-response/vendors/pagerduty/" data-link-title="PagerDuty" data-link-desc="On-call / alerting 主流 SaaS、IR 平台演化">PagerDuty</a> / <a href="/blog/backend/08-incident-response/vendors/incident-io/" data-link-title="incident.io" data-link-desc="Slack-native IR 平台、整合 paging / response / retro">incident.io</a></li>
<li>方法論：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook Methodology</a>（Type E paradigm shift 結構說明）</li>
</ul>
]]></content:encoded></item></channel></rss>