<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Managed on Tarragon</title><link>https://tarrragon.github.io/blog/tags/managed/</link><description>Recent content in Managed on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Tue, 16 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/managed/index.xml" rel="self" type="application/rss+xml"/><item><title>AWS ElastiCache 的責任邊界：managed 接手了什麼、又默默留下什麼</title><link>https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/aws-elasticache/managed-responsibility-boundary/</link><pubDate>Tue, 16 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/aws-elasticache/managed-responsibility-boundary/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/aws-elasticache/" data-link-title="AWS ElastiCache" data-link-desc="AWS managed Redis / Valkey / Memcached">AWS ElastiCache&lt;/a> overview 的 implementation-layer deep article。選型層（為何用 managed、engine 選擇、跟自管取捨）見 overview；本文只處理「決定用 ElastiCache 後，哪些是 AWS 的責任、哪些仍是你的」。CLI 與計費以 &lt;a href="https://docs.aws.amazon.com/elasticache/">AWS ElastiCache 官方文件&lt;/a>、&lt;a href="https://aws.amazon.com/elasticache/pricing/">ElastiCache 定價&lt;/a> 為準、最後檢查日 2026-06-16（managed 服務的引數與價格會變、以官方為準）。&lt;/p>&lt;/blockquote>
&lt;h2 id="managed-不等於-hands-off">managed 不等於 hands-off&lt;/h2>
&lt;p>把 cache 換成 ElastiCache 之後，最危險的心態是「現在 AWS 全包了」。AWS 確實接走了一大塊運維——它幫你做 failover、patching、snapshot、跨 AZ 複製，你不用再自己部署 Sentinel、不用半夜起來手動切 master。但有一類問題 ElastiCache 一個都沒幫你解，而且因為「以為 AWS 會處理」，這些問題在 managed 環境反而更容易被忽略到上線才爆。&lt;/p>
&lt;p>&lt;a href="https://tarrragon.github.io/blog/backend/09-performance-capacity/cases/tinder-elasticache-valkey-matching/" data-link-title="9.C6 Tinder：ElastiCache for Valkey 撐 4700 萬月活的配對引擎" data-link-desc="Tinder 用 Amazon ElastiCache for Valkey 提供配對引擎所需的次毫秒延遲快取層">Tinder 的配對引擎&lt;/a>跑在 ElastiCache for Valkey 上、4700 萬月活、sub-millisecond 延遲——這證明 managed 撐得起極大規模，但 Tinder 仍要自己設計 key、處理 cache miss、控制 client 行為。ElastiCache for Redis 7.1 在 r7g.4xlarge 上單 node 可達約 100 萬 RPS、單 cluster 約 5 億 RPS（引自 &lt;a href="https://aws.amazon.com/blogs/database/achieve-over-500-million-requests-per-second-per-cluster-with-amazon-elasticache-for-redis-7-1/">AWS Database Blog&lt;/a>）——這個吞吐是 AWS 給的，但用不用得好取決於你的 key 分布與 client 設計。&lt;/p>
&lt;p>理解 ElastiCache 就是劃清這條責任邊界。本文按 shared responsibility 展開：AWS 管什麼、你管什麼、邊界上的踩坑在哪。&lt;/p>
&lt;h2 id="核心概念shared-responsibility-的兩側">核心概念：shared responsibility 的兩側&lt;/h2>
&lt;p>ElastiCache 的責任劃分可以列成一張清楚的表，這張表是判讀所有 ElastiCache 事故的起點：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>面向&lt;/th>
 &lt;th>AWS 的責任（managed）&lt;/th>
 &lt;th>你的責任（仍要自己做）&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>硬體 / OS / patching&lt;/td>
 &lt;td>全包&lt;/td>
 &lt;td>—&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>failover&lt;/td>
 &lt;td>自動偵測 + replica 晉升&lt;/td>
 &lt;td>client 要有 reconnect 邏輯&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>跨 AZ 複製&lt;/td>
 &lt;td>Multi-AZ 自動複製&lt;/td>
 &lt;td>接受非同步複製的 stale window&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>snapshot / backup&lt;/td>
 &lt;td>自動 + 手動 snapshot&lt;/td>
 &lt;td>決定保留策略、驗證能還原&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>eviction&lt;/td>
 &lt;td>提供 maxmemory-policy 參數&lt;/td>
 &lt;td>選對 policy、設對 TTL&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>cache stampede&lt;/td>
 &lt;td>不管&lt;/td>
 &lt;td>client-side jitter / singleflight 自己做&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>key 設計 / hot key&lt;/td>
 &lt;td>不管&lt;/td>
 &lt;td>key 分布、hot key 兩層 cache 自己處理&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>連線管理&lt;/td>
 &lt;td>提供 endpoint&lt;/td>
 &lt;td>連線池、socket timeout 自己設&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>左欄是用 managed 換到的，右欄是用 managed 換不掉的。&lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/cases/failure-cache-stampede-rollout-regression/" data-link-title="2.C9 反例：快取切換引發 Stampede 回歸" data-link-desc="快取策略切換若缺乏保護，會導致回源壓力與錯誤率連鎖上升。">2.C9 cache stampede&lt;/a> 的雪崩、&lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/connection-pipeline-latency/" data-link-title="Redis 連線與 pipeline：RTT 稅、連線池與一次往返打包多命令" data-link-desc="Redis 單命令通常微秒級執行，但 application 端量到的延遲是毫秒級——差距全在網路往返（RTT）。pipelining 的本質不是『批次發命令』，是把 N 次 RTT 壓成 1 次。本文展開 RTT 會計、連線池配置、pipeline 與 MULTI 的差異、5 個把連線與往返寫成延遲與正確性問題的 production 踩坑，以及連線模型撞牆的邊界">連線風暴&lt;/a>、&lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/memory-eviction-tuning/" data-link-title="Redis 記憶體與淘汰調校：maxmemory-policy、LFU 與碎片化的實戰判讀" data-link-desc="Redis 的記憶體是一條會在半夜爆掉的曲線：maxmemory 設多少、policy 選 LRU 還 LFU、碎片化什麼時候開始吃掉 30% RAM、OOM 時 noeviction 怎麼讓寫入全部失敗。本文展開 Redis 記憶體會計模型、eviction policy 的選型判讀、5 個把記憶體配置寫成 production 事故的踩坑，以及單機記憶體撞牆後該往 cluster 還是 DragonflyDB 走的邊界">eviction 選錯&lt;/a> 在 ElastiCache 上跟自管 Redis 一模一樣會發生——因為這些是 cache 使用方式的問題，不是運維的問題。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/02-cache-redis/vendors/aws-elasticache/" data-link-title="AWS ElastiCache" data-link-desc="AWS managed Redis / Valkey / Memcached">AWS ElastiCache</a> overview 的 implementation-layer deep article。選型層（為何用 managed、engine 選擇、跟自管取捨）見 overview；本文只處理「決定用 ElastiCache 後，哪些是 AWS 的責任、哪些仍是你的」。CLI 與計費以 <a href="https://docs.aws.amazon.com/elasticache/">AWS ElastiCache 官方文件</a>、<a href="https://aws.amazon.com/elasticache/pricing/">ElastiCache 定價</a> 為準、最後檢查日 2026-06-16（managed 服務的引數與價格會變、以官方為準）。</p></blockquote>
<h2 id="managed-不等於-hands-off">managed 不等於 hands-off</h2>
<p>把 cache 換成 ElastiCache 之後，最危險的心態是「現在 AWS 全包了」。AWS 確實接走了一大塊運維——它幫你做 failover、patching、snapshot、跨 AZ 複製，你不用再自己部署 Sentinel、不用半夜起來手動切 master。但有一類問題 ElastiCache 一個都沒幫你解，而且因為「以為 AWS 會處理」，這些問題在 managed 環境反而更容易被忽略到上線才爆。</p>
<p><a href="/blog/backend/09-performance-capacity/cases/tinder-elasticache-valkey-matching/" data-link-title="9.C6 Tinder：ElastiCache for Valkey 撐 4700 萬月活的配對引擎" data-link-desc="Tinder 用 Amazon ElastiCache for Valkey 提供配對引擎所需的次毫秒延遲快取層">Tinder 的配對引擎</a>跑在 ElastiCache for Valkey 上、4700 萬月活、sub-millisecond 延遲——這證明 managed 撐得起極大規模，但 Tinder 仍要自己設計 key、處理 cache miss、控制 client 行為。ElastiCache for Redis 7.1 在 r7g.4xlarge 上單 node 可達約 100 萬 RPS、單 cluster 約 5 億 RPS（引自 <a href="https://aws.amazon.com/blogs/database/achieve-over-500-million-requests-per-second-per-cluster-with-amazon-elasticache-for-redis-7-1/">AWS Database Blog</a>）——這個吞吐是 AWS 給的，但用不用得好取決於你的 key 分布與 client 設計。</p>
<p>理解 ElastiCache 就是劃清這條責任邊界。本文按 shared responsibility 展開：AWS 管什麼、你管什麼、邊界上的踩坑在哪。</p>
<h2 id="核心概念shared-responsibility-的兩側">核心概念：shared responsibility 的兩側</h2>
<p>ElastiCache 的責任劃分可以列成一張清楚的表，這張表是判讀所有 ElastiCache 事故的起點：</p>
<table>
  <thead>
      <tr>
          <th>面向</th>
          <th>AWS 的責任（managed）</th>
          <th>你的責任（仍要自己做）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>硬體 / OS / patching</td>
          <td>全包</td>
          <td>—</td>
      </tr>
      <tr>
          <td>failover</td>
          <td>自動偵測 + replica 晉升</td>
          <td>client 要有 reconnect 邏輯</td>
      </tr>
      <tr>
          <td>跨 AZ 複製</td>
          <td>Multi-AZ 自動複製</td>
          <td>接受非同步複製的 stale window</td>
      </tr>
      <tr>
          <td>snapshot / backup</td>
          <td>自動 + 手動 snapshot</td>
          <td>決定保留策略、驗證能還原</td>
      </tr>
      <tr>
          <td>eviction</td>
          <td>提供 maxmemory-policy 參數</td>
          <td>選對 policy、設對 TTL</td>
      </tr>
      <tr>
          <td>cache stampede</td>
          <td>不管</td>
          <td>client-side jitter / singleflight 自己做</td>
      </tr>
      <tr>
          <td>key 設計 / hot key</td>
          <td>不管</td>
          <td>key 分布、hot key 兩層 cache 自己處理</td>
      </tr>
      <tr>
          <td>連線管理</td>
          <td>提供 endpoint</td>
          <td>連線池、socket timeout 自己設</td>
      </tr>
  </tbody>
</table>
<p>左欄是用 managed 換到的，右欄是用 managed 換不掉的。<a href="/blog/backend/02-cache-redis/cases/failure-cache-stampede-rollout-regression/" data-link-title="2.C9 反例：快取切換引發 Stampede 回歸" data-link-desc="快取策略切換若缺乏保護，會導致回源壓力與錯誤率連鎖上升。">2.C9 cache stampede</a> 的雪崩、<a href="/blog/backend/02-cache-redis/vendors/redis/connection-pipeline-latency/" data-link-title="Redis 連線與 pipeline：RTT 稅、連線池與一次往返打包多命令" data-link-desc="Redis 單命令通常微秒級執行，但 application 端量到的延遲是毫秒級——差距全在網路往返（RTT）。pipelining 的本質不是『批次發命令』，是把 N 次 RTT 壓成 1 次。本文展開 RTT 會計、連線池配置、pipeline 與 MULTI 的差異、5 個把連線與往返寫成延遲與正確性問題的 production 踩坑，以及連線模型撞牆的邊界">連線風暴</a>、<a href="/blog/backend/02-cache-redis/vendors/redis/memory-eviction-tuning/" data-link-title="Redis 記憶體與淘汰調校：maxmemory-policy、LFU 與碎片化的實戰判讀" data-link-desc="Redis 的記憶體是一條會在半夜爆掉的曲線：maxmemory 設多少、policy 選 LRU 還 LFU、碎片化什麼時候開始吃掉 30% RAM、OOM 時 noeviction 怎麼讓寫入全部失敗。本文展開 Redis 記憶體會計模型、eviction policy 的選型判讀、5 個把記憶體配置寫成 production 事故的踩坑，以及單機記憶體撞牆後該往 cluster 還是 DragonflyDB 走的邊界">eviction 選錯</a> 在 ElastiCache 上跟自管 Redis 一模一樣會發生——因為這些是 cache 使用方式的問題，不是運維的問題。</p>
<h3 id="engine-選擇與-cluster-mode">engine 選擇與 cluster mode</h3>
<p>ElastiCache 的兩個結構性決策：</p>
<p><strong>engine</strong>：2024 起 default 是 Valkey（成本約低 20%、OSI 開源、Redis 7.2.4 fork、API 相容）；Redis OSS 仍可選但 AWS 不推；Memcached 是另一條線（純 KV、無 cluster mode 概念）。新部署或既有 Redis 遷移都走 Valkey（相容、便宜），純 cache 才考慮 Memcached。</p>
<p><strong>cluster mode</strong>：disabled 是 1 primary + 最多 5 replica、單 shard、上限約 340GB；enabled 是多 shard（最多 500）、自動 sharding、橫向擴展。判讀：dataset &lt; 300GB 且不需 sharding 用 disabled（簡單），&gt; 300GB 或要橫向擴展用 enabled（但 client 要 cluster-aware）。</p>
<h2 id="配置建立與治理的設定路徑">配置：建立與治理的設定路徑</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 建立 Valkey replication group（Multi-AZ、auto failover、cluster mode disabled）</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">aws elasticache create-replication-group <span class="se">\
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="se"></span>  --replication-group-id prod-cache <span class="se">\
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="se"></span>  --replication-group-description <span class="s2">&#34;prod cache&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="se"></span>  --engine valkey <span class="se">\
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="se"></span>  --cache-node-type cache.r7g.large <span class="se">\
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="se"></span>  --num-cache-clusters <span class="m">3</span> <span class="se">\ </span>          <span class="c1"># 1 primary + 2 replica</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  --automatic-failover-enabled <span class="se">\
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="se"></span>  --multi-az-enabled <span class="se">\
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="se"></span>  --snapshot-retention-limit <span class="m">7</span> <span class="se">\ </span>    <span class="c1"># 自動 snapshot 保留 7 天</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">  --at-rest-encryption-enabled <span class="se">\
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="se"></span>  --transit-encryption-enabled
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"># 自訂 parameter group（maxmemory-policy 等仍是你的責任）</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">aws elasticache create-cache-parameter-group <span class="se">\
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="se"></span>  --cache-parameter-group-name prod-params <span class="se">\
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="se"></span>  --cache-parameter-group-family valkey8 <span class="se">\
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="se"></span>  --description <span class="s2">&#34;prod cache params&#34;</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">aws elasticache modify-cache-parameter-group <span class="se">\
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="se"></span>  --cache-parameter-group-name prod-params <span class="se">\
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="se"></span>  --parameter-name-values <span class="s2">&#34;ParameterName=maxmemory-policy,ParameterValue=allkeys-lru&#34;</span></span></span></code></pre></div><p>配置判讀：</p>
<ul>
<li><code>--automatic-failover-enabled</code> + <code>--multi-az-enabled</code> 是 HA 的核心，把 <a href="/blog/backend/02-cache-redis/vendors/redis/sentinel-ha-failover/" data-link-title="Redis Sentinel 與 failover 時序：從 master 死掉到 client 重連的每一段" data-link-desc="Redis Sentinel 的 failover 不是一個瞬間動作，是 down 偵測 → quorum 確認 → 選主 → 提升 → 配置廣播 → client 重連的一條時序鏈，每一段都有自己的延遲與失敗模式。本文展開 Sentinel 的判定模型與這條時序、5 個讓 failover 卡住或丟資料的 production 踩坑，以及 Sentinel 撐不住該往 Cluster 或 managed 走的邊界">Sentinel 那條 failover 時序鏈</a>託管掉</li>
<li><code>maxmemory-policy</code> 透過 parameter group 設定——AWS 給旋鈕、選哪個是你的責任（見 <a href="/blog/backend/02-cache-redis/vendors/redis/memory-eviction-tuning/" data-link-title="Redis 記憶體與淘汰調校：maxmemory-policy、LFU 與碎片化的實戰判讀" data-link-desc="Redis 的記憶體是一條會在半夜爆掉的曲線：maxmemory 設多少、policy 選 LRU 還 LFU、碎片化什麼時候開始吃掉 30% RAM、OOM 時 noeviction 怎麼讓寫入全部失敗。本文展開 Redis 記憶體會計模型、eviction policy 的選型判讀、5 個把記憶體配置寫成 production 事故的踩坑，以及單機記憶體撞牆後該往 cluster 還是 DragonflyDB 走的邊界">eviction 調校</a>）</li>
<li><code>--transit-encryption-enabled</code> 加 TLS，但 TLS 增加 client 建連成本，連線池更重要</li>
<li>IAM authentication（Redis 7+）取代 AUTH password，對應 <a href="/blog/backend/07-security-data-protection/" data-link-title="模組七：資安與資料保護" data-link-desc="以問題驅動方式擴充資安知識網：先定義服務環節問題，再以案例作為觸發式參考">security 模組</a></li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1failover-期間-client-持續-error">Case 1：failover 期間 client 持續 error</h3>
<p><strong>徵兆</strong>：ElastiCache 觸發 failover（看 <code>describe-events</code>），AWS 端 replica 晉升完成，但 application 持續 30 秒到幾分鐘大量連線 error。</p>
<p><strong>根因</strong>：failover 時 primary endpoint 的 DNS 切到新 primary，但 client 的連線池還握著舊 primary 的連線、DNS 也可能有快取。AWS 完成了 failover，但 client 重連是你的責任——ElastiCache 不會幫你的 application 重連。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>client 用支援自動重連的 library，設合理的 socket timeout 與 retry（見 <a href="/blog/backend/02-cache-redis/vendors/redis/connection-pipeline-latency/" data-link-title="Redis 連線與 pipeline：RTT 稅、連線池與一次往返打包多命令" data-link-desc="Redis 單命令通常微秒級執行，但 application 端量到的延遲是毫秒級——差距全在網路往返（RTT）。pipelining 的本質不是『批次發命令』，是把 N 次 RTT 壓成 1 次。本文展開 RTT 會計、連線池配置、pipeline 與 MULTI 的差異、5 個把連線與往返寫成延遲與正確性問題的 production 踩坑，以及連線模型撞牆的邊界">連線調校</a>）</li>
<li>連到 primary endpoint（會跟著 failover 更新 DNS），不要連到特定 node 的 endpoint</li>
<li>縮短 client 的 DNS 快取 TTL，讓 failover 後的 DNS 切換更快被看到</li>
<li>failover 期間的寫入中斷無法完全避免（非同步複製 + 重連時間），latency-sensitive 服務要設計降級</li>
</ol>
<h3 id="case-2跨-az-replication-lag-造成-stale-read">Case 2：跨 AZ replication lag 造成 stale read</h3>
<p><strong>徵兆</strong>：寫入 primary 後立刻從 replica 讀，偶爾讀到舊值；CloudWatch 的 <code>ReplicationLag</code> 在高寫入時段上升。</p>
<p><strong>根因</strong>：ElastiCache 的跨 AZ 複製是非同步的，replica 有 lag。AWS 保證複製會發生，但不保證即時——read-from-replica 在寫後立即讀的場景會看到 stale window。這跟自管 Redis 的 replica 行為一致，managed 沒有消除它。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>寫後需要立即一致讀的路徑，強制 read from primary</li>
<li>監控 CloudWatch <code>ReplicationLag</code>，持續高代表寫入超過複製能力，要 scale up node 或降寫入</li>
<li>接受 cache 的最終一致性——這是 cache copy 的本質，不是 bug（見 <a href="/blog/backend/02-cache-redis/cache-copy-freshness-boundary/" data-link-title="2.7 Cache Copy Boundary 與 Freshness" data-link-desc="說明快取何時只是可重建副本，何時會影響交易、權限或配額正確性。">cache copy boundary</a>）</li>
<li>需要強一致 + durability 走 MemoryDB（見本文 Capacity / cost 邊界段）</li>
</ol>
<h3 id="case-3serverless-計費超出預期">Case 3：Serverless 計費超出預期</h3>
<p><strong>徵兆</strong>：用了 ElastiCache Serverless 想省容量規劃，月底帳單遠超預期。</p>
<p><strong>根因</strong>：Serverless 按 ECPU（運算）+ storage 計費，流量尖峰或低效的 access pattern（大量小命令、大 value）會推高 ECPU 消耗。Serverless 解的是「不想規劃容量」，不是「一定更便宜」——可預測的穩態流量用 node-based + Reserved Instance 通常更省。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>流量可預測、穩態高的 workload 用 node-based + Reserved Instance（1/3 年承諾、折扣約 30-60%）</li>
<li>流量不可預測、有大量閒置時段的才適合 Serverless</li>
<li>監控 ECPU 消耗，找出推高成本的 access pattern（用 pipeline 合併小命令降 ECPU）</li>
<li>成本模型對比要算實際 workload，不要假設 Serverless 一定划算</li>
</ol>
<h3 id="case-4cluster-mode-enabled-但-client-不是-cluster-aware">Case 4：cluster mode enabled 但 client 不是 cluster-aware</h3>
<p><strong>徵兆</strong>：建了 cluster mode enabled 的 cluster，application 連線報 <code>MOVED</code> redirect 或連不上某些 key。</p>
<p><strong>根因</strong>：cluster mode enabled 把 keyspace 分到多 shard，client 必須 cluster-aware（懂 <code>CLUSTER SLOTS</code>、處理 <code>MOVED</code>/<code>ASK</code> redirect）才能正確路由。普通 standalone client 連 cluster mode enabled 會失敗。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>cluster mode enabled 一律用 cluster-aware client（連 configuration endpoint 不是單一 node）</li>
<li>確認 application 的多 key 操作用 hash tag 把相關 key co-locate 同 slot（見 <a href="/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">cluster re-sharding</a>）</li>
<li>dataset &lt; 300GB 且不需 sharding，用 cluster mode disabled 省掉這層複雜度</li>
<li>從 disabled 升 enabled 是有成本的架構變更，初期規劃就要決定</li>
</ol>
<h3 id="case-5snapshot-期間記憶體尖峰node-不穩">Case 5：snapshot 期間記憶體尖峰、node 不穩</h3>
<p><strong>徵兆</strong>：自動 snapshot 時段 node 延遲上升、<code>DatabaseMemoryUsagePercentage</code> 衝高，偶爾 snapshot 失敗。</p>
<p><strong>根因</strong>：Redis engine 的 snapshot 靠 fork（見 <a href="/blog/backend/02-cache-redis/vendors/redis/persistence-fork-latency/" data-link-title="Redis 持久化與 fork latency：AOF、RDB 與那一次卡住整個 cluster 的 fork" data-link-desc="Redis 的 RDB save 與 AOF rewrite 都靠一次 fork()，而 fork 在大記憶體實例上會凍結主執行緒數百毫秒、複製分頁讓記憶體逼近翻倍。本文展開 AOF / RDB 的機制與 fsync 取捨、copy-on-write 的記憶體放大、5 個把持久化寫成延遲尖峰與資料遺失的 production 踩坑，以及 cache 場景到底要不要持久化的邊界">persistence / fork latency</a>），fork 期間 copy-on-write 推高記憶體。如果 node 記憶體已吃緊，snapshot 的 fork 把它推爆。AWS 託管了 snapshot 排程，但 fork 的記憶體成本仍在 engine 層存在。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>node 記憶體留 headroom（不要長期 &gt; 80%），給 snapshot 的 fork copy-on-write 空間</li>
<li>snapshot window 設在低流量時段，減少 fork 期間被改的 page</li>
<li>監控 CloudWatch <code>DatabaseMemoryUsagePercentage</code>，&gt; 80% 考慮 scale up node type</li>
<li>Valkey engine 繼承 Redis 的 fork 模型，這個成本換 engine 到 Valkey 也還在（fork-less 要 DragonflyDB、但 ElastiCache 不提供）</li>
</ol>
<h2 id="capacity--cost-邊界">Capacity / cost 邊界</h2>
<p>ElastiCache 的容量判讀，混合了 AWS 的 metric 與 engine 層的行為：</p>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>健康區間</th>
          <th>警戒與動作</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>DatabaseMemoryUsagePercentage</code></td>
          <td>&lt; 80%</td>
          <td>&gt; 80% → scale up node 或調 maxmemory-policy</td>
      </tr>
      <tr>
          <td><code>ReplicationLag</code></td>
          <td>&lt; 1 秒</td>
          <td>持續高 → 寫入超過複製能力</td>
      </tr>
      <tr>
          <td><code>CurrConnections</code></td>
          <td>遠低於 node 上限</td>
          <td>接近上限 → client 連線池問題</td>
      </tr>
      <tr>
          <td><code>CacheHitRate</code></td>
          <td>&gt; 90%（多數 cache）</td>
          <td>下滑 → TTL / eviction / key 設計問題</td>
      </tr>
      <tr>
          <td>Serverless ECPU</td>
          <td>對齊預算</td>
          <td>暴衝 → access pattern 低效、用 pipeline 合併</td>
      </tr>
  </tbody>
</table>
<p>撞牆後的路由判斷：</p>
<ul>
<li><strong>需要 source-of-truth 的 Redis API（不是 cache）</strong>：ElastiCache 是 cache 語意（資料可重建）。需要 durability 走 <strong>AWS MemoryDB</strong>——Redis-compatible 但有 multi-AZ transaction log、提供 source-of-truth 語意，成本約 ElastiCache 的 2-3 倍。判讀：<a href="/blog/backend/09-performance-capacity/cases/tubi-elasticache-ml-feature-store/" data-link-title="9.C25 Tubi：從 ScyllaDB 遷到 ElastiCache、ML feature store 達 sub-10ms p99" data-link-desc="Tubi 把 ML 推薦的 feature store 從 ScyllaDB 遷到 ElastiCache for Redis、99 百分位延遲降到 10ms 以下">Tubi 把 feature store 從 ScyllaDB 遷到 ElastiCache</a> 的前提是「feature 可重新計算」——可重建選 ElastiCache，不可重建選 MemoryDB 或 database。</li>
<li><strong>跨雲 / 不在 AWS 生態</strong>：ElastiCache 綁 AWS，跨雲走自管 <a href="/blog/backend/02-cache-redis/vendors/redis/" data-link-title="Redis" data-link-desc="OSS in-memory data structure store、cache 主流">Redis / Valkey</a> 或 GCP Memorystore / Azure Cache。</li>
<li><strong>極端單機 throughput</strong>：要榨單機多核走自管 <a href="/blog/backend/02-cache-redis/vendors/dragonflydb/" data-link-title="DragonflyDB" data-link-desc="高效能 Redis / Memcached 相容替代、多核架構">DragonflyDB</a>（ElastiCache 不提供 Dragonfly engine）。</li>
<li><strong>跨 region active-passive DR</strong>：ElastiCache 的 Global Datastore（1 primary region + 多 secondary read replica、跨 region lag &lt; 1 秒），不支援 active-active multi-master。</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<p>ElastiCache 的 deep article 本質是「劃清 managed 邊界」，它跟 engine 層的調校知識緊密相連：</p>
<ul>
<li><strong>跟 <a href="/blog/backend/02-cache-redis/vendors/redis/" data-link-title="Redis" data-link-desc="OSS in-memory data structure store、cache 主流">Redis 全系列 deep article</a></strong>：eviction、persistence/fork、連線的調校在 ElastiCache 上仍適用（engine 是 Redis/Valkey），AWS 託管的是 failover/patching/snapshot 排程，不是這些 engine 行為。</li>
<li><strong>跟 <a href="/blog/backend/02-cache-redis/vendors/valkey/redis-compatibility-and-io-threads/" data-link-title="Valkey 相容性驗證與 io-threads 調校：drop-in 切換與多執行緒的實機判讀" data-link-desc="Valkey 跟 Redis 100% 相容這句話要怎麼驗證、切換才敢上線。本文用 INFO server 的雙版本回報拆解相容性的真實邊界、展開 Valkey 8 的 io-threads 多執行緒調校、5 個把 drop-in 切換或執行緒配置寫成事故的 production 踩坑，以及相容性撞牆該怎麼判斷的邊界">Valkey 相容性</a></strong>：ElastiCache 的 default engine 就是 Valkey，相容性與 io-threads 的判讀直接適用。</li>
<li><strong>跟 <a href="/blog/backend/02-cache-redis/cases/netflix-evcache-global-cache-layer/" data-link-title="2.C6 Netflix：EVCache 全域快取層" data-link-desc="快取從本地層演進為跨區分散式能力的案例。">Netflix EVCache</a></strong>：EVCache 是 Netflix 自管的 Memcached-based 全域 cache，對照 ElastiCache for Memcached + Global Datastore——展示了自管跨區 vs managed 跨區的取捨。</li>
<li><strong>跟 <a href="/blog/backend/09-performance-capacity/cases/tinder-elasticache-valkey-matching/" data-link-title="9.C6 Tinder：ElastiCache for Valkey 撐 4700 萬月活的配對引擎" data-link-desc="Tinder 用 Amazon ElastiCache for Valkey 提供配對引擎所需的次毫秒延遲快取層">Tinder</a> / <a href="/blog/backend/09-performance-capacity/cases/tubi-elasticache-ml-feature-store/" data-link-title="9.C25 Tubi：從 ScyllaDB 遷到 ElastiCache、ML feature store 達 sub-10ms p99" data-link-desc="Tubi 把 ML 推薦的 feature store 從 ScyllaDB 遷到 ElastiCache for Redis、99 百分位延遲降到 10ms 以下">Tubi</a></strong>：兩個 ElastiCache 規模化案例，一個是 sub-ms 配對引擎、一個是 ML feature store p99&lt;10ms，都展示了「AWS 給吞吐、你給設計」的邊界。</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/02-cache-redis/vendors/aws-elasticache/" data-link-title="AWS ElastiCache" data-link-desc="AWS managed Redis / Valkey / Memcached">AWS ElastiCache</a></li>
<li>engine 層 deep article：<a href="/blog/backend/02-cache-redis/vendors/redis/memory-eviction-tuning/" data-link-title="Redis 記憶體與淘汰調校：maxmemory-policy、LFU 與碎片化的實戰判讀" data-link-desc="Redis 的記憶體是一條會在半夜爆掉的曲線：maxmemory 設多少、policy 選 LRU 還 LFU、碎片化什麼時候開始吃掉 30% RAM、OOM 時 noeviction 怎麼讓寫入全部失敗。本文展開 Redis 記憶體會計模型、eviction policy 的選型判讀、5 個把記憶體配置寫成 production 事故的踩坑，以及單機記憶體撞牆後該往 cluster 還是 DragonflyDB 走的邊界">Redis 記憶體與淘汰</a>、<a href="/blog/backend/02-cache-redis/vendors/redis/persistence-fork-latency/" data-link-title="Redis 持久化與 fork latency：AOF、RDB 與那一次卡住整個 cluster 的 fork" data-link-desc="Redis 的 RDB save 與 AOF rewrite 都靠一次 fork()，而 fork 在大記憶體實例上會凍結主執行緒數百毫秒、複製分頁讓記憶體逼近翻倍。本文展開 AOF / RDB 的機制與 fsync 取捨、copy-on-write 的記憶體放大、5 個把持久化寫成延遲尖峰與資料遺失的 production 踩坑，以及 cache 場景到底要不要持久化的邊界">persistence 與 fork latency</a>、<a href="/blog/backend/02-cache-redis/vendors/redis/sentinel-ha-failover/" data-link-title="Redis Sentinel 與 failover 時序：從 master 死掉到 client 重連的每一段" data-link-desc="Redis Sentinel 的 failover 不是一個瞬間動作，是 down 偵測 → quorum 確認 → 選主 → 提升 → 配置廣播 → client 重連的一條時序鏈，每一段都有自己的延遲與失敗模式。本文展開 Sentinel 的判定模型與這條時序、5 個讓 failover 卡住或丟資料的 production 踩坑，以及 Sentinel 撐不住該往 Cluster 或 managed 走的邊界">Sentinel 與 failover 時序</a>、<a href="/blog/backend/02-cache-redis/vendors/valkey/redis-compatibility-and-io-threads/" data-link-title="Valkey 相容性驗證與 io-threads 調校：drop-in 切換與多執行緒的實機判讀" data-link-desc="Valkey 跟 Redis 100% 相容這句話要怎麼驗證、切換才敢上線。本文用 INFO server 的雙版本回報拆解相容性的真實邊界、展開 Valkey 8 的 io-threads 多執行緒調校、5 個把 drop-in 切換或執行緒配置寫成事故的 production 踩坑，以及相容性撞牆該怎麼判斷的邊界">Valkey 相容性</a></li>
<li>上游能力：<a href="/blog/backend/00-service-selection/cost-risk-tradeoffs/" data-link-title="0.6 成本、風險與選型取捨" data-link-desc="用人力成本、雲端成本、操作成本與失敗代價判斷後端能力投入順序">0.6 成本取捨</a>、<a href="/blog/backend/02-cache-redis/cache-copy-freshness-boundary/" data-link-title="2.7 Cache Copy Boundary 與 Freshness" data-link-desc="說明快取何時只是可重建副本，何時會影響交易、權限或配額正確性。">cache copy boundary</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>MongoDB → Atlas：Atlas 不是 MongoDB + managed、是另一個 product</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mongodb/migrate-to-atlas/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mongodb/migrate-to-atlas/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration&lt;/a> playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB&lt;/a> 跟 MongoDB Atlas。本文是 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> Type C operational redesign hybrid 的標準形態實證。每階段切換用 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate&lt;/a> 把關 — 4 phase 之間的驗證條件就是 gate。&lt;/p>&lt;/blockquote>
&lt;h2 id="atlas-不是-mongodb--managed是另一個-product">Atlas 不是 MongoDB + managed、是另一個 product&lt;/h2>
&lt;p>「MongoDB Atlas 是 MongoDB 的 managed 版本」這個 framing 看似合理、實際誤導：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Protocol 相容&lt;/strong>：MongoDB wire protocol 一致、driver 不改、&lt;code>mongosh&lt;/code> 連線跟 self-managed 一樣&lt;/li>
&lt;li>&lt;strong>Storage 一致&lt;/strong>：WiredTiger storage engine 一樣、document model 一樣&lt;/li>
&lt;li>&lt;strong>API 一致&lt;/strong>：Aggregation framework、indexing、change stream 都一樣&lt;/li>
&lt;/ul>
&lt;p>但 &lt;em>operational surface 完全不同&lt;/em>：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Operational concept&lt;/th>
 &lt;th>Self-managed MongoDB&lt;/th>
 &lt;th>Atlas&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Cluster bootstrap&lt;/td>
 &lt;td>mongod + replica set config + cfgsvr + shard 手動&lt;/td>
 &lt;td>UI / API 一鍵建集群、全自動&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>HA&lt;/td>
 &lt;td>Replica set 自管 + arbiter + priority&lt;/td>
 &lt;td>自動跨 AZ replica + automatic failover&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Backup&lt;/td>
 &lt;td>mongodump + S3 archive 自管&lt;/td>
 &lt;td>內建 cloud backup + PITR（按 region 設）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Network access&lt;/td>
 &lt;td>VPC + security group + IP whitelist 自管&lt;/td>
 &lt;td>Atlas private endpoint / VPC peering / IP access list&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Authentication&lt;/td>
 &lt;td>mongod 內部 user / x.509 自管&lt;/td>
 &lt;td>Atlas Database User + 整合 LDAP / SSO / AWS IAM&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Monitoring&lt;/td>
 &lt;td>Self-deploy Prometheus + grafana&lt;/td>
 &lt;td>Atlas Performance Advisor + APM 內建&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Sizing&lt;/td>
 &lt;td>Manual instance class + scale&lt;/td>
 &lt;td>Auto-tier scaling + tier-based pricing&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Patching&lt;/td>
 &lt;td>Manual + outage window&lt;/td>
 &lt;td>Automatic（可配置 maintenance window）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Migration 主要工作不在 &lt;em>資料層&lt;/em> — protocol drop-in 已 cover；是 &lt;em>operational stack 全換&lt;/em>：SRE runbook、monitoring dashboard、access control、IAM 整合、cost 預估全要重做。「Atlas 是 managed MongoDB」這個 framing 低估了 operational 工作量。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor <a href="/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration</a> playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB</a> 跟 MongoDB Atlas。本文是 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> Type C operational redesign hybrid 的標準形態實證。每階段切換用 <a href="/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate</a> 把關 — 4 phase 之間的驗證條件就是 gate。</p></blockquote>
<h2 id="atlas-不是-mongodb--managed是另一個-product">Atlas 不是 MongoDB + managed、是另一個 product</h2>
<p>「MongoDB Atlas 是 MongoDB 的 managed 版本」這個 framing 看似合理、實際誤導：</p>
<ul>
<li><strong>Protocol 相容</strong>：MongoDB wire protocol 一致、driver 不改、<code>mongosh</code> 連線跟 self-managed 一樣</li>
<li><strong>Storage 一致</strong>：WiredTiger storage engine 一樣、document model 一樣</li>
<li><strong>API 一致</strong>：Aggregation framework、indexing、change stream 都一樣</li>
</ul>
<p>但 <em>operational surface 完全不同</em>：</p>
<table>
  <thead>
      <tr>
          <th>Operational concept</th>
          <th>Self-managed MongoDB</th>
          <th>Atlas</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster bootstrap</td>
          <td>mongod + replica set config + cfgsvr + shard 手動</td>
          <td>UI / API 一鍵建集群、全自動</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>Replica set 自管 + arbiter + priority</td>
          <td>自動跨 AZ replica + automatic failover</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>mongodump + S3 archive 自管</td>
          <td>內建 cloud backup + PITR（按 region 設）</td>
      </tr>
      <tr>
          <td>Network access</td>
          <td>VPC + security group + IP whitelist 自管</td>
          <td>Atlas private endpoint / VPC peering / IP access list</td>
      </tr>
      <tr>
          <td>Authentication</td>
          <td>mongod 內部 user / x.509 自管</td>
          <td>Atlas Database User + 整合 LDAP / SSO / AWS IAM</td>
      </tr>
      <tr>
          <td>Monitoring</td>
          <td>Self-deploy Prometheus + grafana</td>
          <td>Atlas Performance Advisor + APM 內建</td>
      </tr>
      <tr>
          <td>Sizing</td>
          <td>Manual instance class + scale</td>
          <td>Auto-tier scaling + tier-based pricing</td>
      </tr>
      <tr>
          <td>Patching</td>
          <td>Manual + outage window</td>
          <td>Automatic（可配置 maintenance window）</td>
      </tr>
  </tbody>
</table>
<p>Migration 主要工作不在 <em>資料層</em> — protocol drop-in 已 cover；是 <em>operational stack 全換</em>：SRE runbook、monitoring dashboard、access control、IAM 整合、cost 預估全要重做。「Atlas 是 managed MongoDB」這個 framing 低估了 operational 工作量。</p>
<p>跑 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>MongoDB protocol / API 完全相容</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>HA / backup / monitoring / IAM / network 全換</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Abstraction / paradigm</td>
          <td>同 document DB</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Number of components</td>
          <td>同 1 個 cluster</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Connection string / IAM 整合改、application logic 不改</td>
          <td>Low/Medium</td>
      </tr>
  </tbody>
</table>
<p>主導維度 Operational = High、Schema / Paradigm 都 Low — 對映 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Type C operational redesign hybrid</a>。</p>
<h2 id="結構4-phase-operational--drop-in-cutover">結構：4-phase operational + drop-in cutover</h2>
<p>跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a> 結構對齊（同 Type C）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Phase 0：Pre-migration audit（1-2 週）
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  - Workload sizing（IOPS / connection / storage）
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  - Application connection pattern audit
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  - Compliance requirement audit
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">Phase 1：Operational infrastructure 準備（2-3 週）
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  - Atlas cluster 建立
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  - VPC peering / private endpoint
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  - IAM role + Atlas Database User
</span></span><span class="line"><span class="ln">10</span><span class="cl">  - Monitoring + alert
</span></span><span class="line"><span class="ln">11</span><span class="cl">  - Backup retention 設定
</span></span><span class="line"><span class="ln">12</span><span class="cl">
</span></span><span class="line"><span class="ln">13</span><span class="cl">Phase 2：Data migration（取決於 dataset 大小）
</span></span><span class="line"><span class="ln">14</span><span class="cl">  - mongomirror / Atlas Live Migration tool
</span></span><span class="line"><span class="ln">15</span><span class="cl">  - 或 mongodump → mongorestore（小 DB）
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl">Phase 3：Cutover 跟 verification
</span></span><span class="line"><span class="ln">18</span><span class="cl">
</span></span><span class="line"><span class="ln">19</span><span class="cl">Phase 4：Cleanup（self-managed decommission）</span></span></code></pre></div><p>整體 4-12 週、依 dataset 大小跟 organization 流程複雜度。</p>
<h2 id="phase-0pre-migration-audit">Phase 0：Pre-migration audit</h2>
<h3 id="workload-sizing--atlas-tier">Workload sizing → Atlas tier</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Self-managed observations:
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">- Peak IOPS: 8000
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">- P99 read latency: 5ms
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">- Connection count peak: 1500
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">- Storage: 800GB
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">- Cross-region replication needed: yes
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">Atlas tier mapping:
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">- M40 (8 vCPU, 16GB RAM): IOPS 3000、不夠
</span></span><span class="line"><span class="ln">10</span><span class="cl">- M60 (16 vCPU, 64GB RAM): IOPS 6000、邊界
</span></span><span class="line"><span class="ln">11</span><span class="cl">- M80 (32 vCPU, 128GB RAM): IOPS 9000、安全（選此）
</span></span><span class="line"><span class="ln">12</span><span class="cl">- Storage: 1TB tier（足夠 800GB + 25% buffer）
</span></span><span class="line"><span class="ln">13</span><span class="cl">- Cross-region replication add-on</span></span></code></pre></div><p>Atlas 不是 <em>自由 instance class</em>、是 <em>固定 tier</em>；workload 跨 tier 邊界時要選 <em>上一級</em> 而不是 push 下一級。</p>
<h3 id="connection-pattern-audit">Connection pattern audit</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">// Application connection pool config
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="kr">const</span> <span class="nx">client</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">MongoClient</span><span class="p">(</span><span class="nx">uri</span><span class="p">,</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">  <span class="nx">maxPoolSize</span><span class="o">:</span> <span class="mi">100</span><span class="p">,</span>     <span class="c1">// ← Atlas 端 tier-specific connection limit
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span>  <span class="nx">minPoolSize</span><span class="o">:</span> <span class="mi">10</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">  <span class="nx">maxIdleTimeMS</span><span class="o">:</span> <span class="mi">60000</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="p">});</span></span></span></code></pre></div><p>Atlas tier 對 <em>single user connection</em> 有限制（M40 ~1500、M80 ~3000）；多 application instance 跑同帳號連 Atlas 可能撞 limit。預先計算 total connection = <code>pod_count × maxPoolSize</code>、對照 tier limit。</p>
<h3 id="compliance-audit">Compliance audit</h3>
<ul>
<li><strong>Data residency</strong>：Atlas 部署 region 是否符合 GDPR / 客戶合約</li>
<li><strong>Encryption at rest</strong>：Atlas 預設 enable、但 <em>encryption key 是 Atlas-managed</em> — 合規嚴格要用 CMK / BYOK</li>
<li><strong>Audit log</strong>：Atlas 提供 audit log、export 到 S3 / Splunk</li>
</ul>
<h2 id="phase-1operational-infrastructure-準備">Phase 1：Operational infrastructure 準備</h2>
<h3 id="atlas-cluster-配置">Atlas cluster 配置</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># 用 Terraform mongodbatlas provider</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="l">resource &#34;mongodbatlas_cluster&#34; &#34;production&#34; {</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="l">project_id   = var.project_id</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="l">name         = &#34;production-cluster&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="l">cluster_type = &#34;REPLICASET&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">  </span><span class="l">provider_name         = &#34;AWS&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">  </span><span class="l">provider_region_name  = &#34;US_EAST_1&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">  </span><span class="l">provider_instance_size_name = &#34;M80&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">  </span><span class="l">backup_enabled         = true</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">  </span><span class="l">pit_enabled            = true  </span><span class="w"> </span><span class="c"># PITR</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">  </span><span class="l">mongo_db_major_version = &#34;7.0&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">  </span><span class="l">advanced_configuration {</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">    </span><span class="l">javascript_enabled                   = false</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">    </span><span class="l">minimum_enabled_tls_protocol         = &#34;TLS1_2&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w">    </span><span class="l">no_table_scan                        = false</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">    </span><span class="l">oplog_size_mb                        = 51200</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">  </span>}<span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w"></span>}<span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w"></span><span class="c"># Backup retention</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w"></span><span class="l">resource &#34;mongodbatlas_cloud_backup_schedule&#34; &#34;production&#34; {</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w">  </span><span class="l">project_id   = var.project_id</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w">  </span><span class="l">cluster_name = mongodbatlas_cluster.production.name</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="w">  </span><span class="l">reference_hour_of_day    = 3</span><span class="w">
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="w">  </span><span class="l">reference_minute_of_hour = 0</span><span class="w">
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="w">  </span><span class="l">restore_window_days      = 7</span><span class="w">
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="w">  </span><span class="l">policy_item_daily {</span><span class="w">
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="w">    </span><span class="l">frequency_interval = 1</span><span class="w">
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="w">    </span><span class="l">retention_unit     = &#34;days&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="w">    </span><span class="l">retention_value    = 7</span><span class="w">
</span></span></span><span class="line"><span class="ln">36</span><span class="cl"><span class="w">  </span>}<span class="w">
</span></span></span><span class="line"><span class="ln">37</span><span class="cl"><span class="w"></span>}</span></span></code></pre></div><h3 id="vpc-peering--private-endpoint">VPC peering / private endpoint</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Pattern A: VPC Peering
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  AWS VPC &lt;──peering──&gt; Atlas project VPC
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  - 跨 region 跑、routing table 對齊
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  - 適合中型 / 大型 workload、stable network topology
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">Pattern B: Private Endpoint (Atlas private link)
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  AWS VPC ──private link──&gt; Atlas
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  - 不需要 routing table 改
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  - 適合 multi-account / multi-region 複雜場景
</span></span><span class="line"><span class="ln">10</span><span class="cl">  - Cost 略高</span></span></code></pre></div><p>production default 走 Private Endpoint、設定簡單跟 IAM 整合好。</p>
<h3 id="atlas-database-user-跟-iam-整合">Atlas Database User 跟 IAM 整合</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Pattern A: 傳統 username / password
</span></span><span class="line"><span class="ln">2</span><span class="cl">  - 設 Database User、application 用 SCRAM-SHA-256 連
</span></span><span class="line"><span class="ln">3</span><span class="cl">  - 適合 legacy application
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl">Pattern B: AWS IAM authentication（推薦）
</span></span><span class="line"><span class="ln">6</span><span class="cl">  - Atlas Database User type: &#34;AWS IAM&#34;
</span></span><span class="line"><span class="ln">7</span><span class="cl">  - Application 用 AWS IAM role + Atlas SDK
</span></span><span class="line"><span class="ln">8</span><span class="cl">  - Token 15 分鐘輪換、application 自管 refresh</span></span></code></pre></div><p>cutover 時間表內加 IAM authentication migration、不要事後補。</p>
<h2 id="phase-2data-migration">Phase 2：Data migration</h2>
<h3 id="atlas-live-migration-tool小到中型">Atlas Live Migration tool（小到中型）</h3>
<p>Atlas UI 內建 Live Migration tool：</p>
<ol>
<li>Source cluster URI（self-managed MongoDB）</li>
<li>Atlas target cluster</li>
<li>tool 自動 full sync + oplog tailing</li>
<li>Cutover window 內 final cutover</li>
</ol>
<p>支援 dataset &lt; 100GB 簡單；100GB-1TB 需要分批 / collection 順序設計。</p>
<h3 id="mongomirror大型">mongomirror（大型）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Mongomirror: source → atlas</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">mongomirror <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --host source-replicaset/host1:27017,host2:27017 <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  --destination atlas-cluster-host:27017 <span class="se">\
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="se"></span>  --destinationUsername admin <span class="se">\
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="se"></span>  --destinationPassword <span class="nv">$ATLAS_PASSWORD</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="se"></span>  --ssl</span></span></code></pre></div><p>mongomirror 分兩段：</p>
<ol>
<li>Initial sync（full dump + restore）</li>
<li>Oplog tailing（continuous CDC）</li>
</ol>
<p>Cutover 期間 application 切 connection string、mongomirror 跟著 stream 收尾。</p>
<h2 id="phase-3cutover--verification">Phase 3：Cutover + verification</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. Application 端設 maintenance mode（block write）
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. Wait mongomirror catch up（oplog gap → 0）
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. 驗證 Atlas 端 collection count + sample query
</span></span><span class="line"><span class="ln">4</span><span class="cl">4. Application connection string 切到 Atlas
</span></span><span class="line"><span class="ln">5</span><span class="cl">5. 解除 maintenance、monitor 24-48 小時
</span></span><span class="line"><span class="ln">6</span><span class="cl">6. Self-managed mongo read-only standby 1-2 週</span></span></code></pre></div><h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1atlas-tier-connection-limit-撞牆">Case 1：Atlas tier connection limit 撞牆</h3>
<p><strong>徵兆</strong>：cutover 後 application 流量高峰時大量 <code>Connection refused</code>、Atlas 端顯示 connection limit reached；self-managed 階段沒有這問題。</p>
<p><strong>根因</strong>：M80 tier connection limit ~3000、application 100 個 pod × maxPoolSize=50 = 5000 connection；超出 limit。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration 計算</strong>：total connection 對照 Atlas tier、超出選上一級 tier</li>
<li><strong>降 maxPoolSize</strong>：100 pod × 30 = 3000、剛好 cap；但 burst 仍可能撞</li>
<li><strong>加 connection proxy</strong>：在 application 跟 Atlas 之間放 connection pooler（如 mongos sharded 或 ProxySQL-style proxy）</li>
</ol>
<h3 id="case-2ip-whitelist-漏-application-vpccutover-後完全連不上">Case 2：IP whitelist 漏 application VPC、cutover 後完全連不上</h3>
<p><strong>徵兆</strong>：cutover 後 application 直接報 <code>connection timeout</code>、Atlas dashboard 顯示 zero traffic；troubleshooting 1 小時才發現是 IP access list 漏掉某 application VPC CIDR。</p>
<p><strong>根因</strong>：Atlas IP access list 預設 deny all、必須明示加 application VPC；Phase 1 設定漏看某個 VPC（如 multi-account organization 內的 staging account）。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-cutover 連線測試</strong>：每個 application VPC 跑 sample MongoDB 連線、確認 ping 通</li>
<li><strong>改 Private Endpoint</strong>：不靠 IP whitelist、用 PrivateLink 自動 routing</li>
<li><strong>Backup access</strong>：保留 bastion host with whitelisted IP、incident 期間能直連</li>
</ol>
<h3 id="case-3backup-retention-設不夠compliance-audit-抓到">Case 3：Backup retention 設不夠、compliance audit 抓到</h3>
<p><strong>徵兆</strong>：cutover 3 個月後 SOX audit 發現 backup retention 設 7 天、合規要求 90 天；急忙改 Atlas config 設 90 天、但 <em>過去 3 個月 backup 已不可恢復</em>。</p>
<p><strong>根因</strong>：Atlas backup retention 是 <em>向前生效</em>、不能回追加；Phase 1 預設配置漏對合規 review。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-Phase 1 跑 compliance review</strong>：跟 legal / security team 確認 retention / data residency / audit log</li>
<li><strong>預設 retention 設保守值</strong>（30 / 60 天）、之後可降不能升</li>
<li><strong>PITR 跟 backup retention 分開設</strong>：PITR window 7-30 天、full backup 90-365 天</li>
</ol>
<h3 id="case-4iam-token-過期application-端-reconnect-storm">Case 4：IAM token 過期、application 端 reconnect storm</h3>
<p><strong>徵兆</strong>：production 切到 IAM authentication 後、每 15 分鐘出現一波 connection failure；Atlas log 顯示「auth token expired」。</p>
<p><strong>根因</strong>：AWS IAM token 15 分鐘輪換、application 用舊 token 重連失敗；token refresh 邏輯沒寫對。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">// 用 Atlas SDK + AWS SDK 整合、自動 token refresh
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="kr">const</span> <span class="p">{</span> <span class="nx">MongoClient</span> <span class="p">}</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="s1">&#39;mongodb&#39;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="kr">const</span> <span class="p">{</span> <span class="nx">fromIni</span> <span class="p">}</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="s1">&#39;@aws-sdk/credential-providers&#39;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="kr">const</span> <span class="nx">credentials</span> <span class="o">=</span> <span class="nx">fromIni</span><span class="p">({</span> <span class="nx">profile</span><span class="o">:</span> <span class="s1">&#39;production&#39;</span> <span class="p">});</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="kr">const</span> <span class="nx">client</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">MongoClient</span><span class="p">(</span><span class="nx">uri</span><span class="p">,</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">  <span class="nx">authMechanism</span><span class="o">:</span> <span class="s1">&#39;MONGODB-AWS&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">  <span class="c1">// SDK 自動 refresh token
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1"></span><span class="p">});</span></span></span></code></pre></div><p>不要自管 token rotation、用 vendor SDK 抽象掉。</p>
<h3 id="case-5billing-暴漲iops-跟-backup-storage-超預估">Case 5：Billing 暴漲、IOPS 跟 backup storage 超預估</h3>
<p><strong>徵兆</strong>：第一個月 Atlas 帳單 $15K USD、預估 $8K；Atlas dashboard 顯示 backup storage 跟 IOPS 各超 1.5-2x 預估。</p>
<p><strong>根因</strong>：</p>
<ul>
<li>Atlas backup 預設 <em>跨 region replicated</em>、storage cost 2x</li>
<li>IOPS-heavy workload 在 M tier 內可能撞 burst credit、auto-tier-up 暫時觸發更貴 tier</li>
<li>Data transfer 跨 region / 跨 cloud 計費沒算</li>
</ul>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration cost estimate</strong>：用 self-managed metrics 估 IOPS / bandwidth、套 Atlas pricing</li>
<li><strong>Backup region 設單一</strong>：若不要跨 region DR、設 same-region backup 省 50%</li>
<li><strong>Reserved Instance</strong>：穩定 workload 預付 1-3 年、省 30-40%</li>
<li><strong>Performance Advisor 早用</strong>：第一週就跑、找 inefficient query 降 IOPS</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Self-managed MongoDB</th>
          <th>Atlas</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster cost (M80)</td>
          <td>EC2 r6g.4xlarge × 3 ≈ $1.5K / mo</td>
          <td>M80 + storage + backup ≈ $3K / mo</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.5-1.5 FTE</td>
          <td>0.1-0.3 FTE</td>
      </tr>
      <tr>
          <td>Backup cost</td>
          <td>S3 + tooling 自管</td>
          <td>內建 + tiered storage</td>
      </tr>
      <tr>
          <td>Cross-region DR cost</td>
          <td>Manual + 2x infrastructure</td>
          <td>1-click + 1.5-2x billing</td>
      </tr>
      <tr>
          <td>Time to value</td>
          <td>1-3 個月（HA + ops setup）</td>
          <td>1-2 週（cluster ready + IAM）</td>
      </tr>
      <tr>
          <td>Migration cost</td>
          <td>-</td>
          <td>1-3 FTE × 2-3 個月</td>
      </tr>
  </tbody>
</table>
<p><strong>Break-even</strong>：~200GB / 中型 workload、Atlas operational savings 平攤 1-2 年後比 self-managed cheaper；TB+ 大型 workload self-managed 仍可能便宜、但需要 ops team。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-postgresql--aurora-migration-對照">跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora migration</a> 對照</h3>
<p>兩篇都是 Type C operational redesign hybrid、模板共用、細節差：</p>
<ul>
<li>Aurora 端 RDS Proxy 是推薦做法、Atlas 端 Private Endpoint 更標準</li>
<li>Aurora 端 IAM authentication 是 <em>optional best practice</em>、Atlas IAM 是 <em>推薦預設</em></li>
<li>兩家 cost model 都複雜、I/O cost 是 surprise 主要來源</li>
</ul>
<h3 id="跟-application-端-iam-token-rotation-整合">跟 <a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/dynamic-credential/" data-link-title="HashiCorp Vault Dynamic Credential：lease 治理跟 application 整合的實作層" data-link-desc="Vault database secrets engine 怎麼配、application 怎麼 renew lease、production 五大踩雷（lease 過期 race、DB max_connections 撞牆、Vault sealed、token expire、scope 過寬）、容量規劃跟 vault-agent injector 整合">Application 端 IAM token rotation</a> 整合</h3>
<p>Vault dynamic credential 可 issue Atlas Database User credential、lease lifecycle 對齊 application；對 high-stakes workload 是好做法、但 setup 複雜。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Atlas Data Federation</strong>：跨 Atlas 集群 query S3 / 跨 region；如果走 multi-region 評估這 feature</li>
<li><strong>Atlas Online Archive</strong>：cold data 自動 archive 到 S3、查 query 透明；對 retention 重的 workload 省 storage cost</li>
<li><strong>Atlas Serverless</strong>：burst workload 適合、steady 不划算</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source vendor：<a href="/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB</a></li>
<li>平行 migration playbook (Type C)：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a></li>
<li>平行 migration playbook：<a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic</a>（Type A schema 差） / <a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a>（Type E paradigm shift）</li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a>（本文驗證 Type C 標準形態）</li>
</ul>
]]></content:encoded></item><item><title>Self-managed Prometheus → Grafana Cloud Metrics：feature × ops × cost 對照</title><link>https://tarrragon.github.io/blog/backend/04-observability/vendors/grafana-stack/migrate-prometheus-to-cloud-metrics/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/04-observability/vendors/grafana-stack/migrate-prometheus-to-cloud-metrics/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/prometheus/" data-link-title="Prometheus" data-link-desc="Pull-based metrics 主流 OSS、PromQL 與 alerting">Prometheus&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack&lt;/a>（Grafana Cloud Metrics、Mimir-backed）。跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit&lt;/a> 後對映 &lt;em>Operational = High → Type C operational redesign hybrid&lt;/em>。&lt;/p>&lt;/blockquote>
&lt;h2 id="feature--ops--cost-三維對照">Feature / ops / cost 三維對照&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>Self-managed Prometheus&lt;/th>
 &lt;th>Grafana Cloud Metrics&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Storage backend&lt;/td>
 &lt;td>Local disk + remote_write (optional)&lt;/td>
 &lt;td>Mimir + S3 (auto cold tier)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Retention&lt;/td>
 &lt;td>TSDB local 15 天 default&lt;/td>
 &lt;td>13 個月 default、可延長&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>HA&lt;/td>
 &lt;td>Two Prometheus + sidecar&lt;/td>
 &lt;td>Built-in multi-AZ&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cardinality limit&lt;/td>
 &lt;td>自管 limit + recording rule&lt;/td>
 &lt;td>1.5M active series / tier、scale-up 配額&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Query API&lt;/td>
 &lt;td>PromQL + Prometheus HTTP API&lt;/td>
 &lt;td>完全相容&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Alert&lt;/td>
 &lt;td>Alertmanager self-managed&lt;/td>
 &lt;td>Grafana Cloud Alerting&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Dashboard&lt;/td>
 &lt;td>Grafana self-managed&lt;/td>
 &lt;td>Grafana Cloud (included)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Long-term storage&lt;/td>
 &lt;td>Thanos / Cortex / Mimir 自管&lt;/td>
 &lt;td>Mimir 內建&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cost (mid-tier)&lt;/td>
 &lt;td>$500-2000 / mo + ops FTE&lt;/td>
 &lt;td>$300-1500 / mo (按 series)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operational FTE&lt;/td>
 &lt;td>0.3-0.8&lt;/td>
 &lt;td>0.05-0.15&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit&lt;/a>：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>等級&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Schema / API&lt;/td>
 &lt;td>Low（PromQL + API 完全相容）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operational&lt;/td>
 &lt;td>&lt;strong>High&lt;/strong>（HA / retention / scaling 全託管）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Paradigm&lt;/td>
 &lt;td>Low（同 Prometheus metric paradigm）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Components&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application change&lt;/td>
 &lt;td>Low（remote_write endpoint 改）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Data topology&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Operational = High → Type C standard。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link <a href="/blog/backend/04-observability/vendors/prometheus/" data-link-title="Prometheus" data-link-desc="Pull-based metrics 主流 OSS、PromQL 與 alerting">Prometheus</a> 跟 <a href="/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack</a>（Grafana Cloud Metrics、Mimir-backed）。跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit</a> 後對映 <em>Operational = High → Type C operational redesign hybrid</em>。</p></blockquote>
<h2 id="feature--ops--cost-三維對照">Feature / ops / cost 三維對照</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Self-managed Prometheus</th>
          <th>Grafana Cloud Metrics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Storage backend</td>
          <td>Local disk + remote_write (optional)</td>
          <td>Mimir + S3 (auto cold tier)</td>
      </tr>
      <tr>
          <td>Retention</td>
          <td>TSDB local 15 天 default</td>
          <td>13 個月 default、可延長</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>Two Prometheus + sidecar</td>
          <td>Built-in multi-AZ</td>
      </tr>
      <tr>
          <td>Cardinality limit</td>
          <td>自管 limit + recording rule</td>
          <td>1.5M active series / tier、scale-up 配額</td>
      </tr>
      <tr>
          <td>Query API</td>
          <td>PromQL + Prometheus HTTP API</td>
          <td>完全相容</td>
      </tr>
      <tr>
          <td>Alert</td>
          <td>Alertmanager self-managed</td>
          <td>Grafana Cloud Alerting</td>
      </tr>
      <tr>
          <td>Dashboard</td>
          <td>Grafana self-managed</td>
          <td>Grafana Cloud (included)</td>
      </tr>
      <tr>
          <td>Long-term storage</td>
          <td>Thanos / Cortex / Mimir 自管</td>
          <td>Mimir 內建</td>
      </tr>
      <tr>
          <td>Cost (mid-tier)</td>
          <td>$500-2000 / mo + ops FTE</td>
          <td>$300-1500 / mo (按 series)</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.3-0.8</td>
          <td>0.05-0.15</td>
      </tr>
  </tbody>
</table>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>Low（PromQL + API 完全相容）</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td><strong>High</strong>（HA / retention / scaling 全託管）</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>Low（同 Prometheus metric paradigm）</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Low（remote_write endpoint 改）</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>Low</td>
      </tr>
  </tbody>
</table>
<p>Operational = High → Type C standard。</p>
<h2 id="為什麼遷retention--ops--vendor-consolidation-三條-driver">為什麼遷：retention / ops / vendor consolidation 三條 driver</h2>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>觸發</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Retention</td>
          <td>Prometheus TSDB local 預設 15 天、長期 retention 需要 Thanos / Cortex / Mimir 自管</td>
      </tr>
      <tr>
          <td>Ops FTE</td>
          <td>Self-managed Prometheus + Alertmanager + Grafana 自管全部加起來 0.5-1 FTE</td>
      </tr>
      <tr>
          <td>Vendor consolidation</td>
          <td>已用 Grafana Cloud（logs / traces）、metric 加進 stack 統一</td>
      </tr>
  </tbody>
</table>
<h2 id="operational-redesign">Operational redesign</h2>
<table>
  <thead>
      <tr>
          <th>Concept</th>
          <th>Self-managed</th>
          <th>Grafana Cloud Metrics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster bootstrap</td>
          <td>Helm chart + manual config</td>
          <td>UI 一鍵建</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>Two Prometheus 配置</td>
          <td>內建 multi-AZ Mimir</td>
      </tr>
      <tr>
          <td>Long-term retention</td>
          <td>Thanos / Cortex / Mimir 自管</td>
          <td>Built-in (S3-backed)</td>
      </tr>
      <tr>
          <td>Cardinality control</td>
          <td>Manual recording rule + relabel</td>
          <td>Adaptive sampling + cardinality limit</td>
      </tr>
      <tr>
          <td>Alerting</td>
          <td>Alertmanager 自管</td>
          <td>Grafana Cloud Alerting (integrated)</td>
      </tr>
      <tr>
          <td>Dashboard</td>
          <td>Grafana self-host</td>
          <td>Grafana Cloud (free tier 包含)</td>
      </tr>
  </tbody>
</table>
<h2 id="migration-4-phase">Migration 4-phase</h2>
<h3 id="phase-0audit">Phase 0：Audit</h3>
<ul>
<li>列所有 Prometheus job / scrape config</li>
<li>統計 active series 數（Mimir tier 計費基準）</li>
<li>估 retention 需求</li>
</ul>
<h3 id="phase-1grafana-cloud-setup">Phase 1：Grafana Cloud setup</h3>
<ul>
<li>Account + organization 設定</li>
<li>API key for <code>remote_write</code></li>
<li>Grafana Cloud Mimir endpoint 啟用</li>
</ul>
<h3 id="phase-2dual-write">Phase 2：Dual-write</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># prometheus.yml</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">remote_write</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span>- <span class="nt">url</span><span class="p">:</span><span class="w"> </span><span class="l">https://prometheus-prod-XX-prod-us-central-0.grafana.net/api/prom/push</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="nt">basic_auth</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">      </span><span class="nt">username</span><span class="p">:</span><span class="w"> </span><span class="l">&lt;INSTANCE_ID&gt;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">      </span><span class="nt">password</span><span class="p">:</span><span class="w"> </span><span class="l">&lt;API_KEY&gt;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="nt">write_relabel_configs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">      </span><span class="c"># Optional: drop high-cardinality before sending</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">      </span>- <span class="nt">source_labels</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">__name__]</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">        </span><span class="nt">regex</span><span class="p">:</span><span class="w"> </span><span class="s1">&#39;high_card_metric_.*&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">        </span><span class="nt">action</span><span class="p">:</span><span class="w"> </span><span class="l">drop</span></span></span></code></pre></div><p>跑 4-8 週、確認 query 結果一致 + cost 在預期。</p>
<h3 id="phase-3cutover">Phase 3：Cutover</h3>
<ul>
<li>Dashboard / alert 切到 Grafana Cloud endpoint</li>
<li>應用層 / Grafana 自管 instance 關閉 query 對 self-managed Prometheus</li>
</ul>
<h3 id="phase-4cleanup">Phase 4：Cleanup</h3>
<ul>
<li>Self-managed Prometheus stop scrape</li>
<li>留 1-2 月歷史查詢能力（用 archive snapshot）</li>
<li>Decommission</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1cardinality-爆cost-暴漲">Case 1：Cardinality 爆、cost 暴漲</h3>
<p><strong>徵兆</strong>：dual-write 第 2 週 Grafana Cloud series 從預估 100K 漲到 800K、cost 翻 8 倍。</p>
<p><strong>根因</strong>：application-level high-cardinality label（user_id / request_id）沒被 drop、scraped 進來。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><code>write_relabel_configs</code> drop unbounded label</li>
<li>Application metric 設計改 fixed-bucket histogram、不用 unbounded label</li>
<li>Mimir cardinality limit 設保護 + alert</li>
</ol>
<h3 id="case-2recording-rule-對應失效">Case 2：Recording rule 對應失效</h3>
<p><strong>徵兆</strong>：cutover 後 Grafana dashboard 某些 panel 顯示空；發現用了 Prometheus 端 recording rule (<code>job:request_count:rate5m</code>)、Grafana Cloud 端沒對應 rule。</p>
<p><strong>根因</strong>：Prometheus 端 recording rule 是 <em>server-side</em>、不會跟著 remote_write 帶過去；Grafana Cloud 需要自己 setup recording rule。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Export 所有 recording rule、import 到 Grafana Cloud Mimir</li>
<li>或改用 <em>raw query</em> + Grafana query template、不依賴 recording rule</li>
</ol>
<h3 id="case-3promql-微差行為">Case 3：PromQL 微差行為</h3>
<p><strong>徵兆</strong>：某些 query 在 self-managed Prometheus 跑得好好的、切 Grafana Cloud Mimir 後 returns slightly different results。</p>
<p><strong>根因</strong>：Mimir 對某些 edge case（empty result handling / staleness marker timing）行為跟 Prometheus 略不同；多數 query 一致、&lt; 1% query 受影響。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Pre-cutover dual-query 驗證、用 critical dashboard 比對</li>
<li>Affected query 重寫、用更 robust PromQL pattern</li>
<li>文件 known incompatibility list</li>
</ol>
<h3 id="case-4alert-routing-改變">Case 4：Alert routing 改變</h3>
<p><strong>徵兆</strong>：Cutover 後 PagerDuty / Slack 收不到 alert；發現 Alertmanager 端 webhook 沒切。</p>
<p><strong>根因</strong>：alert 邏輯從 self-managed Alertmanager 搬到 Grafana Cloud Alerting、routing / contact 配置完全重做。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Pre-cutover 在 Grafana Cloud 端 rebuild alert + routing</li>
<li>雙 alert pipeline 跑 1-2 週、確認 Grafana Cloud 收到</li>
<li>Cutover 切 routing、SOC drill 一次</li>
</ol>
<h3 id="case-5歷史資料查不到">Case 5：歷史資料查不到</h3>
<p><strong>徵兆</strong>：Cutover 後 SOC 想 query 6 個月前事件、Grafana Cloud 只有 2 個月（dual-write 後的）資料。</p>
<p><strong>根因</strong>：Grafana Cloud 從 dual-write 開始才有資料、之前的 self-managed Prometheus historical data 沒 backfill。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Phase 2 期間用 <code>promtool tsdb dump</code> + <code>mimirtool</code> 把 self-managed historical 灌進 Mimir</li>
<li>或保留 self-managed Prometheus read-only 6 個月（給 historical query）</li>
<li>Long-term：retention 從 cutover 開始算、historical 是 <em>one-time backfill</em></li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Self-managed</th>
          <th>Grafana Cloud Metrics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Compute (100 host, 100K series)</td>
          <td>$500-1000 / mo + ops</td>
          <td>$300-800 / mo</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.3-0.8 = $3K-8K</td>
          <td>0.05-0.15 = $500-1500</td>
      </tr>
      <tr>
          <td>Long-term retention</td>
          <td>Thanos / Cortex / Mimir 自管</td>
          <td>Built-in 13 個月</td>
      </tr>
      <tr>
          <td>Total (mid-tier)</td>
          <td>$4K-9K / mo (含 FTE)</td>
          <td>$1K-2.5K / mo</td>
      </tr>
      <tr>
          <td>Migration cost</td>
          <td>-</td>
          <td>1-2 FTE × 1-2 個月</td>
      </tr>
  </tbody>
</table>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-datadog--grafana-stack-migration-對位">跟 <a href="/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/" data-link-title="Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability" data-link-desc="Datadog 五層計費（host APM / metric / log ingest / log retention / RUM）拆解、對位 Grafana Stack（Mimir / Loki / Tempo / Grafana / Alloy）的 5 層責任；OTel-based agent migration、5 個 production 踩雷（cardinality 爆 / log volume cost / dashboard 不直接轉 / alert routing 換邏輯 / SLO definition 差異）、cost reality check">Datadog → Grafana Stack migration</a> 對位</h3>
<p>兩條 Grafana Stack 路線：</p>
<ul>
<li>Self-host (Mimir + Loki + Tempo) on K8s：開源、自管</li>
<li>Grafana Cloud：SaaS、operational simplification</li>
</ul>
<p>本篇是「self-managed Prometheus → Grafana Cloud」、互補；如果跑兩階段（self-host → Cloud）跟「Datadog → Grafana Cloud」差不多。</p>
<h3 id="跟-opentelemetry-整合">跟 OpenTelemetry 整合</h3>
<p>OTel Collector 可同時 ship 到 Mimir (metric) + Loki (log) + Tempo (trace)；Migration 順便升 OTel 化避免下次 vendor 切換重複。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source vendor：<a href="/blog/backend/04-observability/vendors/prometheus/" data-link-title="Prometheus" data-link-desc="Pull-based metrics 主流 OSS、PromQL 與 alerting">Prometheus</a></li>
<li>Target vendor：<a href="/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack</a></li>
<li>平行 migration playbook (Type C)：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a> / <a href="/blog/backend/03-message-queue/vendors/kafka/migrate-to-msk/" data-link-title="Self-managed Kafka → AWS MSK：把 $15K/month operational cost 拆解到 managed" data-link-desc="Kafka self-managed → MSK 是 Type C operational redesign — protocol 完全相容、operational stack（ZooKeeper / brokers / monitoring / patching）全託管；本文用 cost 拆解開頭、5 個 production 踩雷（client connection pattern / version pinning / metric pipeline / IAM auth / cross-cluster mirror）">Kafka → MSK</a> / <a href="/blog/backend/04-observability/vendors/elastic-stack/migrate-to-elastic-cloud/" data-link-title="Self-managed ELK → Elastic Cloud：5 年 ELK 集群的 lifecycle 收尾" data-link-desc="Self-managed ELK Stack → Elastic Cloud 是 Type C operational redesign — protocol drop-in、operational stack（cluster sizing / shard 治理 / upgrade / backup）全託管；本文按 5 年 ELK lifecycle (build → scale → degrade → save → migrate) 組織、5 個 production 踩雷">ELK → Elastic Cloud</a></li>
<li>平行 D-type 對位：<a href="/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/" data-link-title="Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability" data-link-desc="Datadog 五層計費（host APM / metric / log ingest / log retention / RUM）拆解、對位 Grafana Stack（Mimir / Loki / Tempo / Grafana / Alloy）的 5 層責任；OTel-based agent migration、5 個 production 踩雷（cardinality 爆 / log volume cost / dashboard 不直接轉 / alert routing 換邏輯 / SLO definition 差異）、cost reality check">Datadog → Grafana Stack</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a></li>
</ul>
]]></content:encoded></item><item><title>RabbitMQ → AWS SQS：交出 broker 維運、把 routing 收斂進 application</title><link>https://tarrragon.github.io/blog/backend/03-message-queue/vendors/rabbitmq/migrate-to-aws-sqs/</link><pubDate>Tue, 16 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/03-message-queue/vendors/rabbitmq/migrate-to-aws-sqs/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/03-message-queue/vendors/rabbitmq/" data-link-title="RabbitMQ" data-link-desc="Classic message broker、AMQP routing 為主">RabbitMQ&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/backend/03-message-queue/vendors/aws-sqs/" data-link-title="AWS SQS" data-link-desc="AWS managed queue、簡單可靠、無 ordering（standard）">AWS SQS&lt;/a>。對照 &lt;a href="https://tarrragon.github.io/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&amp;#39;migration&amp;#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &amp;#43; 混合架構">Kafka ↔ NATS&lt;/a> 的 paradigm shift、本篇主導差異維度是 &lt;em>operational model&lt;/em>：source 跟 target 都是任務隊列、能力大致對得上、但運維責任從「自管 broker 叢集」整批交給 AWS managed 服務。&lt;/p>&lt;/blockquote>
&lt;p>RabbitMQ → AWS SQS 的核心是把 broker 運維責任轉移給 managed 服務、同時接受 SQS 沒有 exchange routing 這個事實、把路由邏輯收斂回 application 或改用 SNS fan-out。這個遷移不是 protocol drop-in（AMQP client 不能直接連 SQS）、application 端需要改 delivery 控制機制（manual ack → visibility timeout + delete）；但它也不是 paradigm shift（兩端都是 at-least-once 任務隊列、DLQ / 重試 / 解耦的語意一致）。主導差異落在 operational 維度、所以本文走 Type C operational redesign hybrid 結構。&lt;/p>
&lt;h2 id="為什麼遷不想再養-rabbitmq-叢集">為什麼遷：不想再養 RabbitMQ 叢集&lt;/h2>
&lt;p>觸發評估 SQS 的最常見壓力是 broker 維運成本、不是功能缺口。自管 RabbitMQ 叢集要承擔的運維責任包含 Erlang cluster 拓樸維護、network partition（腦裂）處理、quorum queue 的 Raft 一致性調校、disk / memory alarm 的容量規劃、版本升級的 rolling restart。這些責任需要至少 0.5-1 FTE 的持續投入、且在 &lt;a href="https://tarrragon.github.io/blog/backend/03-message-queue/vendors/rabbitmq/" data-link-title="RabbitMQ" data-link-desc="Classic message broker、AMQP routing 為主">network partition&lt;/a> 這類事故發生時需要熟悉 Erlang runtime 的人即時介入。&lt;/p>
&lt;p>SQS 把這整層責任移除。沒有 broker 實例、沒有 cluster 拓樸、沒有 disk / memory watermark、沒有版本升級。換來的代價是 routing 能力消失（SQS 沒有 exchange）、application 要改 delivery 控制機制、以及 AWS 生態綁定。這個交換在三種情境下成立：&lt;/p>
&lt;p>第一種是 AWS 生態原生服務。若 producer / consumer 已經跑在 Lambda、ECS、EKS 上、SQS 的 event source mapping 跟 IAM 整合讓 application 不必自管連線池跟認證。RabbitMQ 在 AWS 上要嘛自管 EC2 叢集、要嘛用 Amazon MQ（仍是 broker 模型、運維責任只是部分轉移）、都不如 SQS 的 serverless 整合直接。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link 到 <a href="/blog/backend/03-message-queue/vendors/rabbitmq/" data-link-title="RabbitMQ" data-link-desc="Classic message broker、AMQP routing 為主">RabbitMQ</a> 跟 <a href="/blog/backend/03-message-queue/vendors/aws-sqs/" data-link-title="AWS SQS" data-link-desc="AWS managed queue、簡單可靠、無 ordering（standard）">AWS SQS</a>。對照 <a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a> 的 paradigm shift、本篇主導差異維度是 <em>operational model</em>：source 跟 target 都是任務隊列、能力大致對得上、但運維責任從「自管 broker 叢集」整批交給 AWS managed 服務。</p></blockquote>
<p>RabbitMQ → AWS SQS 的核心是把 broker 運維責任轉移給 managed 服務、同時接受 SQS 沒有 exchange routing 這個事實、把路由邏輯收斂回 application 或改用 SNS fan-out。這個遷移不是 protocol drop-in（AMQP client 不能直接連 SQS）、application 端需要改 delivery 控制機制（manual ack → visibility timeout + delete）；但它也不是 paradigm shift（兩端都是 at-least-once 任務隊列、DLQ / 重試 / 解耦的語意一致）。主導差異落在 operational 維度、所以本文走 Type C operational redesign hybrid 結構。</p>
<h2 id="為什麼遷不想再養-rabbitmq-叢集">為什麼遷：不想再養 RabbitMQ 叢集</h2>
<p>觸發評估 SQS 的最常見壓力是 broker 維運成本、不是功能缺口。自管 RabbitMQ 叢集要承擔的運維責任包含 Erlang cluster 拓樸維護、network partition（腦裂）處理、quorum queue 的 Raft 一致性調校、disk / memory alarm 的容量規劃、版本升級的 rolling restart。這些責任需要至少 0.5-1 FTE 的持續投入、且在 <a href="/blog/backend/03-message-queue/vendors/rabbitmq/" data-link-title="RabbitMQ" data-link-desc="Classic message broker、AMQP routing 為主">network partition</a> 這類事故發生時需要熟悉 Erlang runtime 的人即時介入。</p>
<p>SQS 把這整層責任移除。沒有 broker 實例、沒有 cluster 拓樸、沒有 disk / memory watermark、沒有版本升級。換來的代價是 routing 能力消失（SQS 沒有 exchange）、application 要改 delivery 控制機制、以及 AWS 生態綁定。這個交換在三種情境下成立：</p>
<p>第一種是 AWS 生態原生服務。若 producer / consumer 已經跑在 Lambda、ECS、EKS 上、SQS 的 event source mapping 跟 IAM 整合讓 application 不必自管連線池跟認證。RabbitMQ 在 AWS 上要嘛自管 EC2 叢集、要嘛用 Amazon MQ（仍是 broker 模型、運維責任只是部分轉移）、都不如 SQS 的 serverless 整合直接。</p>
<p>第二種是 routing 邏輯本來就簡單。若 RabbitMQ 的用法是 direct exchange + 少數固定 routing key、或單純 worker pool 消費單一 queue、那 exchange 的靈活性本來就沒被用到、遷到 SQS 不損失能力。Airbnb 的 Dynein 分散式延遲任務系統就是這個形狀：用 SQS at-least-once + DLQ 取代原本受限於單 Redis 的 Resque、每 scheduler instance 達約 1000 QPS、水平擴展（見 <a href="/blog/backend/03-message-queue/cases/sqs-airbnb-dynein-delayed-jobs/" data-link-title="3.C48 Airbnb Dynein：SQS 分散式延遲任務排程" data-link-desc="Airbnb 用 SQS at-least-once &#43; DLQ 取代 Resque 單 Redis 限制、每 scheduler 1000 QPS、SQS wrap DynamoDB 處理 &gt; 15 分鐘 delay。">3.C48 Airbnb Dynein</a>）。任務排程對「不丟資料」的需求 at-least-once 足夠、不需要 broker 級 routing。</p>
<p>第三種是團隊規模不支撐 broker 專業。小團隊養一套 RabbitMQ 叢集、真正用到的是「可靠的任務隊列 + DLQ」、但要付出整套 Erlang 運維學習曲線。把這層交給 SQS、團隊把精力放回 application 邏輯。</p>
<h2 id="6-維-diff-dimension-audit">6 維 diff dimension audit</h2>
<p>遷移前先跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">diff dimension audit</a>、對每個維度評估 source 跟 target 的差異程度、決定主導維度跟結構：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>RabbitMQ（self-managed）</th>
          <th>AWS SQS（managed）</th>
          <th>差異</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>AMQP 0-9-1 協議、exchange / queue</td>
          <td>HTTP API、SendMessage / ReceiveMessage</td>
          <td>中</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>自管 Erlang 叢集、cluster / disk / 升級</td>
          <td>Fully managed、無實例、無版本</td>
          <td>高</td>
      </tr>
      <tr>
          <td>Abstraction / paradigm</td>
          <td>任務隊列 + 重試 + DLQ</td>
          <td>任務隊列 + 重試 + DLQ</td>
          <td>低</td>
      </tr>
      <tr>
          <td>Components（1 vs N）</td>
          <td>broker 一站式（routing 內建）</td>
          <td>SQS + 需要 SNS 補 fan-out routing</td>
          <td>中</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>manual ack / nack、prefetch、AMQP client</td>
          <td>visibility timeout + delete、batch、SDK</td>
          <td>中高</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>單叢集 / federation 拓樸</td>
          <td>region-scoped queue、無拓樸概念</td>
          <td>低</td>
      </tr>
  </tbody>
</table>
<p><strong>主導維度是 operational（高）</strong>：遷移的核心價值跟核心風險都在「broker 運維責任整批轉移」。Application change 維度評中高、因為 delivery 控制機制要改、但這是受控的 SDK 層改寫、不是 paradigm 重設計。Components 維度評中、因為 exchange routing 在 SQS 沒有對等物、要靠 SNS fan-out 或多 queue 補回來。其餘三維度低或中。</p>
<p>主導維度落在 operational、所以主結構走 Type C：以 operational redesign 對位開頭、phased 執行、故障演練聚焦在「以為對等其實不對等」的運維陷阱。Application change 跟 Components 兩個次高維度不硬塞進主結構、各自抽出獨立段（下面「application 改寫」跟「routing 收斂」兩段）。</p>
<h3 id="operational-redesign-對位">Operational redesign 對位</h3>
<p>Operational 維度差異最大、先逐項對位「原本自己做的事、現在誰做、怎麼做」：</p>
<table>
  <thead>
      <tr>
          <th>運維責任</th>
          <th>RabbitMQ（自己做）</th>
          <th>SQS（managed / application）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>高可用</td>
          <td>quorum queue + cluster + partition 處理</td>
          <td>AWS 跨 AZ 自動冗餘、無需配置</td>
      </tr>
      <tr>
          <td>容量規劃</td>
          <td>disk / memory watermark、queue length 限</td>
          <td>自動擴展、無實例容量概念</td>
      </tr>
      <tr>
          <td>版本升級</td>
          <td>rolling restart、相容性驗證</td>
          <td>無、AWS 維護</td>
      </tr>
      <tr>
          <td>監控</td>
          <td>Management UI + Prometheus exporter</td>
          <td>CloudWatch metric（depth / age）</td>
      </tr>
      <tr>
          <td>Delivery 控制</td>
          <td>broker-side ack / nack 狀態機</td>
          <td>client-side visibility timeout + delete</td>
      </tr>
      <tr>
          <td>重試 / DLQ</td>
          <td>DLX + dead-letter routing key</td>
          <td>redrive policy + maxReceiveCount</td>
      </tr>
      <tr>
          <td>Routing</td>
          <td>exchange + binding（broker 內建）</td>
          <td>application 或 SNS（broker 外）</td>
      </tr>
  </tbody>
</table>
<p>前四列是純收益：責任消失、不需要對等實作。後三列是責任轉移、不是消失 — delivery 控制從 broker 移到 client、重試從 DLX 移到 redrive policy、routing 從 broker 移到 application。這三列正是故障演練聚焦的地方、因為「以為功能還在、其實機制換了」是這類遷移的主要事故來源。</p>
<p>監控這列值得展開。RabbitMQ 的 queue depth、unacked、consumer 數量是從 broker 直接讀；SQS 改看 CloudWatch 的 <code>ApproximateNumberOfMessagesVisible</code>（queue depth）跟 <code>ApproximateAgeOfOldestMessage</code>（lag 訊號）。差異在於 SQS 的 metric 是 approximate、且有分鐘級延遲、不適合用來做秒級的 backpressure 決策。原本靠 RabbitMQ Management UI 即時看 queue 狀態的 runbook 要改寫成 CloudWatch alarm 驅動。</p>
<h2 id="application-改寫manual-ack--visibility-timeout--delete">Application 改寫：manual ack → visibility timeout + delete</h2>
<p>Application change 維度的核心是 delivery 控制機制換了一套模型。RabbitMQ 是 broker-side 維護訊息狀態、consumer 用 <a href="/blog/backend/knowledge-cards/ack-nack/" data-link-title="Ack / Nack" data-link-desc="說明 consumer 如何向 broker 回報訊息處理結果">ack/nack</a> 回報處理結果；SQS 是 client-side 用 <a href="/blog/backend/knowledge-cards/in-flight/" data-link-title="In-Flight Work" data-link-desc="目前已接收但尚未完成處理的工作量">visibility timeout</a> + 顯式 delete、broker 不維護「處理中」以外的狀態。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># RabbitMQ 端：manual ack pattern</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">channel</span><span class="o">.</span><span class="n">basic_qos</span><span class="p">(</span><span class="n">prefetch_count</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>  <span class="c1"># 一次最多領 10 條未 ack</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="k">def</span> <span class="nf">callback</span><span class="p">(</span><span class="n">ch</span><span class="p">,</span> <span class="n">method</span><span class="p">,</span> <span class="n">properties</span><span class="p">,</span> <span class="n">body</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">        <span class="n">process</span><span class="p">(</span><span class="n">body</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">        <span class="n">ch</span><span class="o">.</span><span class="n">basic_ack</span><span class="p">(</span><span class="n">delivery_tag</span><span class="o">=</span><span class="n">method</span><span class="o">.</span><span class="n">delivery_tag</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="k">except</span> <span class="ne">Exception</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">        <span class="c1"># nack + requeue，或丟 DLX</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">        <span class="n">ch</span><span class="o">.</span><span class="n">basic_nack</span><span class="p">(</span><span class="n">delivery_tag</span><span class="o">=</span><span class="n">method</span><span class="o">.</span><span class="n">delivery_tag</span><span class="p">,</span> <span class="n">requeue</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">channel</span><span class="o">.</span><span class="n">basic_consume</span><span class="p">(</span><span class="n">queue</span><span class="o">=</span><span class="s2">&#34;orders&#34;</span><span class="p">,</span> <span class="n">on_message_callback</span><span class="o">=</span><span class="n">callback</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="n">channel</span><span class="o">.</span><span class="n">start_consuming</span><span class="p">()</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># SQS 端：visibility timeout + delete pattern</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="k">while</span> <span class="kc">True</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="n">resp</span> <span class="o">=</span> <span class="n">sqs</span><span class="o">.</span><span class="n">receive_message</span><span class="p">(</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">        <span class="n">QueueUrl</span><span class="o">=</span><span class="n">queue_url</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">        <span class="n">MaxNumberOfMessages</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>        <span class="c1"># batch、對應 prefetch</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">        <span class="n">WaitTimeSeconds</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>            <span class="c1"># long polling</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">        <span class="n">VisibilityTimeout</span><span class="o">=</span><span class="mi">60</span><span class="p">,</span>          <span class="c1"># 處理中對其他 consumer 隱藏</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="p">)</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="k">for</span> <span class="n">msg</span> <span class="ow">in</span> <span class="n">resp</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;Messages&#34;</span><span class="p">,</span> <span class="p">[]):</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">        <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">            <span class="n">process</span><span class="p">(</span><span class="n">msg</span><span class="p">[</span><span class="s2">&#34;Body&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">            <span class="n">sqs</span><span class="o">.</span><span class="n">delete_message</span><span class="p">(</span>           <span class="c1"># 顯式 delete = ack</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">                <span class="n">QueueUrl</span><span class="o">=</span><span class="n">queue_url</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">                <span class="n">ReceiptHandle</span><span class="o">=</span><span class="n">msg</span><span class="p">[</span><span class="s2">&#34;ReceiptHandle&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">            <span class="p">)</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">        <span class="k">except</span> <span class="ne">Exception</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">            <span class="k">pass</span>  <span class="c1"># 不 delete、visibility timeout 後自動回 queue 重試</span></span></span></code></pre></div><p>對應關係：</p>
<ul>
<li>RabbitMQ <code>basic_ack</code> → SQS <code>delete_message</code>：處理成功的訊息要顯式刪除、否則 visibility timeout 後重新可見。「不做事」在 SQS 等於「重試」、在 RabbitMQ 等於「卡住 unacked」。</li>
<li>RabbitMQ <code>prefetch_count</code> → SQS <code>MaxNumberOfMessages</code>（上限 10）+ visibility timeout：併發控制從「broker 限制未 ack 數量」變成「一次 receive 的 batch 大小 + 隱藏時間窗」。</li>
<li>RabbitMQ <code>basic_nack(requeue=False)</code>（丟 DLX）→ SQS redrive policy：失敗不再是 application 主動丟 DLX、而是「達到 maxReceiveCount 次數後 SQS 自動送 DLQ」。</li>
<li>RabbitMQ push 模型（broker 主動推給 consumer）→ SQS pull 模型（consumer 主動 long polling）：consumer loop 結構不同、SQS 沒有 broker 主動推送、要嘛自己 poll、要嘛交給 Lambda event source mapping 代 poll。</li>
</ul>
<p>application 邏輯改動集中在 consumer 的 receive / ack / 重試三段、producer 端從 <code>basic_publish</code> 改成 <code>send_message</code> 相對單純。整體改動量取決於原本用了多少 AMQP 特性、典型情境是 consumer 端 20-40% 改寫。</p>
<h2 id="routing-收斂exchange-沒了靠-sns-fan-out-或多-queue">Routing 收斂：exchange 沒了、靠 SNS fan-out 或多 queue</h2>
<p>Components 維度的核心是 SQS 沒有 exchange、RabbitMQ 的 routing 能力要在 broker 外重建。RabbitMQ 的 <a href="/blog/backend/knowledge-cards/broker/" data-link-title="Broker" data-link-desc="說明 broker 在訊息傳遞系統中負責保存、路由與交付訊息">exchange</a> 在 broker 內承擔分流：一條訊息經 routing key 跟 binding 決定進哪些 queue。SQS 是裸 queue、producer 直接指定 queue、沒有中間分流層。</p>
<table>
  <thead>
      <tr>
          <th>RabbitMQ routing 模式</th>
          <th>SQS 對應方案</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Direct（固定 key）</td>
          <td>直接 send 到對應 queue、routing 收斂進 producer 程式碼</td>
      </tr>
      <tr>
          <td>Fanout（廣播）</td>
          <td>SNS topic → 多個 SQS queue 訂閱（SNS-to-SQS fan-out）</td>
      </tr>
      <tr>
          <td>Topic（層級 key 匹配）</td>
          <td>SNS + message filtering（subscription filter policy）</td>
      </tr>
      <tr>
          <td>Headers</td>
          <td>SNS message attribute filtering</td>
      </tr>
  </tbody>
</table>
<p>判讀：</p>
<ul>
<li><strong>Direct exchange + 少數固定 key</strong>：最容易遷。routing 邏輯本來就是「key X 進 queue X」、改成 producer 直接 <code>send_message</code> 到對應 queue url。routing 從 broker 收斂進 application、程式碼多幾行 if/else 或 map 查表。</li>
<li><strong>Fanout（一條訊息給多個 downstream）</strong>：用 SNS-to-SQS。SNS topic 當 fan-out 點、每個 downstream 訂閱一個自己的 SQS queue。Twitch EventSub 就是這個形狀（見 <a href="/blog/backend/03-message-queue/cases/sqs-twitch-eventsub-fanout/" data-link-title="3.C54 Twitch EventSub：SNS&#43;SQS fan-out 給第三方" data-link-desc="Twitch Event Bus ~1660 events/sec 進 SNS、EventSub 用 SQS 接收 &#43; Dispatcher fan-out 給訂閱者。">3.C54 Twitch EventSub</a>）：SNS fan-out 到多個 SQS、各 consumer 獨立消費。這比 RabbitMQ fanout exchange 多一層 SNS、但換來 managed 運維。</li>
<li><strong>Topic exchange（複雜層級匹配）</strong>：SNS 的 subscription filter policy 能做 attribute-based 過濾、但表達力不如 AMQP topic 的 <code>*</code> / <code>#</code> 通配。複雜 topic routing 是「不該遷」的訊號（見下節）。</li>
</ul>
<p>關鍵取捨：SQS + SNS 把 RabbitMQ 的單一 broker（routing 內建）拆成兩個 managed 服務（SQS 排隊 + SNS 分流）。好處是各自 managed、壞處是 routing 從宣告式 binding 變成要管 SNS topic + subscription + filter policy 的組合、跨服務除錯多一層。</p>
<h2 id="什麼不該遷保留-rabbitmq-的訊號">什麼不該遷：保留 RabbitMQ 的訊號</h2>
<p>SQS 的 managed 簡潔有代價、三類用法遷過去會損失能力或增加複雜度：</p>
<p><strong>複雜 topic routing</strong>。若 RabbitMQ 重度使用 topic exchange 的 <code>*</code> / <code>#</code> 層級通配、binding 規則數十條、那 routing 的表達力是核心價值。SNS subscription filter 的 attribute 匹配做不到對等表達、勉強遷會把 broker 內的宣告式 routing 拆成散落在 SNS filter policy + application 程式碼的命令式邏輯、維護成本反而上升。GoCardless 用單一 topic exchange 當服務 mesh（見 <a href="/blog/backend/03-message-queue/cases/rabbitmq-gocardless-hutch-service-mesh/" data-link-title="3.C26 GoCardless：Hutch &#43; 單一 topic exchange service mesh" data-link-desc="GoCardless 單一 RabbitMQ cluster 作所有 service 通訊中樞、routing key 用 service.subject.action 格式、JSON 多語言可讀。">3.C26 GoCardless Hutch</a>）這類設計、routing 就是架構本身、不該拆。</p>
<p><strong>需要 broker 級 ordering</strong>。RabbitMQ 單 queue 預設 FIFO、consistent hash exchange 還能做 per-key ordering（見 <a href="/blog/backend/03-message-queue/cases/rabbitmq-wework-consistent-hash-ordering/" data-link-title="3.C28 WeWork：Consistent hash exchange 保證帳戶順序" data-link-desc="WeWork 固定數量 queue &#43; account ID hash 路由、每 queue 一個 worker &#43; exclusive consumer 保 partition-level ordering。">3.C28 WeWork hash ordering</a>）。SQS standard queue <em>無 ordering</em>；要 ordering 只能用 FIFO queue、而 FIFO 吞吐受限（每 MessageGroupId 有序、整體 3000 msg/sec with batching）。若 workload 同時要高吞吐跟嚴格 ordering、SQS FIFO 兩者不可兼得、RabbitMQ 反而更適合。</p>
<p><strong>RPC over messaging（request-reply）</strong>。RabbitMQ 的 reply-to + correlation-id 做同步 RPC 模式、SQS 沒有原生 request-reply、要自己用兩條 queue + correlation 拼、延遲也不適合（SQS 是 task queue 不是低延遲傳輸）。這類用法該考慮 <a href="/blog/backend/03-message-queue/vendors/nats/" data-link-title="NATS" data-link-desc="Lightweight messaging、JetStream 加持久化與 streams">NATS</a> 的 request-reply 或直接 HTTP。</p>
<h2 id="migration-結構漸進-cutover">Migration 結構：漸進 cutover</h2>
<p>operational redesign 的 cutover 走 dual-run、按 queue（不是按整個叢集）漸進切、每步都保留回退邊界：</p>
<ol>
<li><strong>Phase 0：scope 盤點</strong> — 列出所有 exchange / queue / binding、標註 routing 模式（direct / fanout / topic）跟 ordering 需求。判斷哪些 queue 適合遷（簡單 routing、at-least-once 夠用）、哪些保留（複雜 topic、需 broker ordering、RPC）。</li>
<li><strong>Phase 1：SQS / SNS 基礎建設</strong> — 對適合遷的 queue 建對應 SQS queue + DLQ（設 redrive policy + maxReceiveCount）、fanout 場景建 SNS topic + subscription。設好 IAM policy、visibility timeout 對齊 consumer 最大處理時間。</li>
<li><strong>Phase 2：consumer 改寫 + dual-consume</strong> — application consumer 改成 SQS pull 模型（或 Lambda event source）、先讓新 consumer 跟舊 RabbitMQ consumer <em>並存</em>、producer 暫時雙寫到 RabbitMQ + SQS、驗證 SQS 端處理正確。</li>
<li><strong>Phase 3：producer cutover</strong> — 逐 queue 把 producer 從 RabbitMQ 切到 SQS / SNS、停掉該 queue 的雙寫。這步可逆：發現問題切回 RabbitMQ producer 即可。</li>
<li><strong>Phase 4：下線 RabbitMQ queue</strong> — 確認某 queue 在 SQS 穩定運行、且 RabbitMQ 端該 queue 已排空、才停掉 RabbitMQ 對應的 exchange / queue。這是不可逆步驟、不該過早。</li>
<li><strong>Phase 5：叢集退役</strong> — 所有適合遷的 queue 都切完、RabbitMQ 只剩保留的複雜 routing queue（或完全清空）、才縮編或退役叢集。</li>
</ol>
<p>漸進 cutover 的關鍵是 <em>按 queue 切、不按叢集切</em>。每條 queue 是獨立的遷移單元、各自走 Phase 2-4、互不阻塞。複雜 routing 的 queue 可以永遠留在 RabbitMQ、形成 RabbitMQ + SQS 長期共存的混合架構。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1dlx-改-redrive-policy重試語意不對等">Case 1：DLX 改 redrive policy，重試語意不對等</h3>
<p><strong>徵兆</strong>：RabbitMQ 端用 DLX 配 message TTL 做「延遲重試 + 多層 escalation」（如 <a href="/blog/backend/03-message-queue/cases/rabbitmq-indeed-delay-dlq-escalation/" data-link-title="3.C25 Indeed：Delay queue &#43; DLQ 三層 escalation" data-link-desc="Indeed 每天 35M&#43; 職缺、設計 Requeue → Delay queue → DLQ 三層 escalation 避開 head-of-line blocking。">3.C25 Indeed Delay + DLQ</a> 的三層 retry）；遷到 SQS 後發現 redrive policy 只能設「失敗 N 次直接進 DLQ」、做不出原本的延遲重試階梯。</p>
<p><strong>根因</strong>：RabbitMQ DLX 是 routing 機制、能配 TTL + 多個中繼 queue 組出任意 escalation 拓樸；SQS redrive policy 是單一規則（maxReceiveCount 到了就送 DLQ）、沒有中繼層。兩者都叫「DLQ」、但 RabbitMQ 的是可編程 routing、SQS 的是固定計數。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>指數退避用 visibility timeout 做</strong>：失敗時 application 主動 <code>ChangeMessageVisibility</code> 延長隱藏時間、實現退避、而不是依賴 DLX TTL。</li>
<li><strong>多層 escalation 用多 queue 串</strong>：若真需要 N 層、建 N 個 SQS queue、application 失敗時把訊息 send 到下一層 queue、每層設不同 redrive policy。複雜度比 DLX 高、是「複雜 routing 不該遷」的訊號之一。</li>
<li><strong>接受簡化</strong>：多數 task queue 的重試需求是「重試幾次後進 DLQ 人工檢視」、SQS redrive policy 直接對應、不需要重建 escalation 階梯。</li>
</ol>
<h3 id="case-2prefetch-改-batch--visibility併發控制行為變了">Case 2：prefetch 改 batch + visibility，併發控制行為變了</h3>
<p><strong>徵兆</strong>：RabbitMQ 端 <code>prefetch_count=1</code> 確保 worker 一次只處理一條（公平派發、慢任務不囤積）；遷 SQS 後 consumer 一次 <code>receive_message</code> 領 10 條、其中一條慢任務拖累整批、且 visibility timeout 對整批同時計時、處理到一半超時導致前面已處理的訊息重複。</p>
<p><strong>根因</strong>：RabbitMQ prefetch 是 per-message 的未 ack 上限、broker 逐條控制；SQS 的 batch 是一次領多條、visibility timeout 對 batch 內每條<em>獨立</em>計時、但 application 若同步處理整批、慢的那條會讓後面的訊息在處理前就接近超時。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>慢任務用 batch size 1</strong>：對等 RabbitMQ <code>prefetch=1</code> 就設 <code>MaxNumberOfMessages=1</code>、一次領一條、避免批內互相拖累。</li>
<li><strong>visibility timeout 設成略高於最大處理時間</strong>：Capital One 的 SQS + Lambda 實務明示這點（見 <a href="/blog/backend/03-message-queue/cases/sqs-capital-one-visibility-timeout/" data-link-title="3.C50 Capital One：Visibility timeout 設計與 Lambda event source" data-link-desc="Capital One tech blog 講 SQS &#43; Lambda：visibility timeout 應略高於最大處理時間、Lambda 初 5 個 long polling、可擴 60/min。">3.C50 Capital One</a>）— timeout 太短重複處理、太長延遲 retry。長任務處理中主動 <code>ChangeMessageVisibility</code> 續期。</li>
<li><strong>逐條 delete 不等整批</strong>：每條處理完立刻 <code>delete_message</code>、不要等整批做完才一起刪、降低整批超時導致部分重複的風險。</li>
</ol>
<h3 id="case-3fanout-改-sns-to-sqs漏訂閱導致部分-downstream-收不到">Case 3：fanout 改 SNS-to-SQS，漏訂閱導致部分 downstream 收不到</h3>
<p><strong>徵兆</strong>：RabbitMQ fanout exchange 廣播到所有 binding queue、新增 downstream 只要 bind 上去就收得到；遷成 SNS-to-SQS 後、某個新 downstream 的 SQS queue 沒訂閱到 SNS topic、或 subscription filter policy 設錯、導致該 downstream 靜默漏訊息。</p>
<p><strong>根因</strong>：RabbitMQ fanout 的廣播是 broker 內建語意、binding 一建立就生效；SNS-to-SQS 的 fan-out 是「每個 downstream 各自建 SQS queue + 訂閱 SNS topic + 設 queue policy 允許 SNS 投遞」三步、任一步漏掉或 filter policy 寫錯就靜默漏。多一層服務 = 多一層配置出錯點。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>訂閱關係 IaC 管理</strong>：SNS subscription + SQS queue policy 用 Terraform / CloudFormation 宣告、避免手動建漏。</li>
<li><strong>驗證 fan-out 完整性</strong>：cutover 前發測試訊息、確認<em>每個</em> downstream queue 都收到（對照 RabbitMQ 端 binding 清單逐一核對）。</li>
<li><strong>filter policy 預設寬鬆</strong>：除非明確要過濾、subscription 不設 filter policy（全收）、避免「以為廣播、實際被 filter 擋掉」。</li>
</ol>
<h3 id="case-4訊息超過-256kbsqs-拒收">Case 4：訊息超過 256KB，SQS 拒收</h3>
<p><strong>徵兆</strong>：RabbitMQ 對單訊息大小無硬性低上限（受 frame_max / memory 限制、實務常見 MB 級 payload）；遷 SQS 後、原本能傳的大 payload 訊息被拒、SendMessage 報 message 超過 256KB 上限。</p>
<p><strong>根因</strong>：SQS 單訊息上限 256KB（含 message attribute）。RabbitMQ 沒有這個低上限、application 可能習慣直接把大 payload（如完整文件、序列化大物件）塞進訊息體。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Claim-check pattern</strong>：大 payload 存 S3、訊息只放 S3 物件的引用（key / presigned URL）、consumer 收到後從 S3 取。FINRA 的大檔案處理是 S3 event notification → SQS（檔案上傳 S3 後由 S3 推通知），結果同樣讓訊息只帶 S3 物件引用，但機制是 S3 觸發、不是 producer 主動 offload（見 <a href="/blog/backend/03-message-queue/cases/sqs-finra-large-file-service/" data-link-title="3.C53 FINRA：S3 → SQS notification 大檔上傳" data-link-desc="FINRA 金融監管、broker 上傳大檔、S3 → SQS notification → LFS、KMS &#43; bucket policy &#43; queue policy 三層稽核。">3.C53 FINRA Large File</a>）。</li>
<li><strong>SQS Extended Client Library</strong>：AWS 官方 library 自動把超過上限的 payload 透明存 S3、訊息存指標、consumer 端自動取回、application 程式碼幾乎不改。</li>
<li><strong>盤點 payload 大小分佈</strong>：Phase 0 audit 時量測現有訊息大小、超 256KB 的比例決定是否需要 claim-check、避免 cutover 後才發現大量訊息被拒。</li>
</ol>
<h3 id="case-5ordering-從-rabbitmq-到-sqs-fifo吞吐撞天花板">Case 5：ordering 從 RabbitMQ 到 SQS FIFO，吞吐撞天花板</h3>
<p><strong>徵兆</strong>：RabbitMQ 單 queue 提供順序消費、原本靠這個保證同一筆訂單的事件有序處理；遷 SQS standard queue 後 ordering 消失、改用 SQS FIFO queue 恢復 ordering、但吞吐從原本的數萬 msg/sec 掉到 3000 msg/sec 上限、隊列堆積。</p>
<p><strong>根因</strong>：SQS standard queue 無 ordering（為了吞吐跟可用性的設計取捨）；FIFO queue 提供 per-MessageGroupId 有序 + 去重、但整體吞吐上限 3000 msg/sec（with batching）。RabbitMQ 單 queue 的有序消費吞吐遠高於此。SQS FIFO 的吞吐上限是 300 TPS（不 batch）／ 3000 TPS（batch，後者為通用 SQS FIFO 數值）。Twilio 的 webhook buffer 文件特別點出 FIFO 300 TPS 這個限制（見 <a href="/blog/backend/03-message-queue/cases/sqs-twilio-webhook-buffer/" data-link-title="3.C58 Twilio：SQS 緩衝高流量 webhook" data-link-desc="Twilio 教用 SQS 緩衝 SMS / status callback webhook、分 queue（SMS vs callback）、long polling 減 cost、FIFO 300 TPS 上限要分片。">3.C58 Twilio webhook</a>）。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>重新審視 ordering 粒度</strong>：用 MessageGroupId 把 ordering 限縮到真正需要的範圍（如 per-訂單、per-用戶）、不同 group 平行處理、整體吞吐 = group 數 × per-group 吞吐、繞過單 queue 3000 上限。</li>
<li><strong>拆分 ordered 跟 unordered 流量</strong>：只有真需要 ordering 的訊息走 FIFO、其餘走 standard queue 拿高吞吐。多數 workload 只有一小部分需要嚴格 ordering。</li>
<li><strong>ordering 是「不該遷」的硬訊號</strong>：若 workload 整體都需要高吞吐 + 嚴格 ordering、SQS FIFO 兩者不可兼得、保留 RabbitMQ 或考慮 <a href="/blog/backend/03-message-queue/vendors/kafka/" data-link-title="Apache Kafka" data-link-desc="Distributed event streaming platform、log-based 模型">Kafka</a>（per-partition ordering + 高吞吐）。</li>
</ol>
<h2 id="capacity--cost-對照">Capacity / cost 對照</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>RabbitMQ（self-managed EC2）</th>
          <th>AWS SQS（managed）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>叢集 baseline</td>
          <td>3 broker（HA）+ EBS</td>
          <td>無實例</td>
      </tr>
      <tr>
          <td>運維 FTE</td>
          <td>0.5-1 FTE</td>
          <td>~0.1 FTE（IAM / alarm 配置）</td>
      </tr>
      <tr>
          <td>計費模型</td>
          <td>EC2 instance hour + EBS + 流量</td>
          <td>per-request（每百萬 request）+ 跨 region 流量</td>
      </tr>
      <tr>
          <td>吞吐上限</td>
          <td>受 broker 規格 / 網路限制</td>
          <td>standard 近乎無限、FIFO 3000 msg/sec</td>
      </tr>
      <tr>
          <td>Ordering</td>
          <td>單 queue 有序、consistent hash per-key</td>
          <td>standard 無、FIFO per-group</td>
      </tr>
      <tr>
          <td>Routing</td>
          <td>broker 內建 exchange</td>
          <td>無（需 SNS / application）</td>
      </tr>
      <tr>
          <td>訊息大小上限</td>
          <td>受 frame_max / memory（MB 級可行）</td>
          <td>256KB（超過用 S3 claim-check）</td>
      </tr>
      <tr>
          <td>監控延遲</td>
          <td>即時（Management UI）</td>
          <td>CloudWatch approximate、分鐘級</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：低到中吞吐、簡單 routing、AWS 生態的 task queue、SQS 在運維成本上顯著划算（FTE 從 0.5-1 降到約 0.1）。高吞吐 + 嚴格 ordering、或重度 exchange routing 的 workload、SQS 的 per-request 成本跟能力限制可能讓 RabbitMQ（或 Kafka）反而合適。SQS 的 cost 是用量驅動、流量大時 per-request 費用要納入評估、對照 <a href="/blog/backend/00-service-selection/cost-risk-tradeoffs/" data-link-title="0.6 成本、風險與選型取捨" data-link-desc="用人力成本、雲端成本、操作成本與失敗代價判斷後端能力投入順序">0.6 成本取捨</a>。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="混合架構是常見終態">混合架構是常見終態</h3>
<p>多數遷移不會把 RabbitMQ 完全清空。簡單 task queue 遷 SQS、複雜 topic routing / broker ordering / RPC 留 RabbitMQ、形成長期共存：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">[簡單 task queue / fanout]              [複雜 topic routing / RPC / ordering]
</span></span><span class="line"><span class="ln">2</span><span class="cl">        AWS SQS / SNS                              RabbitMQ
</span></span><span class="line"><span class="ln">3</span><span class="cl">        │                                            │
</span></span><span class="line"><span class="ln">4</span><span class="cl">   Lambda / ECS consumer                    自管叢集（縮編後）</span></span></code></pre></div><p>按 queue 漸進切的結果就是混合架構 — 不需要為了「遷乾淨」勉強把不適合的 queue 也搬過去。</p>
<h3 id="跟-rabbitmq--kafka-的對照">跟 RabbitMQ → Kafka 的對照</h3>
<p>RabbitMQ 還有另一條遷移路徑是 <a href="/blog/backend/03-message-queue/vendors/kafka/" data-link-title="Apache Kafka" data-link-desc="Distributed event streaming platform、log-based 模型">RabbitMQ → Kafka</a>（work queue → event streaming）。兩條路的差異：遷 SQS 是 <em>交出運維、能力對等簡化</em>（仍是 task queue）；遷 Kafka 是 <em>換 paradigm、要 replay / 高吞吐 streaming</em>（從任務隊列變 event log）。選哪條看的是「想擺脫運維」還是「需要 streaming 能力」、不是同一個決策。</p>
<h3 id="跟前面-migration-playbook-的結構對照">跟前面 migration playbook 的結構對照</h3>
<table>
  <thead>
      <tr>
          <th>篇</th>
          <th>主導差異維度</th>
          <th>結構</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Kafka ↔ NATS</td>
          <td>Paradigm（高）</td>
          <td>partial + 混合</td>
      </tr>
      <tr>
          <td>RabbitMQ → SQS（本篇）</td>
          <td>Operational（高）</td>
          <td>Type C operational hybrid</td>
      </tr>
  </tbody>
</table>
<p><strong>結論</strong>：兩篇都是 message queue 跨 vendor、但主導差異維度不同 — Kafka ↔ NATS 卡在 paradigm（不同抽象層）、RabbitMQ → SQS 卡在 operational（運維責任轉移）。結構由主導維度決定、不是 universal phased playbook。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source / target vendor：<a href="/blog/backend/03-message-queue/vendors/rabbitmq/" data-link-title="RabbitMQ" data-link-desc="Classic message broker、AMQP routing 為主">RabbitMQ</a> / <a href="/blog/backend/03-message-queue/vendors/aws-sqs/" data-link-title="AWS SQS" data-link-desc="AWS managed queue、簡單可靠、無 ordering（standard）">AWS SQS</a></li>
<li>平行 vendor：<a href="/blog/backend/03-message-queue/vendors/google-pubsub/" data-link-title="Google Cloud Pub/Sub" data-link-desc="GCP managed pub/sub、global routing、push/pull">Google Pub/Sub</a> / <a href="/blog/backend/03-message-queue/vendors/nats/" data-link-title="NATS" data-link-desc="Lightweight messaging、JetStream 加持久化與 streams">NATS</a></li>
<li>平行 migration playbook：<a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a></li>
<li>引用案例：<a href="/blog/backend/03-message-queue/cases/sqs-airbnb-dynein-delayed-jobs/" data-link-title="3.C48 Airbnb Dynein：SQS 分散式延遲任務排程" data-link-desc="Airbnb 用 SQS at-least-once &#43; DLQ 取代 Resque 單 Redis 限制、每 scheduler 1000 QPS、SQS wrap DynamoDB 處理 &gt; 15 分鐘 delay。">3.C48 Airbnb Dynein</a> / <a href="/blog/backend/03-message-queue/cases/sqs-capital-one-visibility-timeout/" data-link-title="3.C50 Capital One：Visibility timeout 設計與 Lambda event source" data-link-desc="Capital One tech blog 講 SQS &#43; Lambda：visibility timeout 應略高於最大處理時間、Lambda 初 5 個 long polling、可擴 60/min。">3.C50 Capital One</a> / <a href="/blog/backend/03-message-queue/cases/sqs-twitch-eventsub-fanout/" data-link-title="3.C54 Twitch EventSub：SNS&#43;SQS fan-out 給第三方" data-link-desc="Twitch Event Bus ~1660 events/sec 進 SNS、EventSub 用 SQS 接收 &#43; Dispatcher fan-out 給訂閱者。">3.C54 Twitch EventSub</a> / <a href="/blog/backend/03-message-queue/cases/sqs-twilio-webhook-buffer/" data-link-title="3.C58 Twilio：SQS 緩衝高流量 webhook" data-link-desc="Twilio 教用 SQS 緩衝 SMS / status callback webhook、分 queue（SMS vs callback）、long polling 減 cost、FIFO 300 TPS 上限要分片。">3.C58 Twilio webhook</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook 寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>Self-managed ELK → Elastic Cloud：5 年 ELK 集群的 lifecycle 收尾</title><link>https://tarrragon.github.io/blog/backend/04-observability/vendors/elastic-stack/migrate-to-elastic-cloud/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/04-observability/vendors/elastic-stack/migrate-to-elastic-cloud/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/elastic-stack/" data-link-title="Elastic Stack" data-link-desc="ELK：Elasticsearch / Logstash / Kibana &amp;#43; Beats / APM">Elastic Stack&lt;/a> 跟 Elastic Cloud。跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit&lt;/a> 後對映 &lt;em>Operational = High（self-managed → Elastic managed）→ Type C operational redesign hybrid&lt;/em>。&lt;/p>&lt;/blockquote>
&lt;h2 id="5-年-elk-集群的-lifecycle-收尾">5 年 ELK 集群的 lifecycle 收尾&lt;/h2>
&lt;p>跟前批 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &amp;#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora&lt;/a> 同 Type C、本文用 &lt;em>lifecycle-driven&lt;/em> entry — 看 5 年 ELK 集群典型壽命曲線：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>年份&lt;/th>
 &lt;th>Phase&lt;/th>
 &lt;th>集群狀態&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>0-1&lt;/td>
 &lt;td>Build&lt;/td>
 &lt;td>3 node、簡單部署、SOC 學 Lucene query / dashboard / alert&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>1-2&lt;/td>
 &lt;td>Scale-out&lt;/td>
 &lt;td>5-7 node、shard 計畫、hot/warm/cold tier、index lifecycle management&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>2-3&lt;/td>
 &lt;td>Degrade&lt;/td>
 &lt;td>10+ node、shard 過多、query latency 升、upgrade window 開始痛&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>3-4&lt;/td>
 &lt;td>Save&lt;/td>
 &lt;td>加 dedicated master / cross-cluster replication、ops cost 飛漲&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>4-5&lt;/td>
 &lt;td>Migrate decision&lt;/td>
 &lt;td>評估走 Elastic Cloud（managed）或下一個 SIEM vendor&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>多數中型 organization 在 lifecycle 第 4-5 年遇到 &lt;em>operational ceiling&lt;/em> — SRE team 0.5-1.5 FTE 跑 ELK ops、新 feature 開發停滯、cost 跟 alternative observability vendor 比較。Elastic Cloud 把 operational stack 全託管、SOC 留在 &lt;em>Lucene query + dashboard + alert&lt;/em> 層、不再管 cluster sizing。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link <a href="/blog/backend/04-observability/vendors/elastic-stack/" data-link-title="Elastic Stack" data-link-desc="ELK：Elasticsearch / Logstash / Kibana &#43; Beats / APM">Elastic Stack</a> 跟 Elastic Cloud。跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit</a> 後對映 <em>Operational = High（self-managed → Elastic managed）→ Type C operational redesign hybrid</em>。</p></blockquote>
<h2 id="5-年-elk-集群的-lifecycle-收尾">5 年 ELK 集群的 lifecycle 收尾</h2>
<p>跟前批 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a> 同 Type C、本文用 <em>lifecycle-driven</em> entry — 看 5 年 ELK 集群典型壽命曲線：</p>
<table>
  <thead>
      <tr>
          <th>年份</th>
          <th>Phase</th>
          <th>集群狀態</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0-1</td>
          <td>Build</td>
          <td>3 node、簡單部署、SOC 學 Lucene query / dashboard / alert</td>
      </tr>
      <tr>
          <td>1-2</td>
          <td>Scale-out</td>
          <td>5-7 node、shard 計畫、hot/warm/cold tier、index lifecycle management</td>
      </tr>
      <tr>
          <td>2-3</td>
          <td>Degrade</td>
          <td>10+ node、shard 過多、query latency 升、upgrade window 開始痛</td>
      </tr>
      <tr>
          <td>3-4</td>
          <td>Save</td>
          <td>加 dedicated master / cross-cluster replication、ops cost 飛漲</td>
      </tr>
      <tr>
          <td>4-5</td>
          <td>Migrate decision</td>
          <td>評估走 Elastic Cloud（managed）或下一個 SIEM vendor</td>
      </tr>
  </tbody>
</table>
<p>多數中型 organization 在 lifecycle 第 4-5 年遇到 <em>operational ceiling</em> — SRE team 0.5-1.5 FTE 跑 ELK ops、新 feature 開發停滯、cost 跟 alternative observability vendor 比較。Elastic Cloud 把 operational stack 全託管、SOC 留在 <em>Lucene query + dashboard + alert</em> 層、不再管 cluster sizing。</p>
<h2 id="為什麼遷fte--availability--version-cadence-三條-driver">為什麼遷：FTE / availability / version cadence 三條 driver</h2>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>觸發</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FTE</td>
          <td>Self-managed ELK 0.5-1.5 FTE 跑 ops、Elastic Cloud 降到 0.1-0.3 FTE</td>
      </tr>
      <tr>
          <td>Availability</td>
          <td>Cross-AZ failover 自管太複雜、Cloud 內建</td>
      </tr>
      <tr>
          <td>Version cadence</td>
          <td>Elasticsearch 8.x quarterly release、self-managed upgrade window 是痛點、Cloud 自動</td>
      </tr>
  </tbody>
</table>
<h2 id="6-維-audit">6 維 audit</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>Low（Elasticsearch API 完全相容）</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td><strong>High</strong>（cluster mgmt 全託管）</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>Low（同 Elasticsearch + Kibana + Beats / Logstash）</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Low-Medium（連線 endpoint + auth 改）</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>Low</td>
      </tr>
  </tbody>
</table>
<p>Operational = High → Type C standard。</p>
<h2 id="operational-redesign-對位">Operational redesign 對位</h2>
<table>
  <thead>
      <tr>
          <th>Concept</th>
          <th>Self-managed ELK</th>
          <th>Elastic Cloud</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster bootstrap</td>
          <td>手動 install + config</td>
          <td>UI / API 一鍵建 deployment</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>自管 master / dedicated voting / cross-AZ</td>
          <td>內建 multi-AZ</td>
      </tr>
      <tr>
          <td>Upgrade</td>
          <td>手動 rolling restart 6-12 小時</td>
          <td>自動 patch + minor version</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>自管 snapshot to S3</td>
          <td>內建 snapshot lifecycle</td>
      </tr>
      <tr>
          <td>Shard management</td>
          <td>手動 ILM policy</td>
          <td>UI-driven ILM</td>
      </tr>
      <tr>
          <td>Security</td>
          <td>自管 X-Pack / SSL cert</td>
          <td>內建 + 自動 cert rotation</td>
      </tr>
      <tr>
          <td>Monitoring</td>
          <td>自管 Metricbeat → 自己集群</td>
          <td>內建 deployment monitoring</td>
      </tr>
  </tbody>
</table>
<h2 id="migration-4-phase">Migration 4-phase</h2>
<h3 id="phase-0pre-migration-audit">Phase 0：Pre-migration audit</h3>
<ul>
<li>列 application 連線 endpoint (Logstash / Beats / SDK direct)</li>
<li>列 ILM policy + retention setting</li>
<li>估 deployment size（hot tier RAM / cold tier storage）</li>
</ul>
<h3 id="phase-1elastic-cloud-deployment-建置">Phase 1：Elastic Cloud deployment 建置</h3>
<ul>
<li>選 region + provider（AWS / GCP / Azure）</li>
<li>Hot tier RAM × N + cold tier S3-backed × N</li>
<li>Snapshot lifecycle 配置</li>
</ul>
<h3 id="phase-2data-migration">Phase 2：Data migration</h3>
<ul>
<li><strong>Cross-cluster replication (CCR)</strong> 從 self-managed → Cloud（推薦、incremental）</li>
<li>或 <strong>snapshot + restore</strong>（簡單但需要 maintenance window）</li>
</ul>
<h3 id="phase-3cutover--cleanup">Phase 3：Cutover + cleanup</h3>
<ul>
<li>Application 端切 endpoint</li>
<li>Self-managed 端 read-only 1-2 月</li>
<li>Decommission</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1application-endpoint-hardcodecutover-失敗">Case 1：Application endpoint hardcode、cutover 失敗</h3>
<p><strong>徵兆</strong>：cutover 後 N 個 application 仍連舊 endpoint、log / metric 斷流。</p>
<p><strong>根因</strong>：endpoint 寫死在 config file、deploy 時沒一起改。</p>
<p><strong>修法</strong>：endpoint 用 ENV variable + service discovery、cutover 是 single deploy。</p>
<h3 id="case-2ccr-replication-lagcutover-時資料-gap">Case 2：CCR replication lag、cutover 時資料 gap</h3>
<p><strong>徵兆</strong>：CCR 跑 1 週、cutover 前 lag 200ms 看似 OK；application 切到 Cloud 後 search 顯示 <em>缺最近 5 分鐘 data</em>。</p>
<p><strong>根因</strong>：CCR replication 不保證即時 catch up、cutover 期間仍可能 lag；且 follower index 對 <em>write</em> 不接受。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Cutover 流程加 <em>drain window</em> — 停 application write 5-10 分鐘、等 CCR catch up</li>
<li>確認 follower index 已 <em>promote</em> 成 write-capable</li>
<li>監控 CCR lag、&lt; 100ms 才 cutover</li>
</ol>
<h3 id="case-3auth-改變soc-alert-失效">Case 3：Auth 改變、SOC alert 失效</h3>
<p><strong>徵兆</strong>：cutover 後 SOC dashboard 顯示「authentication failed」、SIEM rule 全失效。</p>
<p><strong>根因</strong>：self-managed 用 X-Pack basic auth、Cloud 用 API key + SSO；SOC tooling 沒改 auth。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Pre-cutover 列所有 tool 連線 ELK 的 auth</li>
<li>改 API key、用 IAM-friendly token rotation</li>
<li>Cloud 端 enable SSO + 設 service account</li>
</ol>
<h3 id="case-4cost-暴漲cold-tier-設定錯">Case 4：Cost 暴漲、cold tier 設定錯</h3>
<p><strong>徵兆</strong>：第一個月 Cloud 帳單比預估高 50%；cold tier 用 <em>fast storage</em>（hot-tier-level）而非 S3-backed。</p>
<p><strong>根因</strong>：Cloud deployment template 預設 hot 是 fast、cold 也是 fast（slow 需要明示）；team 沒 review template。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Pre-cutover review deployment template、確認 cold tier = searchable snapshot to S3</li>
<li>Cost monitor 第一週密集 check</li>
<li>Hot tier RAM 估算 conservative</li>
</ol>
<h3 id="case-5snapshot-跨-region-失效">Case 5：Snapshot 跨 region 失效</h3>
<p><strong>徵兆</strong>：DR drill 切 region 失敗；Cloud 內建 snapshot 是 same-region、不跨 region。</p>
<p><strong>根因</strong>：multi-region DR 需要 <em>cross-region snapshot</em> 或 <em>multi-deployment</em>、不是預設。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>評估 DR 需求、是否需要 cross-region</li>
<li>配 <em>additional deployment in DR region</em> + CCR</li>
<li>Cost 增 50-100%、是 DR 投資不是 cost optimization</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Self-managed ELK</th>
          <th>Elastic Cloud</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Compute cost (5 node)</td>
          <td>$1,000-2,000 / mo</td>
          <td>$1,500-3,000 / mo</td>
      </tr>
      <tr>
          <td>Storage cost</td>
          <td>EBS</td>
          <td>included + 加 S3 cold tier</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.5-1.5 = $5K-15K</td>
          <td>0.1-0.3 = $1K-3K</td>
      </tr>
      <tr>
          <td>Total (5 node, mid-tier)</td>
          <td>$6K-17K / mo</td>
          <td>$2.5K-6K / mo</td>
      </tr>
      <tr>
          <td>Migration cost</td>
          <td>-</td>
          <td>1-2 FTE × 1-2 個月</td>
      </tr>
  </tbody>
</table>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-splunk--elastic-security-migration-對位">跟 <a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic Security migration</a> 對位</h3>
<p>兩篇都到 Elastic 生態、但 Splunk → Elastic Security 是 Schema 高差 Type A、本篇是 Operational 高差 Type C；如果同時跑兩個 migration、Splunk → Elastic Security 先、ELK Cloud 後（避免雙重變動）。</p>
<h3 id="跟-application-observability-stack-整合">跟 Application observability stack 整合</h3>
<p>Elastic Cloud + APM + OpenTelemetry：cutover 後可以 <em>順便升 OTel 化 application</em>、避免下次 vendor 切換重複工作。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source vendor：<a href="/blog/backend/04-observability/vendors/elastic-stack/" data-link-title="Elastic Stack" data-link-desc="ELK：Elasticsearch / Logstash / Kibana &#43; Beats / APM">Elastic Stack</a></li>
<li>平行 migration playbook (Type C)：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a> / <a href="/blog/backend/01-database/vendors/mongodb/migrate-to-atlas/" data-link-title="MongoDB → Atlas：Atlas 不是 MongoDB &#43; managed、是另一個 product" data-link-desc="Atlas 號稱「MongoDB managed」但 operational model 完全不同（auto-scaling / VPC peering / IAM-driven access / 內建 backup / billing 模型）；本文採用 Type C operational redesign hybrid 結構、4-phase operational migration &#43; drop-in cutover、5 個 production 踩雷（連線數限制 / IP whitelist / backup retention / IAM token 過期 / billing 暴漲）">MongoDB → Atlas</a> / <a href="/blog/backend/03-message-queue/vendors/kafka/migrate-to-msk/" data-link-title="Self-managed Kafka → AWS MSK：把 $15K/month operational cost 拆解到 managed" data-link-desc="Kafka self-managed → MSK 是 Type C operational redesign — protocol 完全相容、operational stack（ZooKeeper / brokers / monitoring / patching）全託管；本文用 cost 拆解開頭、5 個 production 踩雷（client connection pattern / version pinning / metric pipeline / IAM auth / cross-cluster mirror）">Kafka → MSK</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a></li>
</ul>
]]></content:encoded></item><item><title>Self-managed Kafka → AWS MSK：把 $15K/month operational cost 拆解到 managed</title><link>https://tarrragon.github.io/blog/backend/03-message-queue/vendors/kafka/migrate-to-msk/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/03-message-queue/vendors/kafka/migrate-to-msk/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link &lt;a href="https://tarrragon.github.io/blog/backend/03-message-queue/vendors/kafka/" data-link-title="Apache Kafka" data-link-desc="Distributed event streaming platform、log-based 模型">Kafka&lt;/a> 跟 AWS MSK。跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit&lt;/a> 後對映 &lt;em>Operational = High（self-managed → AWS managed）→ Type C operational redesign hybrid&lt;/em>。&lt;/p>&lt;/blockquote>
&lt;h2 id="15kmonth-operational-cost-拆解">$15K/month operational cost 拆解&lt;/h2>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/" data-link-title="Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability" data-link-desc="Datadog 五層計費（host APM / metric / log ingest / log retention / RUM）拆解、對位 Grafana Stack（Mimir / Loki / Tempo / Grafana / Alloy）的 5 層責任；OTel-based agent migration、5 個 production 踩雷（cardinality 爆 / log volume cost / dashboard 不直接轉 / alert routing 換邏輯 / SLO definition 差異）、cost reality check">Datadog → Grafana Stack&lt;/a>（H cost variant）同 framing — 用 cost 拆解開頭、不是「為什麼遷」driver list：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Self-managed Kafka cost 項&lt;/th>
 &lt;th>中型 (3 broker + 3 ZK + monitoring) / month&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>EC2 (3× r6g.xlarge broker)&lt;/td>
 &lt;td>$660&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>EBS (3× 1TB io2)&lt;/td>
 &lt;td>$1,500&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>EC2 (3× t3.medium ZK / KRaft)&lt;/td>
 &lt;td>$90&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Monitoring (Prometheus + Grafana on EC2)&lt;/td>
 &lt;td>$200&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Backup S3 (1TB)&lt;/td>
 &lt;td>$25&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cross-AZ traffic&lt;/td>
 &lt;td>$300&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Operational FTE (0.5)&lt;/strong>&lt;/td>
 &lt;td>&lt;strong>$5,000-8,000&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Patching window cost&lt;/td>
 &lt;td>$200 (downtime opportunity)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Total infrastructure&lt;/td>
 &lt;td>$7,975-10,975&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Total with FTE&lt;/td>
 &lt;td>&lt;strong>$13,000-18,975&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>最大成本塊是 operational FTE、不是 infrastructure&lt;/strong>。MSK 把 50-80% operational 工作轉嫁 AWS、留 application + cost monitoring 給 SRE。&lt;/p>
&lt;p>跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit&lt;/a>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link <a href="/blog/backend/03-message-queue/vendors/kafka/" data-link-title="Apache Kafka" data-link-desc="Distributed event streaming platform、log-based 模型">Kafka</a> 跟 AWS MSK。跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit</a> 後對映 <em>Operational = High（self-managed → AWS managed）→ Type C operational redesign hybrid</em>。</p></blockquote>
<h2 id="15kmonth-operational-cost-拆解">$15K/month operational cost 拆解</h2>
<p>跟 <a href="/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/" data-link-title="Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability" data-link-desc="Datadog 五層計費（host APM / metric / log ingest / log retention / RUM）拆解、對位 Grafana Stack（Mimir / Loki / Tempo / Grafana / Alloy）的 5 層責任；OTel-based agent migration、5 個 production 踩雷（cardinality 爆 / log volume cost / dashboard 不直接轉 / alert routing 換邏輯 / SLO definition 差異）、cost reality check">Datadog → Grafana Stack</a>（H cost variant）同 framing — 用 cost 拆解開頭、不是「為什麼遷」driver list：</p>
<table>
  <thead>
      <tr>
          <th>Self-managed Kafka cost 項</th>
          <th>中型 (3 broker + 3 ZK + monitoring) / month</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>EC2 (3× r6g.xlarge broker)</td>
          <td>$660</td>
      </tr>
      <tr>
          <td>EBS (3× 1TB io2)</td>
          <td>$1,500</td>
      </tr>
      <tr>
          <td>EC2 (3× t3.medium ZK / KRaft)</td>
          <td>$90</td>
      </tr>
      <tr>
          <td>Monitoring (Prometheus + Grafana on EC2)</td>
          <td>$200</td>
      </tr>
      <tr>
          <td>Backup S3 (1TB)</td>
          <td>$25</td>
      </tr>
      <tr>
          <td>Cross-AZ traffic</td>
          <td>$300</td>
      </tr>
      <tr>
          <td><strong>Operational FTE (0.5)</strong></td>
          <td><strong>$5,000-8,000</strong></td>
      </tr>
      <tr>
          <td>Patching window cost</td>
          <td>$200 (downtime opportunity)</td>
      </tr>
      <tr>
          <td>Total infrastructure</td>
          <td>$7,975-10,975</td>
      </tr>
      <tr>
          <td>Total with FTE</td>
          <td><strong>$13,000-18,975</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>最大成本塊是 operational FTE、不是 infrastructure</strong>。MSK 把 50-80% operational 工作轉嫁 AWS、留 application + cost monitoring 給 SRE。</p>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>同 Kafka protocol、client SDK 不改</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>Self-managed → AWS managed、HA / patch / backup 全託管</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>同 Kafka log-based</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>同 1 個 Kafka cluster</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Auth config 改（IAM / SASL）、其他不變</td>
          <td>Low-Medium</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>同 broker + partition 配置</td>
          <td>Low</td>
      </tr>
  </tbody>
</table>
<p>Operational = High（其他 Low-Medium）→ <strong>Type C operational redesign hybrid</strong>。</p>
<h2 id="為什麼遷fte--availability--consistency-三條-driver">為什麼遷：FTE / availability / consistency 三條 driver</h2>
<ul>
<li><strong>Operational FTE</strong>：Kafka self-managed + ZooKeeper / KRaft + Prometheus 端到端 ops 是 0.5-1 FTE、MSK 把 patch / HA / backup 全託管</li>
<li><strong>Availability</strong>：MSK 自動 multi-AZ broker + auto-recovery、self-managed 自管 broker 故障 RTO 30 分鐘-2 小時</li>
<li><strong>Consistency with cloud stack</strong>：已 deep on AWS（RDS / S3 / Lambda）、MSK 進 same VPC + IAM auth、降低 cross-vendor 設置成本</li>
</ul>
<p>反向 driver（MSK → self-managed）：</p>
<ul>
<li>Throughput / GB 規模大時 MSK 跨 broker cost 反轉（cost &gt; self-managed）</li>
<li>需要 Kafka 客製化（custom plugin / kraft early adopter / 非 AWS region）</li>
<li>Multi-cloud / hybrid 架構不想 vendor lock</li>
</ul>
<h2 id="operational-redesign-對位">Operational redesign 對位</h2>
<p>跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a> / <a href="/blog/backend/01-database/vendors/mongodb/migrate-to-atlas/" data-link-title="MongoDB → Atlas：Atlas 不是 MongoDB &#43; managed、是另一個 product" data-link-desc="Atlas 號稱「MongoDB managed」但 operational model 完全不同（auto-scaling / VPC peering / IAM-driven access / 內建 backup / billing 模型）；本文採用 Type C operational redesign hybrid 結構、4-phase operational migration &#43; drop-in cutover、5 個 production 踩雷（連線數限制 / IP whitelist / backup retention / IAM token 過期 / billing 暴漲）">MongoDB → Atlas</a> 同 Type C pattern：</p>
<table>
  <thead>
      <tr>
          <th>Operational concept</th>
          <th>Self-managed Kafka</th>
          <th>MSK</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster bootstrap</td>
          <td>手動配置 broker + ZK + brokers.properties</td>
          <td>UI / Terraform 一鍵建</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>自管 replica + ISR + broker placement</td>
          <td>自動 multi-AZ + auto-recovery</td>
      </tr>
      <tr>
          <td>Patching</td>
          <td>Rolling restart 手動 / 工具</td>
          <td>MSK 自動 monthly maintenance window</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>自管 MirrorMaker / cluster snapshot</td>
          <td>MSK 內建 backup（S3、自動）</td>
      </tr>
      <tr>
          <td>Authentication</td>
          <td>SASL/SCRAM / mTLS 自管</td>
          <td>IAM auth（推薦）/ SASL/SCRAM via Secrets Manager</td>
      </tr>
      <tr>
          <td>Monitoring</td>
          <td>Prometheus + JMX exporter 自建</td>
          <td>CloudWatch + open monitoring + Prometheus</td>
      </tr>
      <tr>
          <td>Sizing</td>
          <td>手動 broker instance class</td>
          <td>MSK broker size（kafka.m5.large+）</td>
      </tr>
      <tr>
          <td>Configuration</td>
          <td>server.properties 全控</td>
          <td>Configuration set（限制可調 parameter）</td>
      </tr>
      <tr>
          <td>Cluster topology</td>
          <td>自管 placement / rack awareness</td>
          <td>MSK 自動 multi-AZ + rack-aware</td>
      </tr>
      <tr>
          <td>Tiered storage</td>
          <td>Kafka 3.6+ 自管</td>
          <td>MSK Tiered Storage（auto-tier 到 S3）</td>
      </tr>
  </tbody>
</table>
<p>每行 operational concept 都需要 migration plan、application code 不變但 <em>運維知識體系全換</em>。</p>
<h2 id="4-phase-migrationtype-c-標準流程">4-phase migration（Type C 標準流程）</h2>
<h3 id="phase-0pre-migration-audit">Phase 0：Pre-migration audit</h3>
<ul>
<li><strong>Workload sizing → MSK broker class</strong>：當前 throughput / partition count / topic count</li>
<li><strong>Application connection pattern audit</strong>：客戶端 producer / consumer 用 SASL / mTLS / plaintext？哪個 application</li>
<li><strong>Topic config audit</strong>：retention / replication factor / cleanup policy</li>
<li><strong>Backup pattern audit</strong>：有 MirrorMaker / cross-cluster mirror 嗎</li>
</ul>
<h3 id="phase-1msk-cluster-建置2-3-週">Phase 1：MSK cluster 建置（2-3 週）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-hcl" data-lang="hcl"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">resource</span> <span class="s2">&#34;aws_msk_cluster&#34; &#34;main&#34;</span> {
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="n">  cluster_name</span>           <span class="o">=</span> <span class="s2">&#34;production&#34;</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="n">  kafka_version</span>          <span class="o">=</span> <span class="s2">&#34;3.6.0&#34;</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">  number_of_broker_nodes</span> <span class="o">=</span> <span class="m">3</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">  <span class="k">broker_node_group_info</span> {
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">    instance_type</span>   <span class="o">=</span> <span class="s2">&#34;kafka.m5.large&#34;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="n">    client_subnets</span>  <span class="o">=</span> <span class="k">var</span><span class="p">.</span><span class="k">private_subnets</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">    security_groups</span> <span class="o">=</span> <span class="p">[</span><span class="k">aws_security_group</span><span class="p">.</span><span class="k">msk</span><span class="p">.</span><span class="k">id</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="k">storage_info</span> {
</span></span><span class="line"><span class="ln">11</span><span class="cl">      <span class="k">ebs_storage_info</span> {
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="n">        volume_size</span> <span class="o">=</span> <span class="m">1000</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">        <span class="k">provisioned_throughput</span> {
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="n">          enabled</span>           <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="n">          volume_throughput</span> <span class="o">=</span> <span class="m">500</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">        }
</span></span><span class="line"><span class="ln">17</span><span class="cl">      }
</span></span><span class="line"><span class="ln">18</span><span class="cl">    }
</span></span><span class="line"><span class="ln">19</span><span class="cl">  }
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl">  <span class="k">client_authentication</span> {
</span></span><span class="line"><span class="ln">22</span><span class="cl">    <span class="k">sasl</span> {
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="n">      iam</span> <span class="o">=</span> <span class="kt">true</span><span class="c1">        # IAM auth (推薦)
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="c1"></span><span class="n">      scram</span> <span class="o">=</span> <span class="kt">false</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">    }
</span></span><span class="line"><span class="ln">26</span><span class="cl">  }
</span></span><span class="line"><span class="ln">27</span><span class="cl">
</span></span><span class="line"><span class="ln">28</span><span class="cl">  <span class="k">configuration_info</span> {
</span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="n">    arn</span>      <span class="o">=</span> <span class="k">aws_msk_configuration</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">arn</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="n">    revision</span> <span class="o">=</span> <span class="k">aws_msk_configuration</span><span class="p">.</span><span class="k">main</span><span class="p">.</span><span class="k">latest_revision</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl">  }
</span></span><span class="line"><span class="ln">32</span><span class="cl">
</span></span><span class="line"><span class="ln">33</span><span class="cl">  <span class="k">encryption_info</span> {
</span></span><span class="line"><span class="ln">34</span><span class="cl">    <span class="k">encryption_in_transit</span> {
</span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="n">      client_broker</span> <span class="o">=</span> <span class="s2">&#34;TLS&#34;</span>
</span></span><span class="line"><span class="ln">36</span><span class="cl">    }
</span></span><span class="line"><span class="ln">37</span><span class="cl">  }
</span></span><span class="line"><span class="ln">38</span><span class="cl">
</span></span><span class="line"><span class="ln">39</span><span class="cl">  <span class="k">logging_info</span> {
</span></span><span class="line"><span class="ln">40</span><span class="cl">    <span class="k">broker_logs</span> {
</span></span><span class="line"><span class="ln">41</span><span class="cl">      <span class="k">cloudwatch_logs</span> {
</span></span><span class="line"><span class="ln">42</span><span class="cl"><span class="n">        enabled</span>   <span class="o">=</span> <span class="kt">true</span>
</span></span><span class="line"><span class="ln">43</span><span class="cl"><span class="n">        log_group</span> <span class="o">=</span> <span class="k">aws_cloudwatch_log_group</span><span class="p">.</span><span class="k">msk</span><span class="p">.</span><span class="k">name</span>
</span></span><span class="line"><span class="ln">44</span><span class="cl">      }
</span></span><span class="line"><span class="ln">45</span><span class="cl">    }
</span></span><span class="line"><span class="ln">46</span><span class="cl">  }
</span></span><span class="line"><span class="ln">47</span><span class="cl">}</span></span></code></pre></div><h3 id="phase-2data-migrationmirrormaker-20">Phase 2：Data migration（MirrorMaker 2.0）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Self-managed Kafka ──(MM2)──→ MSK
</span></span><span class="line"><span class="ln">2</span><span class="cl">                       │
</span></span><span class="line"><span class="ln">3</span><span class="cl">                consumer offset sync
</span></span><span class="line"><span class="ln">4</span><span class="cl">                       │
</span></span><span class="line"><span class="ln">5</span><span class="cl">                topic config sync</span></span></code></pre></div><p>MM2 跑 1-7 天、依 topic 量 + retention 期間；replica.lag 對齊後進 cutover。</p>
<h3 id="phase-3cutover">Phase 3：Cutover</h3>
<ul>
<li>Application 端切 bootstrap.servers 從 self-managed → MSK</li>
<li>Producer 漸進切（10% → 50% → 100%）</li>
<li>Consumer 切換時 offset 從 MM2 sync 過的位置開始</li>
<li>Self-managed cluster read-only standby 2 週</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1iam-auth-沒設application-連不上">Case 1：IAM auth 沒設、application 連不上</h3>
<p><strong>徵兆</strong>：cutover 後 application 報 <code>SaslAuthenticationException: Access denied</code>；MSK 端 cloudWatch log 顯示 IAM principal 不認。</p>
<p><strong>根因</strong>：MSK IAM auth 要求 client 跑 <em>MSK IAM auth library</em>（Java 用 <code>aws-msk-iam-auth</code>、Python 用 <code>aws-msk-iam-sasl-signer-python</code>）；application 端用 standard Kafka client、不知道怎麼 sign IAM signature。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Python kafka-python + IAM auth</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">from</span> <span class="nn">aws_msk_iam_sasl_signer</span> <span class="kn">import</span> <span class="n">MSKAuthTokenProvider</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="kn">from</span> <span class="nn">kafka</span> <span class="kn">import</span> <span class="n">KafkaProducer</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="k">class</span> <span class="nc">AwsMskIamProvider</span><span class="p">(</span><span class="n">MSKAuthTokenProvider</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="k">def</span> <span class="nf">token</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">generate_auth_token</span><span class="p">(</span><span class="s1">&#39;us-east-1&#39;</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="n">producer</span> <span class="o">=</span> <span class="n">KafkaProducer</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="n">bootstrap_servers</span><span class="o">=</span><span class="s1">&#39;b-1.mycluster.kafka.us-east-1.amazonaws.com:9098&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">    <span class="n">security_protocol</span><span class="o">=</span><span class="s1">&#39;SASL_SSL&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">    <span class="n">sasl_mechanism</span><span class="o">=</span><span class="s1">&#39;OAUTHBEARER&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">    <span class="n">sasl_oauth_token_provider</span><span class="o">=</span><span class="n">AwsMskIamProvider</span><span class="p">(),</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="p">)</span></span></span></code></pre></div><p>EKS pod 必須有 IAM role（IRSA）對 MSK cluster <code>kafka-cluster:Connect</code> action。</p>
<h3 id="case-2version-pinning360-跟-self-managed-行為差">Case 2：Version pinning、3.6.0 跟 self-managed 行為差</h3>
<p><strong>徵兆</strong>：cutover 到 MSK 3.6.0 後、某些 consumer 跑舊 client 失敗；新 broker 改 default <code>inter.broker.protocol.version</code> 但 client 不認。</p>
<p><strong>根因</strong>：MSK 升 Kafka version 後 broker config 變動、舊 client（&lt; 2.8）跟新 broker 協議不對；self-managed 端可能用更舊 broker version 跑、看不出問題。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration</strong>：所有 client 升 Kafka client library 2.8+</li>
<li><strong>MSK kafka_version 對齊 self-managed</strong>：先建 MSK 3.0 / 3.5、跟 self-managed 一致、cutover 後再升</li>
<li><strong>Phase rollout</strong>：用 <em>Tiered Storage</em> + retention 策略保留舊資料、新 producer / consumer 用新 version</li>
</ol>
<h3 id="case-3metric-pipeline-失效soc-dashboard-無數據">Case 3：Metric pipeline 失效、SOC dashboard 無數據</h3>
<p><strong>徵兆</strong>：cutover 後 Grafana dashboard 顯示 MSK metric 0；舊 JMX exporter 抓不到 MSK；CloudWatch 有 metric 但 SOC 端不接 CloudWatch。</p>
<p><strong>根因</strong>：MSK 不暴露 JMX、metric 走 CloudWatch / open monitoring (Prometheus + Grafana)、跟自建 JMX-based pipeline 不對等。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Open monitoring enabled</strong>：MSK config 設 <code>open_monitoring.prometheus.jmx_exporter.enabled = true</code>、跑 Prometheus 對 MSK broker 拉 metric</li>
<li><strong>CloudWatch → Prometheus</strong>：用 <code>cloudwatch-exporter</code> 拉 CloudWatch metric 進 Prometheus</li>
<li><strong>Dashboard refresh</strong>：Grafana dashboard 對 MSK-specific metric name 重寫（<code>kafka_server_*</code> → <code>aws_kafka_*</code> 或統一 alias）</li>
</ol>
<h3 id="case-4cross-cluster-mirrormm2--msk配置複雜">Case 4：Cross-cluster mirror（MM2 → MSK）配置複雜</h3>
<p><strong>徵兆</strong>：MM2 跑了 1 週、self-managed 跟 MSK consumer offset 沒同步；application 切過去後 <em>重新讀整批舊資料</em>、duplicate processing。</p>
<p><strong>根因</strong>：MM2 consumer offset sync 需要 <em>跨 cluster</em> mapping、source 端 offset 跟 target 端 offset 不直通；MM2 預設 offset sync 沒打開。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-properties" data-lang="properties"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># MM2 config</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">source.consumer.bootstrap.servers</span><span class="o">=</span><span class="s">self-managed-kafka:9092</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">target.consumer.bootstrap.servers</span><span class="o">=</span><span class="s">msk-cluster:9098</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">target.security.protocol</span><span class="o">=</span><span class="s">SASL_SSL</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">sync.group.offsets.enabled</span><span class="o">=</span><span class="s">true       # 必須打開</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="na">emit.checkpoints.enabled</span><span class="o">=</span><span class="s">true</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="na">checkpoints.topic.replication.factor</span><span class="o">=</span><span class="s">3</span></span></span></code></pre></div><p><strong>Architecture</strong>：consumer 切換時讀 <em>MM2 checkpoint</em> topic、不直接讀 internal offset；application 端用 <em>idempotent</em> + <em>dedup key</em>、avoid duplicate processing。</p>
<h3 id="case-5msk-billing-暴漲tiered-storage--cross-az-沒控">Case 5：MSK billing 暴漲、Tiered Storage / cross-AZ 沒控</h3>
<p><strong>徵兆</strong>：MSK 第一個月帳單比預估高 50%；breakdown 後發現 cross-AZ traffic（producer/consumer 跨 AZ）+ Tiered Storage 退到 S3 的 hot tier。</p>
<p><strong>根因</strong>：</p>
<ul>
<li>MSK auto multi-AZ deployment 不可避免 cross-AZ traffic、producer 寫 partition leader 可能跨 AZ</li>
<li>Tiered Storage 對 hot data（retention &lt; 24 小時）會多 storage cost；cold data 才 cost-effective</li>
</ul>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Application AZ-aware routing</strong>：producer 走 same-AZ broker（用 rack-aware producer config）、降 cross-AZ</li>
<li><strong>Retention 對齊 hot tier</strong>：&lt; 24 小時 retention 用 broker local storage、24 小時+ 才走 Tiered Storage</li>
<li><strong>Reserved instance</strong>：MSK 不直接 reserved、但 EBS / data transfer 可預付、降 10-20%</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Self-managed Kafka</th>
          <th>MSK</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster cost (3 broker)</td>
          <td>$660 EC2 + $1500 EBS = $2,160</td>
          <td>$2,500-3,500（含 storage + multi-AZ）</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.5-1 FTE = $5K-10K</td>
          <td>0.1-0.3 FTE = $1K-3K</td>
      </tr>
      <tr>
          <td>Patch / maintenance</td>
          <td>Manual + downtime opportunity</td>
          <td>Auto + maintenance window scheduled</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>Self-managed MirrorMaker</td>
          <td>Built-in（S3 archive、auto）</td>
      </tr>
      <tr>
          <td>Metric / monitoring</td>
          <td>Prometheus + Grafana self-deploy</td>
          <td>CloudWatch + open monitoring</td>
      </tr>
      <tr>
          <td>Cross-AZ traffic</td>
          <td>Limited by VPC layout</td>
          <td>Auto multi-AZ、cross-AZ traffic cost 注意</td>
      </tr>
      <tr>
          <td>Tiered storage</td>
          <td>Kafka 3.6+ self-managed</td>
          <td>MSK built-in tiered storage</td>
      </tr>
      <tr>
          <td>Total (3 broker, 中型)</td>
          <td>$7K-11K / mo (含 FTE)</td>
          <td>$3.5K-6.5K / mo (含 FTE)</td>
      </tr>
      <tr>
          <td>Migration cost</td>
          <td>-</td>
          <td>1-3 FTE × 1-2 個月</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：&lt; 50 broker organization MSK ROI 通常 6-12 月持平、之後省 FTE；50+ broker 大 organization 自管 cost 可能反而低。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-kafka--nats-migration-對位">跟 <a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS migration</a> 對位</h3>
<p>兩條 Kafka 出路：</p>
<ul>
<li>MSK：operational simplification、protocol drop-in、cost 中等漲；適合 <em>繼續用 Kafka paradigm</em> 的 organization</li>
<li>NATS：paradigm shift、application 必須改、適合 <em>單純 messaging 不要 event sourcing</em> 的 use case</li>
</ul>
<p>多數 organization 不需要 paradigm shift、MSK 更合理；真正需要 lightweight messaging 才走 NATS。</p>
<h3 id="跟-confluent-cloud-對位">跟 <a href="https://www.confluent.io/confluent-cloud/">Confluent Cloud</a> 對位</h3>
<p>Confluent Cloud 是另一個 managed Kafka、跨 cloud（AWS / GCP / Azure）；MSK 是 AWS-only、但跟 IAM / VPC 整合更深。Multi-cloud organization 走 Confluent、AWS-deep organization 走 MSK。</p>
<h3 id="跟-iam--secrets-manager-整合">跟 IAM / Secrets Manager 整合</h3>
<p>MSK + IAM auth + Secrets Manager（連 <a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/migrate-to-aws-secrets-manager/" data-link-title="Vault → AWS Secrets Manager：「secret」不是「secret」、identity model 才是核心差異" data-link-desc="Vault → AWS Secrets Manager migration 表面是 secret store 替換、實際核心是 identity model 對位（Vault token &#43; policy vs AWS IAM &#43; resource policy）；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 提出的 identity axis 候選 — identity 是否獨立 audit 軸；5 個 production 踩雷（IAM principal 對位 / dynamic credential 對等失敗 / lease lifecycle 模型不同 / audit log 結構差 / 計費模型反轉）">Vault → AWS Secrets Manager migration</a>）是 AWS-deep stack 的標準組合；short-lived credential + IRSA 是 production best practice。</p>
<h3 id="反向-migrationmsk--self-managed">反向 migration（MSK → self-managed）</h3>
<p>少見、通常是 <em>cost 反轉</em>（大 scale）或 <em>multi-cloud strategy</em>；流程鏡像對稱、注意 MSK Tiered Storage data 不直接 export、需要 <em>先 disable tiered storage</em> + recall data。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>MSK Connect</strong>：managed Kafka Connect、降 connector 運維、但 plugin ecosystem 比 self-managed Connect 少</li>
<li><strong>MSK Serverless</strong>：burst workload 適合、steady workload 反而貴</li>
<li><strong>Cost monitoring playbook</strong>：MSK billing 拆解每月跑一次、catch unexpected egress / tiered storage cost</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source vendor：<a href="/blog/backend/03-message-queue/vendors/kafka/" data-link-title="Apache Kafka" data-link-desc="Distributed event streaming platform、log-based 模型">Kafka</a></li>
<li>平行 migration playbook (Type C)：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a> / <a href="/blog/backend/01-database/vendors/mongodb/migrate-to-atlas/" data-link-title="MongoDB → Atlas：Atlas 不是 MongoDB &#43; managed、是另一個 product" data-link-desc="Atlas 號稱「MongoDB managed」但 operational model 完全不同（auto-scaling / VPC peering / IAM-driven access / 內建 backup / billing 模型）；本文採用 Type C operational redesign hybrid 結構、4-phase operational migration &#43; drop-in cutover、5 個 production 踩雷（連線數限制 / IP whitelist / backup retention / IAM token 過期 / billing 暴漲）">MongoDB → Atlas</a></li>
<li>平行 H cost variant：<a href="/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/" data-link-title="Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability" data-link-desc="Datadog 五層計費（host APM / metric / log ingest / log retention / RUM）拆解、對位 Grafana Stack（Mimir / Loki / Tempo / Grafana / Alloy）的 5 層責任；OTel-based agent migration、5 個 production 踩雷（cardinality 爆 / log volume cost / dashboard 不直接轉 / alert routing 換邏輯 / SLO definition 差異）、cost reality check">Datadog → Grafana Stack</a></li>
<li>平行 paradigm shift：<a href="/blog/backend/03-message-queue/vendors/kafka/migrate-from-to-nats/" data-link-title="Kafka ↔ NATS：不是 migration、是 messaging paradigm 重設計" data-link-desc="Kafka 跟 NATS 不是同類產品（log-based event streaming vs subject-based messaging）、&#39;migration&#39; 字面上不成立；本文釐清兩家 paradigm 邊界、什麼情境真的能換、application 模式重設計的 5 個踩雷（consumer offset 觀念差 / retention model / exactly-once 假設 / schema registry 缺位 / fan-out 模式差）、跟 JetStream 對位 &#43; 混合架構">Kafka ↔ NATS</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a></li>
</ul>
]]></content:encoded></item><item><title>自管 Redis / Valkey → AWS ElastiCache：engine 不變、變的是誰運維</title><link>https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/migrate-to-elasticache/</link><pubDate>Tue, 16 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/migrate-to-elasticache/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/" data-link-title="Redis" data-link-desc="OSS in-memory data structure store、cache 主流">Redis&lt;/a> / &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/valkey/" data-link-title="Valkey" data-link-desc="Redis fork、Linux Foundation 託管、BSD 授權">Valkey&lt;/a>（source、自管）跟 &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/aws-elasticache/" data-link-title="AWS ElastiCache" data-link-desc="AWS managed Redis / Valkey / Memcached">AWS ElastiCache&lt;/a>（target、managed）。跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit&lt;/a> 對映 &lt;strong>Operational model = High（自管 → managed）、其他 Low → Type C operational hybrid&lt;/strong>。ElastiCache 是 managed SaaS、AWS 操作依官方文件（未本機驗證、引數以官方為準）、最後檢查日 2026-06-16。&lt;/p>&lt;/blockquote>
&lt;h2 id="engine-不變變的是誰運維">engine 不變、變的是誰運維&lt;/h2>
&lt;p>多數 vendor 遷移會換掉某個本質的東西——協定、data model、paradigm。自管 Redis/Valkey → ElastiCache 一個都沒換：ElastiCache 跑的就是 Redis 或 Valkey engine，同樣的 RESP 協定、同樣的 data types、同樣的 client library、同樣的命令。application code 幾乎不用動。&lt;/p>
&lt;p>那遷的是什麼？&lt;strong>運維責任的歸屬&lt;/strong>。自管時要自己部署、自己打 patch、自己設 replication、自己半夜處理 failover。ElastiCache 把這些接走——AWS 做 failover、patching、snapshot、跨 AZ 複製。這個遷移的全部工作量集中在「把運維交出去」這件事上：網路（VPC）、安全（IAM / Security Group）、cutover 的資料連續性，以及想清楚&lt;strong>交出運維的同時、交出了哪些控制權&lt;/strong>（不再能 SSH 進機器、不能改任意 config、parameter group 限定可調項）。&lt;/p>
&lt;p>這對映 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration 方法論&lt;/a> 的 Type C operational hybrid——operational model 是唯一的 High 維度，其他全 Low。本文展開這個「engine 不變、運維轉移」遷移的實際工作與責任邊界。&lt;/p>
&lt;h2 id="6-維-diff-dimension-audit">6 維 diff dimension audit&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>評估&lt;/th>
 &lt;th>等級&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Schema / API&lt;/td>
 &lt;td>同 engine（Redis/Valkey）、RESP 一致、命令一致&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Operational model&lt;/strong>&lt;/td>
 &lt;td>&lt;strong>自管 → AWS managed（failover/patch/snapshot）&lt;/strong>&lt;/td>
 &lt;td>&lt;strong>High&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Abstraction / paradigm&lt;/td>
 &lt;td>完全相同（同 engine）&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Number of components&lt;/td>
 &lt;td>1 → 1&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application change&lt;/td>
 &lt;td>endpoint 換、client 加 reconnect / TLS、其餘不動&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Data topology&lt;/td>
 &lt;td>cache 可重建（re-warm）或 RDB seed / online 複製&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>唯一 High 是 operational model，對映 &lt;strong>Type C operational hybrid&lt;/strong>。Type C 的結構是「operational audit 前置 + drop-in cutover」——因為 engine/API 不變，cutover 本身接近 drop-in（換 endpoint），重點在前置的網路/安全/責任邊界盤點。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link <a href="/blog/backend/02-cache-redis/vendors/redis/" data-link-title="Redis" data-link-desc="OSS in-memory data structure store、cache 主流">Redis</a> / <a href="/blog/backend/02-cache-redis/vendors/valkey/" data-link-title="Valkey" data-link-desc="Redis fork、Linux Foundation 託管、BSD 授權">Valkey</a>（source、自管）跟 <a href="/blog/backend/02-cache-redis/vendors/aws-elasticache/" data-link-title="AWS ElastiCache" data-link-desc="AWS managed Redis / Valkey / Memcached">AWS ElastiCache</a>（target、managed）。跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit</a> 對映 <strong>Operational model = High（自管 → managed）、其他 Low → Type C operational hybrid</strong>。ElastiCache 是 managed SaaS、AWS 操作依官方文件（未本機驗證、引數以官方為準）、最後檢查日 2026-06-16。</p></blockquote>
<h2 id="engine-不變變的是誰運維">engine 不變、變的是誰運維</h2>
<p>多數 vendor 遷移會換掉某個本質的東西——協定、data model、paradigm。自管 Redis/Valkey → ElastiCache 一個都沒換：ElastiCache 跑的就是 Redis 或 Valkey engine，同樣的 RESP 協定、同樣的 data types、同樣的 client library、同樣的命令。application code 幾乎不用動。</p>
<p>那遷的是什麼？<strong>運維責任的歸屬</strong>。自管時要自己部署、自己打 patch、自己設 replication、自己半夜處理 failover。ElastiCache 把這些接走——AWS 做 failover、patching、snapshot、跨 AZ 複製。這個遷移的全部工作量集中在「把運維交出去」這件事上：網路（VPC）、安全（IAM / Security Group）、cutover 的資料連續性，以及想清楚<strong>交出運維的同時、交出了哪些控制權</strong>（不再能 SSH 進機器、不能改任意 config、parameter group 限定可調項）。</p>
<p>這對映 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration 方法論</a> 的 Type C operational hybrid——operational model 是唯一的 High 維度，其他全 Low。本文展開這個「engine 不變、運維轉移」遷移的實際工作與責任邊界。</p>
<h2 id="6-維-diff-dimension-audit">6 維 diff dimension audit</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>同 engine（Redis/Valkey）、RESP 一致、命令一致</td>
          <td>Low</td>
      </tr>
      <tr>
          <td><strong>Operational model</strong></td>
          <td><strong>自管 → AWS managed（failover/patch/snapshot）</strong></td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Abstraction / paradigm</td>
          <td>完全相同（同 engine）</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Number of components</td>
          <td>1 → 1</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>endpoint 換、client 加 reconnect / TLS、其餘不動</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>cache 可重建（re-warm）或 RDB seed / online 複製</td>
          <td>Low</td>
      </tr>
  </tbody>
</table>
<p>唯一 High 是 operational model，對映 <strong>Type C operational hybrid</strong>。Type C 的結構是「operational audit 前置 + drop-in cutover」——因為 engine/API 不變，cutover 本身接近 drop-in（換 endpoint），重點在前置的網路/安全/責任邊界盤點。</p>
<h2 id="operational-auditcutover-前要盤點的">operational audit：cutover 前要盤點的</h2>
<p>ElastiCache 把運維接走，但也劃下新的邊界。cutover 前必盤：</p>
<table>
  <thead>
      <tr>
          <th>面向</th>
          <th>自管時的負責項</th>
          <th>ElastiCache 後</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>部署 / patch</td>
          <td>自己裝、自己升級</td>
          <td>AWS 管（失去任意版本控制、跟 AWS 的 engine 版本走）</td>
      </tr>
      <tr>
          <td>failover</td>
          <td>自己設 Sentinel / 手動切</td>
          <td>Multi-AZ 自動（需確保 client 會重連）</td>
      </tr>
      <tr>
          <td>config</td>
          <td>改任意 redis.conf</td>
          <td>只能改 parameter group 開放的項（部分鎖死）</td>
      </tr>
      <tr>
          <td>網路存取</td>
          <td>自己的網路</td>
          <td>只在 VPC 內可達、要設 subnet group / Security Group</td>
      </tr>
      <tr>
          <td>認證</td>
          <td>AUTH password / 自管 TLS</td>
          <td>IAM auth（Redis 7+）/ ElastiCache 管的 TLS</td>
      </tr>
      <tr>
          <td>監控</td>
          <td>自己的 Prometheus 等</td>
          <td>CloudWatch（指標名與自管不同、dashboard 要改）</td>
      </tr>
  </tbody>
</table>
<p><strong>audit 的關鍵 output</strong>：(1) 目前改了哪些 redis.conf 項、ElastiCache parameter group 是否支援；(2) client 是否有 failover reconnect 邏輯（managed failover 不會代為重連）；(3) 監控要從自管工具搬到 CloudWatch。這三項是 Type C 的核心工作。詳細的 managed 責任邊界見 <a href="/blog/backend/02-cache-redis/vendors/aws-elasticache/managed-responsibility-boundary/" data-link-title="AWS ElastiCache 的責任邊界：managed 接手了什麼、又默默留下什麼" data-link-desc="ElastiCache 把 failover、patching、snapshot、跨 AZ 複製接走，但 cache stampede、client 重連、key 設計、eviction policy 還是你的事。本文用 shared responsibility 拆解 managed 的真實邊界、展開 engine 選擇與 cluster mode 配置、5 個把『以為 AWS 全包』寫成事故的 production 踩坑，以及 ElastiCache 到 MemoryDB 的 durability 邊界">ElastiCache 責任邊界 deep article</a>。</p>
<h2 id="cutover資料連續性的兩條路">cutover：資料連續性的兩條路</h2>
<p>因為 engine/API 不變，cutover 接近 drop-in（換 endpoint）。資料連續性有兩條路：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">路徑 A：re-warm（cache 可重建、最簡單）
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  1. 建 ElastiCache cluster（空的、選 Valkey / Redis engine、設 parameter group）
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  2. application 雙寫（自管 + ElastiCache）、讀仍走自管
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  3. 讀切到 ElastiCache endpoint、cache miss 回源 warm up
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">  4. 命中率追上 → 停寫自管 → 下線自管
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">路徑 B：RDB seed（要 cache 連續性、避免 warm-up origin 衝擊）
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  1. 自管端 BGSAVE 產生 RDB
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  2. RDB 上傳 S3、ElastiCache 從 S3 seed 建 cluster（依官方 restore 流程）
</span></span><span class="line"><span class="ln">10</span><span class="cl">  3. application 換 endpoint cutover
</span></span><span class="line"><span class="ln">11</span><span class="cl">  （ElastiCache 也提供 self-managed Redis online migration、見官方文件）</span></span></code></pre></div><p>判讀：</p>
<ul>
<li>純 cache、能接受短暫 warm-up → 路徑 A（最簡單、無資料遷移）</li>
<li>大 dataset、warm-up 會打爆 origin → 路徑 B（RDB seed 保連續性）</li>
<li>AWS CLI 建 cluster 與 restore 細節依 <a href="https://docs.aws.amazon.com/AmazonElastiCache/latest/dg/">ElastiCache 官方文件</a>（未本機驗證）</li>
<li>engine 選 Valkey（AWS default、約低 Redis 20%）除非有 Redis 商業 module 依賴</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1parameter-group-不支援自管時改的-config">Case 1：parameter group 不支援自管時改的 config</h3>
<p><strong>徵兆</strong>：自管時改了某個 redis.conf 項（例如特定 <code>client-output-buffer-limit</code> 或某個進階參數），遷到 ElastiCache 後該設定無法套用或行為不同。</p>
<p><strong>根因</strong>：ElastiCache 只允許改 parameter group 開放的項，部分 config 被 AWS 鎖死（為了 managed 穩定性）。自管時的任意 config 自由度在 managed 後收窄。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>pre-migration 列出自管端所有非預設 config，逐項對照 ElastiCache parameter group 支援度</li>
<li>不支援的項要評估影響——有些是 AWS 已用更好的方式處理、有些要調整 application 適應</li>
<li>把這個盤點放在 operational audit（cutover 前），不要遷完才發現</li>
<li>高度依賴特殊 config 調校的場景，managed 可能不適合、留自管</li>
</ol>
<h3 id="case-2failover-後-client-不重連managed-不代為重連">Case 2：failover 後 client 不重連（managed 不代為重連）</h3>
<p><strong>徵兆</strong>：ElastiCache Multi-AZ failover 完成，但 application 持續連舊 primary、寫入失敗。</p>
<p><strong>根因</strong>：ElastiCache 接走了 failover（自動晉升 replica），但 application 的 client 重連仍是 application 端的責任——這是 <a href="/blog/backend/02-cache-redis/vendors/aws-elasticache/managed-responsibility-boundary/" data-link-title="AWS ElastiCache 的責任邊界：managed 接手了什麼、又默默留下什麼" data-link-desc="ElastiCache 把 failover、patching、snapshot、跨 AZ 複製接走，但 cache stampede、client 重連、key 設計、eviction policy 還是你的事。本文用 shared responsibility 拆解 managed 的真實邊界、展開 engine 選擇與 cluster mode 配置、5 個把『以為 AWS 全包』寫成事故的 production 踩坑，以及 ElastiCache 到 MemoryDB 的 durability 邊界">managed 責任邊界</a> 的核心：AWS 換 primary，client 要自己跟上。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>client 連 primary endpoint（會跟著 failover 更新 DNS）、不寫死 node IP</li>
<li>client 設合理 socket timeout + retry + 縮短 DNS 快取</li>
<li>遷移前就驗證 client 有 failover reconnect 行為（自管 Sentinel 時可能靠不同機制）</li>
<li>對應 <a href="/blog/backend/02-cache-redis/vendors/redis/sentinel-ha-failover/" data-link-title="Redis Sentinel 與 failover 時序：從 master 死掉到 client 重連的每一段" data-link-desc="Redis Sentinel 的 failover 不是一個瞬間動作，是 down 偵測 → quorum 確認 → 選主 → 提升 → 配置廣播 → client 重連的一條時序鏈，每一段都有自己的延遲與失敗模式。本文展開 Sentinel 的判定模型與這條時序、5 個讓 failover 卡住或丟資料的 production 踩坑，以及 Sentinel 撐不住該往 Cluster 或 managed 走的邊界">Redis Sentinel failover 時序</a>：自管與 managed 的 failover 機制不同、client 處理要重驗</li>
</ol>
<h3 id="case-3endpoint-只在-vpc-內cutover-後連不上">Case 3：endpoint 只在 VPC 內、cutover 後連不上</h3>
<p><strong>徵兆</strong>：cutover 後 application 完全連不上 ElastiCache、連線逾時。</p>
<p><strong>根因</strong>：ElastiCache endpoint 只在 VPC 內可達、不對公網開放。Security Group 沒開 6379、subnet group 配置錯、或 application 不在同 VPC / 沒有 VPC peering，就連不上。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>cutover 前確認 Security Group 開 6379 給 application 的來源、subnet group 正確</li>
<li>application 不在同 VPC 要設 peering / Transit Gateway</li>
<li>從 VPC 內 EC2 先 <code>redis-cli -h &lt;endpoint&gt; ping</code> 驗證連通，再切 application</li>
<li>這是自管（自己的網路）→ managed（AWS VPC 模型）最常見的卡點</li>
</ol>
<h3 id="case-4監控斷層自管工具--cloudwatch">Case 4：監控斷層（自管工具 → CloudWatch）</h3>
<p><strong>徵兆</strong>：cutover 後原本的 Prometheus / Grafana dashboard 全空、告警失效。</p>
<p><strong>根因</strong>：自管時用 redis_exporter + Prometheus，ElastiCache 的指標在 CloudWatch、指標名與維度不同。直接搬 dashboard 不會動。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>cutover 前把關鍵告警在 CloudWatch 重建（<code>DatabaseMemoryUsagePercentage</code> / <code>ReplicationLag</code> / <code>CurrConnections</code> 等）</li>
<li>要保留 Grafana 可用 CloudWatch data source 接</li>
<li>把監控遷移納入 operational audit、不要遷完才發現沒監控</li>
<li>核心指標語意相同（記憶體 / 命中 / 連線 / 複製延遲）、只是來源與命名變了</li>
</ol>
<h3 id="case-5以為-managed-就不會-oom--stampede--熱-key">Case 5：以為 managed 就不會 OOM / stampede / 熱 key</h3>
<p><strong>徵兆</strong>：遷到 ElastiCache 後仍然 OOM、cache stampede、熱 key 打爆單 shard。</p>
<p><strong>根因</strong>：ElastiCache 接走的是運維（failover/patch/snapshot），不是 cache 使用方式的問題。記憶體淘汰、stampede、熱 key、key 設計仍是 application 端的責任——managed 不等於 hands-off。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>記憶體 / eviction 調校仍要做（透過 parameter group 設 maxmemory-policy），見 <a href="/blog/backend/02-cache-redis/vendors/redis/memory-eviction-tuning/" data-link-title="Redis 記憶體與淘汰調校：maxmemory-policy、LFU 與碎片化的實戰判讀" data-link-desc="Redis 的記憶體是一條會在半夜爆掉的曲線：maxmemory 設多少、policy 選 LRU 還 LFU、碎片化什麼時候開始吃掉 30% RAM、OOM 時 noeviction 怎麼讓寫入全部失敗。本文展開 Redis 記憶體會計模型、eviction policy 的選型判讀、5 個把記憶體配置寫成 production 事故的踩坑，以及單機記憶體撞牆後該往 cluster 還是 DragonflyDB 走的邊界">記憶體調校</a></li>
<li>stampede / 熱 key 的 application 端防護（jitter / singleflight / 兩層 cache）照舊</li>
<li>釐清 managed 的責任邊界——左欄 AWS 管、右欄 application 端管，見 <a href="/blog/backend/02-cache-redis/vendors/aws-elasticache/managed-responsibility-boundary/" data-link-title="AWS ElastiCache 的責任邊界：managed 接手了什麼、又默默留下什麼" data-link-desc="ElastiCache 把 failover、patching、snapshot、跨 AZ 複製接走，但 cache stampede、client 重連、key 設計、eviction policy 還是你的事。本文用 shared responsibility 拆解 managed 的真實邊界、展開 engine 選擇與 cluster mode 配置、5 個把『以為 AWS 全包』寫成事故的 production 踩坑，以及 ElastiCache 到 MemoryDB 的 durability 邊界">責任邊界 deep article</a></li>
<li>遷 managed 是減運維、不是免設計</li>
</ol>
<h2 id="capacity--cost-對照">Capacity / cost 對照</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>自管 Redis / Valkey</th>
          <th>ElastiCache（managed）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>engine / API</td>
          <td>同（Redis / Valkey）</td>
          <td>同（Redis / Valkey engine）</td>
      </tr>
      <tr>
          <td>運維責任</td>
          <td>全部自己扛</td>
          <td>failover / patch / snapshot 交 AWS</td>
      </tr>
      <tr>
          <td>config 自由度</td>
          <td>任意 redis.conf</td>
          <td>parameter group 開放項（部分鎖死）</td>
      </tr>
      <tr>
          <td>failover</td>
          <td>自設 Sentinel / Cluster</td>
          <td>Multi-AZ 自動（client 要會重連）</td>
      </tr>
      <tr>
          <td>成本</td>
          <td>機器 + 人力運維</td>
          <td>node 費 + managed premium（省人力）</td>
      </tr>
      <tr>
          <td>控制權</td>
          <td>完全</td>
          <td>受 AWS 邊界限制</td>
      </tr>
      <tr>
          <td>適合</td>
          <td>要極致控制 / 跨雲 / 特殊 config</td>
          <td>AWS 生態 / 要減運維 / 可預測 SLA</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：在 AWS 生態、要把運維交出去、能接受 config 自由度收窄 → 遷 ElastiCache（engine 不變、Type C 低風險）；要極致控制 / 跨雲 / 依賴特殊 config → 留自管。engine 選 Valkey 省約 20%。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<p>self-managed → ElastiCache 是運維轉移，它跟 managed 邊界與 engine 調校交織：</p>
<ul>
<li><strong>跟 <a href="/blog/backend/02-cache-redis/vendors/aws-elasticache/managed-responsibility-boundary/" data-link-title="AWS ElastiCache 的責任邊界：managed 接手了什麼、又默默留下什麼" data-link-desc="ElastiCache 把 failover、patching、snapshot、跨 AZ 複製接走，但 cache stampede、client 重連、key 設計、eviction policy 還是你的事。本文用 shared responsibility 拆解 managed 的真實邊界、展開 engine 選擇與 cluster mode 配置、5 個把『以為 AWS 全包』寫成事故的 production 踩坑，以及 ElastiCache 到 MemoryDB 的 durability 邊界">ElastiCache 責任邊界 deep article</a></strong>：遷過去後哪些 AWS 管、哪些仍 application 端管，是這個遷移的核心後果。</li>
<li><strong>跟 <a href="/blog/backend/02-cache-redis/vendors/redis/sentinel-ha-failover/" data-link-title="Redis Sentinel 與 failover 時序：從 master 死掉到 client 重連的每一段" data-link-desc="Redis Sentinel 的 failover 不是一個瞬間動作，是 down 偵測 → quorum 確認 → 選主 → 提升 → 配置廣播 → client 重連的一條時序鏈，每一段都有自己的延遲與失敗模式。本文展開 Sentinel 的判定模型與這條時序、5 個讓 failover 卡住或丟資料的 production 踩坑，以及 Sentinel 撐不住該往 Cluster 或 managed 走的邊界">Redis Sentinel failover</a></strong>：自管 failover（Sentinel）→ managed failover（Multi-AZ），client 重連邏輯要重驗。</li>
<li><strong>跟 <a href="/blog/backend/02-cache-redis/vendors/valkey/" data-link-title="Valkey" data-link-desc="Redis fork、Linux Foundation 託管、BSD 授權">Valkey</a></strong>：ElastiCache default engine 是 Valkey，自管 Redis 遷 ElastiCache for Valkey 是「換授權 + 轉 managed」一次到位（見 <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-valkey/" data-link-title="Redis → Valkey：同一份程式碼、不同授權的 drop-in 遷移" data-link-desc="Valkey 是 Redis 7.2.4 的 fork，bit-for-bit 幾乎同源、RDB/AOF 檔案相容、client 一行不改——這是技術上最容易的 cache 遷移。真正的工作不在搬資料，在授權合規驗證與 fork 後分歧（Redis 7.4&#43; 功能、Stack 商業 module）的盤點。本文走 Type B drop-in、相容性 audit 前置、5 個把『最容易的遷移』寫成事故的踩坑">Redis → Valkey 遷移</a>）。</li>
<li><strong>跟<a href="/blog/backend/00-service-selection/capability-buy-vs-build/" data-link-title="0.22 能力級買 vs 建：feature-as-a-service 與 BaaS bundle 選型" data-link-desc="在交付形態決定整個系統要不要自建之後、逐能力判斷該外包還是自建：辨識 managed 基礎設施、feature SaaS 與 BaaS bundle 三種外包深度、no-code 到 dev-tool 的服務光譜、買 vs 建判準與權重浮動、整合接縫與遷出代價">能力級買 vs 建</a></strong>：自管 vs managed 的上層取捨見該章，本文是「決定買（managed）之後」的遷移執行。</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source vendor：<a href="/blog/backend/02-cache-redis/vendors/redis/" data-link-title="Redis" data-link-desc="OSS in-memory data structure store、cache 主流">Redis</a> / <a href="/blog/backend/02-cache-redis/vendors/valkey/" data-link-title="Valkey" data-link-desc="Redis fork、Linux Foundation 託管、BSD 授權">Valkey</a>（自管）</li>
<li>Target vendor：<a href="/blog/backend/02-cache-redis/vendors/aws-elasticache/" data-link-title="AWS ElastiCache" data-link-desc="AWS managed Redis / Valkey / Memcached">AWS ElastiCache</a></li>
<li>對應 deep article：<a href="/blog/backend/02-cache-redis/vendors/aws-elasticache/managed-responsibility-boundary/" data-link-title="AWS ElastiCache 的責任邊界：managed 接手了什麼、又默默留下什麼" data-link-desc="ElastiCache 把 failover、patching、snapshot、跨 AZ 複製接走，但 cache stampede、client 重連、key 設計、eviction policy 還是你的事。本文用 shared responsibility 拆解 managed 的真實邊界、展開 engine 選擇與 cluster mode 配置、5 個把『以為 AWS 全包』寫成事故的 production 踩坑，以及 ElastiCache 到 MemoryDB 的 durability 邊界">ElastiCache 責任邊界</a></li>
<li>相關 migration：<a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-valkey/" data-link-title="Redis → Valkey：同一份程式碼、不同授權的 drop-in 遷移" data-link-desc="Valkey 是 Redis 7.2.4 的 fork，bit-for-bit 幾乎同源、RDB/AOF 檔案相容、client 一行不改——這是技術上最容易的 cache 遷移。真正的工作不在搬資料，在授權合規驗證與 fork 後分歧（Redis 7.4&#43; 功能、Stack 商業 module）的盤點。本文走 Type B drop-in、相容性 audit 前置、5 個把『最容易的遷移』寫成事故的踩坑">Redis → Valkey</a>（換授權 + 可同時轉 managed）</li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a>（Type C operational hybrid）</li>
</ul>
]]></content:encoded></item></channel></rss>