<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Topology on Tarragon</title><link>https://tarrragon.github.io/blog/tags/topology/</link><description>Recent content in Topology on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Tue, 19 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/topology/index.xml" rel="self" type="application/rss+xml"/><item><title>MongoDB Shard Expansion + Multi-DC：Type F「不需要 parallel run」的 multi-region 例外</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mongodb/shard-expansion-multi-dc/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mongodb/shard-expansion-multi-dc/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB&lt;/a> overview 的 implementation-layer deep article。對應 &lt;a href="https://tarrragon.github.io/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Type F「Topology re-layout」&lt;/a> 第 3 個 dogfood、特別驗證 self-aware limitation 第 3 點「不需要 parallel run」claim 的 &lt;em>multi-region rollout 例外&lt;/em> — 本文是反例的具體實證。&lt;/p>&lt;/blockquote>
&lt;h2 id="reviewer-d-的質疑type-f-一定不需要-parallel-run-嗎">Reviewer D 的質疑：Type F 一定不需要 parallel run 嗎&lt;/h2>
&lt;p>&lt;a href="https://tarrragon.github.io/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Self-aware limitation&lt;/a> 第 3 點承認：&lt;/p>
&lt;blockquote>
&lt;p>「不需要 parallel run」claim 部分不成立：multi-region rollout（#128 列為 Type F 情境）必須 parallel run — 兩 region 同時跑然後切流量、不然就是停機切換、跟 Type A phase 3 機制相同。&lt;/p>&lt;/blockquote>
&lt;p>本文是該 claim 的 &lt;em>正面實證&lt;/em> — MongoDB sharded cluster 從 single-DC 加 shard + 加 secondary DC、確實需要 parallel run + 流量切換、跟 Type A phased migration 局部同構：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB</a> overview 的 implementation-layer deep article。對應 <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Type F「Topology re-layout」</a> 第 3 個 dogfood、特別驗證 self-aware limitation 第 3 點「不需要 parallel run」claim 的 <em>multi-region rollout 例外</em> — 本文是反例的具體實證。</p></blockquote>
<h2 id="reviewer-d-的質疑type-f-一定不需要-parallel-run-嗎">Reviewer D 的質疑：Type F 一定不需要 parallel run 嗎</h2>
<p><a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Self-aware limitation</a> 第 3 點承認：</p>
<blockquote>
<p>「不需要 parallel run」claim 部分不成立：multi-region rollout（#128 列為 Type F 情境）必須 parallel run — 兩 region 同時跑然後切流量、不然就是停機切換、跟 Type A phase 3 機制相同。</p></blockquote>
<p>本文是該 claim 的 <em>正面實證</em> — MongoDB sharded cluster 從 single-DC 加 shard + 加 secondary DC、確實需要 parallel run + 流量切換、跟 Type A phased migration 局部同構：</p>
<table>
  <thead>
      <tr>
          <th>Type F 假設</th>
          <th>Single-DC re-sharding（Redis case）</th>
          <th><strong>Multi-DC expansion（本文）</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>同 cluster 不同 state</td>
          <td>yes</td>
          <td>yes（同 MongoDB cluster）</td>
      </tr>
      <tr>
          <td>不需 schema translation</td>
          <td>yes</td>
          <td>yes</td>
      </tr>
      <tr>
          <td>不需 parallel run</td>
          <td>yes（slot migration 內部完成）</td>
          <td><strong>no — 兩 DC 同跑後切流量</strong></td>
      </tr>
      <tr>
          <td>不需 cleanup phase</td>
          <td>yes</td>
          <td>partial（舊 DC 角色降為 standby）</td>
      </tr>
      <tr>
          <td>Step-by-step + rollback boundary</td>
          <td>yes</td>
          <td>yes</td>
      </tr>
  </tbody>
</table>
<p>→ Type F anatomy 仍適用、但「不需 parallel run」是 <em>子情境條件</em>、不是 universal claim。</p>
<h2 id="兩個操作合併shard-加--dc-加">兩個操作合併：shard 加 + DC 加</h2>
<p>實務上中型公司常 <em>同時</em> 跑兩個 topology 變動：</p>
<ol>
<li><strong>Shard expansion</strong>：現有 3-shard cluster 加到 5-shard、chunk migration 平均分佈</li>
<li><strong>Multi-DC</strong>：從 single-DC（us-east-1）加到 multi-DC（us-east-1 + us-west-2）</li>
</ol>
<p>兩個操作的 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Shard 加（單獨）</th>
          <th>Multi-DC（單獨）</th>
          <th>兩者同跑</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>Low</td>
          <td>Low</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>Low</td>
          <td>Medium（跨 DC ops）</td>
          <td>Medium</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>Low</td>
          <td>Low</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Low（加 shard、同 cluster）</td>
          <td>Low</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Low</td>
          <td>Low-Medium（cross-DC latency aware）</td>
          <td>Low-Medium</td>
      </tr>
      <tr>
          <td><strong>Data topology</strong></td>
          <td><strong>High</strong>（sharding strategy）</td>
          <td><strong>High</strong>（replication + region）</td>
          <td><strong>High</strong>（雙變、複合 topology）</td>
      </tr>
  </tbody>
</table>
<p>兩者主導維度都是 topology = High、組合走 Type F multi-axis 子情境。</p>
<h2 id="pre-layout-analysis當前--目標-topology">Pre-layout analysis：當前 + 目標 topology</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">// 1. 當前 shard 分佈
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">status</span><span class="p">({</span><span class="nx">verbose</span><span class="o">:</span> <span class="kc">false</span><span class="p">});</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1">// 期望輸出: 3 shard、每個 ~33% chunks、no migration in progress
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"></span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="nx">db</span><span class="p">.</span><span class="nx">printShardingStatus</span><span class="p">({</span><span class="nx">verbose</span><span class="o">:</span> <span class="kc">false</span><span class="p">});</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1">// 找 hot shard、imbalanced chunk distribution
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1">// 2. Replication topology
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="nx">rs</span><span class="p">.</span><span class="nx">status</span><span class="p">();</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1">// 各 replica set primary/secondary 健康度、replication lag
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1">// 3. Cross-DC network baseline (在 add DC 前測)
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1">// us-east-1 → us-west-2 RTT、bandwidth
</span></span></span></code></pre></div><p>Pre-layout 階段 output：</p>
<ul>
<li><strong>當前</strong>：3 shard × 1 replica set per shard (3 member) = 9 node、全在 us-east-1</li>
<li><strong>目標</strong>：5 shard × 1 replica set per shard (5 member: 3 us-east + 2 us-west) = 25 node</li>
<li><strong>Migration scope</strong>：加 2 shard + 加 2 DC member 每 shard、共 +16 node</li>
<li><strong>Chunk migration estimate</strong>：30% chunk 需重分（從 33% × 3 變 20% × 5）</li>
</ul>
<h2 id="re-layout-機制">Re-layout 機制</h2>
<p>兩個 mechanism 平行進行：</p>
<h3 id="shard-expansion-mechanism">Shard expansion mechanism</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">// 1. 新增 shard 到 cluster
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">addShard</span><span class="p">(</span><span class="s2">&#34;rs-shard4/host10:27017,host11:27017,host12:27017&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="nx">sh</span><span class="p">.</span><span class="nx">addShard</span><span class="p">(</span><span class="s2">&#34;rs-shard5/host13:27017,host14:27017,host15:27017&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1">// 2. balancer 自動 chunk migration
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">startBalancer</span><span class="p">();</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1">// 觀察 progress: db.adminCommand({balancerStatus: 1})
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1">// 3. 完成後 verify shard distribution
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">status</span><span class="p">();</span></span></span></code></pre></div><p>Chunk migration 是 <em>background</em> job、balancer 控制 throttle；不阻塞 production query、但 CPU / network 上升 30-50%。</p>
<h3 id="multi-dc-expansion-mechanism">Multi-DC expansion mechanism</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">// 1. 對每 shard 的 replica set 加 us-west-2 member (priority 0)
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="nx">rs</span><span class="p">.</span><span class="nx">add</span><span class="p">({</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  <span class="nx">host</span><span class="o">:</span> <span class="s2">&#34;us-west-2-host:27017&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  <span class="nx">priority</span><span class="o">:</span> <span class="mi">0</span><span class="p">,</span>           <span class="c1">// 不能當 primary
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span>  <span class="nx">votes</span><span class="o">:</span> <span class="mi">1</span><span class="p">,</span>              <span class="c1">// 參與投票
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span>  <span class="nx">hidden</span><span class="o">:</span> <span class="kc">false</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="p">});</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1">// 2. 等 initial sync 完成（依資料量 1 小時 - 1 天）
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="nx">rs</span><span class="p">.</span><span class="nx">printReplicationInfo</span><span class="p">();</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1">// 3. 確認 secondary 健康後、提升 priority 或 votes
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1">// 不要立刻設 priority 1、避免 unintended failover
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1">// 4. Cross-DC routing 透過 readPreference 在 application 設
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="c1"></span><span class="kr">const</span> <span class="nx">client</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">MongoClient</span><span class="p">(</span><span class="nx">uri</span><span class="p">,</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">  <span class="nx">readPreference</span><span class="o">:</span> <span class="s1">&#39;secondaryPreferred&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">  <span class="nx">readPreferenceTags</span><span class="o">:</span> <span class="p">[{</span> <span class="nx">region</span><span class="o">:</span> <span class="s1">&#39;us-west-2&#39;</span> <span class="p">},</span> <span class="p">{}],</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="p">});</span></span></span></code></pre></div><p>關鍵：multi-DC 是 <em>漸進加 member</em>、不是 atomic switch；每 shard 獨立加、整體耗時 = shard 數 × initial sync time。</p>
<h2 id="execution-flow含-parallel-run--流量切換">Execution flow（含 parallel run + 流量切換）</h2>
<p>8 step、包含 <em>parallel run + 切流量</em> 段——驗證 <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 self-aware limitation</a> 第 3 點：</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>動作</th>
          <th>Parallel run?</th>
          <th>Rollback boundary</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 Pre-check</td>
          <td>量化當前 topology、確認 cluster 健康</td>
          <td>no</td>
          <td>-</td>
      </tr>
      <tr>
          <td>2 加 us-east shard</td>
          <td>sh.addShard、balancer migrate chunk</td>
          <td>no（cluster 內）</td>
          <td>removeShard、chunk migrate 回</td>
      </tr>
      <tr>
          <td>3 加 us-west member</td>
          <td>對每 shard rs.add 跨 DC member</td>
          <td>no</td>
          <td>rs.remove、initial sync 投入廢棄</td>
      </tr>
      <tr>
          <td>4 <strong>Initial sync wait</strong></td>
          <td>等所有 us-west member catch up</td>
          <td><strong>parallel run starts</strong>：兩 DC 同時 serve</td>
          <td>-</td>
      </tr>
      <tr>
          <td>5 <strong>Cross-DC dual-serve</strong></td>
          <td>兩 DC 都跑 read traffic（不切 write）</td>
          <td><strong>yes、parallel run</strong>：app 用 secondary preferred us-west</td>
          <td>readPref 切回 us-east primary</td>
      </tr>
      <tr>
          <td>6 <strong>流量切換</strong></td>
          <td>application us-west traffic 走 us-west read</td>
          <td><strong>yes</strong></td>
          <td>DNS / readPref 切回</td>
      </tr>
      <tr>
          <td>7 Promote us-west（optional）</td>
          <td>一個 shard 的 us-west member priority 提到 1</td>
          <td>post-cutover</td>
          <td>demote priority 回 0</td>
      </tr>
      <tr>
          <td>8 Cleanup</td>
          <td>Verify、archive log、document new topology</td>
          <td>no</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Step 4-6 是 <em>parallel run + 切流量</em> — <strong>Type F 有此例外、跟 Type A phase 3 機制同構</strong>；anatomy 中「Execution flow per-step」段必須含 parallel run 子段。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1balancer-跑-chunk-migration-撞-production-peak">Case 1：Balancer 跑 chunk migration 撞 production peak</h3>
<p><strong>徵兆</strong>：加 shard 後 balancer 開始 migrate chunk、production write latency p99 從 10ms 跳到 100ms；application 端 timeout 大量。</p>
<p><strong>根因</strong>：MongoDB balancer 預設 24×7 跑、chunk migrate 是 <em>blocking</em> 操作（migration lock 期間阻塞 write 到該 chunk）；產線高峰時間 balancer 不會自動暫停。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">// 限 balancer 跑在 low-traffic window
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">setBalancerState</span><span class="p">(</span><span class="kc">true</span><span class="p">);</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="nx">db</span><span class="p">.</span><span class="nx">settings</span><span class="p">.</span><span class="nx">update</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">  <span class="p">{</span> <span class="nx">_id</span><span class="o">:</span> <span class="s2">&#34;balancer&#34;</span> <span class="p">},</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">  <span class="p">{</span> <span class="nx">$set</span><span class="o">:</span> <span class="p">{</span> <span class="nx">activeWindow</span><span class="o">:</span> <span class="p">{</span> <span class="nx">start</span><span class="o">:</span> <span class="s2">&#34;02:00&#34;</span><span class="p">,</span> <span class="nx">stop</span><span class="o">:</span> <span class="s2">&#34;06:00&#34;</span> <span class="p">}</span> <span class="p">}</span> <span class="p">},</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">  <span class="p">{</span> <span class="nx">upsert</span><span class="o">:</span> <span class="kc">true</span> <span class="p">}</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="p">);</span></span></span></code></pre></div><p>且設 <code>chunkSize</code> 較小（128MB → 64MB）讓 migration 步驟細、單次 lock 時間短。</p>
<h3 id="case-2cross-dc-initial-sync-期間-oplog-跑出窗口">Case 2：Cross-DC initial sync 期間 oplog 跑出窗口</h3>
<p><strong>徵兆</strong>：加 us-west member 後、initial sync 跑 4 小時、結束時 member 顯示「too stale to catch up」、需要 full re-sync。</p>
<p><strong>根因</strong>：MongoDB oplog 是 capped collection、預設 size 5% disk；4 小時 initial sync 期間 primary 寫入量超出 oplog 保留範圍、member 拿到的 oplog start point 已被覆蓋。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預先擴 oplog size</strong>：<code>db.adminCommand({replSetResizeOplog: 1, size: 51200})</code> 加到 50GB、覆蓋 sync window</li>
<li><strong>Off-peak initial sync</strong>：跑在低流量時間、oplog 寫入較慢</li>
<li><strong>Manual initial sync via snapshot</strong>：用 <code>mongodump</code> 從 primary snapshot、restore 到 new member、跳過 oplog tail catch-up</li>
</ol>
<h3 id="case-3跨-dc-read-路由錯誤stale-data-影響業務">Case 3：跨 DC read 路由錯誤、stale data 影響業務</h3>
<p><strong>徵兆</strong>：切流量到 us-west 後、application 偶爾抓到 5-30 秒前的 stale data；customer 報告「明明剛改了 setting、refresh 又變回去」。</p>
<p><strong>根因</strong>：us-west member 是 secondary、replication lag 5-30 秒；application readPreference 設 <code>secondaryPreferred</code> 但沒 <code>maxStalenessSeconds</code>、可能讀到嚴重 stale member。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kr">const</span> <span class="nx">client</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">MongoClient</span><span class="p">(</span><span class="nx">uri</span><span class="p">,</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  <span class="nx">readPreference</span><span class="o">:</span> <span class="s1">&#39;secondaryPreferred&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  <span class="nx">readPreferenceTags</span><span class="o">:</span> <span class="p">[{</span> <span class="nx">region</span><span class="o">:</span> <span class="s1">&#39;us-west-2&#39;</span> <span class="p">},</span> <span class="p">{}],</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  <span class="nx">maxStalenessSeconds</span><span class="o">:</span> <span class="mi">90</span><span class="p">,</span>  <span class="c1">// 限 stale 不超過 90 秒
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="p">});</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1">// 對 strict consistency 場景強制 primary
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="kr">const</span> <span class="nx">client_strict</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">MongoClient</span><span class="p">(</span><span class="nx">uri</span><span class="p">,</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  <span class="nx">readPreference</span><span class="o">:</span> <span class="s1">&#39;primary&#39;</span><span class="p">,</span>  <span class="c1">// 強制讀 us-east primary
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="p">});</span></span></span></code></pre></div><p>Application-level read pattern 必須區分「accept stale read」vs「require fresh read」、不是 cluster-level 統一配置。</p>
<h3 id="case-4shard-tag-aware-routing-沒設cross-dc-traffic-爆-cost">Case 4：Shard tag-aware routing 沒設、cross-DC traffic 爆 cost</h3>
<p><strong>徵兆</strong>：multi-DC 跑了 1 個月、AWS egress cost 從 $500 / month 漲到 $8000 / month；99% 流量還是 us-east → us-west 跨 DC。</p>
<p><strong>根因</strong>：sharded cluster 沒設 <em>zone sharding</em>、application 不知道哪些 chunk 在哪個 DC、所有 query 預設打 us-east primary、跨 DC bandwidth 爆。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">// 注意: MongoDB 4.2+ API、舊版 sh.addShardTag / sh.addTagRange 已 deprecated
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1">// 對應改 sh.addShardToZone / sh.updateZoneKeyRange
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"></span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1">// 1. 給 shard 加 zone (MongoDB 4.2+)
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">addShardToZone</span><span class="p">(</span><span class="s2">&#34;rs-shard1&#34;</span><span class="p">,</span> <span class="s2">&#34;us-east&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="nx">sh</span><span class="p">.</span><span class="nx">addShardToZone</span><span class="p">(</span><span class="s2">&#34;rs-shard2&#34;</span><span class="p">,</span> <span class="s2">&#34;us-east&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="nx">sh</span><span class="p">.</span><span class="nx">addShardToZone</span><span class="p">(</span><span class="s2">&#34;rs-shard3&#34;</span><span class="p">,</span> <span class="s2">&#34;us-east&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="nx">sh</span><span class="p">.</span><span class="nx">addShardToZone</span><span class="p">(</span><span class="s2">&#34;rs-shard4&#34;</span><span class="p">,</span> <span class="s2">&#34;us-west&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="nx">sh</span><span class="p">.</span><span class="nx">addShardToZone</span><span class="p">(</span><span class="s2">&#34;rs-shard5&#34;</span><span class="p">,</span> <span class="s2">&#34;us-west&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1">// 2. 對 collection 加 zone range
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">updateZoneKeyRange</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">  <span class="s2">&#34;myapp.events&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">  <span class="p">{</span> <span class="nx">region</span><span class="o">:</span> <span class="s2">&#34;us-east&#34;</span><span class="p">,</span> <span class="nx">_id</span><span class="o">:</span> <span class="nx">MinKey</span> <span class="p">},</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">  <span class="p">{</span> <span class="nx">region</span><span class="o">:</span> <span class="s2">&#34;us-east&#34;</span><span class="p">,</span> <span class="nx">_id</span><span class="o">:</span> <span class="nx">MaxKey</span> <span class="p">},</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">  <span class="s2">&#34;us-east&#34;</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="p">);</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="nx">sh</span><span class="p">.</span><span class="nx">updateZoneKeyRange</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">  <span class="s2">&#34;myapp.events&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">  <span class="p">{</span> <span class="nx">region</span><span class="o">:</span> <span class="s2">&#34;us-west&#34;</span><span class="p">,</span> <span class="nx">_id</span><span class="o">:</span> <span class="nx">MinKey</span> <span class="p">},</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">  <span class="p">{</span> <span class="nx">region</span><span class="o">:</span> <span class="s2">&#34;us-west&#34;</span><span class="p">,</span> <span class="nx">_id</span><span class="o">:</span> <span class="nx">MaxKey</span> <span class="p">},</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">  <span class="s2">&#34;us-west&#34;</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="p">);</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="c1">// 3. balancer 重新分配 chunk 到對應 zone
</span></span></span></code></pre></div><p>Zone sharding 是 multi-DC 必要設計、不設等於白付 egress cost。</p>
<h3 id="case-5failover-後跨-dc-primary-切換application-連線中斷">Case 5：Failover 後跨 DC primary 切換、application 連線中斷</h3>
<p><strong>徵兆</strong>：production 跑 6 個月後、us-east-1 outage、某 shard primary 切到 us-west member；application 5-10 秒內大量 connection error。</p>
<p><strong>根因</strong>：MongoDB driver 預設 election timeout 10 秒、application 沒設 server selection retry；primary 切換期間 client 沒重連。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln">1</span><span class="cl"><span class="kr">const</span> <span class="nx">client</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">MongoClient</span><span class="p">(</span><span class="nx">uri</span><span class="p">,</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  <span class="nx">serverSelectionTimeoutMS</span><span class="o">:</span> <span class="mi">30000</span><span class="p">,</span>    <span class="c1">// 等 30 秒給 election
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span>  <span class="nx">retryWrites</span><span class="o">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">  <span class="nx">retryReads</span><span class="o">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">  <span class="nx">heartbeatFrequencyMS</span><span class="o">:</span> <span class="mi">5000</span><span class="p">,</span>         <span class="c1">// 更頻繁 detect topology 變動
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="p">});</span></span></span></code></pre></div><p>且 multi-DC primary 應該設 <em>priority asymmetry</em>：us-east member priority 2、us-west priority 1；正常情況不切換、災難時自動切。</p>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Single-DC 3-shard</th>
          <th>Multi-DC 5-shard</th>
          <th>Trade-off</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Node count</td>
          <td>9</td>
          <td>25</td>
          <td>~3x infrastructure cost</td>
      </tr>
      <tr>
          <td>Storage redundancy</td>
          <td>3 replica</td>
          <td>5 replica (3 east + 2 west)</td>
          <td>+2 copy、storage cost +66%</td>
      </tr>
      <tr>
          <td>Network egress</td>
          <td>內部 VPC、低</td>
          <td>Cross-DC、高（需 zone sharding）</td>
          <td>$500 → $8000 / month if no zone sharding</td>
      </tr>
      <tr>
          <td>Latency p99 (write)</td>
          <td>5-10ms</td>
          <td>5-15ms（primary 仍 us-east）</td>
          <td>略升</td>
      </tr>
      <tr>
          <td>Latency p99 (read)</td>
          <td>5-10ms</td>
          <td>2-5ms (local DC)</td>
          <td>Multi-DC 區域 read 加快</td>
      </tr>
      <tr>
          <td>Disaster recovery</td>
          <td>RTO 30 分鐘（rebuild）</td>
          <td>RTO &lt; 1 分鐘（auto failover）</td>
          <td>顯著改善</td>
      </tr>
      <tr>
          <td>Operational complexity</td>
          <td>低</td>
          <td>高（zone sharding / DR drill）</td>
          <td>+1 SRE FTE 維護</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：multi-DC 是 <em>DR 投資</em>、不是 cost optimization；只在 <em>availability SLA &gt; 99.9% 或合規要求</em> 場景值得。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-mongodb--atlas-migration-對位">跟 <a href="/blog/backend/01-database/vendors/mongodb/migrate-to-atlas/" data-link-title="MongoDB → Atlas：Atlas 不是 MongoDB &#43; managed、是另一個 product" data-link-desc="Atlas 號稱「MongoDB managed」但 operational model 完全不同（auto-scaling / VPC peering / IAM-driven access / 內建 backup / billing 模型）；本文採用 Type C operational redesign hybrid 結構、4-phase operational migration &#43; drop-in cutover、5 個 production 踩雷（連線數限制 / IP whitelist / backup retention / IAM token 過期 / billing 暴漲）">MongoDB → Atlas migration</a> 對位</h3>
<p>Self-managed multi-DC 複雜度高、Atlas 把 multi-cluster + cross-region 簡化成 UI 配置；如果走 multi-DC、考慮直接遷 Atlas。</p>
<h3 id="跟-application-read-pattern-整合">跟 Application read pattern 整合</h3>
<p>zone sharding + readPreference 跟 application logic 緊密耦合；不能事後補、應在 multi-DC 設計階段就設計 application 端的 region-aware routing。</p>
<h3 id="跟-cassandra-keyspace-re-balance-對比">跟 <a href="https://cassandra.apache.org/">Cassandra keyspace re-balance</a> 對比</h3>
<p>Cassandra 是另一個 Type F multi-DC 典型 case；用 <em>NetworkTopologyStrategy + replication factor per DC</em>、跟 MongoDB zone sharding 概念對等但 mechanism 完全不同。Reviewer D 把 Cassandra 列為 Type F 反例 — 本文以 MongoDB 替代驗證。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Cross-region active-active</strong>：MongoDB 不支援 multi-primary、cross-region active-active 需要 application-level conflict resolution</li>
<li><strong>PostgreSQL Citus / CockroachDB multi-region</strong> 對比：distributed SQL 對 multi-region 有不同設計</li>
<li><strong>Cost optimization</strong>：跨 DC egress 是 long-term concern、zone sharding 設好後仍要 quarterly review</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB</a></li>
<li>平行 migration playbook：<a href="/blog/backend/01-database/vendors/mongodb/migrate-to-atlas/" data-link-title="MongoDB → Atlas：Atlas 不是 MongoDB &#43; managed、是另一個 product" data-link-desc="Atlas 號稱「MongoDB managed」但 operational model 完全不同（auto-scaling / VPC peering / IAM-driven access / 內建 backup / billing 模型）；本文採用 Type C operational redesign hybrid 結構、4-phase operational migration &#43; drop-in cutover、5 個 production 踩雷（連線數限制 / IP whitelist / backup retention / IAM token 過期 / billing 暴漲）">MongoDB → Atlas</a></li>
<li>平行 Type F dogfood：<a href="/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis Cluster Re-sharding</a>（dogfood #1）/ <a href="/blog/backend/01-database/vendors/postgresql/partition-redesign/" data-link-title="PostgreSQL Partition Redesign：當 monthly partition 越跑越慢" data-link-desc="PostgreSQL partition redesign 是 Type F「topology re-layout」第 2 個 dogfood — 從 monthly partition 改 daily / 從 range 改 list / 從單軸改 sub-partition；6 維 audit 皆 Low &#43; topology 軸 High；涵蓋 partition 不平衡偵測、ATTACH/DETACH 線上重劃、5 個 production 踩雷、跟 partition_pruning &#43; autovacuum 整合">PostgreSQL Partition Redesign</a>（dogfood #2）</li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> / <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Data topology 是第 6 audit 維度</a>（本文驗證 self-aware limitation 第 3 點）</li>
</ul>
]]></content:encoded></item><item><title>Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程</title><link>https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/" data-link-title="Redis" data-link-desc="OSS in-memory data structure store、cache 主流">Redis&lt;/a> overview 的 implementation-layer deep article。本文是 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> 「何時不該套」段的第 3 項實證（容量重新規劃 / re-sharding）— source / target 同 vendor 同 cluster、但 &lt;em>data topology 重劃&lt;/em>、不在 5 type 內。&lt;/p>&lt;/blockquote>
&lt;h2 id="source--target但-topology-重劃">Source = Target，但 topology 重劃&lt;/h2>
&lt;p>Migration 通常假設 &lt;em>source 跟 target 是不同 cluster / vendor&lt;/em>；re-sharding 是 &lt;em>同 cluster 內的 slot 重分配&lt;/em>、source 跟 target 是 &lt;em>同一個 Redis Cluster 的不同 state&lt;/em>：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">Before re-shard:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> Cluster A: [node1: slots 0-5460] [node2: slots 5461-10921] [node3: slots 10922-16383]
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> ~ 33% load ~ 50% load ~ 17% load (heavy imbalance)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">After re-shard:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl"> Cluster A: [node1: slots 0-4095] [node2: slots 4096-8191] [node3: slots 8192-12287] [node4: slots 12288-16383]
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl"> ~ 25% load ~ 25% load ~ 25% load ~ 25% load&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>source 跟 target 是 &lt;em>同 cluster&lt;/em>、區別在 &lt;em>slot 對 node 的 mapping&lt;/em>。Application connection string 不變、cluster API 不變、data model 不變。但 &lt;em>slot migration 期間&lt;/em> application 行為跟 &lt;em>normal operation&lt;/em> 差很多 — 這是 re-sharding 主要工作。&lt;/p>
&lt;p>跑 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit&lt;/a> 對 Redis cluster re-sharding：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/02-cache-redis/vendors/redis/" data-link-title="Redis" data-link-desc="OSS in-memory data structure store、cache 主流">Redis</a> overview 的 implementation-layer deep article。本文是 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> 「何時不該套」段的第 3 項實證（容量重新規劃 / re-sharding）— source / target 同 vendor 同 cluster、但 <em>data topology 重劃</em>、不在 5 type 內。</p></blockquote>
<h2 id="source--target但-topology-重劃">Source = Target，但 topology 重劃</h2>
<p>Migration 通常假設 <em>source 跟 target 是不同 cluster / vendor</em>；re-sharding 是 <em>同 cluster 內的 slot 重分配</em>、source 跟 target 是 <em>同一個 Redis Cluster 的不同 state</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Before re-shard:
</span></span><span class="line"><span class="ln">2</span><span class="cl">  Cluster A: [node1: slots 0-5460] [node2: slots 5461-10921] [node3: slots 10922-16383]
</span></span><span class="line"><span class="ln">3</span><span class="cl">              ~ 33% load           ~ 50% load              ~ 17% load (heavy imbalance)
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl">After re-shard:
</span></span><span class="line"><span class="ln">6</span><span class="cl">  Cluster A: [node1: slots 0-4095] [node2: slots 4096-8191] [node3: slots 8192-12287] [node4: slots 12288-16383]
</span></span><span class="line"><span class="ln">7</span><span class="cl">              ~ 25% load           ~ 25% load              ~ 25% load              ~ 25% load</span></span></code></pre></div><p>source 跟 target 是 <em>同 cluster</em>、區別在 <em>slot 對 node 的 mapping</em>。Application connection string 不變、cluster API 不變、data model 不變。但 <em>slot migration 期間</em> application 行為跟 <em>normal operation</em> 差很多 — 這是 re-sharding 主要工作。</p>
<p>跑 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a> 對 Redis cluster re-sharding：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>同 Redis、無變</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>同 Redis Cluster、operational 不變</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Abstraction / paradigm</td>
          <td>同 Redis Cluster、無 paradigm 差</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Number of components</td>
          <td>同 1 個（cluster）</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>多數不改、client cluster mode 自處理</td>
          <td>Low</td>
      </tr>
      <tr>
          <td><strong>Data topology</strong></td>
          <td><strong>重劃</strong> — slot mapping 跟 node 數</td>
          <td><strong>New axis</strong></td>
      </tr>
  </tbody>
</table>
<p>5 維皆 Low、對映 Type B drop-in；但 <em>data topology</em> 是 5 type 沒有的 <em>第 6 維度</em>。本文採用 <em>re-sharding-specific 結構</em>、不是 5 type 任一個。</p>
<h2 id="4-種-re-sharding-driver">4 種 re-sharding driver</h2>
<p>不同 driver 對應不同 re-sharding 策略：</p>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>觸發場景</th>
          <th>對應 re-sharding 操作</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Slot imbalance</td>
          <td>業務熱點打到部分 slot、單 node CPU / memory 80%+</td>
          <td>Rebalance（slot 重分配、不加 node）</td>
      </tr>
      <tr>
          <td>Capacity expansion</td>
          <td>整 cluster memory / throughput 上限快到、要加 node</td>
          <td>Add node + slot migration（從現有 node 搬部分 slot 過去）</td>
      </tr>
      <tr>
          <td>Node decommission</td>
          <td>老 node 硬體淘汰 / cloud instance 換代</td>
          <td>Drain（該 node 的 slot 全搬走）+ remove</td>
      </tr>
      <tr>
          <td>Hash tag refactor</td>
          <td>業務 access pattern 變、需要 co-located key 群重分組</td>
          <td>Application-side migration（不是 cluster-level）</td>
      </tr>
  </tbody>
</table>
<p>前 3 種是 cluster-internal、用 <code>redis-cli --cluster</code> 工具完成；第 4 種需要 application 端 dual-write + migration、本文不展開。</p>
<h2 id="slot-migration-機制">Slot migration 機制</h2>
<p>Redis Cluster 16384 個 slot、每個 key 經 <code>CRC16(key) % 16384</code> 對應 slot。Slot migration 過程：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Source node:     [slot N: MIGRATING to dest]
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">Dest node:       [slot N: IMPORTING from source]
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">                 ↓
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">Source node:     SCAN slot N → for each key:
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">                 1. DUMP key (serialize value)
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">                 2. send to dest via MIGRATE command
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">                 3. dest RESTORE key
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">                 4. source DEL key
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">                 ↓
</span></span><span class="line"><span class="ln">10</span><span class="cl">Source node:     [slot N: OWNED by dest]
</span></span><span class="line"><span class="ln">11</span><span class="cl">Dest node:       [slot N: OWNED]
</span></span><span class="line"><span class="ln">12</span><span class="cl">                 ↓
</span></span><span class="line"><span class="ln">13</span><span class="cl">跨 cluster broadcast: slot N 屬於 dest</span></span></code></pre></div><p>期間 client 行為：</p>
<ul>
<li>Key 在 source 端（未 migrate）：source 直接 serve</li>
<li>Key 在 dest 端（已 migrate）：source 回 <code>-ASK</code> redirect、client 重發到 dest</li>
<li>寫入 MIGRATING slot 的新 key：source serve、之後也會 migrate</li>
<li>Application 不需要改 code、cluster-aware client 自動處理 <code>-ASK</code> redirect</li>
</ul>
<h2 id="redis-cli-cluster-工具">redis-cli &ndash;cluster 工具</h2>
<p>production 用 official tool、不要手寫 slot migration：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. Rebalance（slot 重分配、適合 imbalance）</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">redis-cli --cluster rebalance 10.0.0.1:6379 <span class="se">\
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="se"></span>  --cluster-use-empty-masters <span class="se">\
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="se"></span>  --cluster-threshold <span class="m">5</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># 2. Reshard（指定來源 → 目標、適合 capacity expansion）</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">redis-cli --cluster reshard 10.0.0.1:6379 <span class="se">\
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="se"></span>  --cluster-from &lt;source-node-id&gt; <span class="se">\
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="se"></span>  --cluster-to &lt;dest-node-id&gt; <span class="se">\
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="se"></span>  --cluster-slots <span class="m">4096</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="se"></span>  --cluster-yes
</span></span><span class="line"><span class="ln">12</span><span class="cl">
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"># 3. Add-node（加新 node 進 cluster）</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">redis-cli --cluster add-node 10.0.0.4:6379 10.0.0.1:6379 <span class="se">\
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="se"></span>  --cluster-master-id &lt;existing-master-id&gt;
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1"># 4. Del-node（移除 node、需先 drain slot）</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">redis-cli --cluster del-node 10.0.0.1:6379 &lt;node-to-remove&gt;</span></span></code></pre></div><p>關鍵：</p>
<ul>
<li><code>--cluster-threshold 5</code>：load 差異超過 5% 才 rebalance、避免反覆觸發</li>
<li><code>--cluster-slots</code>：一次 migrate 多少 slot；太大 lock 久、太小步驟多</li>
<li>Rebalance / reshard 過程 cluster 仍 serve traffic、但 <em>latency 升高</em>（migration overhead）</li>
</ul>
<h2 id="5-段執行流程">5 段執行流程</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">1. Pre-resharding analysis
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">   - 當前 slot 分佈跟 load
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">   - Hot key 識別（CLUSTER COUNTKEYSINSLOT）
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">   - 預估 migration 時間
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">2. Backup checkpoint
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">   - BGSAVE on all master
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">   - 確認 replica 跟得上（replication offset diff &lt; 10MB）
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl">3. Execute re-sharding
</span></span><span class="line"><span class="ln">11</span><span class="cl">   - 用 redis-cli --cluster 工具
</span></span><span class="line"><span class="ln">12</span><span class="cl">   - Monitor cluster health（CLUSTER INFO + CLUSTER NODES）
</span></span><span class="line"><span class="ln">13</span><span class="cl">   - Migration 期間 application 端 latency baseline 比對
</span></span><span class="line"><span class="ln">14</span><span class="cl">
</span></span><span class="line"><span class="ln">15</span><span class="cl">4. Verify
</span></span><span class="line"><span class="ln">16</span><span class="cl">   - Slot distribution 對 expected mapping
</span></span><span class="line"><span class="ln">17</span><span class="cl">   - Application traffic pattern 對 baseline
</span></span><span class="line"><span class="ln">18</span><span class="cl">   - 跑 cross-node sanity check
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl">5. Cleanup
</span></span><span class="line"><span class="ln">21</span><span class="cl">   - 舊 node（若 decommission）reset / 釋放
</span></span><span class="line"><span class="ln">22</span><span class="cl">   - Monitoring dashboard 更新 (Prometheus target / Grafana panel)
</span></span><span class="line"><span class="ln">23</span><span class="cl">   - Document new topology</span></span></code></pre></div><p>整體 1-7 天、依 cluster 大小（10GB ~ 1 小時、TB 級 1-3 天）。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1cluster-busy-期間-application-timeout">Case 1：Cluster busy 期間 application timeout</h3>
<p><strong>徵兆</strong>：re-sharding 跑到一半、application 端開始大量 <code>CLUSTER BUSY</code> error / <code>OOM</code> warning / latency p99 從 5ms 跳到 200-2000ms；某些 batch operation 完全失敗。</p>
<p><strong>根因</strong>：MIGRATE command 對單 key 是 <em>blocking</em>（DUMP + send + RESTORE + DEL atomic）— 大 value（HASH / SORTED SET / LIST 含 100K+ entry）migration 可能 lock node 數秒；同期間其他 query 阻塞。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-resharding audit</strong>：<code>MEMORY USAGE</code> 跑 sample key、找 &gt; 1MB 的 <em>fat key</em>、列出單獨處理</li>
<li><strong>MIGRATE timeout 調</strong>：<code>redis.conf</code> 設 <code>cluster-migration-timeout 10000</code>（10s）、避免單 key migration 卡爆 cluster</li>
<li><strong>降低並行</strong>：<code>--cluster-pipeline 1</code> 一次只搬一個 slot（預設 10）、減少 CPU 壓力</li>
<li><strong>Fat key refactor</strong>：production 不該有 1M+ entry 的 collection、refactor 拆分</li>
</ol>
<h3 id="case-2replica-lag-during-re-sharding">Case 2：Replica lag during re-sharding</h3>
<p><strong>徵兆</strong>：reshard 完成後、replica 顯示 stale data 數分鐘、application 端 read from replica 拿到舊值。</p>
<p><strong>根因</strong>：master 端 slot migration 產生大量 <code>DEL</code> + <code>RESTORE</code> 命令、replication stream 量爆、replica 跟不上、accumulated lag。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-resharding 確認 replica lag &lt; 5MB</strong>、否則先 fix replica issue 再開始</li>
<li><strong>Throttle migration</strong>：用 <code>--cluster-replace</code> + lower pipeline、放慢 master 寫入速度</li>
<li><strong>Application 端 read-write split policy</strong>：reshard 期間強制 read from master、暫時放棄 replica read</li>
<li><strong>預備計畫</strong>：若 lag &gt; 30s 撐了 5+ 分鐘、考慮暫停 reshard、wait replica catch up</li>
</ol>
<h3 id="case-3client-side-topology-cache-stale">Case 3：Client-side topology cache stale</h3>
<p><strong>徵兆</strong>：reshard 完、application 端持續報 <code>MOVED &lt;slot&gt; &lt;new-node&gt;</code> redirect、但隔 30s 又 redirect 一次；某些 client 直接 connection refused（連到已 decommission node）。</p>
<p><strong>根因</strong>：cluster-aware client（lettuce / Jedis cluster mode）有 <em>topology cache</em>、reshard 後不主動 refresh；遇 MOVED 後 refresh 一次、但 cache TTL 內可能繼續用舊 mapping。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Client config</strong>：lettuce <code>clusterTopologyRefreshOptions(...)</code> 設較短 refresh interval（60s）+ <code>enablePeriodicRefresh()</code></li>
<li><strong>Reshard 完後 trigger refresh</strong>：application 端可主動發 <code>CLUSTER NODES</code> 拿最新 topology、不依賴 client lib 自動 refresh</li>
<li><strong>Graceful client shutdown / restart</strong>：對 latency-sensitive 服務、reshard 完 rolling restart application pod、避免 stale cache</li>
<li><strong>Decommissioned node 保留 5 分鐘</strong>：不立刻 stop node、給 stale client 自然 retry 機會</li>
</ol>
<h3 id="case-4cross-slot-transaction-失敗">Case 4：Cross-slot transaction 失敗</h3>
<p><strong>徵兆</strong>：application 用 <code>MULTI/EXEC</code> 跨多 key、reshard 期間部分 transaction 報 <code>MOVED</code> error、整個 transaction 失敗、business logic 不一致。</p>
<p><strong>根因</strong>：Redis Cluster transaction 要求 <em>所有 key 在同 slot</em>（用 hash tag <code>{user:123}</code>）；reshard 期間如果 transaction 內某 key migrate 到 dest、cluster topology 暫時 inconsistent、transaction 拒絕。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-resharding audit</strong>：grep application code 找 MULTI / pipeline 使用、確認所有都用 hash tag co-locate</li>
<li><strong>Reshard 期間 application 端加 retry</strong>：transaction failure 後 backoff retry、cluster stabilize 後成功</li>
<li><strong>架構</strong>：transaction-heavy 場景考慮不用 Redis Cluster、用 Redis Sentinel single master（無 slot 概念）</li>
</ol>
<h3 id="case-5monitor-visibility-gap-during-reshard">Case 5：Monitor visibility gap during reshard</h3>
<p><strong>徵兆</strong>：reshard 期間 Prometheus dashboard 對某 node 的 metric 突然顯示 <em>錯位</em> — load = 95% 但 slot count 顯示 6% slot；SOC 不知道 node 健康狀況。</p>
<p><strong>根因</strong>：Prometheus exporter 對 <em>slot count</em> 跟 <em>traffic load</em> 分開計算；reshard 期間 slot count 已 migrate 但流量仍打 source node（client cache stale）— metric 看似矛盾。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Reshard 期間關 alert</strong>：knownmaintenance window、Prometheus silence alert</li>
<li><strong>加 reshard-aware metric</strong>：用 <code>redis_cluster_migration_slots</code> 量化 in-flight migration</li>
<li><strong>Dashboard 加註解</strong>：reshard 期間 SOC 看 dashboard 知道是 <em>normal anomaly</em></li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>估算</th>
          <th>警戒</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Slot migration 速度</td>
          <td>1-10K key / sec（依 key size + network）</td>
          <td>TB 級 10K key / sec → 1 天</td>
      </tr>
      <tr>
          <td>Application latency impact</td>
          <td>p99 +50-200% during migration</td>
          <td>設 latency budget、超出暫停</td>
      </tr>
      <tr>
          <td>Memory / node</td>
          <td>不變、但 temporary 雙寫期間 +5-15%</td>
          <td>不能在 memory 90%+ 時 reshard</td>
      </tr>
      <tr>
          <td>Network bandwidth</td>
          <td>跨 node 大流量、~100-500 Mbps per migration stream</td>
          <td>跨 AZ reshard egress cost 注意</td>
      </tr>
      <tr>
          <td>Recovery time</td>
          <td>Reshard 失敗回退 = 反向 reshard（時間相同）</td>
          <td>不能在 incident 期間 reshard</td>
      </tr>
  </tbody>
</table>
<p>實務 default：</p>
<ul>
<li>跑在 <em>低流量時段</em>（夜間 / 週末）</li>
<li>Throughput 容忍度 &lt; 50% 再 reshard、不要 80%+ 時操作</li>
<li>預留 <em>回退 window</em> — reshard 卡住時能 abort + 恢復原狀</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-redis--dragonflydb-migration-對位">跟 <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB migration</a> 對位</h3>
<p>DragonflyDB 設計上 <em>單機效能取代 cluster</em>、re-sharding 議題消失；如果 cluster re-sharding 頻繁觸發、評估直接遷 DragonflyDB 是否更便宜。</p>
<h3 id="跟-sentinel-ha-對比">跟 <a href="/blog/backend/02-cache-redis/vendors/redis/" data-link-title="Redis" data-link-desc="OSS in-memory data structure store、cache 主流">Sentinel HA</a> 對比</h3>
<p>Sentinel 模式無 slot 概念、re-sharding 不適用；但 <em>manual sharding by application</em> 場景仍可能需要類似 topology re-layout、application 端要自己處理。</p>
<h3 id="跟-redis-7-function--cluster-v2">跟 Redis 7+ Function / Cluster v2</h3>
<p>Redis 7 推 Cluster v2 跟 Functions、slot migration 機制部分升級；keyspace migration 仍是核心議題、但 API 跟 monitoring 改進。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Auto-rebalance via operator</strong>：Redis Enterprise / Aiven 等 managed Redis 提供自動 rebalance、不需手動觸發</li>
<li><strong>Cross-DC slot migration</strong>：跨 region cluster slot migration 對 latency / cost 影響大、通常用 <em>application-level sharding</em> 取代 cluster-level</li>
<li><strong>Hash tag 治理</strong>：application code grep / lint 強制 hash tag、避免 cross-slot transaction 反模式</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/02-cache-redis/vendors/redis/" data-link-title="Redis" data-link-desc="OSS in-memory data structure store、cache 主流">Redis</a></li>
<li>平行 migration playbook：<a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a></li>
<li>對位 deep article：<a href="/blog/backend/01-database/vendors/postgresql/major-version-upgrade/" data-link-title="PostgreSQL major version upgrade (14 → 17)：為什麼這篇不套 5 type migration" data-link-desc="PostgreSQL major version upgrade 是 *5 type 漏類* 的實證 — source/target 同 vendor、5 維度都 Low 但 *upgrade-specific audit* 是核心；本文結構接近 deep article methodology 的 6-section &#43; 額外 upgrade audit 段；涵蓋 pg_upgrade / logical replication / blue-green 三方法、extension 相容性、5 production 踩雷">PostgreSQL major version upgrade</a>（另一個 5 type 漏類驗證）</li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a> / <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a>（本文驗證 <em>容量重劃漏類</em>）</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Partition Redesign：當 monthly partition 越跑越慢</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/partition-redesign/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/partition-redesign/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。對應 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Type F「Topology re-layout」&lt;/a> 第 2 個 dogfood（第 1 個是 &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis cluster re-sharding&lt;/a>）— 驗證 Type F anatomy 在不同 vendor 上的通用性。&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼-monthly-partition-越跑越慢">為什麼 monthly partition 越跑越慢&lt;/h2>
&lt;p>上線時 monthly range partition 設計很合理 — 每月一個 partition、12 個月一年、partition_pruning 在 &lt;code>WHERE event_time &amp;gt;= '2026-05-01'&lt;/code> 時跑單 partition、查詢快。但業務跑了 18 個月後：&lt;/p>
&lt;ul>
&lt;li>每月 partition size 從 50GB 漲到 500GB（流量 10x）&lt;/li>
&lt;li>單月查詢 &lt;code>WHERE event_time BETWEEN '2026-05-01' AND '2026-05-15'&lt;/code> 仍掃整月 500GB（partition_pruning 粒度只到 month）&lt;/li>
&lt;li>Vacuum 一個月 partition 需要 6-8 小時、跑不進 maintenance window&lt;/li>
&lt;li>DROP 老 partition 釋放 storage 是 monthly cadence、但 retention policy 要求 daily granularity&lt;/li>
&lt;/ul>
&lt;p>partition 設計需要 &lt;em>redesign&lt;/em>、不是「optimize」 — 從 monthly range partition 改成 daily range partition、partition 數量從 36 個（3 年 retention）變 1095 個。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。對應 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Type F「Topology re-layout」</a> 第 2 個 dogfood（第 1 個是 <a href="/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis cluster re-sharding</a>）— 驗證 Type F anatomy 在不同 vendor 上的通用性。</p></blockquote>
<h2 id="為什麼-monthly-partition-越跑越慢">為什麼 monthly partition 越跑越慢</h2>
<p>上線時 monthly range partition 設計很合理 — 每月一個 partition、12 個月一年、partition_pruning 在 <code>WHERE event_time &gt;= '2026-05-01'</code> 時跑單 partition、查詢快。但業務跑了 18 個月後：</p>
<ul>
<li>每月 partition size 從 50GB 漲到 500GB（流量 10x）</li>
<li>單月查詢 <code>WHERE event_time BETWEEN '2026-05-01' AND '2026-05-15'</code> 仍掃整月 500GB（partition_pruning 粒度只到 month）</li>
<li>Vacuum 一個月 partition 需要 6-8 小時、跑不進 maintenance window</li>
<li>DROP 老 partition 釋放 storage 是 monthly cadence、但 retention policy 要求 daily granularity</li>
</ul>
<p>partition 設計需要 <em>redesign</em>、不是「optimize」 — 從 monthly range partition 改成 daily range partition、partition 數量從 36 個（3 年 retention）變 1095 個。</p>
<p><a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a> 結果：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>同 PostgreSQL、同 table 定義、partition key 不變</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>同 PostgreSQL operational stack</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>同 OLTP RDBMS</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>同 1 個 DB</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>不改（partition_pruning 透明）</td>
          <td>Low</td>
      </tr>
      <tr>
          <td><strong>Data topology</strong></td>
          <td><strong>Partition strategy 從 monthly → daily</strong></td>
          <td><strong>High</strong></td>
      </tr>
  </tbody>
</table>
<p>6 維皆 Low + topology High = <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">Type F「Topology re-layout」</a>。</p>
<h2 id="pre-layout-analysispartition-不平衡偵測">Pre-layout analysis：partition 不平衡偵測</h2>
<p>執行 redesign 前必須先量化當前 topology：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. 每 partition size + row count
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="n">child</span><span class="p">.</span><span class="n">relname</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">partition_name</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">size</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="n">child</span><span class="p">.</span><span class="n">reltuples</span><span class="p">::</span><span class="nb">bigint</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">estimated_rows</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">  </span><span class="n">pg_stat_get_last_vacuum_time</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">last_vacuum</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_inherits</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">JOIN</span><span class="w"> </span><span class="n">pg_class</span><span class="w"> </span><span class="n">parent</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">pg_inherits</span><span class="p">.</span><span class="n">inhparent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">parent</span><span class="p">.</span><span class="n">oid</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">JOIN</span><span class="w"> </span><span class="n">pg_class</span><span class="w"> </span><span class="n">child</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">pg_inherits</span><span class="p">.</span><span class="n">inhrelid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">parent</span><span class="p">.</span><span class="n">relname</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;events&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="p">)</span><span class="w"> </span><span class="k">DESC</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- 2. partition_pruning 命中率
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span><span class="k">EXPLAIN</span><span class="w"> </span><span class="p">(</span><span class="k">ANALYZE</span><span class="p">,</span><span class="w"> </span><span class="n">BUFFERS</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="k">BETWEEN</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="s1">&#39;2026-05-15&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="c1">-- 期望: 只 scan 1 partition (target: daily) 或 1 partition (current: monthly)
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1">-- 觀察: monthly 設計下、即使 query 只跨 15 天、planner 仍 scan 整月 partition (~500GB)
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="c1">-- 3. 找 partition imbalance
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w">  </span><span class="n">to_char</span><span class="p">(</span><span class="n">event_time</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;YYYY-MM&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">month</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">  </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">row_count</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="k">DESC</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w"></span><span class="c1">-- 找 hot month / cold month、判斷 redesign 後分佈</span></span></span></code></pre></div><p>Pre-layout 階段的 output：</p>
<ul>
<li><strong>當前 topology 量化</strong>：36 monthly partition、總 size 1.8TB、最大 partition 500GB、最小 50GB</li>
<li><strong>Hot key 分佈</strong>：80% 流量集中最近 3 個月</li>
<li><strong>Redesign 目標</strong>：daily partition、最近 3 個月 hot daily / 3 個月 + 之前 cold weekly / 1 年 + 之前 monthly（sub-partition strategy）</li>
<li><strong>Migration scope</strong>：1095 個 partition 不直接全建、按 retention policy 階段性</li>
</ul>
<h2 id="re-layout-機制attach--detach-線上重劃">Re-layout 機制：ATTACH / DETACH 線上重劃</h2>
<p>PostgreSQL 不支援「直接改 partition strategy」、必須走 <em>新 partition tree + 資料搬遷</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. 建新 daily partition table (parallel to events)
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_daily</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="n">id</span><span class="w"> </span><span class="nb">bigint</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="n">event_time</span><span class="w"> </span><span class="n">timestamptz</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="n">payload</span><span class="w"> </span><span class="n">jsonb</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">RANGE</span><span class="w"> </span><span class="p">(</span><span class="n">event_time</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- 2. 預建未來 90 天 daily partition
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">  </span><span class="n">format</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">    </span><span class="s1">&#39;CREATE TABLE events_daily_%s PARTITION OF events_daily FOR VALUES FROM (%L) TO (%L)&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="n">to_char</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;YYYY_MM_DD&#39;</span><span class="p">),</span><span class="w"> </span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">d</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nb">interval</span><span class="w"> </span><span class="s1">&#39;1 day&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">  </span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">generate_series</span><span class="p">(</span><span class="k">current_date</span><span class="p">,</span><span class="w"> </span><span class="k">current_date</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nb">interval</span><span class="w"> </span><span class="s1">&#39;90 days&#39;</span><span class="p">,</span><span class="w"> </span><span class="nb">interval</span><span class="w"> </span><span class="s1">&#39;1 day&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">d</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="c1">-- 3. dual-write phase: application 同寫 events + events_daily
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1">-- (用 trigger 或 application-side)
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">OR</span><span class="w"> </span><span class="k">REPLACE</span><span class="w"> </span><span class="k">FUNCTION</span><span class="w"> </span><span class="n">dual_write_events</span><span class="p">()</span><span class="w"> </span><span class="k">RETURNS</span><span class="w"> </span><span class="k">TRIGGER</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="err">$$</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"></span><span class="k">BEGIN</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">  </span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">events_daily</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="k">NEW</span><span class="p">.</span><span class="o">*</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w">  </span><span class="k">RETURN</span><span class="w"> </span><span class="k">NEW</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w"></span><span class="k">END</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w"></span><span class="err">$$</span><span class="w"> </span><span class="k">LANGUAGE</span><span class="w"> </span><span class="n">plpgsql</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TRIGGER</span><span class="w"> </span><span class="n">events_dual_write</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w"></span><span class="k">AFTER</span><span class="w"> </span><span class="k">INSERT</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w"></span><span class="k">FOR</span><span class="w"> </span><span class="k">EACH</span><span class="w"> </span><span class="k">ROW</span><span class="w"> </span><span class="k">EXECUTE</span><span class="w"> </span><span class="k">FUNCTION</span><span class="w"> </span><span class="n">dual_write_events</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="w"></span><span class="c1">-- 4. backfill historical data per partition
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">events_daily</span><span class="w">
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="s1">&#39;2026-05-02&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="w"></span><span class="c1">-- ... 每天跑一個 day partition、avoid long transaction
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="w"></span><span class="c1">-- 5. cutover: rename swap
</span></span></span><span class="line"><span class="ln">36</span><span class="cl"><span class="c1"></span><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">37</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">RENAME</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">events_old</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">38</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_daily</span><span class="w"> </span><span class="k">RENAME</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">events</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">39</span><span class="cl"><span class="w"></span><span class="k">DROP</span><span class="w"> </span><span class="k">TRIGGER</span><span class="w"> </span><span class="n">events_dual_write</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events_old</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">40</span><span class="cl"><span class="w"></span><span class="k">COMMIT</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">41</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">42</span><span class="cl"><span class="w"></span><span class="c1">-- 6. 觀察 1-2 週、DROP events_old</span></span></span></code></pre></div><p>關鍵：rename swap 是 <em>single transaction</em>、cutover 瞬間發生；application connection 不需重連、但 prepared statement cache 可能要刷新。</p>
<h2 id="execution-flow-per-step">Execution flow per-step</h2>
<p>5 段、每段含 rollback boundary：</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>動作</th>
          <th>Rollback boundary</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 預建 partition</td>
          <td>建 events_daily + 90 天 partition、不影響 production</td>
          <td>DROP events_daily、無 impact</td>
      </tr>
      <tr>
          <td>2 Dual-write</td>
          <td>加 trigger 同寫兩端、observe diff</td>
          <td>DROP trigger、events_daily 留作 cleanup</td>
      </tr>
      <tr>
          <td>3 Backfill</td>
          <td>逐日 backfill 歷史資料、用 CHECK constraint 確保完整性</td>
          <td>DROP backfilled partition、不影響 source events</td>
      </tr>
      <tr>
          <td>4 Verify</td>
          <td>對 sample query 跑 events vs events_daily、確認 row count 一致</td>
          <td>仍在 dual-write、發現 diff 可暫停 cutover</td>
      </tr>
      <tr>
          <td>5 Cutover</td>
          <td>Rename swap</td>
          <td><strong>不可逆</strong>、回退需 reverse rename + dual-write restart</td>
      </tr>
  </tbody>
</table>
<p>Step 5 是不可逆邊界、應該排在 <em>低流量 maintenance window</em> 跑、且 cutover 前必須有 backup checkpoint。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1backfill-期間-long-transaction-阻塞-vacuum">Case 1：Backfill 期間 long transaction 阻塞 vacuum</h3>
<p><strong>徵兆</strong>：backfill 跑 6 小時的 <code>INSERT INTO events_daily SELECT * FROM events WHERE ...</code>、期間 events 表的 autovacuum 完全不跑、dead tuple 累積、production query 變慢。</p>
<p><strong>根因</strong>：PostgreSQL transaction 期間 <em>xmin horizon 鎖死</em>、vacuum 只能回收「不會被任何 active transaction 看到」的 dead tuple；long backfill = long open transaction、vacuum 失效。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>拆 batch INSERT</strong>：每日 backfill 拆成 small batch（10 萬 row 一個 transaction）、每個 commit 釋放 xmin</li>
<li><strong>用 COPY 不用 INSERT</strong>：<code>COPY events_daily FROM (SELECT * FROM events WHERE ...)</code> 是 PG 對 batch 最快 + 對 vacuum 影響小</li>
<li><strong>Backfill 跑在 standby</strong>：用 logical replication 從 standby 拉資料、不在 primary 跑長 transaction</li>
</ol>
<h3 id="case-2trigger-dual-write-對-application-造成-latency">Case 2：Trigger dual-write 對 application 造成 latency</h3>
<p><strong>徵兆</strong>：加 trigger 後 application 寫入 latency p99 從 5ms 漲到 25-50ms；high-throughput batch job 直接 timeout。</p>
<p><strong>根因</strong>：每筆 INSERT 都觸發 trigger function 跑一次 INSERT 到 events_daily、IO 雙倍、index 也雙倍維護。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>改 application-side dual-write</strong>：application code 顯式寫兩端、用 connection pool batch 攤平 IO</li>
<li><strong>用 logical replication slot</strong>：events → events_daily 用 logical replication 取代 trigger、降 IO 衝擊</li>
<li><strong>dual-write 時間最小化</strong>：trigger 只在 backfill + verify 期間打開、cutover 前關掉</li>
</ol>
<h3 id="case-3partition_pruning-沒命中planner-仍掃所有-partition">Case 3：Partition_pruning 沒命中、planner 仍掃所有 partition</h3>
<p><strong>徵兆</strong>：cutover 完成後、application 端某些 query latency 從 200ms 跳到 5000ms；EXPLAIN 顯示 <code>Append</code> 下面所有 1095 個 partition 都被 scan。</p>
<p><strong>根因</strong>：partition 數量爆到 1000+、planner planning_time 對某些 query 變長（含 prepared statement 沒帶 partition key bound）；或 query 用了 <code>WHERE event_time = some_function(now())</code>、planning-time pruning 不觸發。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong><code>enable_partition_pruning = on</code></strong> 預設、確認沒被 disable</li>
<li><strong>PG 11+ runtime pruning</strong>：prepared statement 用 generic plan、runtime pruning 補位</li>
<li><strong>Sub-partition strategy</strong>：1095 個 daily 太多、改 <em>最近 90 天 daily / 之前 monthly</em> 混合 strategy、減 partition count</li>
<li><strong>Planner statistics</strong>：跑 <code>ANALYZE</code> 重建 statistics、partition 樹太大時 planner 需新 stats</li>
</ol>
<h3 id="case-4constraint-exclusion-失敗跨-partition-unique-不-enforce">Case 4：Constraint exclusion 失敗、跨 partition unique 不 enforce</h3>
<p><strong>徵兆</strong>：cutover 後發現某 user 的 event 在多個 partition 都有、unique constraint <code>(user_id, event_id)</code> 沒 enforce；data audit 抓到 duplicate。</p>
<p><strong>根因</strong>：PostgreSQL partition table 的 <code>UNIQUE</code> constraint <em>必須包含 partition key</em>；本來 monthly partition 下 <code>UNIQUE (user_id, event_id)</code> 加上 <code>event_time</code>（partition key）變 <code>UNIQUE (user_id, event_id, event_time)</code>、實際語意是「同月同 user 同 event_id 唯一」；改 daily 後變「同日同 user 同 event_id 唯一」— unique scope 從月變天、原本月內跨日 dedup 失效。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-redesign</strong>：明示 unique constraint 的 <em>時間 scope</em>、redesign 後 scope 縮小是否可接受</li>
<li><strong>Application-side dedup</strong>：跨 partition 唯一性走 application 層 lookup（用 Redis SETEX 暫存 key）</li>
<li><strong>退到 non-partitioned dedup 表</strong>：建獨立 user_events_dedup 表、application 寫入前先 lookup</li>
</ol>
<h3 id="case-5drop-老-partition-太頻繁shared_buffers-cache-miss-爆">Case 5：DROP 老 partition 太頻繁、shared_buffers cache miss 爆</h3>
<p><strong>徵兆</strong>：daily partition 上線後、每天凌晨 cron DROP <code>events_2025_05_18</code>（90 天前）；DROP 後 shared_buffers 大量 invalidate、application 端 query latency p99 從 10ms 跳到 100-200ms 持續 30 分鐘。</p>
<p><strong>根因</strong>：PostgreSQL shared_buffers cache 對被 DROP 表的 page 全部 invalidate；DROP 大 partition（10GB+）後 cache hit rate 從 99% 掉到 60%、application 等 disk IO。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>DROP 跑在 off-peak</strong>：凌晨 3-4 點 cron、避開業務高峰</li>
<li><strong>預熱 next partition</strong>：DROP 前用 <code>pg_prewarm</code> 主動 load 熱 partition 進 cache</li>
<li><strong>改 DETACH + DROP TABLE delayed</strong>：DETACH 是 fast、DROP TABLE 排到 weekly batch、降頻率</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Monthly partition (current)</th>
          <th>Daily partition (target)</th>
          <th>Trade-off</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Partition count</td>
          <td>36 (3 年 retention)</td>
          <td>1095 (3 年 retention)</td>
          <td>30x partition count、planner cost 略升</td>
      </tr>
      <tr>
          <td>Single partition size</td>
          <td>50-500GB</td>
          <td>1-20GB</td>
          <td>Daily 更易 vacuum</td>
      </tr>
      <tr>
          <td>DROP old data</td>
          <td>Monthly cadence</td>
          <td>Daily cadence</td>
          <td>更細 retention 控制</td>
      </tr>
      <tr>
          <td>Query latency</td>
          <td>跨 partition 多時 50-200ms</td>
          <td>跨 partition 少時 5-50ms</td>
          <td>Daily 多數 query 更快</td>
      </tr>
      <tr>
          <td>Planning time</td>
          <td>5-10ms</td>
          <td>50-100ms (對 generic plan)</td>
          <td>Planning overhead + 1 order</td>
      </tr>
      <tr>
          <td>Maintenance window</td>
          <td>Vacuum 1 partition 6 小時</td>
          <td>Vacuum 1 partition 5-30 分鐘</td>
          <td>維護視窗更小、可日跑</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：daily partition 適合 <em>高流量 + 跨日查詢多 + retention 細的場景</em>；超大 partition (TB 級單日) 仍要 sub-partition 拆。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-autovacuum-tuning-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum tuning</a> 整合</h3>
<p>Daily partition 後 autovacuum 行為：</p>
<ul>
<li>每 daily partition 獨立 autovacuum、scale_factor + threshold per-partition tuning</li>
<li><code>autovacuum_max_workers</code> 要從 3 拉到 6-10（partition 數爆）</li>
<li>Cold partition (&gt; 30 天) <code>autovacuum_enabled = false</code>、不浪費 CPU</li>
</ul>
<h3 id="跟-patroni-ha-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> 整合</h3>
<p>Failover 期間 partition migration 不能跑、必須在 stable cluster state 執行；Patroni promote 後重新評估 partition health。</p>
<h3 id="跟-logical-replication--debezium-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a> 整合</h3>
<p><code>publish_via_partition_root = true</code> 讓 publication 從 parent 角度看；CDC consumer 不需要對每個 partition 設 subscription。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>跨 daily partition 的 archive strategy</strong>：archive 到 S3 cold storage、daily granularity 給更細 retention 控制</li>
<li><strong>pg_partman extension</strong>：自動建 daily partition、不用 cron；但要先確認 Aurora / RDS 支援</li>
<li><strong>Sub-partitioning</strong>：未來流量爆時用「daily by time + list by tenant」雙軸 partition</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">Declarative Partitioning</a>（partition 基礎）/ <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a></li>
<li>平行 Type F dogfood：<a href="/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis Cluster Re-sharding</a>（dogfood #1）/ <a href="/blog/backend/01-database/vendors/mongodb/shard-expansion-multi-dc/" data-link-title="MongoDB Shard Expansion &#43; Multi-DC：Type F「不需要 parallel run」的 multi-region 例外" data-link-desc="MongoDB sharded cluster 加 shard &#43; 跨 DC expansion 是 Type F「topology re-layout」第 3 個 dogfood — 同時改 sharding &#43; replication topology &#43; region distribution；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 第 3 點「Type F 不需要 parallel run」claim 的例外（multi-region rollout 必須 parallel run &#43; 切流量）；涵蓋 chunk migration / replica set add member / cross-DC routing">MongoDB Shard + Multi-DC</a>（dogfood #3、F-multi-region sub-type）</li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> / <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Process content 結構由最大差異維度決定</a> / <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Data topology 是第 6 audit 維度</a></li>
</ul>
]]></content:encoded></item></channel></rss>