<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Sharded on Tarragon</title><link>https://tarrragon.github.io/blog/tags/sharded/</link><description>Recent content in Sharded on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Tue, 19 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/sharded/index.xml" rel="self" type="application/rss+xml"/><item><title>MongoDB Shard Expansion + Multi-DC：Type F「不需要 parallel run」的 multi-region 例外</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mongodb/shard-expansion-multi-dc/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mongodb/shard-expansion-multi-dc/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB&lt;/a> overview 的 implementation-layer deep article。對應 &lt;a href="https://tarrragon.github.io/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Type F「Topology re-layout」&lt;/a> 第 3 個 dogfood、特別驗證 self-aware limitation 第 3 點「不需要 parallel run」claim 的 &lt;em>multi-region rollout 例外&lt;/em> — 本文是反例的具體實證。&lt;/p>&lt;/blockquote>
&lt;h2 id="reviewer-d-的質疑type-f-一定不需要-parallel-run-嗎">Reviewer D 的質疑：Type F 一定不需要 parallel run 嗎&lt;/h2>
&lt;p>&lt;a href="https://tarrragon.github.io/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Self-aware limitation&lt;/a> 第 3 點承認：&lt;/p>
&lt;blockquote>
&lt;p>「不需要 parallel run」claim 部分不成立：multi-region rollout（#128 列為 Type F 情境）必須 parallel run — 兩 region 同時跑然後切流量、不然就是停機切換、跟 Type A phase 3 機制相同。&lt;/p>&lt;/blockquote>
&lt;p>本文是該 claim 的 &lt;em>正面實證&lt;/em> — MongoDB sharded cluster 從 single-DC 加 shard + 加 secondary DC、確實需要 parallel run + 流量切換、跟 Type A phased migration 局部同構：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB</a> overview 的 implementation-layer deep article。對應 <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Type F「Topology re-layout」</a> 第 3 個 dogfood、特別驗證 self-aware limitation 第 3 點「不需要 parallel run」claim 的 <em>multi-region rollout 例外</em> — 本文是反例的具體實證。</p></blockquote>
<h2 id="reviewer-d-的質疑type-f-一定不需要-parallel-run-嗎">Reviewer D 的質疑：Type F 一定不需要 parallel run 嗎</h2>
<p><a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Self-aware limitation</a> 第 3 點承認：</p>
<blockquote>
<p>「不需要 parallel run」claim 部分不成立：multi-region rollout（#128 列為 Type F 情境）必須 parallel run — 兩 region 同時跑然後切流量、不然就是停機切換、跟 Type A phase 3 機制相同。</p></blockquote>
<p>本文是該 claim 的 <em>正面實證</em> — MongoDB sharded cluster 從 single-DC 加 shard + 加 secondary DC、確實需要 parallel run + 流量切換、跟 Type A phased migration 局部同構：</p>
<table>
  <thead>
      <tr>
          <th>Type F 假設</th>
          <th>Single-DC re-sharding（Redis case）</th>
          <th><strong>Multi-DC expansion（本文）</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>同 cluster 不同 state</td>
          <td>yes</td>
          <td>yes（同 MongoDB cluster）</td>
      </tr>
      <tr>
          <td>不需 schema translation</td>
          <td>yes</td>
          <td>yes</td>
      </tr>
      <tr>
          <td>不需 parallel run</td>
          <td>yes（slot migration 內部完成）</td>
          <td><strong>no — 兩 DC 同跑後切流量</strong></td>
      </tr>
      <tr>
          <td>不需 cleanup phase</td>
          <td>yes</td>
          <td>partial（舊 DC 角色降為 standby）</td>
      </tr>
      <tr>
          <td>Step-by-step + rollback boundary</td>
          <td>yes</td>
          <td>yes</td>
      </tr>
  </tbody>
</table>
<p>→ Type F anatomy 仍適用、但「不需 parallel run」是 <em>子情境條件</em>、不是 universal claim。</p>
<h2 id="兩個操作合併shard-加--dc-加">兩個操作合併：shard 加 + DC 加</h2>
<p>實務上中型公司常 <em>同時</em> 跑兩個 topology 變動：</p>
<ol>
<li><strong>Shard expansion</strong>：現有 3-shard cluster 加到 5-shard、chunk migration 平均分佈</li>
<li><strong>Multi-DC</strong>：從 single-DC（us-east-1）加到 multi-DC（us-east-1 + us-west-2）</li>
</ol>
<p>兩個操作的 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Shard 加（單獨）</th>
          <th>Multi-DC（單獨）</th>
          <th>兩者同跑</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>Low</td>
          <td>Low</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>Low</td>
          <td>Medium（跨 DC ops）</td>
          <td>Medium</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>Low</td>
          <td>Low</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Low（加 shard、同 cluster）</td>
          <td>Low</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Low</td>
          <td>Low-Medium（cross-DC latency aware）</td>
          <td>Low-Medium</td>
      </tr>
      <tr>
          <td><strong>Data topology</strong></td>
          <td><strong>High</strong>（sharding strategy）</td>
          <td><strong>High</strong>（replication + region）</td>
          <td><strong>High</strong>（雙變、複合 topology）</td>
      </tr>
  </tbody>
</table>
<p>兩者主導維度都是 topology = High、組合走 Type F multi-axis 子情境。</p>
<h2 id="pre-layout-analysis當前--目標-topology">Pre-layout analysis：當前 + 目標 topology</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">// 1. 當前 shard 分佈
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">status</span><span class="p">({</span><span class="nx">verbose</span><span class="o">:</span> <span class="kc">false</span><span class="p">});</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1">// 期望輸出: 3 shard、每個 ~33% chunks、no migration in progress
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"></span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="nx">db</span><span class="p">.</span><span class="nx">printShardingStatus</span><span class="p">({</span><span class="nx">verbose</span><span class="o">:</span> <span class="kc">false</span><span class="p">});</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1">// 找 hot shard、imbalanced chunk distribution
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1">// 2. Replication topology
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="nx">rs</span><span class="p">.</span><span class="nx">status</span><span class="p">();</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1">// 各 replica set primary/secondary 健康度、replication lag
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1">// 3. Cross-DC network baseline (在 add DC 前測)
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1">// us-east-1 → us-west-2 RTT、bandwidth
</span></span></span></code></pre></div><p>Pre-layout 階段 output：</p>
<ul>
<li><strong>當前</strong>：3 shard × 1 replica set per shard (3 member) = 9 node、全在 us-east-1</li>
<li><strong>目標</strong>：5 shard × 1 replica set per shard (5 member: 3 us-east + 2 us-west) = 25 node</li>
<li><strong>Migration scope</strong>：加 2 shard + 加 2 DC member 每 shard、共 +16 node</li>
<li><strong>Chunk migration estimate</strong>：30% chunk 需重分（從 33% × 3 變 20% × 5）</li>
</ul>
<h2 id="re-layout-機制">Re-layout 機制</h2>
<p>兩個 mechanism 平行進行：</p>
<h3 id="shard-expansion-mechanism">Shard expansion mechanism</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">// 1. 新增 shard 到 cluster
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">addShard</span><span class="p">(</span><span class="s2">&#34;rs-shard4/host10:27017,host11:27017,host12:27017&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="nx">sh</span><span class="p">.</span><span class="nx">addShard</span><span class="p">(</span><span class="s2">&#34;rs-shard5/host13:27017,host14:27017,host15:27017&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1">// 2. balancer 自動 chunk migration
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">startBalancer</span><span class="p">();</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1">// 觀察 progress: db.adminCommand({balancerStatus: 1})
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1">// 3. 完成後 verify shard distribution
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">status</span><span class="p">();</span></span></span></code></pre></div><p>Chunk migration 是 <em>background</em> job、balancer 控制 throttle；不阻塞 production query、但 CPU / network 上升 30-50%。</p>
<h3 id="multi-dc-expansion-mechanism">Multi-DC expansion mechanism</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">// 1. 對每 shard 的 replica set 加 us-west-2 member (priority 0)
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="nx">rs</span><span class="p">.</span><span class="nx">add</span><span class="p">({</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  <span class="nx">host</span><span class="o">:</span> <span class="s2">&#34;us-west-2-host:27017&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  <span class="nx">priority</span><span class="o">:</span> <span class="mi">0</span><span class="p">,</span>           <span class="c1">// 不能當 primary
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span>  <span class="nx">votes</span><span class="o">:</span> <span class="mi">1</span><span class="p">,</span>              <span class="c1">// 參與投票
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span>  <span class="nx">hidden</span><span class="o">:</span> <span class="kc">false</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="p">});</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1">// 2. 等 initial sync 完成（依資料量 1 小時 - 1 天）
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="nx">rs</span><span class="p">.</span><span class="nx">printReplicationInfo</span><span class="p">();</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1">// 3. 確認 secondary 健康後、提升 priority 或 votes
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1">// 不要立刻設 priority 1、避免 unintended failover
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1">// 4. Cross-DC routing 透過 readPreference 在 application 設
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="c1"></span><span class="kr">const</span> <span class="nx">client</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">MongoClient</span><span class="p">(</span><span class="nx">uri</span><span class="p">,</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">  <span class="nx">readPreference</span><span class="o">:</span> <span class="s1">&#39;secondaryPreferred&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">  <span class="nx">readPreferenceTags</span><span class="o">:</span> <span class="p">[{</span> <span class="nx">region</span><span class="o">:</span> <span class="s1">&#39;us-west-2&#39;</span> <span class="p">},</span> <span class="p">{}],</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="p">});</span></span></span></code></pre></div><p>關鍵：multi-DC 是 <em>漸進加 member</em>、不是 atomic switch；每 shard 獨立加、整體耗時 = shard 數 × initial sync time。</p>
<h2 id="execution-flow含-parallel-run--流量切換">Execution flow（含 parallel run + 流量切換）</h2>
<p>8 step、包含 <em>parallel run + 切流量</em> 段——驗證 <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 self-aware limitation</a> 第 3 點：</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>動作</th>
          <th>Parallel run?</th>
          <th>Rollback boundary</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 Pre-check</td>
          <td>量化當前 topology、確認 cluster 健康</td>
          <td>no</td>
          <td>-</td>
      </tr>
      <tr>
          <td>2 加 us-east shard</td>
          <td>sh.addShard、balancer migrate chunk</td>
          <td>no（cluster 內）</td>
          <td>removeShard、chunk migrate 回</td>
      </tr>
      <tr>
          <td>3 加 us-west member</td>
          <td>對每 shard rs.add 跨 DC member</td>
          <td>no</td>
          <td>rs.remove、initial sync 投入廢棄</td>
      </tr>
      <tr>
          <td>4 <strong>Initial sync wait</strong></td>
          <td>等所有 us-west member catch up</td>
          <td><strong>parallel run starts</strong>：兩 DC 同時 serve</td>
          <td>-</td>
      </tr>
      <tr>
          <td>5 <strong>Cross-DC dual-serve</strong></td>
          <td>兩 DC 都跑 read traffic（不切 write）</td>
          <td><strong>yes、parallel run</strong>：app 用 secondary preferred us-west</td>
          <td>readPref 切回 us-east primary</td>
      </tr>
      <tr>
          <td>6 <strong>流量切換</strong></td>
          <td>application us-west traffic 走 us-west read</td>
          <td><strong>yes</strong></td>
          <td>DNS / readPref 切回</td>
      </tr>
      <tr>
          <td>7 Promote us-west（optional）</td>
          <td>一個 shard 的 us-west member priority 提到 1</td>
          <td>post-cutover</td>
          <td>demote priority 回 0</td>
      </tr>
      <tr>
          <td>8 Cleanup</td>
          <td>Verify、archive log、document new topology</td>
          <td>no</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Step 4-6 是 <em>parallel run + 切流量</em> — <strong>Type F 有此例外、跟 Type A phase 3 機制同構</strong>；anatomy 中「Execution flow per-step」段必須含 parallel run 子段。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1balancer-跑-chunk-migration-撞-production-peak">Case 1：Balancer 跑 chunk migration 撞 production peak</h3>
<p><strong>徵兆</strong>：加 shard 後 balancer 開始 migrate chunk、production write latency p99 從 10ms 跳到 100ms；application 端 timeout 大量。</p>
<p><strong>根因</strong>：MongoDB balancer 預設 24×7 跑、chunk migrate 是 <em>blocking</em> 操作（migration lock 期間阻塞 write 到該 chunk）；產線高峰時間 balancer 不會自動暫停。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">// 限 balancer 跑在 low-traffic window
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">setBalancerState</span><span class="p">(</span><span class="kc">true</span><span class="p">);</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="nx">db</span><span class="p">.</span><span class="nx">settings</span><span class="p">.</span><span class="nx">update</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">  <span class="p">{</span> <span class="nx">_id</span><span class="o">:</span> <span class="s2">&#34;balancer&#34;</span> <span class="p">},</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">  <span class="p">{</span> <span class="nx">$set</span><span class="o">:</span> <span class="p">{</span> <span class="nx">activeWindow</span><span class="o">:</span> <span class="p">{</span> <span class="nx">start</span><span class="o">:</span> <span class="s2">&#34;02:00&#34;</span><span class="p">,</span> <span class="nx">stop</span><span class="o">:</span> <span class="s2">&#34;06:00&#34;</span> <span class="p">}</span> <span class="p">}</span> <span class="p">},</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">  <span class="p">{</span> <span class="nx">upsert</span><span class="o">:</span> <span class="kc">true</span> <span class="p">}</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="p">);</span></span></span></code></pre></div><p>且設 <code>chunkSize</code> 較小（128MB → 64MB）讓 migration 步驟細、單次 lock 時間短。</p>
<h3 id="case-2cross-dc-initial-sync-期間-oplog-跑出窗口">Case 2：Cross-DC initial sync 期間 oplog 跑出窗口</h3>
<p><strong>徵兆</strong>：加 us-west member 後、initial sync 跑 4 小時、結束時 member 顯示「too stale to catch up」、需要 full re-sync。</p>
<p><strong>根因</strong>：MongoDB oplog 是 capped collection、預設 size 5% disk；4 小時 initial sync 期間 primary 寫入量超出 oplog 保留範圍、member 拿到的 oplog start point 已被覆蓋。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預先擴 oplog size</strong>：<code>db.adminCommand({replSetResizeOplog: 1, size: 51200})</code> 加到 50GB、覆蓋 sync window</li>
<li><strong>Off-peak initial sync</strong>：跑在低流量時間、oplog 寫入較慢</li>
<li><strong>Manual initial sync via snapshot</strong>：用 <code>mongodump</code> 從 primary snapshot、restore 到 new member、跳過 oplog tail catch-up</li>
</ol>
<h3 id="case-3跨-dc-read-路由錯誤stale-data-影響業務">Case 3：跨 DC read 路由錯誤、stale data 影響業務</h3>
<p><strong>徵兆</strong>：切流量到 us-west 後、application 偶爾抓到 5-30 秒前的 stale data；customer 報告「明明剛改了 setting、refresh 又變回去」。</p>
<p><strong>根因</strong>：us-west member 是 secondary、replication lag 5-30 秒；application readPreference 設 <code>secondaryPreferred</code> 但沒 <code>maxStalenessSeconds</code>、可能讀到嚴重 stale member。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kr">const</span> <span class="nx">client</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">MongoClient</span><span class="p">(</span><span class="nx">uri</span><span class="p">,</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  <span class="nx">readPreference</span><span class="o">:</span> <span class="s1">&#39;secondaryPreferred&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  <span class="nx">readPreferenceTags</span><span class="o">:</span> <span class="p">[{</span> <span class="nx">region</span><span class="o">:</span> <span class="s1">&#39;us-west-2&#39;</span> <span class="p">},</span> <span class="p">{}],</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  <span class="nx">maxStalenessSeconds</span><span class="o">:</span> <span class="mi">90</span><span class="p">,</span>  <span class="c1">// 限 stale 不超過 90 秒
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="p">});</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1">// 對 strict consistency 場景強制 primary
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="kr">const</span> <span class="nx">client_strict</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">MongoClient</span><span class="p">(</span><span class="nx">uri</span><span class="p">,</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  <span class="nx">readPreference</span><span class="o">:</span> <span class="s1">&#39;primary&#39;</span><span class="p">,</span>  <span class="c1">// 強制讀 us-east primary
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="p">});</span></span></span></code></pre></div><p>Application-level read pattern 必須區分「accept stale read」vs「require fresh read」、不是 cluster-level 統一配置。</p>
<h3 id="case-4shard-tag-aware-routing-沒設cross-dc-traffic-爆-cost">Case 4：Shard tag-aware routing 沒設、cross-DC traffic 爆 cost</h3>
<p><strong>徵兆</strong>：multi-DC 跑了 1 個月、AWS egress cost 從 $500 / month 漲到 $8000 / month；99% 流量還是 us-east → us-west 跨 DC。</p>
<p><strong>根因</strong>：sharded cluster 沒設 <em>zone sharding</em>、application 不知道哪些 chunk 在哪個 DC、所有 query 預設打 us-east primary、跨 DC bandwidth 爆。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">// 注意: MongoDB 4.2+ API、舊版 sh.addShardTag / sh.addTagRange 已 deprecated
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1">// 對應改 sh.addShardToZone / sh.updateZoneKeyRange
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"></span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1">// 1. 給 shard 加 zone (MongoDB 4.2+)
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">addShardToZone</span><span class="p">(</span><span class="s2">&#34;rs-shard1&#34;</span><span class="p">,</span> <span class="s2">&#34;us-east&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="nx">sh</span><span class="p">.</span><span class="nx">addShardToZone</span><span class="p">(</span><span class="s2">&#34;rs-shard2&#34;</span><span class="p">,</span> <span class="s2">&#34;us-east&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="nx">sh</span><span class="p">.</span><span class="nx">addShardToZone</span><span class="p">(</span><span class="s2">&#34;rs-shard3&#34;</span><span class="p">,</span> <span class="s2">&#34;us-east&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="nx">sh</span><span class="p">.</span><span class="nx">addShardToZone</span><span class="p">(</span><span class="s2">&#34;rs-shard4&#34;</span><span class="p">,</span> <span class="s2">&#34;us-west&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="nx">sh</span><span class="p">.</span><span class="nx">addShardToZone</span><span class="p">(</span><span class="s2">&#34;rs-shard5&#34;</span><span class="p">,</span> <span class="s2">&#34;us-west&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1">// 2. 對 collection 加 zone range
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="nx">sh</span><span class="p">.</span><span class="nx">updateZoneKeyRange</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">  <span class="s2">&#34;myapp.events&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">  <span class="p">{</span> <span class="nx">region</span><span class="o">:</span> <span class="s2">&#34;us-east&#34;</span><span class="p">,</span> <span class="nx">_id</span><span class="o">:</span> <span class="nx">MinKey</span> <span class="p">},</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">  <span class="p">{</span> <span class="nx">region</span><span class="o">:</span> <span class="s2">&#34;us-east&#34;</span><span class="p">,</span> <span class="nx">_id</span><span class="o">:</span> <span class="nx">MaxKey</span> <span class="p">},</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">  <span class="s2">&#34;us-east&#34;</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="p">);</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="nx">sh</span><span class="p">.</span><span class="nx">updateZoneKeyRange</span><span class="p">(</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">  <span class="s2">&#34;myapp.events&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">  <span class="p">{</span> <span class="nx">region</span><span class="o">:</span> <span class="s2">&#34;us-west&#34;</span><span class="p">,</span> <span class="nx">_id</span><span class="o">:</span> <span class="nx">MinKey</span> <span class="p">},</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">  <span class="p">{</span> <span class="nx">region</span><span class="o">:</span> <span class="s2">&#34;us-west&#34;</span><span class="p">,</span> <span class="nx">_id</span><span class="o">:</span> <span class="nx">MaxKey</span> <span class="p">},</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">  <span class="s2">&#34;us-west&#34;</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="p">);</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="c1">// 3. balancer 重新分配 chunk 到對應 zone
</span></span></span></code></pre></div><p>Zone sharding 是 multi-DC 必要設計、不設等於白付 egress cost。</p>
<h3 id="case-5failover-後跨-dc-primary-切換application-連線中斷">Case 5：Failover 後跨 DC primary 切換、application 連線中斷</h3>
<p><strong>徵兆</strong>：production 跑 6 個月後、us-east-1 outage、某 shard primary 切到 us-west member；application 5-10 秒內大量 connection error。</p>
<p><strong>根因</strong>：MongoDB driver 預設 election timeout 10 秒、application 沒設 server selection retry；primary 切換期間 client 沒重連。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln">1</span><span class="cl"><span class="kr">const</span> <span class="nx">client</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">MongoClient</span><span class="p">(</span><span class="nx">uri</span><span class="p">,</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  <span class="nx">serverSelectionTimeoutMS</span><span class="o">:</span> <span class="mi">30000</span><span class="p">,</span>    <span class="c1">// 等 30 秒給 election
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span>  <span class="nx">retryWrites</span><span class="o">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">  <span class="nx">retryReads</span><span class="o">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">  <span class="nx">heartbeatFrequencyMS</span><span class="o">:</span> <span class="mi">5000</span><span class="p">,</span>         <span class="c1">// 更頻繁 detect topology 變動
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="p">});</span></span></span></code></pre></div><p>且 multi-DC primary 應該設 <em>priority asymmetry</em>：us-east member priority 2、us-west priority 1；正常情況不切換、災難時自動切。</p>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Single-DC 3-shard</th>
          <th>Multi-DC 5-shard</th>
          <th>Trade-off</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Node count</td>
          <td>9</td>
          <td>25</td>
          <td>~3x infrastructure cost</td>
      </tr>
      <tr>
          <td>Storage redundancy</td>
          <td>3 replica</td>
          <td>5 replica (3 east + 2 west)</td>
          <td>+2 copy、storage cost +66%</td>
      </tr>
      <tr>
          <td>Network egress</td>
          <td>內部 VPC、低</td>
          <td>Cross-DC、高（需 zone sharding）</td>
          <td>$500 → $8000 / month if no zone sharding</td>
      </tr>
      <tr>
          <td>Latency p99 (write)</td>
          <td>5-10ms</td>
          <td>5-15ms（primary 仍 us-east）</td>
          <td>略升</td>
      </tr>
      <tr>
          <td>Latency p99 (read)</td>
          <td>5-10ms</td>
          <td>2-5ms (local DC)</td>
          <td>Multi-DC 區域 read 加快</td>
      </tr>
      <tr>
          <td>Disaster recovery</td>
          <td>RTO 30 分鐘（rebuild）</td>
          <td>RTO &lt; 1 分鐘（auto failover）</td>
          <td>顯著改善</td>
      </tr>
      <tr>
          <td>Operational complexity</td>
          <td>低</td>
          <td>高（zone sharding / DR drill）</td>
          <td>+1 SRE FTE 維護</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：multi-DC 是 <em>DR 投資</em>、不是 cost optimization；只在 <em>availability SLA &gt; 99.9% 或合規要求</em> 場景值得。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-mongodb--atlas-migration-對位">跟 <a href="/blog/backend/01-database/vendors/mongodb/migrate-to-atlas/" data-link-title="MongoDB → Atlas：Atlas 不是 MongoDB &#43; managed、是另一個 product" data-link-desc="Atlas 號稱「MongoDB managed」但 operational model 完全不同（auto-scaling / VPC peering / IAM-driven access / 內建 backup / billing 模型）；本文採用 Type C operational redesign hybrid 結構、4-phase operational migration &#43; drop-in cutover、5 個 production 踩雷（連線數限制 / IP whitelist / backup retention / IAM token 過期 / billing 暴漲）">MongoDB → Atlas migration</a> 對位</h3>
<p>Self-managed multi-DC 複雜度高、Atlas 把 multi-cluster + cross-region 簡化成 UI 配置；如果走 multi-DC、考慮直接遷 Atlas。</p>
<h3 id="跟-application-read-pattern-整合">跟 Application read pattern 整合</h3>
<p>zone sharding + readPreference 跟 application logic 緊密耦合；不能事後補、應在 multi-DC 設計階段就設計 application 端的 region-aware routing。</p>
<h3 id="跟-cassandra-keyspace-re-balance-對比">跟 <a href="https://cassandra.apache.org/">Cassandra keyspace re-balance</a> 對比</h3>
<p>Cassandra 是另一個 Type F multi-DC 典型 case；用 <em>NetworkTopologyStrategy + replication factor per DC</em>、跟 MongoDB zone sharding 概念對等但 mechanism 完全不同。Reviewer D 把 Cassandra 列為 Type F 反例 — 本文以 MongoDB 替代驗證。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Cross-region active-active</strong>：MongoDB 不支援 multi-primary、cross-region active-active 需要 application-level conflict resolution</li>
<li><strong>PostgreSQL Citus / CockroachDB multi-region</strong> 對比：distributed SQL 對 multi-region 有不同設計</li>
<li><strong>Cost optimization</strong>：跨 DC egress 是 long-term concern、zone sharding 設好後仍要 quarterly review</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB</a></li>
<li>平行 migration playbook：<a href="/blog/backend/01-database/vendors/mongodb/migrate-to-atlas/" data-link-title="MongoDB → Atlas：Atlas 不是 MongoDB &#43; managed、是另一個 product" data-link-desc="Atlas 號稱「MongoDB managed」但 operational model 完全不同（auto-scaling / VPC peering / IAM-driven access / 內建 backup / billing 模型）；本文採用 Type C operational redesign hybrid 結構、4-phase operational migration &#43; drop-in cutover、5 個 production 踩雷（連線數限制 / IP whitelist / backup retention / IAM token 過期 / billing 暴漲）">MongoDB → Atlas</a></li>
<li>平行 Type F dogfood：<a href="/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis Cluster Re-sharding</a>（dogfood #1）/ <a href="/blog/backend/01-database/vendors/postgresql/partition-redesign/" data-link-title="PostgreSQL Partition Redesign：當 monthly partition 越跑越慢" data-link-desc="PostgreSQL partition redesign 是 Type F「topology re-layout」第 2 個 dogfood — 從 monthly partition 改 daily / 從 range 改 list / 從單軸改 sub-partition；6 維 audit 皆 Low &#43; topology 軸 High；涵蓋 partition 不平衡偵測、ATTACH/DETACH 線上重劃、5 個 production 踩雷、跟 partition_pruning &#43; autovacuum 整合">PostgreSQL Partition Redesign</a>（dogfood #2）</li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> / <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Data topology 是第 6 audit 維度</a>（本文驗證 self-aware limitation 第 3 點）</li>
</ul>
]]></content:encoded></item></channel></rss>