<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Active-Active on Tarragon</title><link>https://tarrragon.github.io/blog/tags/active-active/</link><description>Recent content in Active-Active on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Tue, 16 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/active-active/index.xml" rel="self" type="application/rss+xml"/><item><title>KeyDB active-active 多主複製：last-write-wins 會默默吃掉哪一筆寫入</title><link>https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/keydb/active-active-replication/</link><pubDate>Tue, 16 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/keydb/active-active-replication/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/keydb/" data-link-title="KeyDB" data-link-desc="Redis multi-threaded fork、active-replication、Snap 採用">KeyDB&lt;/a> overview 的 implementation-layer deep article。選型層（KeyDB vs Redis / DragonflyDB / Valkey、為何選 fork）見 overview；本文只處理「決定用 KeyDB active-active 後，衝突與一致性怎麼判」。命令實機驗證於 eqalpha/keydb image、最後檢查日 2026-06-16；複製機制以 &lt;a href="https://docs.keydb.dev/docs/active-rep/">KeyDB active-replication 文件&lt;/a> 為準。&lt;/p>&lt;/blockquote>
&lt;h2 id="兩邊都能寫聽起來太美好">兩邊都能寫，聽起來太美好&lt;/h2>
&lt;p>Redis 的複製是單向的：一個 master 寫、replica 唯讀。要跨區讓兩邊都能就近寫入，Redis 本身做不到（得靠應用層分區或外部工具）。KeyDB 的 active-active 把這個限制拿掉——兩個（含以上）KeyDB 節點都是 master、都能接受寫入、互相把寫入同步給對方。對「兩個 region 都要低延遲寫入同一份 cache」的場景，這聽起來解決了所有問題。&lt;/p>
&lt;p>問題藏在「兩邊同時寫同一個 key」的那一刻。active-active 沒有全域協調者來仲裁誰對誰錯，它用 last-write-wins（LWW）：比較兩筆寫入的時間戳，留下較晚的、默默丟掉較早的。多數時候沒事，但當兩個 region 在幾毫秒內各自更新同一個 key，其中一筆寫入會無聲消失——沒有錯誤、沒有日誌、application 以為自己寫成功了。&lt;/p>
&lt;p>理解 KeyDB active-active 就是理解這個取捨：它用 LWW 換到了「兩邊都能寫」的可用性，代價是放棄了強一致與「不丟寫入」的保證。本文展開複製機制、衝突語意，以及哪些資料放得進這個模型、哪些放進去就是 bug。&lt;/p>
&lt;h2 id="核心概念active-active-的複製與衝突語意">核心概念：active-active 的複製與衝突語意&lt;/h2>
&lt;p>active-active 不是「分散式交易」，它是「雙向非同步複製 + LWW 衝突解決」。理解它要抓三個點：&lt;/p>
&lt;p>&lt;strong>每個節點都是 active-replica&lt;/strong>。一般 Redis replica 是唯讀的；KeyDB 的 active-replica 既接受本地寫入、又接收對方的複製流。兩個節點互相設定對方為 master，形成雙向複製環。實機看到的 role 就是 &lt;code>active-replica&lt;/code>（不是 master / slave）。&lt;/p>
&lt;p>&lt;strong>複製是非同步的&lt;/strong>。本地寫入立即回 OK 給 client，之後才非同步傳給對方節點。這意味著兩個節點之間永遠有一個複製延遲窗口——在這個窗口內，兩邊看到的資料可能不同。這是 active-active 是 AP（可用性 + 分區容忍）而非 CP 的根本原因。&lt;/p>
&lt;p>&lt;strong>衝突用 last-write-wins 解決&lt;/strong>。同一個 key 在兩個節點被並發修改時，KeyDB 比較版本，保留較晚的寫入、丟棄較早的。沒有 merge、沒有 vector clock、沒有 application callback——就是比誰較晚。KeyDB 用 hybrid logical clock（HLC）排序、不是純 wall-clock，但 HLC 仍綁節點實體時鐘——時鐘不同步（clock skew）會直接影響哪一筆被判定為「較晚」。同步的是 key 的「值」不是「操作」，這也是為什麼並發 INCR 會互相覆蓋而非累加（見故障演練 Case 1）。&lt;/p>
&lt;p>&lt;strong>每筆寫入帶來源標記避免無限迴圈&lt;/strong>。A 的寫入同步給 B 後，B 不會再把它當成新寫入傳回 A（否則會無限循環）。KeyDB 用來源標記處理這個，但複製拓樸設計錯（例如環狀多節點）仍可能放大流量。&lt;/p>
&lt;h2 id="配置兩節點-active-active-的設定路徑">配置：兩節點 active-active 的設定路徑&lt;/h2>
&lt;p>實機驗證的最小雙主設定（兩個節點互相複製）：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1"># 節點 A 與 B 都開 active-replica + multi-master&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">docker run -d --name kdb-a --network kdbnet -p 6401:6379 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> eqalpha/keydb keydb-server --active-replica yes --multi-master yes
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">docker run -d --name kdb-b --network kdbnet -p 6402:6379 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> eqalpha/keydb keydb-server --active-replica yes --multi-master yes
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">&lt;span class="c1"># 互相指向對方（形成雙向複製）&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl">keydb-cli -p &lt;span class="m">6401&lt;/span> replicaof kdb-b &lt;span class="m">6379&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">9&lt;/span>&lt;span class="cl">keydb-cli -p &lt;span class="m">6402&lt;/span> replicaof kdb-a &lt;span class="m">6379&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>實機驗證雙向同步（最後檢查日 2026-06-16）：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c1"># 寫 A、讀 B&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">keydb-cli -p &lt;span class="m">6401&lt;/span> SET fromA hello &lt;span class="c1"># → OK&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">keydb-cli -p &lt;span class="m">6402&lt;/span> GET fromA &lt;span class="c1"># → hello （A 的寫入同步到 B）&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="c1"># 寫 B、讀 A（雙向）&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">keydb-cli -p &lt;span class="m">6402&lt;/span> SET fromB world &lt;span class="c1"># → OK&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">keydb-cli -p &lt;span class="m">6401&lt;/span> GET fromB &lt;span class="c1"># → world （B 的寫入同步到 A）&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="c1"># 確認 role 與複製鏈路&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">keydb-cli -p &lt;span class="m">6401&lt;/span> INFO replication &lt;span class="p">|&lt;/span> grep -E &lt;span class="s2">&amp;#34;role|master_link_status|connected_slaves&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="c1"># role:active-replica&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="c1"># master_link_status:up&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="c1"># connected_slaves:1&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>兩個節點都回報 &lt;code>role:active-replica&lt;/code>（不是傳統的 master / slave），&lt;code>master_link_status:up&lt;/code> 確認複製鏈路健康。寫入任一節點、另一節點都讀得到，這就是 active-active 的核心行為。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/02-cache-redis/vendors/keydb/" data-link-title="KeyDB" data-link-desc="Redis multi-threaded fork、active-replication、Snap 採用">KeyDB</a> overview 的 implementation-layer deep article。選型層（KeyDB vs Redis / DragonflyDB / Valkey、為何選 fork）見 overview；本文只處理「決定用 KeyDB active-active 後，衝突與一致性怎麼判」。命令實機驗證於 eqalpha/keydb image、最後檢查日 2026-06-16；複製機制以 <a href="https://docs.keydb.dev/docs/active-rep/">KeyDB active-replication 文件</a> 為準。</p></blockquote>
<h2 id="兩邊都能寫聽起來太美好">兩邊都能寫，聽起來太美好</h2>
<p>Redis 的複製是單向的：一個 master 寫、replica 唯讀。要跨區讓兩邊都能就近寫入，Redis 本身做不到（得靠應用層分區或外部工具）。KeyDB 的 active-active 把這個限制拿掉——兩個（含以上）KeyDB 節點都是 master、都能接受寫入、互相把寫入同步給對方。對「兩個 region 都要低延遲寫入同一份 cache」的場景，這聽起來解決了所有問題。</p>
<p>問題藏在「兩邊同時寫同一個 key」的那一刻。active-active 沒有全域協調者來仲裁誰對誰錯，它用 last-write-wins（LWW）：比較兩筆寫入的時間戳，留下較晚的、默默丟掉較早的。多數時候沒事，但當兩個 region 在幾毫秒內各自更新同一個 key，其中一筆寫入會無聲消失——沒有錯誤、沒有日誌、application 以為自己寫成功了。</p>
<p>理解 KeyDB active-active 就是理解這個取捨：它用 LWW 換到了「兩邊都能寫」的可用性，代價是放棄了強一致與「不丟寫入」的保證。本文展開複製機制、衝突語意，以及哪些資料放得進這個模型、哪些放進去就是 bug。</p>
<h2 id="核心概念active-active-的複製與衝突語意">核心概念：active-active 的複製與衝突語意</h2>
<p>active-active 不是「分散式交易」，它是「雙向非同步複製 + LWW 衝突解決」。理解它要抓三個點：</p>
<p><strong>每個節點都是 active-replica</strong>。一般 Redis replica 是唯讀的；KeyDB 的 active-replica 既接受本地寫入、又接收對方的複製流。兩個節點互相設定對方為 master，形成雙向複製環。實機看到的 role 就是 <code>active-replica</code>（不是 master / slave）。</p>
<p><strong>複製是非同步的</strong>。本地寫入立即回 OK 給 client，之後才非同步傳給對方節點。這意味著兩個節點之間永遠有一個複製延遲窗口——在這個窗口內，兩邊看到的資料可能不同。這是 active-active 是 AP（可用性 + 分區容忍）而非 CP 的根本原因。</p>
<p><strong>衝突用 last-write-wins 解決</strong>。同一個 key 在兩個節點被並發修改時，KeyDB 比較版本，保留較晚的寫入、丟棄較早的。沒有 merge、沒有 vector clock、沒有 application callback——就是比誰較晚。KeyDB 用 hybrid logical clock（HLC）排序、不是純 wall-clock，但 HLC 仍綁節點實體時鐘——時鐘不同步（clock skew）會直接影響哪一筆被判定為「較晚」。同步的是 key 的「值」不是「操作」，這也是為什麼並發 INCR 會互相覆蓋而非累加（見故障演練 Case 1）。</p>
<p><strong>每筆寫入帶來源標記避免無限迴圈</strong>。A 的寫入同步給 B 後，B 不會再把它當成新寫入傳回 A（否則會無限循環）。KeyDB 用來源標記處理這個，但複製拓樸設計錯（例如環狀多節點）仍可能放大流量。</p>
<h2 id="配置兩節點-active-active-的設定路徑">配置：兩節點 active-active 的設定路徑</h2>
<p>實機驗證的最小雙主設定（兩個節點互相複製）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 節點 A 與 B 都開 active-replica + multi-master</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">docker run -d --name kdb-a --network kdbnet -p 6401:6379 <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  eqalpha/keydb keydb-server --active-replica yes --multi-master yes
</span></span><span class="line"><span class="ln">4</span><span class="cl">docker run -d --name kdb-b --network kdbnet -p 6402:6379 <span class="se">\
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="se"></span>  eqalpha/keydb keydb-server --active-replica yes --multi-master yes
</span></span><span class="line"><span class="ln">6</span><span class="cl">
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># 互相指向對方（形成雙向複製）</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">keydb-cli -p <span class="m">6401</span> replicaof kdb-b <span class="m">6379</span>
</span></span><span class="line"><span class="ln">9</span><span class="cl">keydb-cli -p <span class="m">6402</span> replicaof kdb-a <span class="m">6379</span></span></span></code></pre></div><p>實機驗證雙向同步（最後檢查日 2026-06-16）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 寫 A、讀 B</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">keydb-cli -p <span class="m">6401</span> SET fromA hello   <span class="c1"># → OK</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">keydb-cli -p <span class="m">6402</span> GET fromA         <span class="c1"># → hello   （A 的寫入同步到 B）</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 寫 B、讀 A（雙向）</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">keydb-cli -p <span class="m">6402</span> SET fromB world   <span class="c1"># → OK</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">keydb-cli -p <span class="m">6401</span> GET fromB         <span class="c1"># → world   （B 的寫入同步到 A）</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 確認 role 與複製鏈路</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">keydb-cli -p <span class="m">6401</span> INFO replication <span class="p">|</span> grep -E <span class="s2">&#34;role|master_link_status|connected_slaves&#34;</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"># role:active-replica</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"># master_link_status:up</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"># connected_slaves:1</span></span></span></code></pre></div><p>兩個節點都回報 <code>role:active-replica</code>（不是傳統的 master / slave），<code>master_link_status:up</code> 確認複製鏈路健康。寫入任一節點、另一節點都讀得到，這就是 active-active 的核心行為。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1並發寫同一-key一筆寫入無聲消失">Case 1：並發寫同一 key、一筆寫入無聲消失</h3>
<p><strong>徵兆</strong>：兩個 region 的 application 各自更新同一個 user 的 cache（例如 profile），事後發現其中一個 region 的更新「沒生效」——但寫入時 application 收到的是 OK，沒有任何錯誤。</p>
<p><strong>根因</strong>：active-active 的 LWW。兩筆寫入在複製延遲窗口內並發發生，KeyDB 比較時間戳保留較晚的、默默丟棄較早的。application 兩邊都以為自己寫成功了（本地確實 OK），但同步後只有一筆存活。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>不要讓同一個 key 被多個 region 並發寫——按 key 分區（user X 的寫入永遠路由到 region A），把多主退化成「就近讀 + 單點寫」</li>
<li>真的需要多點寫的計數器類資料，用 CRDT 語意的結構（KeyDB 的 LWW 不適合 counter，並發 INCR 會互相覆蓋而非累加）</li>
<li>接受 LWW 是 cache 的取捨——可重建的 cache 副本丟一筆寫入可回源重算，不可重建的資料不該放 active-active</li>
<li>衝突無聲是最危險的——加應用層的寫入審計（不靠 KeyDB 告警）</li>
</ol>
<h3 id="case-2clock-skew-讓較晚的判定錯亂">Case 2：clock skew 讓「較晚」的判定錯亂</h3>
<p><strong>徵兆</strong>：明明 region B 後寫的值，最後存活的卻是 region A 先寫的值——LWW 的「後寫者勝」失效。</p>
<p><strong>根因</strong>：LWW 比較時間戳，但兩個節點的系統時鐘若沒同步（clock skew），「較晚」的判定就錯了。B 的時鐘慢了 200ms，B 後寫的值帶的時間戳反而比 A 早，被判定為「較舊」丟棄。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>所有 KeyDB 節點強制 NTP 時鐘同步，把 skew 壓到毫秒級</li>
<li>監控節點間的時鐘偏差，skew 超過複製延遲就有 LWW 判定錯亂風險</li>
<li>對時間敏感的衝突，LWW 本質不可靠——時鐘永遠無法完美同步，這是 LWW 模型的固有弱點</li>
<li>需要正確衝突解決的場景，不要用 LWW 的 active-active，改強一致儲存</li>
</ol>
<h3 id="case-3複製延遲下的-stale-read">Case 3：複製延遲下的 stale read</h3>
<p><strong>徵兆</strong>：region A 寫入後，立刻有請求打到 region B 讀同一 key，讀到舊值；幾百毫秒後再讀才是新值。</p>
<p><strong>根因</strong>：active-active 是非同步複製，A 的寫入要經過網路傳到 B 才可見。在這個複製延遲窗口內，B 讀到的是 stale 值。跨 region 的延遲窗口比同 AZ 大得多。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>寫後需要立即一致讀的路徑，讀同一個寫入的節點（read-your-writes 綁定到寫入 region）</li>
<li>監控節點間複製延遲，跨 region 的延遲是 stale window 的下界</li>
<li>接受最終一致——這是 active-active 的本質，cache 場景多數可容忍短暫 stale</li>
<li>不可容忍 stale 的資料不適合 active-active，走單寫入點 + 跨區唯讀 replica</li>
</ol>
<h3 id="case-4複製拓樸設計錯流量放大或迴圈">Case 4：複製拓樸設計錯、流量放大或迴圈</h3>
<p><strong>徵兆</strong>：加了第三個 active 節點組成環狀後，節點間流量異常放大、CPU 升高，甚至同一筆寫入被反覆傳遞。</p>
<p><strong>根因</strong>：active-active 多節點（&gt; 2）的拓樸需要小心設計。全互連（full mesh）下每筆寫入要傳給所有其他節點、流量隨節點數平方成長；環狀拓樸若來源標記處理不當可能放大傳遞。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>多節點 active-active 優先用 full mesh 但控制節點數（active-active 不適合大量節點）</li>
<li>監控節點間複製流量，異常放大代表拓樸或來源標記問題</li>
<li>大規模多區優先考慮「每區單寫入點 + 跨區唯讀」而非全 active-active</li>
<li>active-active 的甜蜜點是 2-3 個區的雙向就近寫，不是大規模 mesh</li>
</ol>
<h3 id="case-5節點重連後的全量重同步衝擊">Case 5：節點重連後的全量重同步衝擊</h3>
<p><strong>徵兆</strong>：一個節點短暫斷線後重連，重連瞬間 CPU / 網路尖峰，期間延遲升高。</p>
<p><strong>根因</strong>：節點斷線時間過長、超過複製 backlog 能覆蓋的範圍，重連時要做全量重同步（full resync）——對方節點要產生快照（fork、見 <a href="/blog/backend/02-cache-redis/vendors/redis/persistence-fork-latency/" data-link-title="Redis 持久化與 fork latency：AOF、RDB 與那一次卡住整個 cluster 的 fork" data-link-desc="Redis 的 RDB save 與 AOF rewrite 都靠一次 fork()，而 fork 在大記憶體實例上會凍結主執行緒數百毫秒、複製分頁讓記憶體逼近翻倍。本文展開 AOF / RDB 的機制與 fsync 取捨、copy-on-write 的記憶體放大、5 個把持久化寫成延遲尖峰與資料遺失的 production 踩坑，以及 cache 場景到底要不要持久化的邊界">Redis persistence 的 fork 成本</a>，KeyDB 繼承 Redis 的 fork 機制）並傳輸整個 dataset。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>設足夠大的 <code>repl-backlog-size</code>，讓短暫斷線走部分同步（partial resync）而非全量</li>
<li>重同步的 fork 成本跟記憶體 headroom 相關，節點要留 fork 空間</li>
<li>監控 <code>master_link_status</code>，頻繁 down / up 代表網路不穩、要先修網路</li>
<li>跨 region 的 active-active 對網路穩定性敏感，不穩的鏈路會頻繁觸發重同步</li>
</ol>
<h2 id="capacity--cost-邊界">Capacity / cost 邊界</h2>
<p>active-active 的容量判讀，核心在衝突率與複製健康：</p>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>健康區間</th>
          <th>警戒與動作</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>同 key 跨節點並發寫入率</td>
          <td>接近 0（key 按區分區）</td>
          <td>高 → LWW 丟寫入風險、改 key 分區</td>
      </tr>
      <tr>
          <td>節點間 clock skew</td>
          <td>&lt; 複製延遲（毫秒級）</td>
          <td>大 → LWW 判定錯亂、強制 NTP</td>
      </tr>
      <tr>
          <td>節點間複製延遲</td>
          <td>跨 region 可接受的 stale 窗</td>
          <td>過大 → stale read 嚴重、檢查網路</td>
      </tr>
      <tr>
          <td><code>master_link_status</code></td>
          <td><code>up</code></td>
          <td>頻繁 down → 網路不穩、會觸發重同步</td>
      </tr>
      <tr>
          <td>active 節點數</td>
          <td>2-3（雙向就近寫）</td>
          <td>過多 → mesh 流量平方成長、改單寫入點拓樸</td>
      </tr>
  </tbody>
</table>
<p>撞牆後的路由判斷：</p>
<ul>
<li><strong>需要正確的衝突解決 / 不能丟寫入</strong>：LWW 不保證，走強一致儲存（<a href="/blog/backend/01-database/" data-link-title="模組一：資料庫與持久化" data-link-desc="整理 SQL、transaction、migration 與 repository adapter 的後端實務">database 模組</a> 的 multi-region 一致性方案）或單寫入點架構。</li>
<li><strong>需要 counter / 累加語意的多點寫</strong>：LWW 會讓並發 INCR 互相覆蓋，KeyDB active-active 不適合，改 CRDT 或單點 counter。</li>
<li><strong>跨 region 但可接受單寫入點</strong>：用 Redis / Valkey 的單向複製（一區寫、其他區唯讀），比 active-active 簡單且無衝突。</li>
<li><strong>大規模多區</strong>：active-active 的甜蜜點是 2-3 區，更大規模走 managed 的跨區方案（<a href="/blog/backend/02-cache-redis/vendors/aws-elasticache/" data-link-title="AWS ElastiCache" data-link-desc="AWS managed Redis / Valkey / Memcached">ElastiCache Global Datastore</a> 的 active-passive）。</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<p>active-active 是 KeyDB 區別於 Redis 的核心能力之一，但它的取捨跨多個子系統：</p>
<ul>
<li><strong>跟 <a href="/blog/backend/02-cache-redis/vendors/keydb/" data-link-title="KeyDB" data-link-desc="Redis multi-threaded fork、active-replication、Snap 採用">KeyDB overview</a></strong>：overview 點到 active-active 是 last-write-wins、本文展開它什麼時候默默丟資料。</li>
<li><strong>跟 <a href="/blog/backend/02-cache-redis/vendors/redis/persistence-fork-latency/" data-link-title="Redis 持久化與 fork latency：AOF、RDB 與那一次卡住整個 cluster 的 fork" data-link-desc="Redis 的 RDB save 與 AOF rewrite 都靠一次 fork()，而 fork 在大記憶體實例上會凍結主執行緒數百毫秒、複製分頁讓記憶體逼近翻倍。本文展開 AOF / RDB 的機制與 fsync 取捨、copy-on-write 的記憶體放大、5 個把持久化寫成延遲尖峰與資料遺失的 production 踩坑，以及 cache 場景到底要不要持久化的邊界">Redis persistence / fork latency</a></strong>：KeyDB 繼承 Redis 的 fork 機制，節點重連的全量重同步付 fork 成本。</li>
<li><strong>跟 <a href="/blog/backend/02-cache-redis/cache-copy-freshness-boundary/" data-link-title="2.7 Cache Copy Boundary 與 Freshness" data-link-desc="說明快取何時只是可重建副本，何時會影響交易、權限或配額正確性。">cache copy boundary</a></strong>：active-active 的 stale window 與 LWW 丟寫入，本質是「cache 副本的新鮮度與一致性邊界」議題的多主版本。</li>
<li><strong>跟 <a href="/blog/backend/09-performance-capacity/cases/snap-gcp-keydb-cross-cloud/" data-link-title="9.C35 Snap：GCP &#43; KeyDB 在 multi-cloud 架構下的低延遲快取" data-link-desc="Snap 用 GCP 上的 KeyDB cluster 減少跨 cloud cache 延遲、用 TPU 訓練廣告推薦模型">Snap KeyDB cross-cloud case</a></strong>：Snap 用 KeyDB 的主因是 cross-cloud latency 治理（cache 與 application 共置），active-active 的雙向就近寫是這類 multi-cloud 場景的工具，但要按 key 分區避開 LWW 衝突。</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/02-cache-redis/vendors/keydb/" data-link-title="KeyDB" data-link-desc="Redis multi-threaded fork、active-replication、Snap 採用">KeyDB</a></li>
<li>對照 vendor：<a href="/blog/backend/02-cache-redis/vendors/dragonflydb/shared-nothing-multicore-architecture/" data-link-title="DragonflyDB shared-nothing 多核架構：用 scale-up 取代 Redis Cluster" data-link-desc="Redis 要靠 Cluster 分片才能用滿一台多核機器，DragonflyDB 賭的是相反方向——單一進程 thread-per-core、shared-nothing、把單機推到 Redis 要好幾個 shard 才達到的規模。本文展開 thread-per-core 與 dashtable 的架構、fork-less snapshot、5 個把架構假設寫成 production 事故的踩坑，以及 scale-up 撞牆該回 Cluster 的邊界">DragonflyDB 多核架構</a>、<a href="/blog/backend/02-cache-redis/vendors/redis/sentinel-ha-failover/" data-link-title="Redis Sentinel 與 failover 時序：從 master 死掉到 client 重連的每一段" data-link-desc="Redis Sentinel 的 failover 不是一個瞬間動作，是 down 偵測 → quorum 確認 → 選主 → 提升 → 配置廣播 → client 重連的一條時序鏈，每一段都有自己的延遲與失敗模式。本文展開 Sentinel 的判定模型與這條時序、5 個讓 failover 卡住或丟資料的 production 踩坑，以及 Sentinel 撐不住該往 Cluster 或 managed 走的邊界">Redis Sentinel failover</a>（單向複製的 HA）</li>
<li>上游概念：<a href="/blog/backend/02-cache-redis/cache-copy-freshness-boundary/" data-link-title="2.7 Cache Copy Boundary 與 Freshness" data-link-desc="說明快取何時只是可重建副本，何時會影響交易、權限或配額正確性。">2.7 cache copy boundary</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL BDR / Multi-Master：active-active 寫入的 3 種路徑跟 conflict 治理</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/bdr-multi-master/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/bdr-multi-master/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>multi-master / active-active replication&lt;/em> — 不是 PG 預設、需要 extension。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="pg-預設沒-multi-master得用-extension">PG 預設沒 multi-master、得用 extension&lt;/h2>
&lt;p>PG core 是 &lt;em>single-primary streaming replication&lt;/em>：&lt;/p>
&lt;ul>
&lt;li>寫入只能進 primary&lt;/li>
&lt;li>Standby 接受 read（hot_standby）但拒絕 write&lt;/li>
&lt;li>Failover 後新 primary 接管、不能多入口&lt;/li>
&lt;/ul>
&lt;p>對需要 &lt;em>active-active&lt;/em>（多 region 各自接受 local write）的場景、PG 提供 3 條 extension 路徑：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>方案&lt;/th>
 &lt;th>來源&lt;/th>
 &lt;th>機制&lt;/th>
 &lt;th>License&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>BDR&lt;/strong>&lt;/td>
 &lt;td>EDB（Enterprise）&lt;/td>
 &lt;td>Logical replication-based、雙向&lt;/td>
 &lt;td>商業（EDB 訂閱）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>pgEdge&lt;/strong>&lt;/td>
 &lt;td>pgEdge Inc.&lt;/td>
 &lt;td>基於 BDR、開源、加 Spock extension&lt;/td>
 &lt;td>開源（Spock）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Bucardo&lt;/strong>&lt;/td>
 &lt;td>community&lt;/td>
 &lt;td>Trigger-based、async、Perl 寫&lt;/td>
 &lt;td>開源（BSD）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>每條路徑有不同 trade-off。對 99% PG production case、&lt;em>不需要 multi-master&lt;/em> — single-primary streaming replication + read replica scaling 已夠。Multi-master 是 &lt;em>特殊需求&lt;/em>（跨 region active-active write / 不可中斷 maintenance）才上。&lt;/p>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &amp;#43; certification* 整個機制不同。本文走 GR 機制（GCE &amp;#43; certification &amp;#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">MySQL Group Replication&lt;/a> 對比：MySQL GR 是 &lt;em>官方內建&lt;/em>（5.7+）、PG 沒對應內建選項。MySQL 用戶 GR / InnoDB Cluster 直接套、PG 用戶要選 extension + license trade-off。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>multi-master / active-active replication</em> — 不是 PG 預設、需要 extension。</p></blockquote>
<hr>
<h2 id="pg-預設沒-multi-master得用-extension">PG 預設沒 multi-master、得用 extension</h2>
<p>PG core 是 <em>single-primary streaming replication</em>：</p>
<ul>
<li>寫入只能進 primary</li>
<li>Standby 接受 read（hot_standby）但拒絕 write</li>
<li>Failover 後新 primary 接管、不能多入口</li>
</ul>
<p>對需要 <em>active-active</em>（多 region 各自接受 local write）的場景、PG 提供 3 條 extension 路徑：</p>
<table>
  <thead>
      <tr>
          <th>方案</th>
          <th>來源</th>
          <th>機制</th>
          <th>License</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>BDR</strong></td>
          <td>EDB（Enterprise）</td>
          <td>Logical replication-based、雙向</td>
          <td>商業（EDB 訂閱）</td>
      </tr>
      <tr>
          <td><strong>pgEdge</strong></td>
          <td>pgEdge Inc.</td>
          <td>基於 BDR、開源、加 Spock extension</td>
          <td>開源（Spock）</td>
      </tr>
      <tr>
          <td><strong>Bucardo</strong></td>
          <td>community</td>
          <td>Trigger-based、async、Perl 寫</td>
          <td>開源（BSD）</td>
      </tr>
  </tbody>
</table>
<p>每條路徑有不同 trade-off。對 99% PG production case、<em>不需要 multi-master</em> — single-primary streaming replication + read replica scaling 已夠。Multi-master 是 <em>特殊需求</em>（跨 region active-active write / 不可中斷 maintenance）才上。</p>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &#43; certification* 整個機制不同。本文走 GR 機制（GCE &#43; certification &#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">MySQL Group Replication</a> 對比：MySQL GR 是 <em>官方內建</em>（5.7+）、PG 沒對應內建選項。MySQL 用戶 GR / InnoDB Cluster 直接套、PG 用戶要選 extension + license trade-off。</p>
<h2 id="multi-master-三方案對比">Multi-master 三方案對比</h2>
<h3 id="方案-1bdr-edb-postgres-distributed">方案 1：BDR (EDB Postgres Distributed)</h3>
<p>EDB 商業 distributed 方案、跑在 EDB Postgres Advanced Server 或 PG community 上。</p>
<p><strong>特性</strong>：</p>
<ul>
<li>雙向 logical replication、N-way active-active</li>
<li>Built-in conflict detection + resolution（LWW / column-level / user-defined）</li>
<li>Eager（sync）跟 async 兩種 mode</li>
<li>Tightly integrated with EDB tooling</li>
</ul>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>商業 license、EDB 訂閱</li>
<li>對 cross-region multi-master 成熟（北美 enterprise 廣用）</li>
<li>對 <em>新 PG version</em> 通常滯後幾個月</li>
</ul>
<h3 id="方案-2pgedge基於-spock-extension">方案 2：pgEdge（基於 Spock extension）</h3>
<p>pgEdge 開源 multi-master、基於 <em>Spock</em> extension（從 BDR 衍生）：</p>
<p><strong>特性</strong>：</p>
<ul>
<li>開源、可自管</li>
<li>跟 BDR 架構接近、無 license fee</li>
<li>Conflict resolution 用 LWW + column-level</li>
<li>對 <em>edge / 地理分散</em> 場景設計</li>
</ul>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>較新（2023+）、社群驗證度低於 BDR</li>
<li>Conflict resolution policy 比 BDR 簡單</li>
<li>部分 EDB 商業 feature 沒對應</li>
</ul>
<h3 id="方案-3bucardo">方案 3：Bucardo</h3>
<p>PG community async multi-master、Perl 寫、trigger-based：</p>
<p><strong>特性</strong>：</p>
<ul>
<li>完全開源</li>
<li>Trigger-based（不依賴 logical replication）</li>
<li>支援 multi-source replication（fan-in / fan-out）</li>
</ul>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>Async only — <em>higher latency conflict</em></li>
<li>Trigger overhead（影響 primary 寫吞吐）</li>
<li>維護 Perl + tools chain 不普及</li>
<li>對 <em>Sync 一致性</em> 需求不適用</li>
</ul>
<h2 id="multi-master-conflict-model">Multi-Master Conflict Model</h2>
<p>任何 multi-master 方案都要解決 <em>同一 row 兩地同時改</em> 的 conflict：</p>
<h3 id="conflict-來源">Conflict 來源</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Region A (primary 1)          Region B (primary 2)
</span></span><span class="line"><span class="ln">2</span><span class="cl">UPDATE orders                 UPDATE orders
</span></span><span class="line"><span class="ln">3</span><span class="cl">SET status=&#39;shipped&#39;          SET status=&#39;cancelled&#39;
</span></span><span class="line"><span class="ln">4</span><span class="cl">WHERE id=100                  WHERE id=100
</span></span><span class="line"><span class="ln">5</span><span class="cl">     ↓                              ↓
</span></span><span class="line"><span class="ln">6</span><span class="cl">   合併？哪個贏？</span></span></code></pre></div><p>跨 region 兩地各自 commit、replication lag 期間發現 conflict、必須 <em>自動 resolve</em>（不能丟給 application）。</p>
<h3 id="conflict-resolution-strategies">Conflict Resolution Strategies</h3>
<p><strong>1. Last-Write-Wins (LWW)</strong> — 最常見：</p>
<ul>
<li>比較 transaction commit timestamp、晚的贏</li>
<li>簡單但 <em>data loss</em>（前一個 commit 的變更被覆蓋）</li>
<li>需要 <em>clock 同步</em>（NTP）—  clock skew 造成不可預測</li>
</ul>
<p><strong>2. Column-level conflict resolution</strong>：</p>
<ul>
<li>不同 column 各自 LWW（status column 跟 amount column 獨立解）</li>
<li>比 row-level LWW 細、但需 application semantics 配合</li>
</ul>
<p><strong>3. User-defined trigger</strong>：</p>
<ul>
<li>寫 PG function 解 conflict</li>
<li>對 <em>特殊 business logic</em>（如：金額相加、不是覆蓋）有用</li>
<li>維護成本高</li>
</ul>
<p><strong>4. Manual reconciliation</strong>：</p>
<ul>
<li>Conflict 寫進 log table、application / DBA 手動處理</li>
<li>對 <em>無法自動 resolve</em> 場景（如金融）</li>
<li>高 ops cost</li>
</ul>
<p>對 99% case 用 LWW、接受 small data loss、application 設計 <em>idempotent / commutative</em> 操作避免衝突。</p>
<h3 id="conflict-機率取決於-application-pattern">Conflict 機率取決於 application pattern</h3>
<ul>
<li><em>Tenant-isolated</em> application（user_id 各自寫自己的 row）：基本無 conflict</li>
<li><em>Shared counter / inventory</em> application：高 conflict、multi-master 不適合</li>
<li><em>Append-only event log</em>：conflict 低、適合 multi-master</li>
</ul>
<h2 id="配置-step-by-steppgedge-為主">配置 step-by-step（pgEdge 為主）</h2>
<p>pgEdge 開源、最常見的 self-hosted 選擇。</p>
<h3 id="step-1在每個-region-node-裝-pgedge">Step 1：在每個 region node 裝 pgEdge</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Install pgEdge CLI</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">curl -fsSL https://pgedge-upstream.s3.amazonaws.com/REPO/install.py <span class="p">|</span> python3
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># Setup PG + Spock + pgEdge</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">./pgedge install pg16
</span></span><span class="line"><span class="ln">6</span><span class="cl">./pgedge install spock</span></span></code></pre></div><h3 id="step-2配置每個-node">Step 2：配置每個 node</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 在 node1（us-east） 跑
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">node_create</span><span class="p">(</span><span class="n">node_name</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;node1&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">dsn</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;host=node1.example.com port=5432 dbname=production&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 在 node2（eu-west）跑
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">node_create</span><span class="p">(</span><span class="n">node_name</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;node2&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">dsn</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;host=node2.example.com port=5432 dbname=production&#39;</span><span class="p">);</span></span></span></code></pre></div><h3 id="step-3建-replication-set--subscribe">Step 3：建 replication set + subscribe</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 在 node1 建 default replication set + 加 tables
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">repset_add_all_tables</span><span class="p">(</span><span class="s1">&#39;default&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- 在 node1 subscribe node2
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">sub_create</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">subscription_name</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;sub_n1_n2&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="n">provider_dsn</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;host=node2.example.com port=5432 dbname=production&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 在 node2 subscribe node1（雙向）
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">sub_create</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="n">subscription_name</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;sub_n2_n1&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">    </span><span class="n">provider_dsn</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;host=node1.example.com port=5432 dbname=production&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><h3 id="step-4設-conflict-resolution">Step 4：設 conflict resolution</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 設 LWW（預設）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">conflict_resolution_setting_set</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">conflict_type</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;update_origin_change&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">resolution_setting</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;apply_remote&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><h3 id="step-5驗證">Step 5：驗證</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 看 subscription 狀態
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">subscription</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 看 replication lag
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_replication</span><span class="p">;</span></span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-lww-data-loss--application-沒設計-commutative">1. LWW data loss — Application 沒設計 commutative</h3>
<p>LWW 預設、兩 region 同時 UPDATE 同 row → 晚的 commit 贏、早的丟失。Application 看不到「我寫的不見了」、debug 困難。</p>
<p>修法：</p>
<ul>
<li>Application schema 設計 <em>tenant-isolated</em>（user_id 各自寫自己 row）</li>
<li>對 <em>shared counter / inventory</em> 用 <em>commutative operation</em>（INCREMENT not SET）</li>
<li>重要寫入加 <em>audit log</em> — conflict 仍寫到 audit、application 看 audit 知道發生過</li>
<li>真的需要 strict consistency 別用 multi-master、用 single-primary + reader 或 distributed SQL</li>
</ul>
<h3 id="2-sequence-collision--two-region-各自-next-同號">2. Sequence collision — Two region 各自 next 同號</h3>
<p><code>SERIAL</code> / <code>IDENTITY</code> 用 sequence、兩 region 各自 nextval 可能拿到同 number、INSERT 衝突（PK duplicate）。</p>
<p>修法：</p>
<ul>
<li>用 <em>staggered sequence range</em>：node1 用 1-1M、node2 用 1M+1 到 2M（用 <code>setval</code>）</li>
<li>或用 <em>UUID</em>（v4 / v7）作 PK、跨 node 無 collision</li>
<li>或 <em>sequence per-node namespace</em>：<code>CREATE SEQUENCE orders_id_node1 START 1 INCREMENT 2</code>（odd vs even）</li>
</ul>
<h3 id="3-ddl-replication-不自動">3. DDL replication 不自動</h3>
<p>PG logical replication（pgEdge / BDR 基礎）<em>不自動 replicate DDL</em>。每 node <code>CREATE TABLE</code> / <code>ALTER TABLE</code> 必須 <em>分別跑</em>。</p>
<p>修法：</p>
<ul>
<li>用 <em>deployment automation</em>（Ansible / Terraform）對所有 node 同時跑 DDL</li>
<li>pgEdge 提供 <code>spock.replicate_ddl(...)</code> 把 DDL 轉成可 replicate event</li>
<li>BDR Enterprise 有 <em>DDL replication</em>（商業 feature）</li>
<li>DDL 變更前確認 <em>所有 node 都健康</em>、減少 partial state</li>
</ul>
<h3 id="4-conflict-log-治理--log-table-爆滿">4. Conflict log 治理 — Log table 爆滿</h3>
<p>每個 conflict 寫進 <code>spock.conflict_log</code> / <code>bdr.conflict_history</code> 等 table、log 累積 disk 爆。</p>
<p>修法：</p>
<ul>
<li>設 <em>log retention</em>：cron 定期 archive + delete 老 conflict log</li>
<li>監控 conflict rate — 高 conflict rate 是 application 設計問題（不是 ops 問題）</li>
<li>對 <em>strict business</em> conflict 寫進 application-level audit table、不只 system log</li>
</ul>
<h3 id="5-failover-後-timeline-分歧">5. Failover 後 timeline 分歧</h3>
<p>Multi-master 設計上 <em>每 region 是 primary</em>、Region A 掛了 Region B 接管 — 但 Region A 復活後 <em>仍認為自己是 primary</em>。如果 Region A 復活前已有寫入沒 replicate 出去、resolution 跟 LWW 衝突。</p>
<p>修法：</p>
<ul>
<li><em>Fence Region A 復活</em>：物理 fence（network firewall）+ 手動 unfence 流程</li>
<li>用 <em>etcd / Consul</em> 跟 BDR / Spock 整合 leader election（避免 split-brain）</li>
<li>對 cross-region multi-master、必須有 <em>runbook</em> 處理 region 復活流程、不靠自動</li>
</ul>
<h2 id="何時用-multi-master-vs-不用">何時用 multi-master vs 不用</h2>
<table>
  <thead>
      <tr>
          <th>情境</th>
          <th>建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>真正 cross-region active-active write 需求</td>
          <td>BDR / pgEdge</td>
      </tr>
      <tr>
          <td>不可中斷 maintenance（zero downtime upgrade）</td>
          <td>BDR / pgEdge</td>
      </tr>
      <tr>
          <td>高 conflict rate（shared counter / inventory）</td>
          <td>不要 multi-master、用 distributed SQL</td>
      </tr>
      <tr>
          <td>Read scaling 為主、可接受 stale read</td>
          <td>streaming replication + read replica（更簡單）</td>
      </tr>
      <tr>
          <td>Strict consistency 需求</td>
          <td>single-primary + sync replication 或 Aurora DSQL / Spanner</td>
      </tr>
      <tr>
          <td>預算敏感 + 不想養 BDR / pgEdge ops</td>
          <td>不要 multi-master、用 managed distributed SQL</td>
      </tr>
  </tbody>
</table>
<h2 id="跟-mysql-group-replication-對比">跟 MySQL Group Replication 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PG Multi-Master</th>
          <th>MySQL Group Replication</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>內建？</td>
          <td>否、需 extension</td>
          <td>是、5.7+ 內建</td>
      </tr>
      <tr>
          <td>商業 vs 開源</td>
          <td>BDR 商業 / pgEdge 開源</td>
          <td>Oracle 商業 / community 都行</td>
      </tr>
      <tr>
          <td>Sync mode</td>
          <td>可（BDR eager）</td>
          <td>是（certification-based）</td>
      </tr>
      <tr>
          <td>Conflict resolution</td>
          <td>LWW / column / user-defined</td>
          <td>Certification-based（distributed transaction）</td>
      </tr>
      <tr>
          <td>Production maturity</td>
          <td>BDR 高、pgEdge 中</td>
          <td>高（Oracle 推）</td>
      </tr>
      <tr>
          <td>Use case 比例</td>
          <td>少（PG 多用 single-primary）</td>
          <td>較多（MySQL 推 InnoDB Cluster）</td>
      </tr>
  </tbody>
</table>
<p>MySQL GR 內建 + Oracle 推、PG 沒對應內建。對 multi-master 需求重的 org、MySQL 走 GR 路徑更直接。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication Topology</h3>
<p>Multi-master 是 <em>streaming replication 之上的 logical replication 加雙向</em>、不取代 streaming。Streaming 仍給 standby / failover、multi-master 給 active-active write。詳見 <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>。</p>
<h3 id="跟-logical-replication">跟 Logical Replication</h3>
<p>pgEdge / BDR 都基於 logical replication slot、跟 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a> 共用 PG logical decoding infrastructure、但 <em>配置 + tooling</em> 不同。</p>
<h3 id="跟-mvcc">跟 MVCC</h3>
<p>Multi-master 的 conflict 在 <em>commit 後</em> 偵測（async）、不在 transaction 內。跟單機 MVCC（同 cluster 內 transaction snapshot）不同層。詳見 <a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">MVCC + Lock Model</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（streaming + multi-master 共存）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">PG Logical Replication + Debezium</a>（logical decoding 基礎）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">PG MVCC + Lock Model</a>（multi-master conflict vs 單機 MVCC）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">PG Patroni HA</a>（single-primary HA 替代方案）</li>
<li><a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a>（multi-master vs distributed SQL）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &#43; certification* 整個機制不同。本文走 GR 機制（GCE &#43; certification &#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">MySQL Group Replication</a>（sibling、不同實作）</li>
<li>官方：<a href="https://www.enterprisedb.com/products/edb-postgres-distributed-bdr">EDB BDR</a> / <a href="https://www.pgedge.com/">pgEdge</a> / <a href="https://github.com/pgEdge/spock">Spock GitHub</a> / <a href="https://bucardo.org/">Bucardo</a></li>
</ul>
]]></content:encoded></item><item><title>Cosmos DB Multi-Region Write：active-active、LWW、custom merge、Strong + multi-region 互斥的 AP 取捨</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/cosmosdb/multi-region-write-conflict/</link><pubDate>Wed, 27 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/cosmosdb/multi-region-write-conflict/</guid><description>&lt;p>Cosmos DB 是 &lt;em>AP 系統&lt;/em>（&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/cap/" data-link-title="CAP Theorem" data-link-desc="分散式系統在網路分區時一致性與可用性的取捨框架">CAP&lt;/a> 三選二、放棄跨 region linearizability 換取 multi-region write 可用性）。跨 region 寫同一筆 document 必然有 conflict、Cosmos DB 提供三種 resolution policy 處理：LWW（Last-Writer-Wins）、custom merge stored procedure、conflict feed manual reconciliation。本文先講 AP 取捨的硬約束（為什麼 Strong consistency 跟 multi-region write 互斥）、再進三種 resolution 機制、再進廣告 SLA vs 實測可用性的鏈路拆解（DB 端 SLA 不等於使用者體驗）。&lt;/p>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/cosmosdb/" data-link-title="Azure Cosmos DB" data-link-desc="全球分散式 multi-model DB、5 個 consistency levels、Microsoft 自家 dogfood 證據">Cosmos DB vendor 頁&lt;/a> 的深度展開、也是 &lt;em>Strong + multi-region 互斥&lt;/em> 議題的 SSoT 主寫位置（&lt;a href="../consistency-levels-engineering/">consistency-levels-engineering&lt;/a> cross-link 過來、不展開）。Case anchor 是 &lt;a href="https://tarrragon.github.io/blog/backend/09-performance-capacity/cases/minecraft-earth-cosmos-db-global/" data-link-title="9.C11 Minecraft Earth：Azure Cosmos DB 上的全球分散式 AR 遊戲" data-link-desc="Minecraft Earth 用 Cosmos DB 跨地區分散、測試到 100 萬 RU/s 仍維持承諾延遲">9.C11 Minecraft Earth&lt;/a>（AR 遊戲跨 region 寫入、5 consistency level + multi-region SLA）+ &lt;a href="https://tarrragon.github.io/blog/backend/09-performance-capacity/cases/asos-cosmos-db-black-friday/" data-link-title="9.C21 ASOS：Cosmos DB 在 Black Friday 撐 1.67 億請求" data-link-desc="ASOS 在 2016 Black Friday 用 Azure Cosmos DB 撐 24 小時 1.67 億請求、3500 req/sec、48ms 平均延遲">9.C21 ASOS&lt;/a>（Black Friday 全球零售）+ &lt;a href="https://tarrragon.github.io/blog/backend/09-performance-capacity/cases/toyota-connected-mongodb-telematics-iot/" data-link-title="9.C38 Toyota Connected：MongoDB Atlas 撐 900 萬車輛 telematics、月 180 億 transaction" data-link-desc="Toyota Connected 用 MongoDB Atlas 撐 Safety Connect 900 萬車、月 180 億 transaction、緊急訊號 3 秒內到 agent">9.C38 Toyota Connected&lt;/a>（鏈路 SLA 拆解、跨 vendor 適用做 frame anchor）。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Cosmos DB 適用度前置判讀&lt;/strong>：本篇假設 workload 已通過 Cosmos DB 適用度四層 framing（API model 三型遷移路徑 / RU 思維轉換成本 / multi-model 差異化是否真用上 / 跨雲 hedging vs 單雲 lock-in）— 詳見 &lt;a href="../mongodb-api-vs-sql-api/#%e5%9b%9b%e5%b1%a4-framingvendor-selection-%e7%9a%84%e7%9c%9f%e5%af%a6%e6%b1%ba%e7%ad%96%e8%bb%b8">mongodb-api-vs-sql-api 開頭四層 framing&lt;/a>、本篇不重複展開。Multi-region write + conflict resolution 是 &lt;em>已選 Cosmos DB 後&lt;/em> 的拓樸決策；strong global consistency 必要的 workload 應走 Spanner 或 Cosmos DB Strong（單一 write region）、不是用 LWW 補。&lt;/p></description><content:encoded><![CDATA[<p>Cosmos DB 是 <em>AP 系統</em>（<a href="/blog/backend/knowledge-cards/cap/" data-link-title="CAP Theorem" data-link-desc="分散式系統在網路分區時一致性與可用性的取捨框架">CAP</a> 三選二、放棄跨 region linearizability 換取 multi-region write 可用性）。跨 region 寫同一筆 document 必然有 conflict、Cosmos DB 提供三種 resolution policy 處理：LWW（Last-Writer-Wins）、custom merge stored procedure、conflict feed manual reconciliation。本文先講 AP 取捨的硬約束（為什麼 Strong consistency 跟 multi-region write 互斥）、再進三種 resolution 機制、再進廣告 SLA vs 實測可用性的鏈路拆解（DB 端 SLA 不等於使用者體驗）。</p>
<p>本文是 <a href="/blog/backend/01-database/vendors/cosmosdb/" data-link-title="Azure Cosmos DB" data-link-desc="全球分散式 multi-model DB、5 個 consistency levels、Microsoft 自家 dogfood 證據">Cosmos DB vendor 頁</a> 的深度展開、也是 <em>Strong + multi-region 互斥</em> 議題的 SSoT 主寫位置（<a href="../consistency-levels-engineering/">consistency-levels-engineering</a> cross-link 過來、不展開）。Case anchor 是 <a href="/blog/backend/09-performance-capacity/cases/minecraft-earth-cosmos-db-global/" data-link-title="9.C11 Minecraft Earth：Azure Cosmos DB 上的全球分散式 AR 遊戲" data-link-desc="Minecraft Earth 用 Cosmos DB 跨地區分散、測試到 100 萬 RU/s 仍維持承諾延遲">9.C11 Minecraft Earth</a>（AR 遊戲跨 region 寫入、5 consistency level + multi-region SLA）+ <a href="/blog/backend/09-performance-capacity/cases/asos-cosmos-db-black-friday/" data-link-title="9.C21 ASOS：Cosmos DB 在 Black Friday 撐 1.67 億請求" data-link-desc="ASOS 在 2016 Black Friday 用 Azure Cosmos DB 撐 24 小時 1.67 億請求、3500 req/sec、48ms 平均延遲">9.C21 ASOS</a>（Black Friday 全球零售）+ <a href="/blog/backend/09-performance-capacity/cases/toyota-connected-mongodb-telematics-iot/" data-link-title="9.C38 Toyota Connected：MongoDB Atlas 撐 900 萬車輛 telematics、月 180 億 transaction" data-link-desc="Toyota Connected 用 MongoDB Atlas 撐 Safety Connect 900 萬車、月 180 億 transaction、緊急訊號 3 秒內到 agent">9.C38 Toyota Connected</a>（鏈路 SLA 拆解、跨 vendor 適用做 frame anchor）。</p>
<blockquote>
<p><strong>Cosmos DB 適用度前置判讀</strong>：本篇假設 workload 已通過 Cosmos DB 適用度四層 framing（API model 三型遷移路徑 / RU 思維轉換成本 / multi-model 差異化是否真用上 / 跨雲 hedging vs 單雲 lock-in）— 詳見 <a href="../mongodb-api-vs-sql-api/#%e5%9b%9b%e5%b1%a4-framingvendor-selection-%e7%9a%84%e7%9c%9f%e5%af%a6%e6%b1%ba%e7%ad%96%e8%bb%b8">mongodb-api-vs-sql-api 開頭四層 framing</a>、本篇不重複展開。Multi-region write + conflict resolution 是 <em>已選 Cosmos DB 後</em> 的拓樸決策；strong global consistency 必要的 workload 應走 Spanner 或 Cosmos DB Strong（單一 write region）、不是用 LWW 補。</p></blockquote>
<h2 id="問題情境active-active-的-conflict-是必然代價">問題情境：active-active 的 conflict 是必然代價</h2>
<p>典型觸發場景：產品要 global active-active（每個 region 都能寫、低延遲）、Cosmos DB 是 AP 系統、不像 Spanner 用 quorum 強一致；跨 region 寫同一筆 document 必然有 conflict、團隊不知道「conflict 真的發生時、誰贏 / 怎麼處理 / 業務語義保不保得住」。</p>
<p>讀者徵兆：</p>
<ul>
<li>「multi-region write 開了、user 在 A region 寫『加入購物車』、B region 寫『移除購物車』、最後哪個贏」</li>
<li>「LWW 用 timestamp 決定、client clock skew 不就破壞了嗎」</li>
<li>「conflict feed 是什麼、要不要消費」</li>
<li>「multi-region write 開了之後 consistency level 還能設 Strong 嗎」</li>
<li>「廣告寫 99.999%、為什麼實測只有 99%」</li>
</ul>
<p>真實壓力：購物車跨 region 寫入丟失、遊戲玩家狀態跨 region 衝突回滾、IoT device 跨 region 寫 telemetry 後消失。這些事故的根因不是 bug、是 multi-region write 的 <em>設計取捨</em>、需要在 selection 階段就決定 conflict resolution policy。</p>
<h2 id="核心機制">核心機制</h2>
<h3 id="ap-取捨的硬約束為什麼-strong--multi-region-write-互斥">AP 取捨的硬約束：為什麼 Strong + multi-region write 互斥</h3>
<p>Cosmos DB 是 AP 系統（在 partition 的情況下選 availability 跟 partition tolerance、放棄 cross-region linearizability）。multi-region write 的兩個前置條件：</p>
<ul>
<li>account 開啟 <code>enableMultipleWriteLocations = true</code></li>
<li>consistency level <em>不能設 Strong</em>（multi-region write 跟 Strong 互斥、時間敏感 claim、查 <a href="https://learn.microsoft.com/azure/cosmos-db/consistency-levels">最新文件</a>）</li>
</ul>
<p>為什麼互斥（CAP 三選二的硬約束）：</p>
<ul>
<li><strong>Strong consistency</strong> 在 Cosmos DB 的實作是 quorum-based linearizable read — 確保 read 拿到最新 commit、需要 <em>單一 write region</em> 來保證寫入順序</li>
<li><strong>Multi-region write</strong> 是 active-active、每個 region 都能寫 — 不存在「單一 write region」、寫入是 LWW-based eventual consistency</li>
<li>兩者在技術上 <em>不能同時成立</em> — 不是 Microsoft 工程選擇問題、是 distributed system 的基本限制（跟 Spanner 用 Paxos quorum + TrueTime 不同的設計路徑）</li>
</ul>
<p>對 selection 的意義：產品要「全球都能寫」就接受 eventual consistency；產品要「全球 linearizable」就轉 Spanner / Aurora DSQL、Cosmos DB 不是替代品。把 Cosmos DB Strong 跟 Spanner external consistency 等同視之是 <em>常見的選型誤判</em>。</p>
<p><a href="../consistency-levels-engineering/">consistency-levels-engineering</a> 的 Strong 段只 cross-link 過來、不展開 conflict resolution 細節 — 本篇是 SSoT 主寫位置。</p>
<h3 id="conflict-偵測">Conflict 偵測</h3>
<p>同一 document（partition key + id）在多 region 並發寫入、Cosmos DB 偵測為 conflict。偵測機制基於 LSN（log sequence number）、不是 timestamp — 兩個 region 對同一 document 寫入時、replication 過程比對 LSN 發現分歧、進 resolution。</p>
<h3 id="三種-conflict-resolution-policy">三種 conflict resolution policy</h3>
<h4 id="lwwlast-writer-wins預設">LWW（Last-Writer-Wins、預設）</h4>
<ul>
<li>機制：用 <code>_ts</code>（system timestamp）或自訂 numeric property、value 大的贏</li>
<li>副作用：clock skew 在 ms 級就能讓「先寫的反而贏」、業務邏輯破洞</li>
<li>適合：純覆寫場景（如玩家位置最新值、IoT 最新讀數）— write 順序不影響業務語義</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="s2">&#34;conflictResolutionPolicy&#34;</span><span class="err">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  <span class="nt">&#34;mode&#34;</span><span class="p">:</span> <span class="s2">&#34;LastWriterWins&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">  <span class="nt">&#34;conflictResolutionPath&#34;</span><span class="p">:</span> <span class="s2">&#34;/customTimestamp&#34;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><h4 id="custom-merge-stored-procedure">Custom merge stored procedure</h4>
<ul>
<li>機制：寫一個 JavaScript stored proc、conflict 時 Cosmos DB 呼叫、proc 回傳 merge 結果</li>
<li>適合：要保留業務語義的場景（購物車 merge = union 兩邊 items、計數器 merge = sum、status 機器 merge = 狀態圖規則）</li>
<li>風險：stored proc 在 Cosmos DB JavaScript runtime 跑、有 timeout / RU 限制；複雜 merge 邏輯難 debug</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="s2">&#34;conflictResolutionPolicy&#34;</span><span class="err">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  <span class="nt">&#34;mode&#34;</span><span class="p">:</span> <span class="s2">&#34;Custom&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">  <span class="nt">&#34;conflictResolutionProcedure&#34;</span><span class="p">:</span> <span class="s2">&#34;dbs/mydb/colls/mycoll/sprocs/resolveCart&#34;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><h4 id="conflict-feed-manual-reconciliation">Conflict feed manual reconciliation</h4>
<ul>
<li>機制：Cosmos DB 把 conflict 寫入 conflict feed、不自動解決、app 自行消費並 reconcile</li>
<li>適合：conflict 需要人工 / 業務流程判斷、不能 auto-resolve（如金融交易、合規場景）</li>
<li>風險：feed 不消費就累積、後續分析失準；app 需要實作 reconcile 流程</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="s2">&#34;conflictResolutionPolicy&#34;</span><span class="err">:</span> <span class="p">{</span> <span class="nt">&#34;mode&#34;</span><span class="p">:</span> <span class="s2">&#34;Custom&#34;</span> <span class="p">}</span></span></span></code></pre></div><p>（沒指 procedure、conflict 全進 feed、app 用 SDK <code>ReadConflictsAsync()</code> / Change Feed Processor pattern 消費）</p>
<h3 id="跟其他-vendor-對比">跟其他 vendor 對比</h3>
<ul>
<li><strong>DynamoDB Global Tables</strong>：也是 LWW、<em>無</em> custom merge、<em>無</em> conflict feed — 行為比 Cosmos DB 簡單但彈性少</li>
<li><strong>Spanner</strong>：用 Paxos quorum、<em>不會有 conflict</em>（CP 系統、可用性換一致性）— 跨 region write 需 quorum、latency 100-200ms</li>
<li><strong>Aurora Global Database</strong>：single-primary（一個 region 寫、其他 region 讀）、不是真 multi-region write、無 conflict</li>
</ul>
<p>對應 knowledge cards：<a href="/blog/backend/knowledge-cards/stale-read/" data-link-title="Stale Read" data-link-desc="讀取到落後於最新寫入版本的舊資料">stale-read</a>、<a href="/blog/backend/knowledge-cards/rpo/" data-link-title="RPO" data-link-desc="說明恢復點目標如何定義可接受資料損失範圍">rpo</a>、<a href="/blog/backend/knowledge-cards/rto/" data-link-title="RTO" data-link-desc="說明恢復時間目標如何約束事故回復策略">rto</a>。</p>
<h2 id="操作流程">操作流程</h2>
<h3 id="開啟-multi-region-write">開啟 multi-region write</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">az cosmosdb update --name mycosmos --resource-group myrg <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --enable-multiple-write-locations <span class="nb">true</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --locations <span class="nv">regionName</span><span class="o">=</span>eastus <span class="nv">failoverPriority</span><span class="o">=</span><span class="m">0</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  --locations <span class="nv">regionName</span><span class="o">=</span>westeurope <span class="nv">failoverPriority</span><span class="o">=</span><span class="m">1</span></span></span></code></pre></div><p>開啟後 <em>不能直接關回</em>、要 disable + 改 region 配置 + re-enable、有停機窗口。</p>
<h3 id="設定-lww-policycontainer-層">設定 LWW policy（container 層）</h3>
<p>建 container 時指定、可事後改但 conflict 行為以新 policy 為準（既有 conflict 不會重 resolve）。預設用 <code>_ts</code> 比較；改成 customTimestamp 時要保證 application 寫入時 <em>用單調遞增</em> 的 timestamp source（不能用 client clock）。</p>
<h3 id="設定-custom-merge">設定 custom merge</h3>
<p>建 stored proc：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="ln">1</span><span class="cl"><span class="kd">function</span> <span class="nx">resolveCart</span><span class="p">(</span><span class="nx">incomingItem</span><span class="p">,</span> <span class="nx">existingItem</span><span class="p">,</span> <span class="nx">isTombstone</span><span class="p">,</span> <span class="nx">conflictingItems</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  <span class="c1">// 範例：merge 購物車 items（取 union）
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span>  <span class="kd">var</span> <span class="nx">merged</span> <span class="o">=</span> <span class="nx">existingItem</span><span class="p">;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">  <span class="nx">merged</span><span class="p">.</span><span class="nx">items</span> <span class="o">=</span> <span class="nx">mergeArrays</span><span class="p">(</span><span class="nx">existingItem</span><span class="p">.</span><span class="nx">items</span><span class="p">,</span> <span class="nx">incomingItem</span><span class="p">.</span><span class="nx">items</span><span class="p">);</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">  <span class="nx">merged</span><span class="p">.</span><span class="nx">_ts</span> <span class="o">=</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">max</span><span class="p">(</span><span class="nx">existingItem</span><span class="p">.</span><span class="nx">_ts</span><span class="p">,</span> <span class="nx">incomingItem</span><span class="p">.</span><span class="nx">_ts</span><span class="p">);</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">  <span class="nx">__</span><span class="p">.</span><span class="nx">response</span><span class="p">.</span><span class="nx">setBody</span><span class="p">(</span><span class="nx">merged</span><span class="p">);</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="p">}</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln">1</span><span class="cl"><span class="s2">&#34;conflictResolutionPolicy&#34;</span><span class="err">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  <span class="nt">&#34;mode&#34;</span><span class="p">:</span> <span class="s2">&#34;Custom&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">  <span class="nt">&#34;conflictResolutionProcedure&#34;</span><span class="p">:</span> <span class="s2">&#34;dbs/mydb/colls/mycoll/sprocs/resolveCart&#34;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p>驗證：proc 內處理 timeout / exception；測 edge case（空 array / null / 並發 3+ region 寫入）。</p>
<h3 id="消費-conflict-feed">消費 conflict feed</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-csharp" data-lang="csharp"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">// .NET SDK</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="kt">var</span> <span class="n">iterator</span> <span class="p">=</span> <span class="n">container</span><span class="p">.</span><span class="n">GetItemQueryIterator</span><span class="p">&lt;</span><span class="n">ConflictProperties</span><span class="p">&gt;(</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="s">&#34;SELECT * FROM c&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="k">while</span> <span class="p">(</span><span class="n">iterator</span><span class="p">.</span><span class="n">HasMoreResults</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">    <span class="kt">var</span> <span class="n">response</span> <span class="p">=</span> <span class="k">await</span> <span class="n">iterator</span><span class="p">.</span><span class="n">ReadNextAsync</span><span class="p">();</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">    <span class="k">foreach</span> <span class="p">(</span><span class="kt">var</span> <span class="n">conflict</span> <span class="k">in</span> <span class="n">response</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">        <span class="k">await</span> <span class="n">ProcessConflict</span><span class="p">(</span><span class="n">conflict</span><span class="p">);</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p>用 Change Feed Processor pattern 把 conflict feed 當 stream 消費、寫到 reconcile queue、由業務流程處理。</p>
<h3 id="驗證點">驗證點</h3>
<ul>
<li>跨 region 並發寫測試（synthetic load）、觀察 conflict count / resolution result</li>
<li>Custom merge stored proc 跑過 edge case（exception / null / 並發 3+）</li>
<li>Conflict feed 不積壓（lag &lt; 5 min）</li>
<li>Region 故障時 application 仍能寫（active-active 設計、不需 manual failover）</li>
</ul>
<h2 id="失敗模式">失敗模式</h2>
<h3 id="failure-1全用-lww--用-server-timestamp">Failure 1：全用 LWW + 用 server timestamp</h3>
<p>clock skew 在 ms 級可能讓「先寫的反而贏」、業務邏輯破洞。常見徵兆：使用者反映「我明明先按確認、後來改的反而是舊的」、debug 才發現是跨 region clock skew。</p>
<p>修：</p>
<ul>
<li>用 <code>customTimestamp</code> 從 application 端 monotonic source 取（如 Snowflake ID、HLC、Lamport clock）</li>
<li>或改用 custom merge stored proc、用業務邏輯而非 timestamp 決勝</li>
<li>或拆 collection、把 conflict 高的 collection 用 stored proc、低的用 LWW</li>
</ul>
<h3 id="failure-2業務語義不適合-lww">Failure 2：業務語義不適合 LWW</h3>
<p>購物車（要 union）、計數器（要 sum）、status 機器（要狀態圖）全用 LWW = <em>資料丟失</em>。LWW 的設計假設是「最新 write 就是正確答案」、但很多業務語義不是覆寫關係。</p>
<p>修：盤點 collection 的業務語義、選對應 resolution policy：</p>
<ul>
<li>覆寫關係 → LWW</li>
<li>累積關係 → custom merge stored proc（union / sum / set 合併）</li>
<li>狀態機 → custom merge stored proc（按狀態圖規則 resolve）</li>
<li>需要人工裁決 → conflict feed</li>
</ul>
<h3 id="failure-3custom-merge-stored-proc-沒測-edge-case">Failure 3：Custom merge stored proc 沒測 edge case</h3>
<p>proc throw exception 時 Cosmos DB 行為：conflict 留 feed、不會自動 retry。團隊以為 proc 跑了就沒事、實際 conflict 累積在 feed、後續分析失準。</p>
<p>修：proc 內部 try-catch、log exception、確保 <em>任何輸入都能 return 一個合理結果</em>（即使是 fallback 到 LWW）；定期掃 conflict feed 檢查積壓。</p>
<h3 id="failure-4不消費-conflict-feed">Failure 4：不消費 conflict feed</h3>
<p>選 manual mode 後忘記實作 feed consumer、conflict 累積、後續分析失準。常見徵兆：feed lag metric alert、或業務反映「資料對不上」、最後發現 conflict feed 裡躺著一堆未處理的 conflict。</p>
<p>修：選 conflict feed mode 前先實作 consumer pipeline（Azure Function trigger on Change Feed / 自建 worker）；設 alert：feed lag &gt; 5 min 通知。</p>
<h3 id="failure-5期待-multi-region-write-還有-strong-consistency">Failure 5：期待 multi-region write 還有 Strong consistency</h3>
<p>兩者互斥、開啟 multi-region write 後 Strong 自動 downgrade（或拒絕設定、時間敏感、查最新文件）。團隊以為「multi-region + Strong = 全球 linearizable」、底層是設計 incompatibility。</p>
<p>修：在 selection 階段就決定「要 active-active write 還是要 Strong」 — 兩者只能擇一。要全球 linearizable 轉 Spanner / Aurora DSQL、要 active-active 就接受 eventual / session / bounded staleness。</p>
<h3 id="failure-6跨-region-寫入後立即同-session-read-看不到">Failure 6：跨 region 寫入後立即同 session read 看不到</h3>
<p>session token 沒跨 region 傳遞、看似 inconsistency 其實是 session 沒對齊。典型 anti-pattern：service A 在 region 1 寫、用 region 1 session token；service B 在 region 2 讀、沒拿到 A 的 token、看不到 A 的寫。</p>
<p>修：session token 隨 request 傳遞（通常進 HTTP header）；或改 account 層 Bounded staleness（提供跨 session 的 K/T bound）；見 <a href="../consistency-levels-engineering/">consistency-levels-engineering</a> 的 session token 管理段。</p>
<h3 id="failure-7region-故障時的-failover-邏輯誤判">Failure 7：Region 故障時的 failover 邏輯誤判</h3>
<p>multi-region write 已是 active-active、<em>不需要 manual failover</em> — 一個 region 掛、其他 region 自動承接寫入。但若用了 <code>failoverPriority</code> 配置、failover 邏輯仍要審 — priority 是 <em>當 multi-region read 切到哪個 region 為 primary</em>、不是 active-active 的 routing。</p>
<p>修：multi-region write 場景不用依賴 failoverPriority、用 Traffic Manager / Front Door 做 region routing；application 端 SDK 配置 <code>PreferredLocations</code> 讓 SDK 自己選 nearest region。</p>
<h2 id="容量與觀測">容量與觀測</h2>
<ul>
<li>必看 metric：<code>ConflictCount</code>、<code>ReplicationLatency</code> per region pair、conflict feed lag</li>
<li>Conflict rate 監控：正常 &lt; 0.01%、突增代表 hot key 或 region 同步異常</li>
<li>Cost 影響：multi-region write 開啟後、寫入成本 × region 數（每個 region 都 replicate）— 3 region active-active = 3x write <a href="/blog/backend/knowledge-cards/request-unit/" data-link-title="Request Unit" data-link-desc="Cosmos DB 的容量抽象單位、1 RU = 1KB document strong-consistent read 的 CPU &#43; memory &#43; IOPS 綜合 cost、寫 ~5 RU、複雜 query 數百 RU">Request Unit</a> cost</li>
<li>對應 <a href="/blog/backend/09-performance-capacity/capacity-planning/" data-link-title="9.6 容量規劃模型" data-link-desc="peak forecast、headroom budget、growth curve、autoscaling sizing">9.6 容量規劃模型</a>：multi-region write multiplier 進 sizing</li>
<li>對應 <a href="/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">4.20 Observability Evidence Package</a>：conflict rate 當 reliability evidence</li>
<li>Alert：conflict rate &gt; 0.1%、conflict feed lag &gt; 5 min、cross-region replication lag &gt; SLA</li>
</ul>
<h3 id="廣告-sla-vs-實測可用性鏈路拆解本章合成-frame">廣告 SLA vs 實測可用性鏈路拆解（本章合成 frame）</h3>
<p>9.C11 Minecraft Earth 平台揭露的 Cosmos DB SLA：</p>
<ul>
<li>single-region 99.99%</li>
<li>multi-region 99.999%</li>
</ul>
<p>這是 <em>DB 端 SLA</em>、不是 <em>端到端系統 SLA</em>。真實 production 系統的可用性是鏈路乘積：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">實測可用性 = DB SLA × 網路 SLA × 應用層 SLA × 客戶端可達性</span></span></code></pre></div><p><a href="/blog/backend/09-performance-capacity/cases/toyota-connected-mongodb-telematics-iot/" data-link-title="9.C38 Toyota Connected：MongoDB Atlas 撐 900 萬車輛 telematics、月 180 億 transaction" data-link-desc="Toyota Connected 用 MongoDB Atlas 撐 Safety Connect 900 萬車、月 180 億 transaction、緊急訊號 3 秒內到 agent">9.C38 Toyota Connected</a> 揭露「99.99% target vs 99% 實測」段的觀察：兩個 9 的差距 <em>不是</em> MongoDB / Atlas 自身問題、是 end-to-end 鏈路（車輛無線網路 / cellular tower / cloud network / event bus / microservice / DB cluster 任一環節掉都會打掉可用性）。Cosmos DB multi-region write 同模型：</p>
<ul>
<li>多 region active-active 可解 <em>DB 端可用性</em>、但網路 / 應用層任一掉、實測仍 &lt; 99.99%</li>
<li>廣告 99.999% 是 multi-region availability zone 級、<em>不是</em> 「使用者 request 成功率」</li>
</ul>
<p>引用時必須明示：Cosmos DB multi-region 廣告 99.999% 是 DB 端、要算實測可用性必須補網路 / 應用層 SLA 乘積、Toyota case 的「99% 實測」揭露的就是這個鏈路問題、跨 vendor 都適用。</p>
<p>跟 conflict resolution 的關係：多 region 高可用性 <em>買來</em> 的代價是 conflict、conflict rate 是 reliability 的暗稅 — 廣告 SLA 不計 conflict 處理成本。production 設計要把「conflict resolution 的工程成本」加進 multi-region write 的 ROI 評估。</p>
<h2 id="邊界與整合">邊界與整合</h2>
<ul>
<li>Sibling deep articles：<a href="../consistency-levels-engineering/">consistency-levels-engineering</a>（multi-region write 跟 Strong 互斥的 cross-link 來源）、<a href="../partition-key-design/">partition-key-design</a>（hot partition 會放大 conflict）、<a href="../ru-cost-model-sizing/">ru-cost-model-sizing</a>（multi-region cost × region 數）</li>
<li>跟 <a href="/blog/backend/01-database/vendors/spanner/" data-link-title="Google Cloud Spanner" data-link-desc="全球分散式 strong-consistency OLTP、TrueTime API、線性擴展到 10 億 req/sec">Spanner vendor</a> 對比：CP vs AP、無 conflict vs LWW / custom</li>
<li>跟 DynamoDB Global Tables 對比：兩者都 LWW、Cosmos DB 多 custom merge + conflict feed</li>
<li>跟 1.x 章節：<a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a> 把 multi-region write 模式並陳</li>
<li>Knowledge cards：<a href="/blog/backend/knowledge-cards/stale-read/" data-link-title="Stale Read" data-link-desc="讀取到落後於最新寫入版本的舊資料">stale-read</a> / <a href="/blog/backend/knowledge-cards/rpo/" data-link-title="RPO" data-link-desc="說明恢復點目標如何定義可接受資料損失範圍">rpo</a> / <a href="/blog/backend/knowledge-cards/rto/" data-link-title="RTO" data-link-desc="說明恢復時間目標如何約束事故回復策略">rto</a></li>
<li>Anti-recommendation：single-region write + cross-region read replica 在大多數情況更便宜、更易推理；只有 <em>write residency</em> 是產品契約（合規 / latency / 業務需求）時才升 multi-region write</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/cosmosdb/" data-link-title="Azure Cosmos DB" data-link-desc="全球分散式 multi-model DB、5 個 consistency levels、Microsoft 自家 dogfood 證據">Cosmos DB vendor overview</a> — 本文是該頁尾 multi-region write + conflict resolution backlog 的深度展開</li>
<li><a href="/blog/backend/09-performance-capacity/cases/minecraft-earth-cosmos-db-global/" data-link-title="9.C11 Minecraft Earth：Azure Cosmos DB 上的全球分散式 AR 遊戲" data-link-desc="Minecraft Earth 用 Cosmos DB 跨地區分散、測試到 100 萬 RU/s 仍維持承諾延遲">9.C11 Minecraft Earth case</a> — multi-region 99.999% / single-region 99.99% SLA 來源</li>
<li><a href="/blog/backend/09-performance-capacity/cases/asos-cosmos-db-black-friday/" data-link-title="9.C21 ASOS：Cosmos DB 在 Black Friday 撐 1.67 億請求" data-link-desc="ASOS 在 2016 Black Friday 用 Azure Cosmos DB 撐 24 小時 1.67 億請求、3500 req/sec、48ms 平均延遲">9.C21 ASOS case</a> — 全球零售 multi-region 補充</li>
<li><a href="/blog/backend/09-performance-capacity/cases/toyota-connected-mongodb-telematics-iot/" data-link-title="9.C38 Toyota Connected：MongoDB Atlas 撐 900 萬車輛 telematics、月 180 億 transaction" data-link-desc="Toyota Connected 用 MongoDB Atlas 撐 Safety Connect 900 萬車、月 180 億 transaction、緊急訊號 3 秒內到 agent">9.C38 Toyota Connected case</a> — 鏈路 SLA 拆解 frame anchor（跨 vendor 適用）</li>
<li><a href="../consistency-levels-engineering/">consistency-levels-engineering</a> — Strong + multi-region 互斥的 cross-link 目的地</li>
<li><a href="/blog/backend/knowledge-cards/stale-read/" data-link-title="Stale Read" data-link-desc="讀取到落後於最新寫入版本的舊資料">Stale Read 卡片</a> / <a href="/blog/backend/knowledge-cards/rpo/" data-link-title="RPO" data-link-desc="說明恢復點目標如何定義可接受資料損失範圍">RPO 卡片</a> / <a href="/blog/backend/knowledge-cards/rto/" data-link-title="RTO" data-link-desc="說明恢復時間目標如何約束事故回復策略">RTO 卡片</a> — 概念基底</li>
<li>官方：<a href="https://learn.microsoft.com/azure/cosmos-db/conflict-resolution-policies">Cosmos DB conflict resolution</a> / <a href="https://learn.microsoft.com/azure/cosmos-db/how-to-multi-master">Multi-region writes</a></li>
</ul>
]]></content:encoded></item></channel></rss>