<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Dr on Tarragon</title><link>https://tarrragon.github.io/blog/tags/dr/</link><description>Recent content in Dr on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Wed, 27 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/dr/index.xml" rel="self" type="application/rss+xml"/><item><title>Aurora Global Database：跨 region async replication、&lt; 1 秒 lag 與合規 anti-recommendation</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/aurora/global-database-multi-region/</link><pubDate>Wed, 27 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/aurora/global-database-multi-region/</guid><description>&lt;p>Aurora Global Database 是 &lt;em>跨 region async replication&lt;/em>、&amp;lt; 1 秒 typical lag、最多 5 個 secondary region — 看起來是 multi-region OLTP 的標準解、但 &lt;a href="https://tarrragon.github.io/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered&lt;/a> 揭露一個受監管產業的 anti-recommendation：合規禁止跨境複製場景下、Global Database &lt;em>違反合規&lt;/em>、要改用每市場獨立 cluster + 應用層市場切換。本文展開 Global Database 適用條件、跟 cross-AZ failover 的 RTO 數量級差、合規邊界、跟 Aurora DSQL / Spanner / CockroachDB 的決策樹。&lt;/p>
&lt;p>本文不是 Aurora overview（請看 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&amp;#43;75% 效能改善的 production 證據">Aurora vendor 頁&lt;/a>）— 而是 Global Database 的實作層教學。前置閱讀建議 &lt;a href="../storage-architecture/">Aurora storage architecture&lt;/a>（理解 storage-level replication）、&lt;a href="../cross-az-failover-rto/">Aurora cross-AZ failover RTO&lt;/a>（對照單 region failover）。&lt;/p>
&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>典型觸發場景：global SaaS / 跨地理金融服務、需要 region-level DR（us-east-1 整 region 失效時 &amp;lt; 5 分鐘恢復寫入）、或跨地理 read（歐洲用戶查美國 primary 延遲 100ms+ 不可接受）、但又不到「multi-region active-active write」需求。&lt;/p>
&lt;p>讀者常見的具體疑問：&lt;/p>
&lt;ul>
&lt;li>「Global Database 是 sync 還是 async？lag 多少？」&lt;/li>
&lt;li>「Secondary region 可以寫嗎？」&lt;/li>
&lt;li>「Region failover 流程跟 cross-AZ 一樣嗎？」&lt;/li>
&lt;li>「跟 Aurora DSQL / Spanner / CockroachDB 怎麼選？」&lt;/li>
&lt;li>「合規場景一定要用 Global Database 嗎？」&lt;/li>
&lt;/ul>
&lt;p>進一步問題：Global Database 對一般 SaaS 是合理的 DR + 跨地理 read 工具、但對 &lt;em>受監管產業&lt;/em> 是反指標。&lt;a href="https://tarrragon.github.io/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered&lt;/a> 7 個受監管市場、各自獨立 Aurora cluster、不用 Global Database — 不是技術不夠、是合規要求「資料不能跨境複製」。讀者規劃 multi-region 架構時、合規維度要在技術維度之前判斷。&lt;/p>
&lt;h2 id="核心機制跨-region-async-storage-replication">核心機制：跨 region async storage replication&lt;/h2>
&lt;p>Aurora Global Database 的 first-class concept 是 &lt;em>跨 region storage-level async replication&lt;/em>。跟 logical replication / streaming replication 不同、Global Database 在 storage layer 複製、lag 上限相對穩定。&lt;/p></description><content:encoded><![CDATA[<p>Aurora Global Database 是 <em>跨 region async replication</em>、&lt; 1 秒 typical lag、最多 5 個 secondary region — 看起來是 multi-region OLTP 的標準解、但 <a href="/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered</a> 揭露一個受監管產業的 anti-recommendation：合規禁止跨境複製場景下、Global Database <em>違反合規</em>、要改用每市場獨立 cluster + 應用層市場切換。本文展開 Global Database 適用條件、跟 cross-AZ failover 的 RTO 數量級差、合規邊界、跟 Aurora DSQL / Spanner / CockroachDB 的決策樹。</p>
<p>本文不是 Aurora overview（請看 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor 頁</a>）— 而是 Global Database 的實作層教學。前置閱讀建議 <a href="../storage-architecture/">Aurora storage architecture</a>（理解 storage-level replication）、<a href="../cross-az-failover-rto/">Aurora cross-AZ failover RTO</a>（對照單 region failover）。</p>
<h2 id="問題情境">問題情境</h2>
<p>典型觸發場景：global SaaS / 跨地理金融服務、需要 region-level DR（us-east-1 整 region 失效時 &lt; 5 分鐘恢復寫入）、或跨地理 read（歐洲用戶查美國 primary 延遲 100ms+ 不可接受）、但又不到「multi-region active-active write」需求。</p>
<p>讀者常見的具體疑問：</p>
<ul>
<li>「Global Database 是 sync 還是 async？lag 多少？」</li>
<li>「Secondary region 可以寫嗎？」</li>
<li>「Region failover 流程跟 cross-AZ 一樣嗎？」</li>
<li>「跟 Aurora DSQL / Spanner / CockroachDB 怎麼選？」</li>
<li>「合規場景一定要用 Global Database 嗎？」</li>
</ul>
<p>進一步問題：Global Database 對一般 SaaS 是合理的 DR + 跨地理 read 工具、但對 <em>受監管產業</em> 是反指標。<a href="/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered</a> 7 個受監管市場、各自獨立 Aurora cluster、不用 Global Database — 不是技術不夠、是合規要求「資料不能跨境複製」。讀者規劃 multi-region 架構時、合規維度要在技術維度之前判斷。</p>
<h2 id="核心機制跨-region-async-storage-replication">核心機制：跨 region async storage replication</h2>
<p>Aurora Global Database 的 first-class concept 是 <em>跨 region storage-level async replication</em>。跟 logical replication / streaming replication 不同、Global Database 在 storage layer 複製、lag 上限相對穩定。</p>
<p><strong>Architecture</strong>：</p>
<ul>
<li>Primary region：1 個 writer cluster + N read replica</li>
<li>Secondary region：最多 5 個 secondary region、每 region N 個 reader-only cluster（最多 16 個 reader 含 1 個 headless）</li>
<li>Storage replication：primary region 寫 storage 後 <em>async</em> push 到 secondary region storage、不等 ack</li>
</ul>
<p><strong>Write path</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Application
</span></span><span class="line"><span class="ln">2</span><span class="cl">    ↓ writer endpoint (primary region only)
</span></span><span class="line"><span class="ln">3</span><span class="cl">Primary region compute
</span></span><span class="line"><span class="ln">4</span><span class="cl">    ↓ redo log
</span></span><span class="line"><span class="ln">5</span><span class="cl">Primary region storage (4-of-6 quorum)
</span></span><span class="line"><span class="ln">6</span><span class="cl">    ↓ async replication (typical &lt; 1 秒)
</span></span><span class="line"><span class="ln">7</span><span class="cl">Secondary region storage</span></span></code></pre></div><p><strong>Read path</strong>：</p>
<ul>
<li>Secondary region 直接從 local storage 讀、不需要跨 region 拉</li>
<li>Read latency 是 secondary region local latency、不是跨 region</li>
</ul>
<p><strong>DR 切換 RTO 跟 cross-AZ 對比</strong>：</p>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>RTO</th>
          <th>機制</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cross-AZ failover</td>
          <td>&lt; 30 秒</td>
          <td>storage 跨 AZ 共享、replica 升 primary 即可</td>
      </tr>
      <tr>
          <td>Planned failover</td>
          <td>&lt; 2 分鐘</td>
          <td>managed graceful failover、無資料丟失</td>
      </tr>
      <tr>
          <td>Unplanned failover</td>
          <td>5-15 分鐘</td>
          <td>整 region 失效、手動 promote secondary</td>
      </tr>
  </tbody>
</table>
<p>數量級不同 — cross-AZ 是 <em>seconds</em>、cross-region planned 是 <em>minutes</em>、unplanned 是 <em>tens of minutes</em>。</p>
<p><strong>對應 knowledge card</strong>：<a href="/blog/backend/knowledge-cards/stale-read/" data-link-title="Stale Read" data-link-desc="讀取到落後於最新寫入版本的舊資料">stale-read</a>、<a href="/blog/backend/knowledge-cards/rpo/" data-link-title="RPO" data-link-desc="說明恢復點目標如何定義可接受資料損失範圍">rpo</a>、<a href="/blog/backend/knowledge-cards/rto/" data-link-title="RTO" data-link-desc="說明恢復時間目標如何約束事故回復策略">rto</a>。</p>
<p><strong>跟通用 cross-region replication 差在哪</strong>：Aurora 在 storage layer 複製、lag 上限更穩定；vs PostgreSQL logical replication lag 受寫速度影響大、heavy write 期間可能秒級到分鐘級。</p>
<h2 id="step-by-step-配置">Step-by-step 配置</h2>
<p><strong>建 global cluster</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Step 1：在 primary region 建 global cluster</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">aws rds create-global-cluster <span class="se">\
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="se"></span>  --global-cluster-identifier myglobal <span class="se">\
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="se"></span>  --source-db-cluster-identifier arn:aws:rds:us-east-1:123:cluster:primary-cluster <span class="se">\
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="se"></span>  --region us-east-1
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># Step 2：在 secondary region 加 reader cluster</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">aws rds create-db-cluster <span class="se">\
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="se"></span>  --db-cluster-identifier secondary-cluster <span class="se">\
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="se"></span>  --global-cluster-identifier myglobal <span class="se">\
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="se"></span>  --engine aurora-postgresql <span class="se">\
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="se"></span>  --source-region us-east-1 <span class="se">\
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="se"></span>  --region eu-west-1
</span></span><span class="line"><span class="ln">14</span><span class="cl">
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1"># Step 3：在 secondary region 建 db instance</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">aws rds create-db-instance <span class="se">\
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="se"></span>  --db-cluster-identifier secondary-cluster <span class="se">\
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="se"></span>  --db-instance-identifier secondary-reader-01 <span class="se">\
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="se"></span>  --db-instance-class db.r6g.4xlarge <span class="se">\
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="se"></span>  --engine aurora-postgresql <span class="se">\
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="se"></span>  --region eu-west-1</span></span></code></pre></div><p><strong>Application routing</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln">1</span><span class="cl"><span class="c"># 寫永遠去 primary region writer endpoint</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="nt">primary</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="nt">url</span><span class="p">:</span><span class="w"> </span><span class="l">jdbc:postgresql://primary-cluster.cluster-xxx.us-east-1.rds.amazonaws.com/mydb</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c"># read 可走 secondary region reader endpoint（靠近用戶的 region）</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="nt">secondary-eu</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">  </span><span class="nt">url</span><span class="p">:</span><span class="w"> </span><span class="l">jdbc:postgresql://secondary-cluster.cluster-ro-xxx.eu-west-1.rds.amazonaws.com/mydb</span></span></span></code></pre></div><p><strong>DR 切換（planned failover）</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">aws rds failover-global-cluster <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --global-cluster-identifier myglobal <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --target-db-cluster-identifier arn:aws:rds:eu-west-1:123:cluster:secondary-cluster</span></span></code></pre></div><p>切換後 application 端要 <em>reconfigure connection string</em> — DNS 不自動切跨 region（vs cross-AZ failover writer endpoint 自動跟）。</p>
<p><strong>Application reconfiguration 模式</strong>：</p>
<ul>
<li>Connection string 用 service discovery（Consul / Route53 health check）動態解析</li>
<li>或在 application config 加入 region-aware logic、failover 後切換 active region</li>
<li>不能假設 application 自動 reconnect 到新 primary region</li>
</ul>
<p><strong>驗證點</strong>：</p>
<ul>
<li><code>AuroraGlobalDBReplicationLag</code> &lt; 1 秒</li>
<li>Planned failover RTO 量測（手動 trigger + heartbeat timestamp diff）</li>
<li>Application 跨 region read 路徑 latency 符合預期</li>
</ul>
<p><strong>Rollback boundary</strong>：promote secondary 後原 primary 變 secondary、不會自動 fallback；rollback 要再做一次 failover。</p>
<h2 id="故障模式--邊界-case">故障模式 / 邊界 case</h2>
<h3 id="case-1期待-multi-region-active-active-write">Case 1：期待 multi-region active-active write</h3>
<p>徵兆：team 在 secondary region application 直連 secondary cluster 寫資料、收到 <code>cannot execute INSERT in a read-only transaction</code> 錯誤。</p>
<p>原因：Global Database secondary 是 <em>reader-only</em>、寫只能去 primary region。要 active-active write 必須改用其他服務（Aurora DSQL / Spanner / CockroachDB）。</p>
<p>修：</p>
<ul>
<li>Application 設計時明確區分 read region vs write region</li>
<li>寫操作永遠路由到 primary region、容忍跨 region write latency</li>
<li>真的需要 active-active write 才考慮 Aurora DSQL（2024-12 preview / 2025-05 GA）</li>
</ul>
<h3 id="case-2dns-不跨-region-自動切">Case 2：DNS 不跨 region 自動切</h3>
<p>徵兆：手動 failover trigger 後、application 端 connection string 仍指向舊 primary region、寫操作全失敗。</p>
<p>原因：cross-AZ failover writer endpoint DNS 自動跟、cross-region 不會 — Global Database 切換要 application 端管 region-specific connection string。</p>
<p>修：</p>
<ul>
<li>Application 用 service discovery（Route53 / Consul / etcd）解析 active primary region</li>
<li>部署 region-aware DNS（Route53 latency-based routing + health check）</li>
<li>Failover 演練要包含 application reconfiguration step、不只是 DB layer</li>
</ul>
<h3 id="case-3跨-region-read-假設-strong-consistency">Case 3：跨 region read 假設 strong consistency</h3>
<p>徵兆：用戶在 primary region 寫資料、隨即在 secondary region read、看到舊資料、客訴 inconsistency。</p>
<p>原因：Global Database 是 async replication、&lt; 1 秒 lag 不是 zero、read-after-write 場景仍會看到 stale data。</p>
<p>修：</p>
<ul>
<li>用戶寫操作後短期內 read 走 primary region（read-after-write window）</li>
<li>接受最終一致性、application 端做 versioning / timestamp 比對</li>
<li>強一致性需求改 Aurora DSQL / Spanner</li>
</ul>
<h3 id="case-4lag-spike-during-bulk-operation">Case 4：Lag spike during bulk operation</h3>
<p>徵兆：DDL 或 bulk insert 期間 cross-region lag 從 &lt; 1 秒跳到秒級到分鐘級、secondary region read 大量 stale。</p>
<p>原因：Global Database 「&lt; 1 秒」是 typical、heavy write 期間 lag 拉大。Storage-level replication 比 logical 穩定、但 <em>不是 zero variance</em>。</p>
<p>修：</p>
<ul>
<li>DDL 跟 bulk insert 在低峰期跑、避開跨 region read traffic</li>
<li>監測 <code>AuroraGlobalDBReplicationLag</code>、spike 超過閾值 trigger application 端 fallback（read 切回 primary region）</li>
<li>重要 DDL 用 <a href="https://github.com/reorg/pg_repack">pg_repack</a> 避免長時間 lag</li>
</ul>
<h3 id="case-5合規邊界誤用-global-database--standard-chartered-anti-pattern">Case 5：合規邊界誤用 Global Database — Standard Chartered anti-pattern</h3>
<p>徵兆：team 以為 Global Database 是受監管金融的標準 DR 解、配置完才發現監管機構不接受跨境資料複製、被迫拆掉 Global Database 重建獨立 cluster。</p>
<p><a href="/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered case</a> 「判讀」段第 1 點原文：「7 個受監管市場代表 7 個獨立 cluster（資料不能跨境）、容量規劃變成『7 個獨立規劃 × 各自合規門檻』」。</p>
<p>原因：受監管市場資料 <em>不能跨境複製</em>（<a href="/blog/backend/knowledge-cards/data-residency/" data-link-title="Data Residency" data-link-desc="合規要求資料留在特定地理邊界內、跨境複製違反合規、推動 fleet 拓樸決策">Data Residency</a> 硬約束）、Global Database 本質上就是跨 region storage replication、配置了就違反合規。Standard Chartered 的選擇是 <em>每市場獨立 cluster</em>、跨市場 DR 走應用層市場切換、不靠 Global Database。</p>
<p>修：</p>
<ul>
<li>規劃 multi-region 前先確認合規要求（資料駐留、跨境複製禁令、稽核要求）</li>
<li>合規禁止跨境複製場景：每市場獨立 cluster + cross-AZ failover 吸收 RTO（見 <a href="../cross-az-failover-rto/">cross-az-failover-rto</a>）</li>
<li>跨市場 DR 設計成 <em>市場切換</em>（用戶從 A 市場切到 B 市場）、不是 <em>資料切換</em></li>
<li>Fleet 拓樸（多市場 → 多 cluster）詳見 <a href="../read-replica-scaling/">Aurora read replica scaling</a> fleet 治理 SSoT</li>
</ul>
<p><strong>scope warning（必明示）</strong>：Standard Chartered case 未公開是 PostgreSQL 還是 MySQL、未公開具體 cost 數字、屬「相關 case study」匿名對照。引用時不能擴寫具體 engine。</p>
<h3 id="case-6cost-trap--cross-region-data-transfer">Case 6：Cost trap — cross-region data transfer</h3>
<p>徵兆：開了 Global Database 後月帳變高 50%、發現 cross-region data transfer 是主要費用、不是 instance。</p>
<p>原因：Aurora 跨 region replication 走 AWS 內部網路、但 <em>cross-region data transfer 仍計費</em>。Heavy write workload 月費可能 doubled。</p>
<p>修：</p>
<ul>
<li>用 <code>AuroraGlobalDBReplicatedWriteIO</code> × per-region transfer rate 估月費</li>
<li>Write-heavy workload 評估 Global Database ROI（保險、低費用版本是用 cross-region snapshot 做冷備）</li>
<li>Cost 跟 RTO 一起看 — 如果接受 hours RTO、cross-region snapshot 更便宜</li>
</ul>
<h3 id="case-7fanduel-雙峰-case-對照避免-over-extrapolate">Case 7：FanDuel 雙峰 case 對照（避免 over-extrapolate）</h3>
<p>如果 team 引用 <a href="/blog/backend/09-performance-capacity/cases/fanduel-dual-peak-betting-streaming/" data-link-title="9.C28 FanDuel：體育直播 &#43; 投注的雙重峰值" data-link-desc="FanDuel 3.5M MAU、Super Bowl 期間擴容 5-10 倍、用 AWS Local Zones &#43; Wavelength &#43; Outposts 處理 20&#43; 州的雙重峰值">9.C28 FanDuel</a> 規劃 multi-region 部署、要明示 scope warning。</p>
<p><strong>case「判讀」段第 1 點原文</strong>：「直播跟投注是兩種完全不同 SLO：直播容忍秒級延遲（用 CDN + ABR 串流）、投注必須毫秒級成交。兩個服務必須各自獨立擴容、各自獨立 SLO」。</p>
<p><strong>scope warning（必明示）</strong>：</p>
<ul>
<li>FanDuel 5-10x 是 <em>betting 服務的 Aurora 擴容倍數</em>、不是 streaming（streaming 走 CDN、不走 Aurora）</li>
<li>不能壓成「Aurora 撐 5-10x」單一數字</li>
<li>案例自承：betting transaction TPS 跟 concurrent streams 未公開、不能 over-extrapolate</li>
</ul>
<p>引用 FanDuel 規劃自家 multi-region betting workload 時、看 <em>策略</em>（事件型分級 + 雙 SLO 拆分 + 多層 edge）、不套用 <em>具體數字</em>。</p>
<h2 id="跟-aurora-dsql--spanner--cockroachdb-的決策樹">跟 Aurora DSQL / Spanner / CockroachDB 的決策樹</h2>
<p>Global Database 是 <em>async + reader-only secondary</em>、不是 multi-region active-active。當 active-active write 是核心需求時、要看 distributed SQL 方案。</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Aurora Global Database</th>
          <th>Aurora DSQL</th>
          <th>Spanner</th>
          <th>CockroachDB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Replication</td>
          <td>Async storage-level</td>
          <td>Sync distributed</td>
          <td>Sync TrueTime</td>
          <td>Sync Raft consensus</td>
      </tr>
      <tr>
          <td>Secondary</td>
          <td>Reader-only</td>
          <td>Active-active</td>
          <td>Active-active</td>
          <td>Active-active</td>
      </tr>
      <tr>
          <td>Lag</td>
          <td>&lt; 1 秒 typical</td>
          <td>None (sync)</td>
          <td>None (sync)</td>
          <td>None (sync)</td>
      </tr>
      <tr>
          <td>Write</td>
          <td>Primary region only</td>
          <td>Multi-region</td>
          <td>Multi-region</td>
          <td>Multi-region</td>
      </tr>
      <tr>
          <td>Strong consistency cross-region</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>適用</td>
          <td>DR + 跨地理 read</td>
          <td>Multi-region OLTP</td>
          <td>Global scale OLTP</td>
          <td>Cross-cloud OLTP</td>
      </tr>
      <tr>
          <td>邊界</td>
          <td>active-active 不支援、合規 反指標</td>
          <td>AWS-only、新服務</td>
          <td>GCP-only、學習曲線</td>
          <td>跨雲、operational 複雜</td>
      </tr>
  </tbody>
</table>
<p><strong>何時選 Global Database</strong>：</p>
<ul>
<li>DR + 跨地理 read 是主要需求</li>
<li>寫流量集中在一個 region（單 region write 撐得住）</li>
<li>合規允許跨境複製（一般 SaaS、非受監管）</li>
<li>從 single-region Aurora 升級、不想換 engine</li>
</ul>
<p><strong>何時改 Aurora DSQL / Spanner / CockroachDB</strong>：</p>
<ul>
<li>Multi-region active-active write</li>
<li>跨 region strong consistency 是業務需求</li>
<li>跨雲 / on-prem 需求（CockroachDB）</li>
</ul>
<p><strong>何時不用 Global Database</strong>：</p>
<ul>
<li>合規禁止跨境複製（Standard Chartered case）→ 每市場獨立 cluster</li>
<li>Single-region 已滿足 DR / read 需求</li>
<li>跨 region cost 不划算（write-heavy workload）</li>
</ul>
<h2 id="容量與觀測">容量與觀測</h2>
<p><strong>核心 metric</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">AuroraGlobalDBReplicationLag       # secondary lag、&lt; 1 秒 typical
</span></span><span class="line"><span class="ln">2</span><span class="cl">AuroraGlobalDBReplicatedWriteIO    # cross-region data transfer 量
</span></span><span class="line"><span class="ln">3</span><span class="cl">AuroraGlobalDBProgressLag          # storage replication progress</span></span></code></pre></div><p><strong>容量上限</strong>：</p>
<ul>
<li>1 primary region + 5 secondary region</li>
<li>每 secondary region 16 個 reader 含 1 個 headless（可升 writer）</li>
</ul>
<p><strong>Cost signal</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">月費 ≈ AuroraGlobalDBReplicatedWriteIO × per-region transfer rate
</span></span><span class="line"><span class="ln">2</span><span class="cl">     + secondary region instance + storage
</span></span><span class="line"><span class="ln">3</span><span class="cl">     + cross-region snapshot (optional)</span></span></code></pre></div><p>Write 量大的 workload 月費可能 doubled（primary region + secondary region 都計費）、要在規劃時估準。</p>
<p><strong>驗證 DR</strong>：</p>
<ul>
<li>Planned failover drill 每季一次、量測 RTO / RPO</li>
<li>受監管產業：每月一次、有合規 sign-off 記錄</li>
<li>重大版本升級前必跑一次</li>
</ul>
<p><strong>回路徑</strong>：<a href="/blog/backend/09-performance-capacity/capacity-planning/" data-link-title="9.6 容量規劃模型" data-link-desc="peak forecast、headroom budget、growth curve、autoscaling sizing">9.6 容量規劃模型</a> cross-region cost、<a href="/blog/backend/08-incident-response/" data-link-title="模組八：事故處理與復盤" data-link-desc="用 IR 領域詞彙建問題節點、以服務級案例庫累積事故脈絡，先建概念與案例庫再進實作交接">8.x DR playbook</a> region-level failover decision。</p>
<h2 id="邊界與整合--下一步">邊界與整合 / 下一步</h2>
<p><strong>Sibling deep articles</strong>：</p>
<ul>
<li><a href="../storage-architecture/">Aurora storage architecture</a> — cross-region replication 是 storage-level 延伸</li>
<li><a href="../cross-az-failover-rto/">Aurora cross-AZ failover RTO</a> — cross-AZ 跟 cross-region failover RTO 數量級對比</li>
<li><a href="../read-replica-scaling/">Aurora read replica scaling</a> — fleet 治理 SSoT、合規驅動 fleet 拓樸的展開</li>
</ul>
<p><strong>Migration playbook</strong>：</p>
<ul>
<li><a href="../migrate-from-self-managed-pg-mysql/">PostgreSQL / MySQL → Aurora</a> — 從 PostgreSQL streaming replication 跨 region 升級的差異</li>
</ul>
<p><strong>1.x 章節互引</strong>：</p>
<ul>
<li><a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a> — Global Database vs distributed SQL 對比</li>
</ul>
<p><strong>何時不用本文</strong>：single-region OLTP、無跨 region DR / read 需求時可跳過、看 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor overview</a> 即可。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor overview</a> — 服務定位、適用 / 不適用場景</li>
<li><a href="/blog/backend/knowledge-cards/stale-read/" data-link-title="Stale Read" data-link-desc="讀取到落後於最新寫入版本的舊資料">Stale Read 卡片</a> — read-after-write 容忍度</li>
<li><a href="/blog/backend/knowledge-cards/rpo/" data-link-title="RPO" data-link-desc="說明恢復點目標如何定義可接受資料損失範圍">RPO 卡片</a> — DR RPO 判讀</li>
<li><a href="/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered</a> — 合規驅動的 Global Database anti-pattern</li>
<li><a href="/blog/backend/09-performance-capacity/cases/fanduel-dual-peak-betting-streaming/" data-link-title="9.C28 FanDuel：體育直播 &#43; 投注的雙重峰值" data-link-desc="FanDuel 3.5M MAU、Super Bowl 期間擴容 5-10 倍、用 AWS Local Zones &#43; Wavelength &#43; Outposts 處理 20&#43; 州的雙重峰值">9.C28 FanDuel</a> — 雙 SLO 並行的 multi-region 策略對照</li>
<li><a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章方法論</a> — 本文遵循的 6 規格面寫作模板</li>
<li>官方：<a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database.html">Aurora Global Database</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Cross-region DR</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/cross-region-dr/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/cross-region-dr/</guid><description>&lt;p>PostgreSQL cross-region DR 的核心責任是把區域性事故下的資料恢復、服務切換與資料一致性風險寫成可演練流程。跨區 DR 通常由法規、業務連續性、雲區故障、區域隔離或高可用承諾觸發。&lt;/p>
&lt;p>本文的判讀錨點是：cross-region DR 是恢復策略，而非自動等同 multi-region active-active。PostgreSQL 可以透過 backup / WAL archive、physical standby、logical replication、managed service replica 或 application-level replication 支援不同 RPO / RTO；每種路線都有資料延遲、切換與回切成本。&lt;/p>
&lt;h2 id="dr-strategy">DR Strategy&lt;/h2>
&lt;p>DR strategy 的核心責任是把恢復目標和技術路線對齊。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>策略&lt;/th>
 &lt;th>RPO / RTO 型態&lt;/th>
 &lt;th>適合情境&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Backup + WAL archive&lt;/td>
 &lt;td>RPO 依 WAL archive，RTO 依 restore&lt;/td>
 &lt;td>成本敏感、低頻災難復原&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cross-region standby&lt;/td>
 &lt;td>RPO 接近 replication lag，RTO 較短&lt;/td>
 &lt;td>需要較快啟動 read / promote&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Logical replication&lt;/td>
 &lt;td>table-level / selective DR&lt;/td>
 &lt;td>跨版本、跨 schema、局部資料同步&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Managed global DB&lt;/td>
 &lt;td>雲平台提供跨區 replica&lt;/td>
 &lt;td>希望降低自管複製與 promote 維運&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application replay&lt;/td>
 &lt;td>event / queue 重建狀態&lt;/td>
 &lt;td>domain event 已是 source of truth&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>RPO 要由業務定義。若付款、訂單、庫存只允許秒級遺失，backup-only 路線通常成本不足；若是內部報表或可重建資料，backup + WAL archive 可能足夠。&lt;/p>
&lt;h2 id="physical-vs-logical">Physical vs Logical&lt;/h2>
&lt;p>Physical vs logical 的核心責任是區分 byte-level recovery 與 row-level replication。Physical replica 保留 PostgreSQL cluster 層級狀態；&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/logical-replication/" data-link-title="Logical Replication" data-link-desc="說明以表為粒度解碼 row-level 變更的複製方式，對照 byte-level 的實體複製">logical replication&lt;/a> 提供 table / publication 層級彈性。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>面向&lt;/th>
 &lt;th>Physical standby&lt;/th>
 &lt;th>Logical replication&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>粒度&lt;/td>
 &lt;td>cluster / database&lt;/td>
 &lt;td>table / publication&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>版本彈性&lt;/td>
 &lt;td>通常要求版本與系統相容&lt;/td>
 &lt;td>可支援跨版本 / selective migration&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>DDL&lt;/td>
 &lt;td>跟隨 WAL / 需相容&lt;/td>
 &lt;td>需要 schema coordination&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Failover&lt;/td>
 &lt;td>promote standby&lt;/td>
 &lt;td>application / target DB 切換&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>風險&lt;/td>
 &lt;td>replication lag、timeline&lt;/td>
 &lt;td>slot lag、schema drift、missing key&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Physical standby 適合整體 DR。它的 runbook 要處理 WAL archive、replication lag、promotion、timeline、DNS / connection string 切換與回切。&lt;/p>
&lt;p>Logical replication 適合局部資料或跨版本轉換。它的 runbook 要處理 publication、subscription、replication slot、schema migration ordering 與資料 diff。&lt;/p>
&lt;h2 id="failover-runbook">Failover Runbook&lt;/h2>
&lt;p>Failover runbook 的核心責任是把災難切換變成可演練步驟。最小流程包含 incident declare、source freeze、replica health check、promote、traffic switch、data validation 與 rollback / rebuild。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL cross-region DR 的核心責任是把區域性事故下的資料恢復、服務切換與資料一致性風險寫成可演練流程。跨區 DR 通常由法規、業務連續性、雲區故障、區域隔離或高可用承諾觸發。</p>
<p>本文的判讀錨點是：cross-region DR 是恢復策略，而非自動等同 multi-region active-active。PostgreSQL 可以透過 backup / WAL archive、physical standby、logical replication、managed service replica 或 application-level replication 支援不同 RPO / RTO；每種路線都有資料延遲、切換與回切成本。</p>
<h2 id="dr-strategy">DR Strategy</h2>
<p>DR strategy 的核心責任是把恢復目標和技術路線對齊。</p>
<table>
  <thead>
      <tr>
          <th>策略</th>
          <th>RPO / RTO 型態</th>
          <th>適合情境</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Backup + WAL archive</td>
          <td>RPO 依 WAL archive，RTO 依 restore</td>
          <td>成本敏感、低頻災難復原</td>
      </tr>
      <tr>
          <td>Cross-region standby</td>
          <td>RPO 接近 replication lag，RTO 較短</td>
          <td>需要較快啟動 read / promote</td>
      </tr>
      <tr>
          <td>Logical replication</td>
          <td>table-level / selective DR</td>
          <td>跨版本、跨 schema、局部資料同步</td>
      </tr>
      <tr>
          <td>Managed global DB</td>
          <td>雲平台提供跨區 replica</td>
          <td>希望降低自管複製與 promote 維運</td>
      </tr>
      <tr>
          <td>Application replay</td>
          <td>event / queue 重建狀態</td>
          <td>domain event 已是 source of truth</td>
      </tr>
  </tbody>
</table>
<p>RPO 要由業務定義。若付款、訂單、庫存只允許秒級遺失，backup-only 路線通常成本不足；若是內部報表或可重建資料，backup + WAL archive 可能足夠。</p>
<h2 id="physical-vs-logical">Physical vs Logical</h2>
<p>Physical vs logical 的核心責任是區分 byte-level recovery 與 row-level replication。Physical replica 保留 PostgreSQL cluster 層級狀態；<a href="/blog/backend/knowledge-cards/logical-replication/" data-link-title="Logical Replication" data-link-desc="說明以表為粒度解碼 row-level 變更的複製方式，對照 byte-level 的實體複製">logical replication</a> 提供 table / publication 層級彈性。</p>
<table>
  <thead>
      <tr>
          <th>面向</th>
          <th>Physical standby</th>
          <th>Logical replication</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>粒度</td>
          <td>cluster / database</td>
          <td>table / publication</td>
      </tr>
      <tr>
          <td>版本彈性</td>
          <td>通常要求版本與系統相容</td>
          <td>可支援跨版本 / selective migration</td>
      </tr>
      <tr>
          <td>DDL</td>
          <td>跟隨 WAL / 需相容</td>
          <td>需要 schema coordination</td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>promote standby</td>
          <td>application / target DB 切換</td>
      </tr>
      <tr>
          <td>風險</td>
          <td>replication lag、timeline</td>
          <td>slot lag、schema drift、missing key</td>
      </tr>
  </tbody>
</table>
<p>Physical standby 適合整體 DR。它的 runbook 要處理 WAL archive、replication lag、promotion、timeline、DNS / connection string 切換與回切。</p>
<p>Logical replication 適合局部資料或跨版本轉換。它的 runbook 要處理 publication、subscription、replication slot、schema migration ordering 與資料 diff。</p>
<h2 id="failover-runbook">Failover Runbook</h2>
<p>Failover runbook 的核心責任是把災難切換變成可演練步驟。最小流程包含 incident declare、source freeze、replica health check、promote、traffic switch、data validation 與 rollback / rebuild。</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>操作</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Declare incident</td>
          <td>確認 primary region 事故範圍</td>
          <td>incident decision log</td>
      </tr>
      <tr>
          <td>Freeze source</td>
          <td>停止寫入或確認 source 已不可用</td>
          <td>last known LSN / timestamp</td>
      </tr>
      <tr>
          <td>Check replica</td>
          <td>lag、WAL received、read health</td>
          <td>replica status snapshot</td>
      </tr>
      <tr>
          <td>Promote</td>
          <td>promote standby 或啟用 target</td>
          <td>new timeline / role</td>
      </tr>
      <tr>
          <td>Switch traffic</td>
          <td>DNS、secret、connection string</td>
          <td>app smoke test</td>
      </tr>
      <tr>
          <td>Validate</td>
          <td>row count、critical invariant</td>
          <td>validation report</td>
      </tr>
      <tr>
          <td>Rebuild</td>
          <td>重建舊 primary 或新 standby</td>
          <td>follow-up runbook</td>
      </tr>
  </tbody>
</table>
<p>Failover 決策要有 owner。自動化可以執行步驟，但是否接受資料遺失、是否凍結寫入、是否 promote，仍需要明確責任人與 tripwire。</p>
<h2 id="data-reconciliation">Data Reconciliation</h2>
<p>Data reconciliation 的核心責任是處理 cross-region 切換後的資料差異。只要 replication lag 存在，failover 後就可能有未套用交易。</p>
<table>
  <thead>
      <tr>
          <th>差異類型</th>
          <th>處理方式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>已提交但未複製</td>
          <td>從 source WAL / app log / event 補償</td>
      </tr>
      <tr>
          <td>client retry 重複寫入</td>
          <td>idempotency key / natural key 去重</td>
      </tr>
      <tr>
          <td>sequence / identity</td>
          <td>target sequence reset / collision check</td>
      </tr>
      <tr>
          <td>external side effect</td>
          <td>payment、email、queue 需對帳</td>
      </tr>
  </tbody>
</table>
<p>Reconciliation 要先定義 critical table。所有表都做 full diff 成本高；付款、訂單、權限、ledger、mutation log 等高風險資料要有專用 validation query。</p>
<h2 id="drill-design">Drill Design</h2>
<p>Drill design 的核心責任是定期驗證 RPO / RTO。DR 文件只有在演練後才可信。</p>
<p>演練至少包含：</p>
<ol>
<li>從 backup + WAL 還原到指定時間。</li>
<li>Promote standby 到 isolated environment。</li>
<li>Application 使用 DR endpoint 跑 smoke test。</li>
<li>計算實際 RPO / RTO。</li>
<li>記錄失敗點、人工步驟與下一次修正。</li>
</ol>
<p>演練應避開 production destructive action。使用 isolated VPC、staging app、read-only validation 與 mock external side effect。</p>
<h2 id="no-go-conditions">No-Go Conditions</h2>
<p>No-go conditions 的核心責任是指出 PostgreSQL cross-region DR 的邊界。</p>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>建議路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>多區同時交易寫入是核心需求</td>
          <td>CockroachDB / Spanner / YugabyteDB 類 distributed SQL</td>
      </tr>
      <tr>
          <td>RPO 接近零且跨區距離大</td>
          <td>synchronous replication latency 成本評估</td>
      </tr>
      <tr>
          <td>Team 缺少 DR 演練能力</td>
          <td>managed service + vendor runbook</td>
      </tr>
      <tr>
          <td>數據 residency 限制跨區複製</td>
          <td>regional shard / policy-driven replication</td>
      </tr>
  </tbody>
</table>
<p>Cross-region DR 要誠實面對延遲。把每個 region 都變成 writer 需要 distributed transaction 模型；PostgreSQL DR 路線主要提供恢復與切換。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Cross-region DR 完成後，恢復實作讀 <a href="../pitr-wal-archiving/">PITR / WAL Archiving</a>；replication 架構讀 <a href="../replication-topology/">Replication Topology</a>；跨區 rollout 的資料政策讀 <a href="../multi-region-gdpr-rollout/">Multi-region GDPR Rollout</a>。</p>
]]></content:encoded></item></channel></rss>