<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Postgresql on Tarragon</title><link>https://tarrragon.github.io/blog/tags/postgresql/</link><description>Recent content in Postgresql on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Fri, 26 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/postgresql/index.xml" rel="self" type="application/rss+xml"/><item><title>PostgreSQL</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/</link><pubDate>Wed, 13 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/</guid><description>&lt;p>PostgreSQL 是 backend 預設關聯式資料庫的安全選擇。生態完整、SQL 功能豐富、MVCC 跟 transaction 模型穩定、新版本仍積極演進（pg17 加入 JSON_TABLE、平行 vacuum；pg18 加入 io_uring async）。Aurora（AWS managed）、CockroachDB、Aurora DSQL（2024-12 preview / 2025-05 GA）、Spanner（2024 PostgreSQL dialect）都把 PostgreSQL wire protocol 當作相容標的 — 它是 SQL DB 世界的 lingua franca。&lt;/p>
&lt;h2 id="教學路線sql-baseline-與交易演進">教學路線：SQL baseline 與交易演進&lt;/h2>
&lt;p>PostgreSQL 服務頁的教學目標是建立 SQL baseline。讀者讀完後要能用 PostgreSQL 理解 transaction、schema evolution、query boundary、connection pressure 與 managed / distributed SQL 的比較基準。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>學習段&lt;/th>
 &lt;th>核心問題&lt;/th>
 &lt;th>對應段落&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>SQL baseline&lt;/td>
 &lt;td>PostgreSQL 為什麼常作為 OLTP 預設比較基準&lt;/td>
 &lt;td>定位、適用場景&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>容量邊界&lt;/td>
 &lt;td>connection、write throughput、replica、storage 如何限制服務&lt;/td>
 &lt;td>容量特性、容量規劃要點&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>交易與查詢&lt;/td>
 &lt;td>複雜 SQL、JSONB、GIS、全文檢索如何影響資料模型&lt;/td>
 &lt;td>適用場景、跟其他 vendor 的取捨&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>演進與維護&lt;/td>
 &lt;td>vacuum、partition、index、replication 如何成為長期責任&lt;/td>
 &lt;td>容量規劃要點、常見陷阱&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>替代路由&lt;/td>
 &lt;td>何時轉 Aurora、CockroachDB、Spanner、DynamoDB 或 OLAP&lt;/td>
 &lt;td>不適用場景、跟其他 vendor 的取捨&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h2 id="定位oltp-預設sql-工程深度">定位：OLTP 預設、SQL 工程深度&lt;/h2>
&lt;p>PostgreSQL 跟 MySQL 是兩大 SQL OLTP 主流、但設計取捨明顯不同：&lt;/p>
&lt;ul>
&lt;li>PostgreSQL 偏 &lt;em>特性深度&lt;/em> — JSON、GIS、full-text search、partial index、CTE、window function 都成熟&lt;/li>
&lt;li>MySQL 偏 &lt;em>簡單 query 效能 + 分片生態&lt;/em> — Vitess / PlanetScale 提供超大規模 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">database sharding&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>選 PostgreSQL 的核心訴求：需要進階 SQL 特性、需要長期 schema evolution 彈性、信任 community-driven 演進、想避免單一 vendor lock-in（PostgreSQL 是 open source、可跨雲 / on-prem）。&lt;/p>
&lt;h2 id="容量特性">容量特性&lt;/h2>
&lt;p>PostgreSQL 沒有「vendor 給的容量數字」、要靠 instance 配置 + tuning 推估。但有幾個工程上限要知道：&lt;/p>
&lt;p>&lt;strong>單一 primary 寫吞吐&lt;/strong>：&lt;/p>
&lt;ul>
&lt;li>一般 m5.4xlarge 級 instance：5K-10K WPS（依 schema、index、commit fsync）&lt;/li>
&lt;li>高階 r6i.16xlarge + io2 storage：30K-50K WPS&lt;/li>
&lt;li>超過這個級別 → 應用層 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">database sharding&lt;/a> 或換 Aurora / Spanner&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Connection 上限&lt;/strong>：&lt;/p>
&lt;ul>
&lt;li>預設 100 connection、每個 connection ~10MB RAM&lt;/li>
&lt;li>1000+ connection 必須 pgBouncer / PgCat 共享 pool&lt;/li>
&lt;li>對應 &lt;a href="https://tarrragon.github.io/blog/backend/09-performance-capacity/cases/ntt-docomo-lemino-japanese-streaming/" data-link-title="9.C29 NTT DOCOMO Lemino：3 個月達 500 萬 MAU 的串流後端" data-link-desc="Lemino 用 DynamoDB &amp;#43; AWS Media Services 撐 30 channels live &amp;#43; 5M MAU、工程工時下降 90%">9.C29 Lemino case&lt;/a> — RDB connection limit 是 surge 場景的隱性 bottleneck&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Read replica&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL 是 backend 預設關聯式資料庫的安全選擇。生態完整、SQL 功能豐富、MVCC 跟 transaction 模型穩定、新版本仍積極演進（pg17 加入 JSON_TABLE、平行 vacuum；pg18 加入 io_uring async）。Aurora（AWS managed）、CockroachDB、Aurora DSQL（2024-12 preview / 2025-05 GA）、Spanner（2024 PostgreSQL dialect）都把 PostgreSQL wire protocol 當作相容標的 — 它是 SQL DB 世界的 lingua franca。</p>
<h2 id="教學路線sql-baseline-與交易演進">教學路線：SQL baseline 與交易演進</h2>
<p>PostgreSQL 服務頁的教學目標是建立 SQL baseline。讀者讀完後要能用 PostgreSQL 理解 transaction、schema evolution、query boundary、connection pressure 與 managed / distributed SQL 的比較基準。</p>
<table>
  <thead>
      <tr>
          <th>學習段</th>
          <th>核心問題</th>
          <th>對應段落</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SQL baseline</td>
          <td>PostgreSQL 為什麼常作為 OLTP 預設比較基準</td>
          <td>定位、適用場景</td>
      </tr>
      <tr>
          <td>容量邊界</td>
          <td>connection、write throughput、replica、storage 如何限制服務</td>
          <td>容量特性、容量規劃要點</td>
      </tr>
      <tr>
          <td>交易與查詢</td>
          <td>複雜 SQL、JSONB、GIS、全文檢索如何影響資料模型</td>
          <td>適用場景、跟其他 vendor 的取捨</td>
      </tr>
      <tr>
          <td>演進與維護</td>
          <td>vacuum、partition、index、replication 如何成為長期責任</td>
          <td>容量規劃要點、常見陷阱</td>
      </tr>
      <tr>
          <td>替代路由</td>
          <td>何時轉 Aurora、CockroachDB、Spanner、DynamoDB 或 OLAP</td>
          <td>不適用場景、跟其他 vendor 的取捨</td>
      </tr>
  </tbody>
</table>
<h2 id="定位oltp-預設sql-工程深度">定位：OLTP 預設、SQL 工程深度</h2>
<p>PostgreSQL 跟 MySQL 是兩大 SQL OLTP 主流、但設計取捨明顯不同：</p>
<ul>
<li>PostgreSQL 偏 <em>特性深度</em> — JSON、GIS、full-text search、partial index、CTE、window function 都成熟</li>
<li>MySQL 偏 <em>簡單 query 效能 + 分片生態</em> — Vitess / PlanetScale 提供超大規模 <a href="/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">database sharding</a></li>
</ul>
<p>選 PostgreSQL 的核心訴求：需要進階 SQL 特性、需要長期 schema evolution 彈性、信任 community-driven 演進、想避免單一 vendor lock-in（PostgreSQL 是 open source、可跨雲 / on-prem）。</p>
<h2 id="容量特性">容量特性</h2>
<p>PostgreSQL 沒有「vendor 給的容量數字」、要靠 instance 配置 + tuning 推估。但有幾個工程上限要知道：</p>
<p><strong>單一 primary 寫吞吐</strong>：</p>
<ul>
<li>一般 m5.4xlarge 級 instance：5K-10K WPS（依 schema、index、commit fsync）</li>
<li>高階 r6i.16xlarge + io2 storage：30K-50K WPS</li>
<li>超過這個級別 → 應用層 <a href="/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">database sharding</a> 或換 Aurora / Spanner</li>
</ul>
<p><strong>Connection 上限</strong>：</p>
<ul>
<li>預設 100 connection、每個 connection ~10MB RAM</li>
<li>1000+ connection 必須 pgBouncer / PgCat 共享 pool</li>
<li>對應 <a href="/blog/backend/09-performance-capacity/cases/ntt-docomo-lemino-japanese-streaming/" data-link-title="9.C29 NTT DOCOMO Lemino：3 個月達 500 萬 MAU 的串流後端" data-link-desc="Lemino 用 DynamoDB &#43; AWS Media Services 撐 30 channels live &#43; 5M MAU、工程工時下降 90%">9.C29 Lemino case</a> — RDB connection limit 是 surge 場景的隱性 bottleneck</li>
</ul>
<p><strong>Read replica</strong>：</p>
<ul>
<li>streaming replication：1 個 primary + 多個 standby（async / sync）</li>
<li>跨 AZ replication lag 通常 &lt; 100ms、跨 region 可能秒級</li>
<li>跟 Aurora 比、自管 PostgreSQL replication lag 較大</li>
</ul>
<p><strong>Storage 上限</strong>：</p>
<ul>
<li>單一 table 32 TB（PostgreSQL 設計上限）</li>
<li>實務上單表超過 1 TB 開始有 vacuum / index 問題、建議 partition</li>
</ul>
<h2 id="適用場景">適用場景</h2>
<p><strong>1. 多用途 OLTP、複雜查詢</strong>：</p>
<ul>
<li>複雜 JOIN、CTE、window function、subquery</li>
<li>訂單系統、會員系統、訂閱方案、權限 RBAC</li>
<li>需要 strong consistency + ACID transaction</li>
</ul>
<p><strong>2. JSON / 半結構化資料</strong>：</p>
<ul>
<li>JSONB column 支援 indexing、partial query</li>
<li>比 MongoDB 適合 <em>主要結構化 + 部分 JSON</em> workload</li>
<li>不適合主要 document workload（用 MongoDB / Cosmos DB）</li>
</ul>
<p><strong>3. 地理 / 全文檢索</strong>：</p>
<ul>
<li>PostGIS 是業界標準 GIS extension</li>
<li>全文檢索（ts_vector）對中等規模夠用、超大規模用 Elasticsearch</li>
</ul>
<p><strong>4. 進階特性需求</strong>：</p>
<ul>
<li>partial index（WHERE 條件下才建 index）</li>
<li>exclusion constraints（避免 booking 重疊）</li>
<li>range types（時間 / 數字範圍）</li>
<li>logical decoding / CDC（Debezium、pgcapture）</li>
<li>foreign data wrapper（query 跨 DB）</li>
</ul>
<p><strong>5. 跨雲 / on-prem 部署</strong>：</p>
<ul>
<li>不想 vendor lock-in</li>
<li>可用 Patroni / Stolon / pg_auto_failover 做 HA</li>
<li>對應 <a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a> 的 CockroachDB / Aurora DSQL 比較段</li>
</ul>
<p><strong>6. 中小規模高峰場景</strong>：</p>
<ul>
<li>流量 &lt; 10K WPS 級別、PostgreSQL 自管或 RDS 通常夠</li>
<li>流量更高、考慮 Aurora（同 wire protocol、storage 升級）</li>
</ul>
<h2 id="不適用場景">不適用場景</h2>
<p><strong>1. 極高寫入吞吐（單機 &gt; 50K WPS）</strong>：</p>
<ul>
<li>必須進入 <a href="/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">database sharding</a> 或分散式 SQL</li>
<li>替代：CockroachDB、TiDB、Spanner、應用層 sharding</li>
</ul>
<p><strong>2. 全球 multi-region active-active write</strong>：</p>
<ul>
<li>PostgreSQL 是 single primary、不支援 multi-region active-active</li>
<li>替代：Aurora DSQL、Spanner、CockroachDB multi-region</li>
</ul>
<p><strong>3. KV 簡單查詢 + sub-10ms p99</strong>：</p>
<ul>
<li>PostgreSQL connection 開銷 + parsing + planning 已經 1-3ms</li>
<li>KV-pattern workload 用 DynamoDB / Redis / Cosmos DB 更便宜更快</li>
</ul>
<p><strong>4. 大規模 OLAP</strong>：</p>
<ul>
<li>PostgreSQL 定位在 OLTP，analytics workload 交給 OLAP 系統</li>
<li>大數據分析用 ClickHouse / BigQuery / Snowflake / Redshift / Synapse</li>
</ul>
<p><strong>5. 連線量極大 SaaS（每個用戶一個 connection）</strong>：</p>
<ul>
<li>即使有 pgBouncer、超大連線量仍是 PostgreSQL 結構性限制</li>
<li>對應 <a href="/blog/backend/09-performance-capacity/cases/ntt-docomo-lemino-japanese-streaming/" data-link-title="9.C29 NTT DOCOMO Lemino：3 個月達 500 萬 MAU 的串流後端" data-link-desc="Lemino 用 DynamoDB &#43; AWS Media Services 撐 30 channels live &#43; 5M MAU、工程工時下降 90%">9.C29 Lemino</a> 案例 — 流量上升 connection 爆是換 DynamoDB 的主因</li>
</ul>
<h2 id="跟其他-vendor-的取捨">跟其他 vendor 的取捨</h2>
<p><strong>vs MySQL</strong>：</p>
<ul>
<li>PostgreSQL：SQL 特性深、JSON / GIS / window 完整、replication 較簡單但 lag 較大</li>
<li>MySQL：簡單 query 效能好、replication 機制成熟、Vitess 分片生態強</li>
<li>選 PostgreSQL：需要進階 SQL、複雜 query、JSON workload</li>
<li>選 MySQL：高併發簡單 query、需要 sharding、已用 MySQL 生態</li>
</ul>
<p><strong>vs Aurora（同 PostgreSQL wire protocol）</strong>：</p>
<ul>
<li>PostgreSQL：自管 / RDS、特性接近 upstream、跨雲可用</li>
<li>Aurora：AWS managed、storage / compute 分離、更多 read replica</li>
<li>選 PostgreSQL：跨雲、想最新特性、預算敏感</li>
<li>選 Aurora：AWS 生態、需要更快 failover + 更多 read replica</li>
<li>詳見 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a></li>
</ul>
<p><strong>vs CockroachDB（PostgreSQL wire protocol 相容）</strong>：</p>
<ul>
<li>PostgreSQL：single-primary OLTP、SQL 特性完整</li>
<li>CockroachDB：multi-region 強一致 SQL、PostgreSQL wire 相容但部分特性缺</li>
<li>選 PostgreSQL：single-region 或 read replica 跨 region 夠</li>
<li>選 CockroachDB：必須 multi-region active-active write</li>
<li>詳見 <a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a></li>
</ul>
<p><strong>vs Spanner / Aurora DSQL（全球分散式 SQL）</strong>：</p>
<ul>
<li>PostgreSQL：傳統設計、跨 region 是 async replication</li>
<li>Spanner / Aurora DSQL：全球線性化、跨 region 強一致</li>
<li>選 PostgreSQL：90% 場景夠用、便宜、容易</li>
<li>選 Spanner / Aurora DSQL：金融交易、ticketing inventory、必須全球強一致</li>
</ul>
<p><strong>vs DynamoDB</strong>：</p>
<ul>
<li>詳見 <a href="/blog/backend/01-database/kv-document-capacity-planning/" data-link-title="1.10 KV / Document DB 容量規劃" data-link-desc="DynamoDB / Cosmos DB / Bigtable / MongoDB 等 KV / Document DB 的容量設計、partition key 取捨、capacity mode 選擇">1.10 KV / Document DB 容量規劃</a> 的 connection model 對比段</li>
</ul>
<p><strong>vs Neon（PostgreSQL serverless）</strong>：</p>
<ul>
<li>PostgreSQL：standard、自管或 RDS</li>
<li>Neon：branch-based、scale-to-zero、適合 dev / preview environment</li>
<li>選 Neon：dev / preview、稀疏 workload、CI 用</li>
<li>選 PostgreSQL：production sustained workload</li>
</ul>
<h2 id="容量規劃要點">容量規劃要點</h2>
<p><strong>1. Connection pool 必須有</strong>：</p>
<ul>
<li>直接連 1000+ connection 會壓垮 PostgreSQL</li>
<li>pgBouncer（最簡單、transaction pooling）</li>
<li>PgCat（rust 寫的進階替代、支援 sharding）</li>
<li>application 層 pool（HikariCP、SQLAlchemy pool）</li>
<li>通常組合使用：application pool 30-50 connection × 多 instance → pgBouncer 共享 → PostgreSQL 200 connection</li>
<li>對應 <a href="/blog/backend/knowledge-cards/connection-pool/" data-link-title="Connection Pool" data-link-desc="說明連線池如何限制下游資源並影響服務容量">Connection Pool 卡片</a></li>
</ul>
<p><strong>2. Replication 配置</strong>：</p>
<ul>
<li>streaming replication：async / sync / <a href="/blog/backend/knowledge-cards/quorum/" data-link-title="Quorum" data-link-desc="分散式系統以多數節點同意作為提交或讀取有效性的門檻">quorum</a></li>
<li>跨 AZ async：lag 通常 &lt; 100ms、failover 1-2 分鐘</li>
<li>跨 AZ sync：lag 接近 0、但寫入要等 standby ack、會降寫吞吐</li>
<li>跨 region 通常 async</li>
<li>HA 工具：Patroni（最常見）、pg_auto_failover、Stolon</li>
</ul>
<p><strong>3. Vacuum 跟 bloat 治理</strong>：</p>
<ul>
<li>PostgreSQL MVCC 會留下 dead tuples、必須 vacuum</li>
<li>autovacuum 配置：throttle 大表、避免在 peak 跑</li>
<li>bloat 監控：pg_stat_user_tables 看 dead_tup ratio</li>
<li>大表 vacuum 可能要 hours、影響 maintenance window</li>
</ul>
<p><strong>4. 大表 partitioning</strong>：</p>
<ul>
<li>單表 &gt; 1 TB 建議 partition（按時間、按 tenant）</li>
<li>partition pruning 讓 query 只掃需要的 partition</li>
<li>partition 限制：cross-partition unique constraint、跨 partition join 較慢</li>
</ul>
<p><strong>5. Index 策略</strong>：</p>
<ul>
<li>預設 B-tree、適合大多數 query</li>
<li>partial index 對 boolean / status column 特別有用</li>
<li>GIN / GiST 對 JSON / full-text / GIS</li>
<li>index 太多會拖累寫入、定期 review 未用 index（pg_stat_user_indexes）</li>
</ul>
<h2 id="安全dr-與角色分工">安全、DR 與角色分工</h2>
<p>PostgreSQL 的 production 完整性不只來自 SQL 特性，也來自資料存取、備份復原、升級責任與事故證據的分工。這一段補上 PG baseline 原本留在 limitation 的三個缺口：Security / RLS / audit logging、cross-region DR、application developer vs DBA / SRE 視角。</p>
<table>
  <thead>
      <tr>
          <th>責任面</th>
          <th>PostgreSQL 要回答的問題</th>
          <th>主要引用路徑</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Access control / RLS</td>
          <td>table、row、function、extension 與 service account 權限如何切</td>
          <td><a href="security-rls-audit-logging/">Security / RLS / Audit Logging</a>、<a href="/blog/backend/07-security-data-protection/data-protection-and-masking-governance/" data-link-title="7.4 資料保護與遮罩治理" data-link-desc="以問題驅動方式整理資料分級、遮罩、匯出與備份治理">7.4 Data Protection</a>、<a href="/blog/backend/knowledge-cards/audit-log/" data-link-title="Audit Log" data-link-desc="說明高風險操作如何留下可追溯、可稽核的紀錄">Audit Log</a></td>
      </tr>
      <tr>
          <td>TLS / credential</td>
          <td>application 連線、DB user、憑證與 secret rotation 如何治理</td>
          <td><a href="/blog/backend/knowledge-cards/tls-mtls/" data-link-title="TLS / mTLS" data-link-desc="說明傳輸加密與雙向憑證驗證如何保護跨邊界資料流">TLS / mTLS</a>、<a href="/blog/backend/knowledge-cards/credential/" data-link-title="Credential" data-link-desc="整理身分驗證與系統存取用秘密資料">Credential</a>、<a href="/blog/backend/knowledge-cards/secret-management/" data-link-title="Secret Management" data-link-desc="說明 token、key、password 與憑證如何保存、輪替與撤銷">Secret Management</a></td>
      </tr>
      <tr>
          <td>Cross-region DR</td>
          <td>region 失效時要 async replica、PITR、Aurora Global Database 還是 distributed SQL</td>
          <td><a href="cross-region-dr/">Cross-region DR</a>、<a href="/blog/backend/knowledge-cards/rpo/" data-link-title="RPO" data-link-desc="說明恢復點目標如何定義可接受資料損失範圍">RPO</a>、<a href="/blog/backend/knowledge-cards/rto/" data-link-title="RTO" data-link-desc="說明恢復時間目標如何約束事故回復策略">RTO</a>、<a href="/blog/backend/knowledge-cards/failover/" data-link-title="Failover" data-link-desc="說明主要服務或節點失效時如何切換到備援能力">Failover</a>、<a href="pitr-wal-archiving/">PITR + WAL Archiving</a></td>
      </tr>
      <tr>
          <td>Developer / DBA split</td>
          <td>application schema、migration、query、index 與 rollback 誰負責</td>
          <td><a href="developer-dba-responsibility-split/">Developer / DBA Responsibility Split</a>、<a href="/blog/backend/01-database/schema-design/" data-link-title="1.2 Schema Design 與資料建模" data-link-desc="整理 table、index、key、partition、denormalization 與命名規則">1.2 Schema Design</a>、<a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">1.6 Migration Playbook</a></td>
      </tr>
      <tr>
          <td>Incident evidence</td>
          <td>資料事故中要留下哪些 query、timeline、restore 與 decision evidence</td>
          <td><a href="/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">4.20 Observability Evidence Package</a>、<a href="/blog/backend/08-incident-response/incident-decision-log/" data-link-title="8.19 Incident Decision Log" data-link-desc="把事中假設、決策、證據、回退條件與責任人留下可復盤紀錄">8.19 Incident Decision Log</a></td>
      </tr>
  </tbody>
</table>
<p>Access control / RLS 的判讀重點是把資料責任放在資料層與 application 層之間分工。PostgreSQL 支援 role、grant、schema、function security 與 row-level security；但 RLS 會把授權邏輯拉進 database，適合 multi-tenant row isolation、資料平台或共享 reporting schema，日常 OLTP 仍要保留 application authorization 與 audit trail。</p>
<p>TLS / credential 的判讀重點是連線安全與憑證生命週期。Self-managed PostgreSQL 要處理 server cert、client cert、DB user rotation 與 <a href="/blog/backend/knowledge-cards/connection-pool/" data-link-title="Connection Pool" data-link-desc="說明連線池如何限制下游資源並影響服務容量">connection pool</a> 重連；managed PostgreSQL 常把 certificate、IAM auth 或 secret integration 交給平台，但 application pool、migration tool 與 read replica 仍要一起更新。</p>
<p>Cross-region DR 的判讀重點是 <a href="/blog/backend/knowledge-cards/rpo/" data-link-title="RPO" data-link-desc="說明恢復點目標如何定義可接受資料損失範圍">RPO</a> / <a href="/blog/backend/knowledge-cards/rto/" data-link-title="RTO" data-link-desc="說明恢復時間目標如何約束事故回復策略">RTO</a> 與資料一致性。自管 PostgreSQL 可用 streaming replication、WAL archiving、PITR 與 Patroni 做 region <a href="/blog/backend/knowledge-cards/failover/" data-link-title="Failover" data-link-desc="說明主要服務或節點失效時如何切換到備援能力">failover</a>；Aurora 把 backup、PITR 與 Global Database 交給 AWS；真正 active-active 或 global strong consistency 需求要回到 CockroachDB、Spanner 或 Aurora DSQL，single-primary PostgreSQL 保留為 region failover 與 async DR 路線。</p>
<p>Developer / DBA split 的判讀重點是把日常責任寫進流程。Application developer 擁有 query shape、transaction boundary、repository adapter 與 migration contract；DBA / SRE 擁有 backup、replication、pooler、extension、vacuum、index maintenance 與 DR drill；release gate 需要把兩邊 evidence 合在同一份 decision log。</p>
<h2 id="managed-pg-與相容變體路由">Managed PG 與相容變體路由</h2>
<p>PostgreSQL wire protocol 已成為 managed SQL 與 distributed SQL 的相容目標。選型時要區分「PostgreSQL 本體」、「managed PostgreSQL」、「PostgreSQL-compatible distributed SQL」與「PostgreSQL extension ecosystem」四種不同責任。</p>
<table>
  <thead>
      <tr>
          <th>變體</th>
          <th>適合情境</th>
          <th>主要代價 / 檢查點</th>
          <th>下一步路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RDS / self-managed PG</td>
          <td>想接近 upstream、保留跨雲與 extension 彈性</td>
          <td>團隊承擔 HA、backup、upgrade、vacuum 與 pooler</td>
          <td><a href="patroni-ha/">Patroni HA</a>、<a href="pitr-wal-archiving/">PITR + WAL Archiving</a></td>
      </tr>
      <tr>
          <td>Aurora PostgreSQL</td>
          <td>AWS 內 production OLTP、想轉移 HA / storage ops</td>
          <td>extension whitelist、cost model、cluster endpoint</td>
          <td><a href="migrate-to-aurora/">→ Aurora</a>、<a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor</a></td>
      </tr>
      <tr>
          <td>Cloud SQL / AlloyDB</td>
          <td>GCP 內 managed PostgreSQL 與 Google operation model</td>
          <td>extension / version matrix、IAM / backup / cost model</td>
          <td><a href="managed-pg-comparison/">Managed PG Comparison</a></td>
      </tr>
      <tr>
          <td>Azure Cosmos DB for PostgreSQL</td>
          <td>Citus-based distributed PostgreSQL、tenant / shard workload</td>
          <td>coordinator / worker topology、Citus 語意</td>
          <td><a href="citus-distributed/">Citus distributed</a>、<a href="/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">Database Sharding</a>、<a href="/blog/backend/01-database/vendors/cosmosdb/" data-link-title="Azure Cosmos DB" data-link-desc="全球分散式 multi-model DB、5 個 consistency levels、Microsoft 自家 dogfood 證據">Cosmos DB vendor</a></td>
      </tr>
      <tr>
          <td>Neon / serverless PG</td>
          <td>preview、branch、稀疏 workload、dev environment</td>
          <td>cold start、connection、production sustained workload</td>
          <td>本頁 vs Neon 段、後續 serverless PG comparison</td>
      </tr>
      <tr>
          <td>Aurora DSQL / CockroachDB</td>
          <td>global write、distributed SQL、region resiliency</td>
          <td>transaction retry、extension gap、latency / cost</td>
          <td><a href="migrate-to-aurora-dsql/">→ Aurora DSQL</a>、<a href="migrate-to-cockroachdb/">→ CockroachDB</a></td>
      </tr>
  </tbody>
</table>
<p>Managed PG 變體的引用規則是先查 compatibility，再談 migration。Extension whitelist、backup / restore API、logical replication 支援、connection endpoint 行為與 pricing 都是時間敏感 claim；實作前要回到官方文件確認版本，並把確認日期留在 migration plan 或 decision log。</p>
<h2 id="deep-article--migration-playbook已完成">Deep article + Migration playbook（已完成）</h2>
<table>
  <thead>
      <tr>
          <th>主題</th>
          <th>文章</th>
          <th>類型</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Streaming replication topology + LSN + slot</td>
          <td><a href="replication-topology/">replication-topology</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>pg_repack / pg-osc 跟 PG 內建 ALTER 行為</td>
          <td><a href="online-schema-change/">online-schema-change</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>Process-per-connection model + pooler 必要性</td>
          <td><a href="connection-scaling/">connection-scaling</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>pgBouncer + PgCat connection pool</td>
          <td><a href="pgbouncer-config/">pgbouncer-config</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>Patroni HA + DCS-based failover</td>
          <td><a href="patroni-ha/">patroni-ha</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>Autovacuum tuning + bloat 治理</td>
          <td><a href="autovacuum-tuning/">autovacuum-tuning</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>Logical replication + Debezium CDC</td>
          <td><a href="logical-replication-debezium/">logical-replication-debezium</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>Citus distributed extension</td>
          <td><a href="citus-distributed/">citus-distributed</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>BDR / pgEdge / Bucardo multi-master</td>
          <td><a href="bdr-multi-master/">bdr-multi-master</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>MVCC + lock model（PG 並行控制核心）</td>
          <td><a href="mvcc-lock-model/">mvcc-lock-model</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>EXPLAIN / auto_explain / pg_hint_plan</td>
          <td><a href="query-optimization/">query-optimization</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>Index method 選型決策樹（B-tree / GIN / GiST / BRIN）</td>
          <td><a href="index-selection/">index-selection</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>Declarative partitioning + pg_partman</td>
          <td><a href="declarative-partitioning/">declarative-partitioning</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>JSONB binary storage + GIN index</td>
          <td><a href="jsonb-deep-dive/">jsonb-deep-dive</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>Full-text search（tsvector + pg_trgm）</td>
          <td><a href="full-text-search/">full-text-search</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>Extension ecosystem（pgvector / TimescaleDB 等）</td>
          <td><a href="extension-ecosystem/">extension-ecosystem</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>TimescaleDB hypertable + CAGG + compression</td>
          <td><a href="timescaledb-deep-dive/">timescaledb-deep-dive</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>pgvector HNSW / IVFFlat ANN search</td>
          <td><a href="pgvector-deep-dive/">pgvector-deep-dive</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>PostGIS geometry / geography + GiST</td>
          <td><a href="postgis-deep-dive/">postgis-deep-dive</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>PITR + WAL archiving</td>
          <td><a href="pitr-wal-archiving/">pitr-wal-archiving</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>Replication slot management（含 PG 17 failover slot）</td>
          <td><a href="replication-slot-management/">replication-slot-management</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>SQL features baseline + MySQL 對比</td>
          <td><a href="sql-features-baseline/">sql-features-baseline</a></td>
          <td>Deep article</td>
      </tr>
      <tr>
          <td>Hands-on 操作路線</td>
          <td><a href="hands-on/">hands-on</a></td>
          <td>操作型章節群</td>
      </tr>
      <tr>
          <td>Major version upgrade（N → N+1 pg_upgrade）</td>
          <td><a href="major-version-upgrade/">major-version-upgrade</a></td>
          <td>Migration playbook（5-type 漏類 / 接近 Type B 但需 upgrade-specific audit）</td>
      </tr>
      <tr>
          <td>→ Aurora PostgreSQL</td>
          <td><a href="migrate-to-aurora/">migrate-to-aurora</a></td>
          <td>Migration playbook（Type C）</td>
      </tr>
      <tr>
          <td>→ Aurora DSQL（PG wire-compat distributed）</td>
          <td><a href="migrate-to-aurora-dsql/">migrate-to-aurora-dsql</a></td>
          <td>Migration playbook（Type E）</td>
      </tr>
      <tr>
          <td>→ CockroachDB</td>
          <td><a href="migrate-to-cockroachdb/">migrate-to-cockroachdb</a></td>
          <td>Migration playbook（Type E）</td>
      </tr>
      <tr>
          <td>Multi-region + GDPR rollout</td>
          <td><a href="multi-region-gdpr-rollout/">multi-region-gdpr-rollout</a></td>
          <td>Migration playbook（Type F）</td>
      </tr>
      <tr>
          <td>Partition redesign</td>
          <td><a href="partition-redesign/">partition-redesign</a></td>
          <td>Migration playbook（Type F）</td>
      </tr>
  </tbody>
</table>
<h2 id="補充正文路由">補充正文路由</h2>
<p>當前 deep article、migration playbook、補充正文與 hands-on 已 cover replication / HA / OSC / connection / CDC / sharding / multi-master / MVCC / query opt / index / partitioning / JSONB / FTS / extension（含 TimescaleDB / pgvector / PostGIS）/ backup / slot / SQL features / upgrade / migration / security / DR / managed variant 等維度。下列補充正文用來承接 overview 中提到的延伸議題：</p>
<ul>
<li><strong><a href="logical-decoding-plugins/">Logical decoding plugins deep dive</a></strong>：wal2json / pgoutput / decoderbufs 對位、CDC pipeline 整合</li>
<li><strong><a href="pg-partman-advanced/">pg_partman advanced</a></strong>：retention 跟 child partition 自動 management</li>
<li><strong><a href="connection-pooler-comparison/">Connection pooler comparison</a></strong>：PgBouncer vs Pgcat vs Odyssey 細部對比</li>
<li><strong><a href="aurora-io-optimized-cost/">Aurora I/O-Optimized vs standard</a></strong>：cost model 取捨</li>
<li><strong><a href="managed-pg-comparison/">AlloyDB / Cloud SQL 比較</a></strong>：GCP managed PG 選型</li>
</ul>
<p>上述補充篇已完成正文，並保留既有引用路徑。Logical decoding 接 <a href="logical-replication-debezium/">Logical Replication + Debezium</a> 與 <a href="replication-slot-management/">Replication Slot Management</a>；pg_partman advanced 接 <a href="declarative-partitioning/">Declarative Partitioning</a>；pooler comparison 接 <a href="connection-scaling/">Connection Scaling</a> 與 <a href="pgbouncer-config/">pgBouncer Config</a>；Aurora cost 接 <a href="migrate-to-aurora/">→ Aurora</a>；AlloyDB / Cloud SQL 接 <a href="managed-pg-comparison/">Managed PG Comparison</a>。</p>
<h2 id="案例對照">案例對照</h2>
<p>PostgreSQL 沒有直接的 09 case（多數 09 case 用 managed vendor）、但作為 <em>baseline 跟遷移源頭</em> 在許多 case 出現：</p>
<table>
  <thead>
      <tr>
          <th>案例</th>
          <th>跟 PostgreSQL 的關係</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/blog/backend/09-performance-capacity/cases/netflix-aurora-consolidation/" data-link-title="9.C23 Netflix：把關聯式 DB 統一到 Aurora、效能 &#43;75%、成本 -28%" data-link-desc="Netflix 把多套關聯式 DB 統一到 Aurora、效能提升 75%、成本下降 28%、串流數十億小時">9.C23 Netflix Aurora consolidation</a></td>
          <td>從多套 RDBMS（含 PostgreSQL）統一到 Aurora</td>
      </tr>
      <tr>
          <td><a href="/blog/backend/09-performance-capacity/cases/clearent-azure-sql-hyperscale-payments/" data-link-title="9.C32 Clearent：Azure SQL Hyperscale 撐每年 5 億筆支付交易" data-link-desc="Clearent 在 Azure SQL Hyperscale 上處理每年 5 億筆支付交易、autoscale &#43; 微服務架構">9.C32 Clearent Azure SQL Hyperscale</a></td>
          <td>Azure 生態替代 PostgreSQL 的選擇</td>
      </tr>
      <tr>
          <td><a href="/blog/backend/09-performance-capacity/cases/ntt-docomo-lemino-japanese-streaming/" data-link-title="9.C29 NTT DOCOMO Lemino：3 個月達 500 萬 MAU 的串流後端" data-link-desc="Lemino 用 DynamoDB &#43; AWS Media Services 撐 30 channels live &#43; 5M MAU、工程工時下降 90%">9.C29 Lemino RDB connection limit</a></td>
          <td>PostgreSQL/MySQL 都有的 connection 限制</td>
      </tr>
  </tbody>
</table>
<h2 id="已知-limitation-與-audit-紀錄">已知 Limitation 與 Audit 紀錄</h2>
<p>本 vendor 頁的 22 篇 deep article + 6 篇 migration playbook 經過 4-reviewer audit（A 寫作規範 / B 跨檔一致性 / C 技術準確性 / D 框架偏誤）、Phase 1-3 修法完成。承認以下 limitation：</p>
<ul>
<li><strong>PG narrative bias</strong>：pgvector / TimescaleDB / extension-ecosystem / Citus 四篇對「PG 取代專業 DB」描述偏 PG-favoring；對手 vendor（Pinecone / InfluxDB / Vitess）的優勢段相對簡短。讀者選型時、請以 cost / ops / scale 三軸綜合判斷、不依本 vendor 頁單一視角。</li>
<li><strong>Anti-recommendation 深度不一</strong>：bdr-multi-master / extension-ecosystem 有「99% 不需要」明確邊界、其他篇章邊界較柔（如「Vector 量 &gt; 5-20M」是粗略門檻）。實際 production 決策請參考多 vendor 對照 + 自家 workload 量測。</li>
<li><strong>Sibling cross-link 狀態</strong>：MySQL ↔ PG sibling、PG 既有 ↔ 新章節 cross-link 已補（refer <a href="/blog/report/sibling-vendor-cross-link-bidirectionality-audit/" data-link-title="Sibling Vendor Cross-Link 雙向性 Audit：寫 Vendor Batch 結束必跑" data-link-desc="當寫 sibling vendor batch（A vs B）、cross-link 容易單向 — A 提 B 多次、B 沒回提 A、形成 navigation asymmetry。Case：MySQL 18 篇對 PG sibling cross-link 9 條、PG 對 MySQL cross-link 0 條。機制：寫第二個 batch 時 reference 第一個 batch 是自然行為、但 reverse direction 必須主動補。修法：vendor batch 結束跑 bidirectional link audit、`A → B` 跟 `B → A` 對比、缺一邊就補。">#136 卡</a>）；本輪同步補 Aurora / CockroachDB / Spanner / Cosmos DB / DynamoDB vendor 頁的反向 sibling 路由，剩餘精修可在各 migration playbook 補更細的 step-by-step 對照。</li>
<li><strong>時間敏感 vendor claim</strong>：Aurora DSQL（2024-12 preview / 2025-05 GA）/ pgvector（0.8 iterative scan）/ TimescaleDB version matrix / DSQL extension 支援範圍持續演進、本 vendor 頁以 2025-2026 公開狀態為準、實作前請以 vendor 官方 docs 為準（refer <a href="/blog/report/vendor-feature-time-sensitivity-claim-verification/" data-link-title="Vendor Feature 時間敏感性：Claim Verification 必跑、寫作日期必標" data-link-desc="寫 vendor article 時、feature limitation claim（『不支援 X』『最多 Y』『預設 Z』）有時間敏感性 — vendor 持續演進、寫作後 N 個月可能 invalidate 整段 audit 邏輯。Case：PlanetScale FK 不支援是 2022 年的事實、2023 末 Vitess 18 加 FK 支援、寫作時若不 verify、Phase 1 audit「FK audit &#43; 全 drop」整段過時。機制：LLM training cutoff vs vendor changelog 速度差、且 LLM 預設不標 claim 的時間性。修法：每篇 vendor article 標 *Last verified* date、limitation claim 必要時加 *as of N* 註、claim 反轉 invalidates 整段 audit 時必須重寫不是修補。">#137 卡</a>）。</li>
<li><strong>補充維度已正文化</strong>：<a href="security-rls-audit-logging/">Security / RLS / audit logging</a>、<a href="cross-region-dr/">cross-region DR</a>、<a href="developer-dba-responsibility-split/">application developer vs DBA 視角分工</a>、<a href="migrate-to-yugabytedb-tidb/">YugabyteDB / TiDB migration playbook</a>、<a href="specialized-pg-variants/">specialized PG variants</a> 已補成正文。本輪也補上跨 vendor 反向連結與時間敏感 claim 路由；下一輪可集中在 migration playbook 的操作步驟與 lab 化。</li>
</ul>
<p>詳細 audit findings 跟修法見 <a href="/blog/report/sibling-vendor-cross-link-bidirectionality-audit/" data-link-title="Sibling Vendor Cross-Link 雙向性 Audit：寫 Vendor Batch 結束必跑" data-link-desc="當寫 sibling vendor batch（A vs B）、cross-link 容易單向 — A 提 B 多次、B 沒回提 A、形成 navigation asymmetry。Case：MySQL 18 篇對 PG sibling cross-link 9 條、PG 對 MySQL cross-link 0 條。機制：寫第二個 batch 時 reference 第一個 batch 是自然行為、但 reverse direction 必須主動補。修法：vendor batch 結束跑 bidirectional link audit、`A → B` 跟 `B → A` 對比、缺一邊就補。">#136 Sibling Vendor Cross-Link Bidirectionality</a> / <a href="/blog/report/vendor-feature-time-sensitivity-claim-verification/" data-link-title="Vendor Feature 時間敏感性：Claim Verification 必跑、寫作日期必標" data-link-desc="寫 vendor article 時、feature limitation claim（『不支援 X』『最多 Y』『預設 Z』）有時間敏感性 — vendor 持續演進、寫作後 N 個月可能 invalidate 整段 audit 邏輯。Case：PlanetScale FK 不支援是 2022 年的事實、2023 末 Vitess 18 加 FK 支援、寫作時若不 verify、Phase 1 audit「FK audit &#43; 全 drop」整段過時。機制：LLM training cutoff vs vendor changelog 速度差、且 LLM 預設不標 claim 的時間性。修法：每篇 vendor article 標 *Last verified* date、limitation claim 必要時加 *as of N* 註、claim 反轉 invalidates 整段 audit 時必須重寫不是修補。">#137 Vendor Feature 時間敏感性</a> / <a href="/blog/report/cross-reviewer-convergence-priority-weighting/" data-link-title="Cross-Reviewer Convergence：多 Reviewer 收斂的 finding 比單 Reviewer flag 信號強" data-link-desc="Multi-reviewer audit（4-reviewer / N-reviewer parallel）後、finding priority 不該是 *N 個 reviewer 報告平均合併*、應該按 *跨 reviewer convergence* 加權 — 兩個獨立 reviewer 從不同 axis 各自發現同一 finding 是 *信號收斂*、比單 reviewer flag 信號強 5-10x。Case：MySQL 17 篇 4-reviewer audit、Reviewer A（寫作規範）跟 Reviewer B（跨檔一致性）獨立 flag 同一 finding『4 篇 migration playbook 缺 weight &#43; banner』、是跨軸 convergence、是最 high-priority fix。機制：N 個獨立 axis 隨機 hit 同一 finding 的機率隨 N 增加而 exponential decline、convergence 排除噪音、是 signal-to-noise 的最高比訊號。修法：multi-reviewer audit 後做 *cross-reviewer matrix*、convergence column 自動標 priority bump。">#138 Cross-Reviewer Convergence</a>。</p>
<h2 id="常見陷阱">常見陷阱</h2>
<ul>
<li><strong>connection 沒 pool 直接連</strong>：1000 application instance × 30 connection = 30K connection、PostgreSQL 撐不住</li>
<li><strong>沒 vacuum 治理</strong>：dead tuple 累積、table bloat、query 變慢</li>
<li><strong>大表沒 partition</strong>：&gt; 1 TB 單表的 vacuum / index rebuild 變成事故</li>
<li><strong>index 不 review</strong>：寫吞吐被舊 index 拖垮</li>
<li><strong>跨 AZ sync replication 給寫入吞吐高的 workload</strong>：每次 commit 等 standby ack、寫吞吐減半</li>
<li><strong>logical replication 拖太多 publication</strong>：可能造成 primary WAL 堆積、disk 爆</li>
</ul>
<h2 id="下一步路由">下一步路由</h2>
<ul>
<li>完整 T1 對照：<a href="/blog/backend/01-database/vendors/" data-link-title="資料庫 Vendor 清單" data-link-desc="規劃 SQL、managed SQL、document、KV 與 distributed SQL 的服務頁撰寫順序與教學大綱">01-database vendors index</a></li>
<li>平行：<a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor</a>、<a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor</a>（managed PostgreSQL）</li>
<li>操作：<a href="/blog/backend/01-database/vendors/postgresql/hands-on/" data-link-title="PostgreSQL Hands-on 操作路線" data-link-desc="PostgreSQL local lab、connection pool、PITR restore drill、schema migration evidence 與 HA failover 的操作型章節設計">PostgreSQL Hands-on</a>（local lab、pool、PITR、migration evidence、HA drill）</li>
<li>上游：<a href="/blog/backend/01-database/high-concurrency-access/" data-link-title="1.1 高併發下的 SQL 讀寫邊界" data-link-desc="說明高併發服務如何共用資料庫 client、控制 transaction、管理 connection pool、避免資料庫成為瓶頸">1.1 高併發資料存取</a>、<a href="/blog/backend/01-database/transaction-boundary/" data-link-title="1.3 Transaction 與一致性邊界" data-link-desc="交易邊界、isolation level、retry 策略、distributed transaction（2PC、Saga）與跨 region 強一致取捨">1.3 Transaction Boundary</a></li>
<li>下游：<a href="/blog/backend/01-database/kv-document-capacity-planning/" data-link-title="1.10 KV / Document DB 容量規劃" data-link-desc="DynamoDB / Cosmos DB / Bigtable / MongoDB 等 KV / Document DB 的容量設計、partition key 取捨、capacity mode 選擇">1.10 KV / Document DB 容量規劃</a>（PostgreSQL 不適用時的替代）/ <a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a>（PostgreSQL 不夠用時的升級路徑）</li>
<li>跨模組：<a href="/blog/backend/09-performance-capacity/bottleneck-localization/" data-link-title="9.5 瓶頸定位流程" data-link-desc="從 app 到 DB / cache / broker / 第三方 quota 的逐層瓶頸定位">9.5 瓶頸定位流程</a> — connection / replication lag / vacuum 都是 PostgreSQL 常見 bottleneck 源</li>
<li>官方：<a href="https://www.postgresql.org/docs/">PostgreSQL Documentation</a></li>
</ul>
]]></content:encoded></item><item><title>資料庫大版本升級</title><link>https://tarrragon.github.io/blog/infra/upgrade/database-major-upgrade/</link><pubDate>Fri, 26 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/infra/upgrade/database-major-upgrade/</guid><description>&lt;p>資料庫大版本升級是所有升級類型中風險最高的一種，因為資料庫承載的是不可重建的狀態。Runtime 升級（PHP 5.6→8.x）改壞了可以切回舊版本重新部署（切換 PHP 版本即可回退）；平台遷移（共享主機→雲端）改壞了可以把 DNS 切回去（TTL 期間內生效）。資料庫升級改壞了，回退手段是從備份還原——而還原需要時間，還原期間服務不可用，且還原點之後的寫入會遺失。這個不對稱決定了資料庫升級的操作模式：每一步都需要驗證通過才進下一步，且每一步都有明確的回退路徑。&lt;/p>
&lt;h2 id="升級前的相容性評估">升級前的相容性評估&lt;/h2>
&lt;p>大版本升級不只是換一個二進位檔——新版本可能改變 SQL 行為、儲存格式、認證方式與預設值。在動任何生產資源之前，先在本地或測試環境把相容性問題找出來。&lt;/p>
&lt;h3 id="mysql-57--80-的常見破壞性變更">MySQL 5.7 → 8.0 的常見破壞性變更&lt;/h3>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>變更項&lt;/th>
 &lt;th>影響&lt;/th>
 &lt;th>檢查方式&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;code>GROUP BY&lt;/code> 隱式排序移除&lt;/td>
 &lt;td>依賴 &lt;code>GROUP BY&lt;/code> 順序的查詢結果可能改變&lt;/td>
 &lt;td>搜尋沒有 &lt;code>ORDER BY&lt;/code> 的 &lt;code>GROUP BY&lt;/code> 查詢&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>預設字元集 utf8 → utf8mb4&lt;/td>
 &lt;td>欄位長度與索引大小計算改變，索引可能超過限制&lt;/td>
 &lt;td>檢查 &lt;code>VARCHAR(255)&lt;/code> + 唯一索引的欄位&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>認證方式改為 caching_sha2&lt;/td>
 &lt;td>舊版 client / driver 可能無法連線&lt;/td>
 &lt;td>確認應用程式的 MySQL driver 版本支援 caching_sha2_password&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>保留字新增（RANK、ROW_NUMBER）&lt;/td>
 &lt;td>用這些字當欄位名或別名的查詢會報語法錯&lt;/td>
 &lt;td>&lt;code>grep -rn &amp;quot;RANK|ROW_NUMBER|GROUPS|CUME_DIST&amp;quot; --include=&amp;quot;*.sql&amp;quot;&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>JSON 函式行為變更&lt;/td>
 &lt;td>&lt;code>JSON_MERGE&lt;/code> 改名為 &lt;code>JSON_MERGE_PRESERVE&lt;/code>、行為語意不同&lt;/td>
 &lt;td>搜尋 &lt;code>JSON_MERGE&lt;/code> 呼叫&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h3 id="postgresql-大版本升級的檢查點">PostgreSQL 大版本升級的檢查點&lt;/h3>
&lt;p>PostgreSQL 的大版本升級相對穩定，但仍有需要確認的項目：extension 版本是否跟新 PostgreSQL 版本相容（特別是 PostGIS、pg_partman、timescaledb 這類複雜 extension）、&lt;code>pg_upgrade&lt;/code> 的 &lt;code>--check&lt;/code> 模式可以在不實際升級的前提下驗證相容性。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1"># PostgreSQL: 升級前 dry-run 檢查&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">pg_upgrade --old-datadir /var/lib/postgresql/13/main &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> --new-datadir /var/lib/postgresql/16/main &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> --old-bindir /usr/lib/postgresql/13/bin &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> --new-bindir /usr/lib/postgresql/16/bin &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> --check&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="應用程式層的查詢相容性">應用程式層的查詢相容性&lt;/h3>
&lt;p>把應用程式的所有 SQL 查詢（ORM 產生的也算）對新版本跑一遍。重點是行為變更而非語法錯誤——語法錯誤會立刻報錯、容易抓；行為變更（排序結果不同、型別轉換規則不同）不會報錯、但結果錯誤。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1"># MySQL 升級前檢查工具&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">mysqlcheck --all-databases --check-upgrade
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">mysql_upgrade --upgrade-system-tables --dry-run&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>ORM 和 database driver 也要確認版本支援。PHP 的 &lt;code>mysqli&lt;/code> 在 PHP 7.4+ 預設支援 caching_sha2_password、但舊版不支援。Node.js 的 &lt;code>mysql2&lt;/code> 原生支援、但 &lt;code>mysql&lt;/code>（舊套件）不支援。Python 的 &lt;code>mysqlclient&lt;/code> 1.4+ 支援。&lt;/p>
&lt;h2 id="備份升級前的保險">備份：升級前的保險&lt;/h2>
&lt;p>升級前的備份不是日常備份——它是一份明確的、經過驗證的、標記為「升級前保險點」的快照。&lt;/p>
&lt;h3 id="備份操作">備份操作&lt;/h3>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1"># MySQL: 完整 dump（InnoDB 用 --single-transaction 避免鎖表）&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">mysqldump --all-databases --single-transaction --routines --triggers &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> --set-gtid-purged&lt;span class="o">=&lt;/span>OFF &amp;gt; pre-upgrade-&lt;span class="k">$(&lt;/span>date +%Y%m%d-%H%M&lt;span class="k">)&lt;/span>.sql
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="c1"># PostgreSQL: 完整 dump&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">pg_dumpall &amp;gt; pre-upgrade-&lt;span class="k">$(&lt;/span>date +%Y%m%d-%H%M&lt;span class="k">)&lt;/span>.sql&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>RDS 環境：在升級操作前手動建立 snapshot，而非依賴自動備份。自動備份在升級過程中可能被新的快照覆蓋，手動 snapshot 不會被自動清除。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">aws rds create-db-snapshot &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> --db-instance-identifier mydb-prod &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> --db-snapshot-identifier pre-upgrade-&lt;span class="k">$(&lt;/span>date +%Y%m%d&lt;span class="k">)&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="備份驗證">備份驗證&lt;/h3>
&lt;p>備份存在不等於備份可用。驗證方式是把備份還原到一台獨立的測試實例、確認資料完整：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1"># 還原到測試實例&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">mysql -h test-instance -u admin -p &amp;lt; pre-upgrade-20260626-1400.sql
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="c1"># 驗證關鍵表的 row count&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">mysql -h test-instance -e &lt;span class="s2">&amp;#34;SELECT COUNT(*) FROM orders; SELECT COUNT(*) FROM users;&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>記錄還原時間：「從這份備份還原到可服務狀態需要 N 分鐘/小時」。這個數字是升級失敗時的停機時間下限——管理層需要這個數字來評估升級的風險。&lt;/p></description><content:encoded><![CDATA[<p>資料庫大版本升級是所有升級類型中風險最高的一種，因為資料庫承載的是不可重建的狀態。Runtime 升級（PHP 5.6→8.x）改壞了可以切回舊版本重新部署（切換 PHP 版本即可回退）；平台遷移（共享主機→雲端）改壞了可以把 DNS 切回去（TTL 期間內生效）。資料庫升級改壞了，回退手段是從備份還原——而還原需要時間，還原期間服務不可用，且還原點之後的寫入會遺失。這個不對稱決定了資料庫升級的操作模式：每一步都需要驗證通過才進下一步，且每一步都有明確的回退路徑。</p>
<h2 id="升級前的相容性評估">升級前的相容性評估</h2>
<p>大版本升級不只是換一個二進位檔——新版本可能改變 SQL 行為、儲存格式、認證方式與預設值。在動任何生產資源之前，先在本地或測試環境把相容性問題找出來。</p>
<h3 id="mysql-57--80-的常見破壞性變更">MySQL 5.7 → 8.0 的常見破壞性變更</h3>
<table>
  <thead>
      <tr>
          <th>變更項</th>
          <th>影響</th>
          <th>檢查方式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>GROUP BY</code> 隱式排序移除</td>
          <td>依賴 <code>GROUP BY</code> 順序的查詢結果可能改變</td>
          <td>搜尋沒有 <code>ORDER BY</code> 的 <code>GROUP BY</code> 查詢</td>
      </tr>
      <tr>
          <td>預設字元集 utf8 → utf8mb4</td>
          <td>欄位長度與索引大小計算改變，索引可能超過限制</td>
          <td>檢查 <code>VARCHAR(255)</code> + 唯一索引的欄位</td>
      </tr>
      <tr>
          <td>認證方式改為 caching_sha2</td>
          <td>舊版 client / driver 可能無法連線</td>
          <td>確認應用程式的 MySQL driver 版本支援 caching_sha2_password</td>
      </tr>
      <tr>
          <td>保留字新增（RANK、ROW_NUMBER）</td>
          <td>用這些字當欄位名或別名的查詢會報語法錯</td>
          <td><code>grep -rn &quot;RANK|ROW_NUMBER|GROUPS|CUME_DIST&quot; --include=&quot;*.sql&quot;</code></td>
      </tr>
      <tr>
          <td>JSON 函式行為變更</td>
          <td><code>JSON_MERGE</code> 改名為 <code>JSON_MERGE_PRESERVE</code>、行為語意不同</td>
          <td>搜尋 <code>JSON_MERGE</code> 呼叫</td>
      </tr>
  </tbody>
</table>
<h3 id="postgresql-大版本升級的檢查點">PostgreSQL 大版本升級的檢查點</h3>
<p>PostgreSQL 的大版本升級相對穩定，但仍有需要確認的項目：extension 版本是否跟新 PostgreSQL 版本相容（特別是 PostGIS、pg_partman、timescaledb 這類複雜 extension）、<code>pg_upgrade</code> 的 <code>--check</code> 模式可以在不實際升級的前提下驗證相容性。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># PostgreSQL: 升級前 dry-run 檢查</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pg_upgrade --old-datadir /var/lib/postgresql/13/main <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>           --new-datadir /var/lib/postgresql/16/main <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>           --old-bindir /usr/lib/postgresql/13/bin <span class="se">\
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="se"></span>           --new-bindir /usr/lib/postgresql/16/bin <span class="se">\
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="se"></span>           --check</span></span></code></pre></div><h3 id="應用程式層的查詢相容性">應用程式層的查詢相容性</h3>
<p>把應用程式的所有 SQL 查詢（ORM 產生的也算）對新版本跑一遍。重點是行為變更而非語法錯誤——語法錯誤會立刻報錯、容易抓；行為變更（排序結果不同、型別轉換規則不同）不會報錯、但結果錯誤。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># MySQL 升級前檢查工具</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">mysqlcheck --all-databases --check-upgrade
</span></span><span class="line"><span class="ln">3</span><span class="cl">mysql_upgrade --upgrade-system-tables --dry-run</span></span></code></pre></div><p>ORM 和 database driver 也要確認版本支援。PHP 的 <code>mysqli</code> 在 PHP 7.4+ 預設支援 caching_sha2_password、但舊版不支援。Node.js 的 <code>mysql2</code> 原生支援、但 <code>mysql</code>（舊套件）不支援。Python 的 <code>mysqlclient</code> 1.4+ 支援。</p>
<h2 id="備份升級前的保險">備份：升級前的保險</h2>
<p>升級前的備份不是日常備份——它是一份明確的、經過驗證的、標記為「升級前保險點」的快照。</p>
<h3 id="備份操作">備份操作</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># MySQL: 完整 dump（InnoDB 用 --single-transaction 避免鎖表）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">mysqldump --all-databases --single-transaction --routines --triggers <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --set-gtid-purged<span class="o">=</span>OFF &gt; pre-upgrade-<span class="k">$(</span>date +%Y%m%d-%H%M<span class="k">)</span>.sql
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># PostgreSQL: 完整 dump</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">pg_dumpall &gt; pre-upgrade-<span class="k">$(</span>date +%Y%m%d-%H%M<span class="k">)</span>.sql</span></span></code></pre></div><p>RDS 環境：在升級操作前手動建立 snapshot，而非依賴自動備份。自動備份在升級過程中可能被新的快照覆蓋，手動 snapshot 不會被自動清除。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">aws rds create-db-snapshot <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --db-instance-identifier mydb-prod <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --db-snapshot-identifier pre-upgrade-<span class="k">$(</span>date +%Y%m%d<span class="k">)</span></span></span></code></pre></div><h3 id="備份驗證">備份驗證</h3>
<p>備份存在不等於備份可用。驗證方式是把備份還原到一台獨立的測試實例、確認資料完整：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 還原到測試實例</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">mysql -h test-instance -u admin -p &lt; pre-upgrade-20260626-1400.sql
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 驗證關鍵表的 row count</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">mysql -h test-instance -e <span class="s2">&#34;SELECT COUNT(*) FROM orders; SELECT COUNT(*) FROM users;&#34;</span></span></span></code></pre></div><p>記錄還原時間：「從這份備份還原到可服務狀態需要 N 分鐘/小時」。這個數字是升級失敗時的停機時間下限——管理層需要這個數字來評估升級的風險。</p>
<h2 id="平行驗證策略">平行驗證策略</h2>
<p>在生產環境切換之前，先在新版本的平行環境上跑完所有驗證。平行驗證的目標是讓切換那一刻的風險降到最低——切換時已經知道新版本在相同資料和相同負載下的行為。</p>
<h3 id="建立平行環境">建立平行環境</h3>
<table>
  <thead>
      <tr>
          <th>方式</th>
          <th>適用情境</th>
          <th>資料同步方式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Read replica + 版本升級</td>
          <td>RDS 環境、支援跨版本 replica</td>
          <td>RDS 原生複寫</td>
      </tr>
      <tr>
          <td>Logical replication</td>
          <td>需要跨大版本</td>
          <td>pg_logical / binlog → 新實例</td>
      </tr>
      <tr>
          <td>Dump / restore</td>
          <td>任何環境、資料量可控</td>
          <td>一次性 dump + 增量 binlog 回放</td>
      </tr>
  </tbody>
</table>
<h3 id="驗證項目">驗證項目</h3>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>方法</th>
          <th>通過標準</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>應用程式測試套件</td>
          <td>對新版本實例跑完整測試</td>
          <td>0 failure</td>
      </tr>
      <tr>
          <td>查詢效能</td>
          <td>對比兩個版本的 slow query log</td>
          <td>p99 延遲無顯著退化（&lt;10% 差異）</td>
      </tr>
      <tr>
          <td>資料一致性</td>
          <td>關鍵表 row count + checksum</td>
          <td>完全一致</td>
      </tr>
      <tr>
          <td>連線行為</td>
          <td>應用程式連新版本、觀察連線池</td>
          <td>無 authentication failure</td>
      </tr>
      <tr>
          <td>備份還原</td>
          <td>從新版本做一次 dump + restore</td>
          <td>還原成功、資料完整</td>
      </tr>
  </tbody>
</table>
<p>平行驗證至少跑一週。時間越長、覆蓋到的邊界情境越多——月結批次、週期性報表、低頻排程任務都可能觸發只在特定條件下才出現的相容性問題。</p>
<h2 id="切換策略">切換策略</h2>
<p>切換策略的選擇取決於三個變數的取捨：操作複雜度、停機時間、回退速度。</p>
<h3 id="in-place-升級">In-place 升級</h3>
<p>直接在原實例上升級版本。RDS 的操作是修改 engine version、等待升級完成。</p>
<ul>
<li><strong>停機</strong>：升級期間實例不可用（MySQL 5.7→8.0 在 RDS 上約 10-30 分鐘，視資料量而定）</li>
<li><strong>回退</strong>：從 pre-upgrade snapshot 還原，需要 snapshot restore 時間（分鐘到小時級）</li>
<li><strong>適用</strong>：可接受計畫性停機的環境、資料量不大</li>
</ul>
<h3 id="blue-green-切換">Blue-green 切換</h3>
<p>在新版本上建立獨立實例、透過 replication 同步資料、切換應用程式的連線端點。</p>
<ul>
<li><strong>停機</strong>：接近零（DNS TTL 或 endpoint 切換的傳播時間）</li>
<li><strong>回退</strong>：把連線端點切回舊實例，舊實例持續運行</li>
<li><strong>複雜度</strong>：需要維護兩個實例的同步、切換時要處理複寫延遲</li>
<li><strong>適用</strong>：不能接受停機的 production 環境</li>
</ul>
<p>RDS 從 2022 年開始提供原生的 Blue/Green Deployments 功能，簡化了同步與切換的操作：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">aws rds create-blue-green-deployment <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --blue-green-deployment-name mydb-upgrade <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --source arn:aws:rds:ap-northeast-1:123456789012:db:mydb-prod <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  --target-engine-version 8.0.35</span></span></code></pre></div><h3 id="read-replica-升級後提升">Read replica 升級後提升</h3>
<p>建立指定新版本的 read replica，replica 同步完成後提升為獨立實例，應用程式切換連線。</p>
<ul>
<li><strong>停機</strong>：提升 replica 的幾秒 + 連線切換</li>
<li><strong>回退</strong>：舊 primary 仍在，切回即可</li>
<li><strong>限制</strong>：不是所有版本組合都支援跨版本 replica</li>
</ul>
<h3 id="選型判準">選型判準</h3>
<table>
  <thead>
      <tr>
          <th>考量</th>
          <th>In-place</th>
          <th>Blue-green</th>
          <th>Replica 提升</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>操作複雜度</td>
          <td>低</td>
          <td>中</td>
          <td>中</td>
      </tr>
      <tr>
          <td>停機時間</td>
          <td>10-30 分鐘</td>
          <td>接近零</td>
          <td>幾秒</td>
      </tr>
      <tr>
          <td>回退速度</td>
          <td>慢（snapshot restore）</td>
          <td>快（切回舊端點）</td>
          <td>快（切回舊 primary）</td>
      </tr>
      <tr>
          <td>成本</td>
          <td>最低</td>
          <td>升級期間雙倍</td>
          <td>升級期間雙倍</td>
      </tr>
  </tbody>
</table>
<h2 id="升級後的驗證與監控">升級後的驗證與監控</h2>
<p>切換完成後的 48-72 小時是觀察期。這段時間舊實例保持可用狀態，直到確認新版本穩定才退役。</p>
<h3 id="切換後立即驗證">切換後立即驗證</h3>
<ol>
<li>應用程式的所有關鍵路徑可正常操作（登入、查詢、寫入、交易）</li>
<li>連線池行為正常（沒有持續的 authentication failure 或 connection reset）</li>
<li>排程任務（cron job、背景 worker）正常連線並執行</li>
</ol>
<h3 id="效能監控">效能監控</h3>
<p>比較升級前後的關鍵指標：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 觀察升級後的 slow query 數量</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">mysql -e <span class="s2">&#34;SHOW GLOBAL STATUS LIKE &#39;Slow_queries&#39;;&#34;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 比較 p99 延遲（需要 application-level metrics）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># CloudWatch: DBInstanceIdentifier → ReadLatency, WriteLatency</span></span></span></code></pre></div><p>升級後效能退化的常見原因：optimizer 行為改變（新版本選了不同的執行計畫）、buffer pool 冷啟動（升級後快取是空的、前幾小時延遲偏高是正常的）。如果 48 小時後延遲仍未回到基線，檢查 slow query log 找出退化的具體查詢。</p>
<h3 id="舊實例退役">舊實例退役</h3>
<p>觀察期結束、新版本確認穩定後：</p>
<ol>
<li>停止舊實例的 replication（如果仍在同步）</li>
<li>保留舊實例的 final snapshot</li>
<li>刪除舊實例（先確認 deletion protection 關閉是刻意的、不是誤操作）</li>
<li>更新文件：記錄升級日期、版本號、升級過程中遇到的問題</li>
</ol>
<h2 id="時程與管理層溝通">時程與管理層溝通</h2>
<table>
  <thead>
      <tr>
          <th>升級類型</th>
          <th>典型時程</th>
          <th>停機窗口</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Minor version（5.7.x → 5.7.y）</td>
          <td>2-4 小時計畫維護</td>
          <td>10-15 分鐘</td>
      </tr>
      <tr>
          <td>Major version（5.7 → 8.0）in-place</td>
          <td>1-2 週（評估 + 驗證 + 切換 + 監控）</td>
          <td>10-30 分鐘</td>
      </tr>
      <tr>
          <td>Major version blue-green</td>
          <td>2-3 週（含平行運行期）</td>
          <td>接近零</td>
      </tr>
  </tbody>
</table>
<p>向管理層說明時的關鍵框架：資料是不可重建的，升級策略是「在旁邊建一個新版本的資料庫、驗證它在相同資料和相同負載下行為正確、然後切過去」。多出來的時間買的是「切換那一刻的信心」和「出問題時能快速回退」——兩者對生產服務都是必要的保險。</p>
<h2 id="跨分類引用">跨分類引用</h2>
<ul>
<li>→ <a href="/blog/infra/upgrade/upgrade-framework/" data-link-title="升級的共通操作框架" data-link-desc="任何環境或系統升級的四階段模型：差異評估、平行環境驗證、分批切換、退役舊環境，以及貫穿全程的升級紀律">升級的共通操作框架</a>：四階段模型的通用說明</li>
<li>→ <a href="/blog/infra/05-core-services/stateful-protection-dependency/" data-link-title="Stateful 資源保護與跨服務依賴表達" data-link-desc="stateful 資源的保護策略（multi-AZ、備份、刪除保護）、stateful 與 stateless 的操作差異，以及用 output 與 data source 表達服務間依賴">Stateful 資源保護與依賴表達</a>：multi-AZ、備份、deletion protection 的 IaC 描述</li>
<li>→ <a href="/blog/infra/takeover/legacy-database-backup-migration/" data-link-title="無 SSH 環境的資料庫備份與變更管理" data-link-desc="在只有 phpMyAdmin 或有限遠端連線的無 SSH 環境裡，怎麼建立可靠的資料庫備份策略、schema 變更紀律與還原演練流程">無 SSH 環境的資料庫備份與變更管理</a>：接手環境的資料庫備份策略</li>
</ul>
]]></content:encoded></item><item><title>規模演進</title><link>https://tarrragon.github.io/blog/monitoring/04-collector/scaling-evolution/</link><pubDate>Fri, 19 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/monitoring/04-collector/scaling-evolution/</guid><description>&lt;p>Collector 的儲存方案是可插拔 storage backend — 同一個 binary 透過啟動參數選擇不同的 storage implementation。Go 的 interface composition 讓 storage 分成 BasicStorage（所有 backend 共用）和 AnalyticsStorage（PostgreSQL 層新增），內部實作（SQLite / PostgreSQL / 時間序列 DB）分離，切換是 config change 而非重寫程式碼。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-go" data-lang="go">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="kd">type&lt;/span> &lt;span class="nx">BasicStorage&lt;/span> &lt;span class="kd">interface&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl"> &lt;span class="nf">Store&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nx">event&lt;/span> &lt;span class="nx">Event&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="kt">error&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl"> &lt;span class="nf">Query&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nx">filter&lt;/span> &lt;span class="nx">QueryFilter&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">([]&lt;/span>&lt;span class="nx">Event&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="kt">error&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl"> &lt;span class="nf">Close&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="kt">error&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl"> &lt;span class="nf">Downsample&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="kt">error&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl"> &lt;span class="nf">Purge&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="kt">error&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="kd">type&lt;/span> &lt;span class="nx">AnalyticsStorage&lt;/span> &lt;span class="kd">interface&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl"> &lt;span class="nx">BasicStorage&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl"> &lt;span class="nf">Aggregate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nx">spec&lt;/span> &lt;span class="nx">AggregateSpec&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="nx">AggregateResult&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="kt">error&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl"> &lt;span class="nf">Funnel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nx">steps&lt;/span> &lt;span class="p">[]&lt;/span>&lt;span class="kt">string&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nx">timeWindow&lt;/span> &lt;span class="nx">Duration&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="nx">FunnelResult&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="kt">error&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl"> &lt;span class="nf">Cohort&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nx">groupBy&lt;/span> &lt;span class="kt">string&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nx">metric&lt;/span> &lt;span class="kt">string&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="nx">CohortResult&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="kt">error&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="p">}&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>SQLite implementation 只實作 &lt;code>BasicStorage&lt;/code>。PostgreSQL implementation 實作 &lt;code>AnalyticsStorage&lt;/code>。Dashboard 用 Go 的 type assertion（&lt;code>if as, ok := storage.(AnalyticsStorage); ok { ... }&lt;/code>）判斷能力 — funnel/cohort 視圖在 SQLite 模式下不顯示入口，而非顯示後報錯。&lt;/p>
&lt;p>選擇哪個 backend 取決於部署場景和查詢需求：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>場景&lt;/th>
 &lt;th>Backend&lt;/th>
 &lt;th>啟動參數&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>自架簡單版（零依賴）&lt;/td>
 &lt;td>SQLite&lt;/td>
 &lt;td>&lt;code>--storage=sqlite&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>需要聚合分析的自用版&lt;/td>
 &lt;td>PostgreSQL&lt;/td>
 &lt;td>&lt;code>--storage=postgres --dsn=...&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>高併發 + 長期保留&lt;/td>
 &lt;td>時間序列 DB&lt;/td>
 &lt;td>&lt;code>--storage=timescale --dsn=...&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h2 id="sqlite-backendday-one-預設">SQLite Backend（day-one 預設）&lt;/h2>
&lt;p>SQLite 是嵌入式資料庫，編譯進 collector binary 中，不需要額外 server。Go 用 &lt;code>modernc.org/sqlite&lt;/code>（pure Go、無 CGO 依賴、效能約為 CGO driver mattn/go-sqlite3 的 60-80%，自用規模下足夠），開源使用者 &lt;code>go build &amp;amp;&amp;amp; ./collector&lt;/code> 就能跑，部署步驟為零。WAL mode 允許讀寫並行 — dashboard 的 SELECT 查詢不會被 ingestion 的 INSERT 阻塞，反之亦然。寫入之間的競爭由 busy_timeout 處理。&lt;/p>
&lt;h3 id="能力範圍">能力範圍&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>索引查詢&lt;/strong>：按 type、name、timestamp 建索引，查詢從全表掃描變成索引查找&lt;/li>
&lt;li>&lt;strong>SQL 聚合&lt;/strong>：&lt;code>SELECT name, COUNT(*) FROM events WHERE type='error' GROUP BY name&lt;/code> — 一行 SQL 完成分群計數&lt;/li>
&lt;li>&lt;strong>跨欄位過濾&lt;/strong>：&lt;code>WHERE type='error' AND name LIKE 'terminal.%' AND ts &amp;gt; '2026-06-18'&lt;/code>&lt;/li>
&lt;li>&lt;strong>寫入&lt;/strong>：WAL mode 下每秒數千筆 append 寫入&lt;/li>
&lt;/ul>
&lt;h3 id="events-主表-ddl">Events 主表 DDL&lt;/h3>
&lt;p>Events 表的欄位從 &lt;a href="https://tarrragon.github.io/blog/monitoring/02-log-schema/event-schema-fields/" data-link-title="event.schema.json 完整欄位解說" data-link-desc="監控事件的 JSON Schema 定義 — 每個欄位的語意、必填/選填、資料型別和設計理由">event.schema.json&lt;/a> 的 JSON 結構推導。Source 的 nested object 攤平成獨立 column — 方便 SQL 查詢和索引，不需要每次從 JSON 裡 extract。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">INTEGER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">PRIMARY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">AUTOINCREMENT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">v&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">INTEGER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">DEFAULT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">type&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">ts&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">source_sdk&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">source_app&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">source_version&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">source_platform&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">source_os&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">session_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">session_started&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">level&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">data&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">error_message&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">error_stack&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">18&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">error_type&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">19&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">receive_ts&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">20&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">);&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>source_sdk&lt;/code> 獨立成 column 讓「按 SDK 來源篩選」（&lt;code>WHERE source_sdk = 'python'&lt;/code>）不需要從 JSON extract。&lt;code>data&lt;/code> 用 TEXT 存 JSON。SQLite 沒有原生 JSON 型別，但 3.38+ 支援 &lt;code>json_extract()&lt;/code> 函式做查詢（&lt;code>WHERE json_extract(data, '$.duration_ms') &amp;gt; 1000&lt;/code>）。&lt;code>session_id&lt;/code> 獨立成 column 讓 session 回放的 JOIN 不需要 JSON extract。&lt;code>error_stack&lt;/code> 獨立成 column 讓 error 調查時全文搜尋 stack trace 不需要 JSON extract。&lt;code>receive_ts&lt;/code> 是 collector 收到事件的時間，和 SDK 端的 &lt;code>ts&lt;/code> 對照可估算 clock drift。&lt;/p></description><content:encoded><![CDATA[<p>Collector 的儲存方案是可插拔 storage backend — 同一個 binary 透過啟動參數選擇不同的 storage implementation。Go 的 interface composition 讓 storage 分成 BasicStorage（所有 backend 共用）和 AnalyticsStorage（PostgreSQL 層新增），內部實作（SQLite / PostgreSQL / 時間序列 DB）分離，切換是 config change 而非重寫程式碼。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kd">type</span> <span class="nx">BasicStorage</span> <span class="kd">interface</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">    <span class="nf">Store</span><span class="p">(</span><span class="nx">event</span> <span class="nx">Event</span><span class="p">)</span> <span class="kt">error</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="nf">Query</span><span class="p">(</span><span class="nx">filter</span> <span class="nx">QueryFilter</span><span class="p">)</span> <span class="p">([]</span><span class="nx">Event</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="nf">Close</span><span class="p">()</span> <span class="kt">error</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="nf">Downsample</span><span class="p">()</span> <span class="kt">error</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="nf">Purge</span><span class="p">()</span> <span class="kt">error</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="kd">type</span> <span class="nx">AnalyticsStorage</span> <span class="kd">interface</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="nx">BasicStorage</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">    <span class="nf">Aggregate</span><span class="p">(</span><span class="nx">spec</span> <span class="nx">AggregateSpec</span><span class="p">)</span> <span class="p">(</span><span class="nx">AggregateResult</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">    <span class="nf">Funnel</span><span class="p">(</span><span class="nx">steps</span> <span class="p">[]</span><span class="kt">string</span><span class="p">,</span> <span class="nx">timeWindow</span> <span class="nx">Duration</span><span class="p">)</span> <span class="p">(</span><span class="nx">FunnelResult</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">    <span class="nf">Cohort</span><span class="p">(</span><span class="nx">groupBy</span> <span class="kt">string</span><span class="p">,</span> <span class="nx">metric</span> <span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="nx">CohortResult</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p>SQLite implementation 只實作 <code>BasicStorage</code>。PostgreSQL implementation 實作 <code>AnalyticsStorage</code>。Dashboard 用 Go 的 type assertion（<code>if as, ok := storage.(AnalyticsStorage); ok { ... }</code>）判斷能力 — funnel/cohort 視圖在 SQLite 模式下不顯示入口，而非顯示後報錯。</p>
<p>選擇哪個 backend 取決於部署場景和查詢需求：</p>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>Backend</th>
          <th>啟動參數</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>自架簡單版（零依賴）</td>
          <td>SQLite</td>
          <td><code>--storage=sqlite</code></td>
      </tr>
      <tr>
          <td>需要聚合分析的自用版</td>
          <td>PostgreSQL</td>
          <td><code>--storage=postgres --dsn=...</code></td>
      </tr>
      <tr>
          <td>高併發 + 長期保留</td>
          <td>時間序列 DB</td>
          <td><code>--storage=timescale --dsn=...</code></td>
      </tr>
  </tbody>
</table>
<h2 id="sqlite-backendday-one-預設">SQLite Backend（day-one 預設）</h2>
<p>SQLite 是嵌入式資料庫，編譯進 collector binary 中，不需要額外 server。Go 用 <code>modernc.org/sqlite</code>（pure Go、無 CGO 依賴、效能約為 CGO driver mattn/go-sqlite3 的 60-80%，自用規模下足夠），開源使用者 <code>go build &amp;&amp; ./collector</code> 就能跑，部署步驟為零。WAL mode 允許讀寫並行 — dashboard 的 SELECT 查詢不會被 ingestion 的 INSERT 阻塞，反之亦然。寫入之間的競爭由 busy_timeout 處理。</p>
<h3 id="能力範圍">能力範圍</h3>
<ul>
<li><strong>索引查詢</strong>：按 type、name、timestamp 建索引，查詢從全表掃描變成索引查找</li>
<li><strong>SQL 聚合</strong>：<code>SELECT name, COUNT(*) FROM events WHERE type='error' GROUP BY name</code> — 一行 SQL 完成分群計數</li>
<li><strong>跨欄位過濾</strong>：<code>WHERE type='error' AND name LIKE 'terminal.%' AND ts &gt; '2026-06-18'</code></li>
<li><strong>寫入</strong>：WAL mode 下每秒數千筆 append 寫入</li>
</ul>
<h3 id="events-主表-ddl">Events 主表 DDL</h3>
<p>Events 表的欄位從 <a href="/blog/monitoring/02-log-schema/event-schema-fields/" data-link-title="event.schema.json 完整欄位解說" data-link-desc="監控事件的 JSON Schema 定義 — 每個欄位的語意、必填/選填、資料型別和設計理由">event.schema.json</a> 的 JSON 結構推導。Source 的 nested object 攤平成獨立 column — 方便 SQL 查詢和索引，不需要每次從 JSON 裡 extract。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="n">AUTOINCREMENT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">v</span><span class="w"> </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="w"> </span><span class="k">DEFAULT</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="k">type</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">name</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">ts</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="n">source_sdk</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">    </span><span class="n">source_app</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="n">source_version</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="n">source_platform</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">    </span><span class="n">source_os</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="n">session_id</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">    </span><span class="n">session_started</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">    </span><span class="k">level</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">    </span><span class="k">data</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">    </span><span class="n">error_message</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">    </span><span class="n">error_stack</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w">    </span><span class="n">error_type</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">    </span><span class="n">receive_ts</span><span class="w"> </span><span class="nb">TEXT</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p><code>source_sdk</code> 獨立成 column 讓「按 SDK 來源篩選」（<code>WHERE source_sdk = 'python'</code>）不需要從 JSON extract。<code>data</code> 用 TEXT 存 JSON。SQLite 沒有原生 JSON 型別，但 3.38+ 支援 <code>json_extract()</code> 函式做查詢（<code>WHERE json_extract(data, '$.duration_ms') &gt; 1000</code>）。<code>session_id</code> 獨立成 column 讓 session 回放的 JOIN 不需要 JSON extract。<code>error_stack</code> 獨立成 column 讓 error 調查時全文搜尋 stack trace 不需要 JSON extract。<code>receive_ts</code> 是 collector 收到事件的時間，和 SDK 端的 <code>ts</code> 對照可估算 clock drift。</p>
<p>PostgreSQL 版本的差異：<code>data</code> 改成 <code>JSONB</code> 型別（原生索引和查詢）、<code>source_*</code> 可保持為 nested JSON（PostgreSQL 的 JSONB 查詢效能足夠）或維持攤平（和 SQLite 版本保持一致）。</p>
<h3 id="建議索引">建議索引</h3>
<p>建表時一起建索引，覆蓋 dashboard 的核心查詢模式：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_type_ts</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events</span><span class="p">(</span><span class="k">type</span><span class="p">,</span><span class="w"> </span><span class="n">ts</span><span class="p">);</span><span class="w">    </span><span class="c1">-- 按 type + 時間過濾（error 列表、趨勢圖）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_session</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events</span><span class="p">(</span><span class="n">session_id</span><span class="p">);</span><span class="w">   </span><span class="c1">-- 按 session 回放
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_name</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events</span><span class="p">(</span><span class="n">name</span><span class="p">);</span><span class="w">            </span><span class="c1">-- 按 name 分群計數（功能使用排行）</span></span></span></code></pre></div><p>Day-one 建表時就建，不是效能出問題後才加。</p>
<h3 id="適用規模">適用規模</h3>
<p>單日事件量在十萬筆以下、SQLite 資料庫在 1GB 以下。索引查詢在毫秒級完成。自用工具和小型團隊的日常使用通常在這個範圍。</p>
<h3 id="分層保留與降採樣">分層保留與降採樣</h3>
<p>保留策略從查詢需求反推，每一種查詢需要的資料粒度和回溯深度不同。回溯越深的查詢需要的粒度越粗 — debug 需要最近幾天的逐筆事件，cohort 留存需要一整年的資料但每週一筆聚合數字就夠。</p>
<table>
  <thead>
      <tr>
          <th>查詢用途</th>
          <th>需要的粒度</th>
          <th>回溯深度</th>
          <th>對應表</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Debug 定位</td>
          <td>逐筆原始</td>
          <td>天</td>
          <td>events</td>
      </tr>
      <tr>
          <td>Funnel</td>
          <td>逐筆 event</td>
          <td>週～月</td>
          <td>events</td>
      </tr>
      <tr>
          <td>Error 趨勢</td>
          <td>每小時計數</td>
          <td>月～季</td>
          <td>hourly_summary</td>
      </tr>
      <tr>
          <td>Cohort</td>
          <td>每天計數</td>
          <td>季～年</td>
          <td>daily_summary</td>
      </tr>
      <tr>
          <td>RFM 分群</td>
          <td>每月聚合</td>
          <td>年</td>
          <td>monthly_summary</td>
      </tr>
  </tbody>
</table>
<p>SQLite 中的實作是三張摘要表加定期 job：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 摘要表
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">hourly_summary</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">hour</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="k">count</span><span class="w"> </span><span class="nb">INTEGER</span><span class="p">,</span><span class="w"> </span><span class="n">error_count</span><span class="w"> </span><span class="nb">INTEGER</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="k">UNIQUE</span><span class="p">(</span><span class="n">hour</span><span class="p">,</span><span class="w"> </span><span class="k">type</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">daily_summary</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">    </span><span class="nb">date</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="k">count</span><span class="w"> </span><span class="nb">INTEGER</span><span class="p">,</span><span class="w"> </span><span class="n">unique_sessions</span><span class="w"> </span><span class="nb">INTEGER</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="k">UNIQUE</span><span class="p">(</span><span class="nb">date</span><span class="p">,</span><span class="w"> </span><span class="k">type</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- 降採樣（Downsample，每小時跑一次，幂等 — 重跑只更新不重複）
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">OR</span><span class="w"> </span><span class="k">REPLACE</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">hourly_summary</span><span class="w"> </span><span class="p">(</span><span class="n">hour</span><span class="p">,</span><span class="w"> </span><span class="k">type</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">,</span><span class="w"> </span><span class="n">error_count</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">strftime</span><span class="p">(</span><span class="s1">&#39;%Y-%m-%dT%H:00:00&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">ts</span><span class="p">),</span><span class="w"> </span><span class="k">type</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">       </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">),</span><span class="w"> </span><span class="k">SUM</span><span class="p">(</span><span class="k">CASE</span><span class="w"> </span><span class="k">WHEN</span><span class="w"> </span><span class="k">type</span><span class="o">=</span><span class="s1">&#39;error&#39;</span><span class="w"> </span><span class="k">THEN</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">ELSE</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="k">END</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">ts</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">datetime</span><span class="p">(</span><span class="s1">&#39;now&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;-1 hour&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w"></span><span class="c1">-- 清理（Purge，每天跑一次，分批刪除避免長時間鎖定）
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="c1"></span><span class="k">DELETE</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">rowid</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">  </span><span class="k">SELECT</span><span class="w"> </span><span class="n">rowid</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">ts</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">datetime</span><span class="p">(</span><span class="s1">&#39;now&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;-7 days&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10000</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"></span><span class="c1">-- 重複執行直到影響行數為 0
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="c1"></span><span class="k">DELETE</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">hourly_summary</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">hour</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">datetime</span><span class="p">(</span><span class="s1">&#39;now&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;-90 days&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w"></span><span class="k">DELETE</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">daily_summary</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="nb">date</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">datetime</span><span class="p">(</span><span class="s1">&#39;now&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;-365 days&#39;</span><span class="p">);</span></span></span></code></pre></div><p>保留期限由 collector config 設定，數字的來源是「哪些查詢需要回溯多遠」：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln">1</span><span class="cl"><span class="nt">retention</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">  </span><span class="nt">raw_events</span><span class="p">:</span><span class="w"> </span><span class="l">7d</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="nt">hourly_summary</span><span class="p">:</span><span class="w"> </span><span class="l">90d</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="nt">daily_summary</span><span class="p">:</span><span class="w"> </span><span class="l">365d</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">  </span><span class="nt">monthly_summary</span><span class="p">:</span><span class="w"> </span><span class="l">forever</span></span></span></code></pre></div><p>Storage interface 的 <code>Downsample()</code> 和 <code>Purge()</code> 由 collector 的定時排程觸發（Go 的 <code>time.Ticker</code>）。每個 storage backend 各自實作 — SQLite 用上述 SQL、PostgreSQL 用相同邏輯但可以加 partial index 加速、時間序列 DB 的 continuous aggregate 和 retention policy 原生支援。</p>
<h3 id="為什麼是聚合而非抽樣">為什麼是聚合而非抽樣</h3>
<p>原始事件的保留期到期後，需要決定如何保留歷史統計。降採樣有兩種思路。<strong>抽樣保留</strong>是同事件名稱（<code>name</code> 欄位）同小時保留一筆原始事件、刪除其餘，保留了逐筆查詢能力但喪失準確計數。<strong>聚合摘要</strong>是把一小時內的事件壓成一筆計數記錄，喪失逐筆細節但保留準確統計。</p>
<p>Collector 選擇聚合摘要——捨棄逐筆細節，換取準確計數。降採樣後的資料用途是趨勢圖和長期統計，這些查詢需要「過去 30 天每小時的 error 總數」而非「某一筆原始 error 的 stack trace」。</p>
<p>這意味著原始事件 purge（定期清理過期事件）後，超過保留期的逐筆查詢會回傳空結果。Dashboard 在回溯超過原始事件保留期的時間範圍時，應切換到上方的摘要表（<code>hourly_summary</code>/<code>daily_summary</code>）查詢——顯示趨勢圖而非事件列表。設計方向是查詢 API 的 <code>from</code> 參數超過 <code>retention.raw_events</code> 時自動降級到摘要表，或回傳提示告知 client 該時間範圍只有聚合資料（初版 collector 尚未實作此降級邏輯）。</p>
<h3 id="觸發切換到-postgresql-的訊號">觸發切換到 PostgreSQL 的訊號</h3>
<p><strong>寫入爭搶</strong>：SQLite 是單寫者模型。高併發寫入（多個 SDK 同時 flush、每秒數百筆以上持續發生）會出現 <code>database is locked</code> 錯誤。WAL mode 能緩解但不能根治。</p>
<p><strong>聚合查詢效能不足</strong>：Dashboard 需要的聚合查詢（「過去 30 天每小時的 error 數量趨勢」「funnel 的每步轉換率」）在資料量成長後變慢。SQLite 沒有 parallel query 和 partial index 等進階 OLAP 能力。</p>
<p><strong>跨實例需求</strong>：需要多個 collector 實例共用同一個資料庫時，SQLite 的單檔案模型無法跨主機存取。</p>
<h2 id="postgresql-backend分析觸發">PostgreSQL Backend（分析觸發）</h2>
<p>PostgreSQL 是獨立的資料庫 server，提供多連線並行寫入、進階索引（GIN for JSONB、partial index）和完整的 SQL 分析能力。切換到 PostgreSQL 意味著 collector 從「零依賴單一 binary」變成「binary + 外部 DB」，運維複雜度上升。</p>
<h3 id="觸發條件">觸發條件</h3>
<p>SQLite 的寫入爭搶或聚合效能成為瓶頸時切換。具體訊號：<code>database is locked</code> 錯誤頻率超過每分鐘一次、或 dashboard 的聚合查詢超過 3 秒。</p>
<h3 id="切換方式">切換方式</h3>
<p>切換是 config change：把 <code>--storage=sqlite</code> 改成 <code>--storage=postgres --dsn=postgres://...</code>。資料遷移用匯出 + 匯入完成：</p>
<ol>
<li>從 SQLite 匯出事件為 JSONL（<code>monitor export --format=jsonl</code>）</li>
<li>在 PostgreSQL 建立 events 表（schema 和 SQLite 相同，data 欄位改用 JSONB）</li>
<li>匯入 JSONL 到 PostgreSQL（<code>monitor import --storage=postgres --file=events.jsonl</code>）</li>
<li>切換啟動參數、確認查詢正常後停用 SQLite 檔案</li>
</ol>
<p>Storage interface 保證 collector 的 ingestion、query、rule engine 邏輯不需要改動 — 只有 storage implementation 層切換。</p>
<h3 id="能力增量">能力增量</h3>
<ul>
<li><strong>並行寫入</strong>：多個 SDK 同時 flush 不會 lock</li>
<li><strong>JSONB 索引</strong>：對 data 欄位的特定 key 建索引（<code>CREATE INDEX ON events ((data-&gt;&gt;'name'))</code>）</li>
<li><strong>Window function</strong>：funnel 和 cohort 分析的 SQL 基礎</li>
<li><strong>Read replica</strong>：寫入和查詢分離，dashboard 的查詢不影響 ingestion 效能</li>
</ul>
<h2 id="時間序列-db-backend長期演進">時間序列 DB Backend（長期演進）</h2>
<p>時間序列資料庫（TimescaleDB、InfluxDB、VictoriaMetrics）專門為高頻 append 寫入和時間分桶聚合設計。TimescaleDB 基於 PostgreSQL 擴展，Storage interface 的 PostgreSQL implementation 可以直接複用、加上 hypertable 和 continuous aggregate。</p>
<h3 id="觸發條件-1">觸發條件</h3>
<p>每秒數萬筆以上的持續寫入、或需要自動 downsampling（每分鐘的原始資料保留 7 天、每小時的聚合保留 90 天、每天的聚合永久保留）。多數自用工具和小型團隊不會到達這個規模。</p>
<h3 id="能力增量-1">能力增量</h3>
<ul>
<li><strong>時間分桶原生操作</strong>：<code>time_bucket('1 hour', ts)</code> 替代手動 DATE_TRUNC</li>
<li><strong>Continuous aggregate</strong>：預計算的聚合結果自動更新</li>
<li><strong>壓縮</strong>：歷史資料自動壓縮，TB 級資料可查詢</li>
<li><strong>Retention policy</strong>：按時間自動清理舊資料</li>
</ul>
<h2 id="jsonl-匯出debug-用途">JSONL 匯出（debug 用途）</h2>
<p>JSONL 不作為主要 storage backend，而是作為匯出格式保留人類可讀性和 grep 友好性。<code>monitor export --format=jsonl</code> 把 storage 中的事件匯出為每行一個 JSON 物件的檔案，讓開發者可以用 grep / jq 做臨時查詢或把資料搬到其他工具。</p>
<p>JSONL 匯出也是備份和遷移的中介格式 — SQLite 損壞時從 JSONL 重建、切換到 PostgreSQL 時從 JSONL 匯入。</p>
<p>匯出使用 streaming — 從 storage 逐筆讀取、逐行寫出檔案，記憶體使用和事件總量無關。300 萬筆事件（約 900MB JSONL）的匯出不需要載入全部資料到記憶體。匯出的 JSONL 檔案包含事件明文（已 redaction 的欄位除外），匯出後不受 collector 的存取控制保護，應注意存放位置和存取權限。</p>
<h2 id="演進原則">演進原則</h2>
<p><strong>按觀察到的瓶頸切換</strong>。<code>database is locked</code> 錯誤頻率、聚合查詢延遲、磁碟使用量 — 這些是可觀察的訊號。「未來可能有百萬筆事件」是預測。按訊號行動，不按預測行動。</p>
<p><strong>切換是 config change</strong>。Storage interface 確保切換 backend 時 collector 的其他邏輯（ingestion、query API、rule engine、dashboard）不需要改動。切換的成本是資料遷移，不是程式碼重寫。</p>
<p><strong>SQLite 是安全的起點</strong>。多數開源使用者會停留在 SQLite backend — 單日萬筆以下、索引查詢毫秒級、零依賴部署。只有明確的效能瓶頸才值得引入外部 DB 的運維成本。</p>
<h2 id="下一步路由">下一步路由</h2>
<ul>
<li>Collector 的整體架構 → <a href="/blog/monitoring/04-collector/architecture/" data-link-title="Collector 架構" data-link-desc="HTTP endpoint → JSON Schema 驗證 → 儲存 → 查詢 → rule engine 的五段式處理鏈路">Collector 架構</a></li>
<li>查詢 API 的設計（跨 backend 統一） → <a href="/blog/monitoring/04-collector/query-api/" data-link-title="查詢 API 設計" data-link-desc="CLI grep 友好的 JSONL 結構 &#43; HTTP 查詢 endpoint — 兩種查詢介面各自的適用場景和設計要點">查詢 API 設計</a></li>
<li>資料庫選型的通用指南 → <a href="/blog/backend/01-database/" data-link-title="模組一：資料庫與持久化" data-link-desc="整理 SQL、transaction、migration 與 repository adapter 的後端實務">backend 01 資料庫</a></li>
<li>效能瓶頸的判讀方法 → <a href="/blog/backend/09-performance-capacity/" data-link-title="模組九：效能工程與容量規劃" data-link-desc="把『目前配置能撐多少、要加多少機器』變成可量化、可驗證、可改進的工程流程">backend 09 效能容量</a></li>
<li>水平擴展的基礎概念 → <a href="/blog/devops/02-horizontal-scaling/" data-link-title="模組二：水平擴展" data-link-desc="一個實例不夠時怎麼加第二個 — stateless 設計、shared storage、session 處理的工程約束">DevOps 水平擴展</a></li>
<li>Error fingerprint 的 DDL 擴充 → <a href="/blog/monitoring/04-collector/error-fingerprint/" data-link-title="Error Fingerprint 與去重分群" data-link-desc="把大量 error 事件歸組成可管理的 issue 列表 — fingerprint 演算法、message normalization、error_groups 表設計、自架方案的務實邊界">Error Fingerprint 與去重分群</a></li>
</ul>
]]></content:encoded></item><item><title>功能分層與 Backend 選擇</title><link>https://tarrragon.github.io/blog/monitoring/04-collector/feature-tier-boundary/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/monitoring/04-collector/feature-tier-boundary/</guid><description>&lt;p>Collector 的可插拔 Storage Backend 分成兩個功能層級。分界線是查詢模式 — SQLite 能高效處理的查詢定義了簡單版的功能邊界，超出的查詢需求觸發 PostgreSQL 的引入。所有事件都經過同一個 Ingestion domain，差異在 Query 和 Dashboard domain 能提供什麼能力。&lt;/p>
&lt;h2 id="sqlite-層開發者工具">SQLite 層：開發者工具&lt;/h2>
&lt;p>SQLite 層提供的功能聚焦在「開發者自己 debug 和監控」。所有查詢都是單一維度的 — 按時間、按類型、按名稱過濾，不需要跨事件 JOIN 或跨使用者聚合。&lt;/p>
&lt;h3 id="承載的功能">承載的功能&lt;/h3>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>功能&lt;/th>
 &lt;th>查詢模式&lt;/th>
 &lt;th>SQL 範例&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>最近 error 列表&lt;/td>
 &lt;td>按 type + 時間過濾&lt;/td>
 &lt;td>&lt;code>WHERE type='error' ORDER BY ts DESC LIMIT 20&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Error 計數（按 name 分群）&lt;/td>
 &lt;td>單表 GROUP BY&lt;/td>
 &lt;td>&lt;code>SELECT name, COUNT(*) FROM events WHERE type='error' GROUP BY name&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>單次 session 回放&lt;/td>
 &lt;td>按 session_id 過濾&lt;/td>
 &lt;td>&lt;code>WHERE session_id='xxx' ORDER BY ts&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>事件時間軸&lt;/td>
 &lt;td>按時間排序&lt;/td>
 &lt;td>&lt;code>WHERE ts BETWEEN ? AND ? ORDER BY ts&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>基本 rule engine&lt;/td>
 &lt;td>逐筆事件評估&lt;/td>
 &lt;td>收到事件時逐條比對 rule（不需要查歷史）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>CLI 查詢&lt;/td>
 &lt;td>任意過濾&lt;/td>
 &lt;td>&lt;code>WHERE type=? AND name LIKE ? AND ts &amp;gt; ?&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>這些功能覆蓋開發者日常 debug 和監控的核心操作 — 查錯誤、看時間軸、回放 session、設規則告警。&lt;/p>
&lt;h3 id="對應的-dashboard-視圖">對應的 Dashboard 視圖&lt;/h3>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>視圖&lt;/th>
 &lt;th>顯示&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>總覽頁&lt;/td>
 &lt;td>最近 1 小時的事件計數（按 type 分）+ 最近 error 列表&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>事件詳情&lt;/td>
 &lt;td>單筆事件的完整 JSON&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Session 回放&lt;/td>
 &lt;td>單次 session 內的事件序列&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h3 id="對應的事件消費">對應的事件消費&lt;/h3>
&lt;p>SQLite 層消費所有四類事件，但消費方式是「單筆或單 session 級查詢」：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>事件類型&lt;/th>
 &lt;th>消費方式&lt;/th>
 &lt;th>保留需求&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>event&lt;/td>
 &lt;td>按名稱計數、按 session 排列&lt;/td>
 &lt;td>原始 7 天（debug）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>error&lt;/td>
 &lt;td>按名稱分群、按時間排列、看 stack trace&lt;/td>
 &lt;td>原始 30 天（error 追蹤價值較長）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>metric&lt;/td>
 &lt;td>按名稱查最近 N 筆的值&lt;/td>
 &lt;td>原始 7 天 + 每小時聚合 90 天&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>lifecycle&lt;/td>
 &lt;td>按 session 排列、看狀態轉換&lt;/td>
 &lt;td>原始 7 天&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h2 id="postgresql-層行為分析">PostgreSQL 層：行為分析&lt;/h2>
&lt;p>PostgreSQL 層在 SQLite 層的基礎上加入「跨 session、跨使用者的聚合分析」。這些查詢需要 JOIN 多張表、計算時間窗口、處理大量資料的 GROUP BY — SQLite 的單寫者模型和有限的查詢最佳化器在這些場景下效能不足。&lt;/p>
&lt;h3 id="觸發引入-postgresql-的功能需求">觸發引入 PostgreSQL 的功能需求&lt;/h3>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>功能需求&lt;/th>
 &lt;th>為什麼 SQLite 不夠&lt;/th>
 &lt;th>PostgreSQL 提供什麼&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>Funnel 分析&lt;/strong>&lt;/td>
 &lt;td>跨大量 session 的 multi-step JOIN 和聚合效能不足&lt;/td>
 &lt;td>Window functions + 高效 JOIN&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Cohort 留存&lt;/strong>&lt;/td>
 &lt;td>需要按「註冊週」分群、計算每週的回訪率&lt;/td>
 &lt;td>Date functions + 大規模 GROUP BY&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>RFM 分群&lt;/strong>&lt;/td>
 &lt;td>需要跨所有使用者計算 recency/frequency/monetary&lt;/td>
 &lt;td>全表聚合 + 分位數計算&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>時間趨勢 dashboard&lt;/strong>&lt;/td>
 &lt;td>需要「過去 30 天每小時的 error P95」&lt;/td>
 &lt;td>時間分桶 + percentile 函數&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>高併發寫入&lt;/strong>&lt;/td>
 &lt;td>多個 SDK 同時 flush 且持續出現 database is locked&lt;/td>
 &lt;td>連線池 + 並行寫入&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>長期保留 + 聚合&lt;/strong>&lt;/td>
 &lt;td>降採樣的 materialized view&lt;/td>
 &lt;td>REFRESH MATERIALIZED VIEW&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h3 id="判斷公式">判斷公式&lt;/h3>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">需要 funnel / cohort / RFM 任一 → PostgreSQL
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">需要跨使用者聚合（不只看自己的資料） → PostgreSQL
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">需要高併發寫入（多個 SDK 同時 flush 且持續出現 database is locked 錯誤） → PostgreSQL
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">以上都不需要 → SQLite 足夠&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="對應的-dashboard-視圖sqlite-層不提供">對應的 Dashboard 視圖（SQLite 層不提供）&lt;/h3>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>視圖&lt;/th>
 &lt;th>查詢模式&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Funnel 漏斗&lt;/td>
 &lt;td>多步驟轉換率（session 級 JOIN）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cohort 留存表&lt;/td>
 &lt;td>時間窗口 × 群組矩陣&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>RFM 分群散佈&lt;/td>
 &lt;td>三維度分位數計算&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Error 趨勢圖（長期）&lt;/td>
 &lt;td>30 天 × 每小時的時間序列&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>效能 P95 趨勢&lt;/td>
 &lt;td>percentile_cont 視窗函數&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h3 id="對應的事件消費-1">對應的事件消費&lt;/h3>
&lt;p>PostgreSQL 層消費的事件和 SQLite 相同（Ingestion 不變），但消費方式從「單筆/單 session」擴展到「跨 session/跨使用者」：&lt;/p></description><content:encoded><![CDATA[<p>Collector 的可插拔 Storage Backend 分成兩個功能層級。分界線是查詢模式 — SQLite 能高效處理的查詢定義了簡單版的功能邊界，超出的查詢需求觸發 PostgreSQL 的引入。所有事件都經過同一個 Ingestion domain，差異在 Query 和 Dashboard domain 能提供什麼能力。</p>
<h2 id="sqlite-層開發者工具">SQLite 層：開發者工具</h2>
<p>SQLite 層提供的功能聚焦在「開發者自己 debug 和監控」。所有查詢都是單一維度的 — 按時間、按類型、按名稱過濾，不需要跨事件 JOIN 或跨使用者聚合。</p>
<h3 id="承載的功能">承載的功能</h3>
<table>
  <thead>
      <tr>
          <th>功能</th>
          <th>查詢模式</th>
          <th>SQL 範例</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>最近 error 列表</td>
          <td>按 type + 時間過濾</td>
          <td><code>WHERE type='error' ORDER BY ts DESC LIMIT 20</code></td>
      </tr>
      <tr>
          <td>Error 計數（按 name 分群）</td>
          <td>單表 GROUP BY</td>
          <td><code>SELECT name, COUNT(*) FROM events WHERE type='error' GROUP BY name</code></td>
      </tr>
      <tr>
          <td>單次 session 回放</td>
          <td>按 session_id 過濾</td>
          <td><code>WHERE session_id='xxx' ORDER BY ts</code></td>
      </tr>
      <tr>
          <td>事件時間軸</td>
          <td>按時間排序</td>
          <td><code>WHERE ts BETWEEN ? AND ? ORDER BY ts</code></td>
      </tr>
      <tr>
          <td>基本 rule engine</td>
          <td>逐筆事件評估</td>
          <td>收到事件時逐條比對 rule（不需要查歷史）</td>
      </tr>
      <tr>
          <td>CLI 查詢</td>
          <td>任意過濾</td>
          <td><code>WHERE type=? AND name LIKE ? AND ts &gt; ?</code></td>
      </tr>
  </tbody>
</table>
<p>這些功能覆蓋開發者日常 debug 和監控的核心操作 — 查錯誤、看時間軸、回放 session、設規則告警。</p>
<h3 id="對應的-dashboard-視圖">對應的 Dashboard 視圖</h3>
<table>
  <thead>
      <tr>
          <th>視圖</th>
          <th>顯示</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>總覽頁</td>
          <td>最近 1 小時的事件計數（按 type 分）+ 最近 error 列表</td>
      </tr>
      <tr>
          <td>事件詳情</td>
          <td>單筆事件的完整 JSON</td>
      </tr>
      <tr>
          <td>Session 回放</td>
          <td>單次 session 內的事件序列</td>
      </tr>
  </tbody>
</table>
<h3 id="對應的事件消費">對應的事件消費</h3>
<p>SQLite 層消費所有四類事件，但消費方式是「單筆或單 session 級查詢」：</p>
<table>
  <thead>
      <tr>
          <th>事件類型</th>
          <th>消費方式</th>
          <th>保留需求</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>event</td>
          <td>按名稱計數、按 session 排列</td>
          <td>原始 7 天（debug）</td>
      </tr>
      <tr>
          <td>error</td>
          <td>按名稱分群、按時間排列、看 stack trace</td>
          <td>原始 30 天（error 追蹤價值較長）</td>
      </tr>
      <tr>
          <td>metric</td>
          <td>按名稱查最近 N 筆的值</td>
          <td>原始 7 天 + 每小時聚合 90 天</td>
      </tr>
      <tr>
          <td>lifecycle</td>
          <td>按 session 排列、看狀態轉換</td>
          <td>原始 7 天</td>
      </tr>
  </tbody>
</table>
<h2 id="postgresql-層行為分析">PostgreSQL 層：行為分析</h2>
<p>PostgreSQL 層在 SQLite 層的基礎上加入「跨 session、跨使用者的聚合分析」。這些查詢需要 JOIN 多張表、計算時間窗口、處理大量資料的 GROUP BY — SQLite 的單寫者模型和有限的查詢最佳化器在這些場景下效能不足。</p>
<h3 id="觸發引入-postgresql-的功能需求">觸發引入 PostgreSQL 的功能需求</h3>
<table>
  <thead>
      <tr>
          <th>功能需求</th>
          <th>為什麼 SQLite 不夠</th>
          <th>PostgreSQL 提供什麼</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Funnel 分析</strong></td>
          <td>跨大量 session 的 multi-step JOIN 和聚合效能不足</td>
          <td>Window functions + 高效 JOIN</td>
      </tr>
      <tr>
          <td><strong>Cohort 留存</strong></td>
          <td>需要按「註冊週」分群、計算每週的回訪率</td>
          <td>Date functions + 大規模 GROUP BY</td>
      </tr>
      <tr>
          <td><strong>RFM 分群</strong></td>
          <td>需要跨所有使用者計算 recency/frequency/monetary</td>
          <td>全表聚合 + 分位數計算</td>
      </tr>
      <tr>
          <td><strong>時間趨勢 dashboard</strong></td>
          <td>需要「過去 30 天每小時的 error P95」</td>
          <td>時間分桶 + percentile 函數</td>
      </tr>
      <tr>
          <td><strong>高併發寫入</strong></td>
          <td>多個 SDK 同時 flush 且持續出現 database is locked</td>
          <td>連線池 + 並行寫入</td>
      </tr>
      <tr>
          <td><strong>長期保留 + 聚合</strong></td>
          <td>降採樣的 materialized view</td>
          <td>REFRESH MATERIALIZED VIEW</td>
      </tr>
  </tbody>
</table>
<h3 id="判斷公式">判斷公式</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">需要 funnel / cohort / RFM 任一 → PostgreSQL
</span></span><span class="line"><span class="ln">2</span><span class="cl">需要跨使用者聚合（不只看自己的資料） → PostgreSQL
</span></span><span class="line"><span class="ln">3</span><span class="cl">需要高併發寫入（多個 SDK 同時 flush 且持續出現 database is locked 錯誤） → PostgreSQL
</span></span><span class="line"><span class="ln">4</span><span class="cl">以上都不需要 → SQLite 足夠</span></span></code></pre></div><h3 id="對應的-dashboard-視圖sqlite-層不提供">對應的 Dashboard 視圖（SQLite 層不提供）</h3>
<table>
  <thead>
      <tr>
          <th>視圖</th>
          <th>查詢模式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Funnel 漏斗</td>
          <td>多步驟轉換率（session 級 JOIN）</td>
      </tr>
      <tr>
          <td>Cohort 留存表</td>
          <td>時間窗口 × 群組矩陣</td>
      </tr>
      <tr>
          <td>RFM 分群散佈</td>
          <td>三維度分位數計算</td>
      </tr>
      <tr>
          <td>Error 趨勢圖（長期）</td>
          <td>30 天 × 每小時的時間序列</td>
      </tr>
      <tr>
          <td>效能 P95 趨勢</td>
          <td>percentile_cont 視窗函數</td>
      </tr>
  </tbody>
</table>
<h3 id="對應的事件消費-1">對應的事件消費</h3>
<p>PostgreSQL 層消費的事件和 SQLite 相同（Ingestion 不變），但消費方式從「單筆/單 session」擴展到「跨 session/跨使用者」：</p>
<table>
  <thead>
      <tr>
          <th>事件類型</th>
          <th>SQLite 層消費</th>
          <th>PostgreSQL 層新增消費</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>event</td>
          <td>按名稱計數</td>
          <td>funnel 步驟轉換、cohort 行為分群</td>
      </tr>
      <tr>
          <td>error</td>
          <td>按名稱分群</td>
          <td>跨版本 error 率比較、P95 回應時間趨勢</td>
      </tr>
      <tr>
          <td>metric</td>
          <td>最近 N 筆值</td>
          <td>長期趨勢（materialized view 預聚合）</td>
      </tr>
      <tr>
          <td>lifecycle</td>
          <td>單 session 排列</td>
          <td>session 長度分佈、留存率計算</td>
      </tr>
  </tbody>
</table>
<h2 id="domain-的分層影響">Domain 的分層影響</h2>
<table>
  <thead>
      <tr>
          <th>Domain</th>
          <th>SQLite 層</th>
          <th>PostgreSQL 層新增</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Ingestion</strong></td>
          <td>HTTP POST → 驗證 → 寫入</td>
          <td>不變（寫入目標換 backend）</td>
      </tr>
      <tr>
          <td><strong>Storage</strong></td>
          <td>SQLite embedded</td>
          <td>PostgreSQL + 連線池</td>
      </tr>
      <tr>
          <td><strong>Query</strong></td>
          <td>單表過濾 + 單表 GROUP BY</td>
          <td>JOIN + window function + percentile</td>
      </tr>
      <tr>
          <td><strong>Rule</strong></td>
          <td>逐筆事件即時評估</td>
          <td>不變（rule 不依賴聚合查詢）</td>
      </tr>
      <tr>
          <td><strong>Dashboard</strong></td>
          <td>總覽 + 事件詳情 + session 回放</td>
          <td>新增 funnel / cohort / RFM / 趨勢圖</td>
      </tr>
  </tbody>
</table>
<p>Ingestion 和 Rule 兩個 domain 和 storage backend 無關 — 事件進來的方式和規則評估的邏輯不因 backend 改變。Query 和 Dashboard 是分層影響最大的兩個 domain — PostgreSQL 層的查詢能力決定了 Dashboard 能提供什麼視圖。</p>
<h2 id="實作邊界">實作邊界</h2>
<p>Storage interface 用 Go 的 interface composition 分成兩層：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kd">type</span> <span class="nx">BasicStorage</span> <span class="kd">interface</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">    <span class="nf">Store</span><span class="p">(</span><span class="nx">event</span> <span class="nx">Event</span><span class="p">)</span> <span class="kt">error</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="nf">Query</span><span class="p">(</span><span class="nx">filter</span> <span class="nx">QueryFilter</span><span class="p">)</span> <span class="p">([]</span><span class="nx">Event</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="nf">Close</span><span class="p">()</span> <span class="kt">error</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="nf">Downsample</span><span class="p">()</span> <span class="kt">error</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="nf">Purge</span><span class="p">()</span> <span class="kt">error</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="kd">type</span> <span class="nx">AnalyticsStorage</span> <span class="kd">interface</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="nx">BasicStorage</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">    <span class="nf">Aggregate</span><span class="p">(</span><span class="nx">spec</span> <span class="nx">AggregateSpec</span><span class="p">)</span> <span class="p">(</span><span class="nx">AggregateResult</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">    <span class="nf">Funnel</span><span class="p">(</span><span class="nx">steps</span> <span class="p">[]</span><span class="kt">string</span><span class="p">,</span> <span class="nx">timeWindow</span> <span class="nx">Duration</span><span class="p">)</span> <span class="p">(</span><span class="nx">FunnelResult</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">    <span class="nf">Cohort</span><span class="p">(</span><span class="nx">groupBy</span> <span class="kt">string</span><span class="p">,</span> <span class="nx">metric</span> <span class="kt">string</span><span class="p">)</span> <span class="p">(</span><span class="nx">CohortResult</span><span class="p">,</span> <span class="kt">error</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p>SQLite implementation 只實作 <code>BasicStorage</code>。PostgreSQL implementation 實作 <code>AnalyticsStorage</code>。Dashboard 用 Go 的 type assertion（<code>if as, ok := storage.(AnalyticsStorage); ok { ... }</code>）判斷能力 — funnel/cohort 視圖在 SQLite 模式下不顯示入口，而非顯示後報錯。</p>
<h2 id="下一步路由">下一步路由</h2>
<ul>
<li>可插拔 Storage Backend 的架構 → <a href="/blog/monitoring/04-collector/scaling-evolution/" data-link-title="規模演進" data-link-desc="可插拔 Storage Backend 架構 — SQLite 預設、PostgreSQL 觸發切換、時間序列 DB 長期演進">規模演進</a></li>
<li>事件枚舉方法（哪些事件要收） → <a href="/blog/monitoring/01-mental-model/event-enumeration-method/" data-link-title="事件枚舉與補齊檢查" data-link-desc="從操作盤點系統性地推導出完整的事件清單 — 四類補齊檢查確保沒有遺漏、粒度判準確保每個事件只記一個事實">事件枚舉與補齊檢查</a></li>
<li>分層保留策略 → <a href="/blog/monitoring/04-collector/scaling-evolution/" data-link-title="規模演進" data-link-desc="可插拔 Storage Backend 架構 — SQLite 預設、PostgreSQL 觸發切換、時間序列 DB 長期演進">規模演進的分層保留段</a></li>
<li>Funnel 分析的完整方法論 → <a href="/blog/monitoring/08-business-analytics/funnel-analysis/" data-link-title="Funnel Analysis" data-link-desc="使用者在哪一步流失 — 從事件序列計算每步轉換率、找出流失最嚴重的步驟、區分設計問題和技術問題">Funnel analysis</a></li>
<li>查詢消費模式（各場景需要什麼事件）→ <a href="/blog/monitoring/04-collector/query-consumption-patterns/" data-link-title="查詢消費模式" data-link-desc="Debug / Alerting / 產品決策 / 安全審計 / 效能監控 — 五種查詢場景各需要什麼事件、什麼欄位、什麼查詢模式">查詢消費模式</a></li>
</ul>
]]></content:encoded></item><item><title>從 collector 資料做基礎 funnel 分析</title><link>https://tarrragon.github.io/blog/monitoring/08-business-analytics/self-hosted-funnel/</link><pubDate>Fri, 19 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/monitoring/08-business-analytics/self-hosted-funnel/</guid><description>&lt;p>自架 collector 收集的事件資料可以做基礎的 funnel 分析，不需要商業方案。分析的深度取決於 storage backend 的查詢能力 — SQLite 層能做每步事件計數，PostgreSQL 層能做 session 級轉換率分析。功能分層的完整定義見 &lt;a href="https://tarrragon.github.io/blog/monitoring/04-collector/feature-tier-boundary/" data-link-title="功能分層與 Backend 選擇" data-link-desc="SQLite 層和 PostgreSQL 層各自承載哪些功能 — 分界線是查詢模式而非資料量、觸發升級的是功能需求而非規模成長">功能分層與 Backend 選擇&lt;/a>。&lt;/p>
&lt;h2 id="定義-funnel-步驟">定義 funnel 步驟&lt;/h2>
&lt;p>Funnel 分析的第一步是列出每一步和對應的事件名稱。以一個透過 WebSocket 連接遠端終端機的 app 連線流程為例：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>步驟&lt;/th>
 &lt;th>事件名稱&lt;/th>
 &lt;th>意義&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>1&lt;/td>
 &lt;td>terminal.connect.start&lt;/td>
 &lt;td>使用者點擊連線&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>2&lt;/td>
 &lt;td>auth.biometric.success&lt;/td>
 &lt;td>生物辨識通過&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>3&lt;/td>
 &lt;td>terminal.connect.done&lt;/td>
 &lt;td>WebSocket 連線成功&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>4&lt;/td>
 &lt;td>terminal.input.submit&lt;/td>
 &lt;td>使用者開始打字&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h2 id="sqlite-層每步事件計數">SQLite 層：每步事件計數&lt;/h2>
&lt;p>SQLite backend 能做的 funnel 是「每步有多少事件觸發」— 單表 GROUP BY，不需要跨事件 JOIN。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">COUNT&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">as&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">count&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">IN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;terminal.connect.start&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;auth.biometric.success&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;terminal.connect.done&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;terminal.input.submit&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">AND&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">ts&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;gt;=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">datetime&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;now&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;-7 days&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">GROUP&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>步驟 N 的轉換率 = 步驟 N 的事件數 / 步驟 N-1 的事件數。流失率 = 1 - 轉換率。&lt;/p>
&lt;h3 id="能做的">能做的&lt;/h3>
&lt;ul>
&lt;li>每步事件計數（單表 GROUP BY）&lt;/li>
&lt;li>按 source.version 或 source.platform 分群（加 WHERE 條件）&lt;/li>
&lt;li>按天/按週看趨勢（strftime 分桶 + GROUP BY）&lt;/li>
&lt;/ul>
&lt;h3 id="做不到的">做不到的&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Session 級轉換率&lt;/strong>：「同一個 session 完成步驟 1 到步驟 4 的比例」需要 JOIN 同 session 的多個事件、跨所有 session 聚合。SQLite 能做這個 JOIN，但在大量 session 時效能不足。&lt;/li>
&lt;li>&lt;strong>步驟間耗時&lt;/strong>：「使用者在步驟 1 和步驟 2 之間等了多久」需要 self-join on session_id + timestamp 差值計算。&lt;/li>
&lt;li>&lt;strong>漏斗順序驗證&lt;/strong>：確認使用者是按 1→2→3→4 順序完成、不是跳步。&lt;/li>
&lt;/ul>
&lt;h2 id="postgresql-層session-級-funnel">PostgreSQL 層：Session 級 funnel&lt;/h2>
&lt;p>PostgreSQL backend 提供 window function 和高效 JOIN，能做完整的 session 級 funnel 分析。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="k">WITH&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">session_steps&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">AS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">session_id&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">ROW_NUMBER&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">OVER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">session_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ORDER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">ts&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">as&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">step_order&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">IN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;terminal.connect.start&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;auth.biometric.success&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;terminal.connect.done&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;terminal.input.submit&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">AND&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">ts&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;gt;=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">NOW&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">INTERVAL&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;7 days&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="n">session_max_step&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">AS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">session_id&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">MAX&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">step_order&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">as&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">reached&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">session_steps&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">GROUP&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">session_id&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">reached&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">COUNT&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">as&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">sessions&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">session_max_step&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">GROUP&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">reached&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">ORDER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">reached&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="新增能力">新增能力&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Session 級轉換率&lt;/strong>：每個 session 到達了哪一步、在哪一步流失&lt;/li>
&lt;li>&lt;strong>步驟間耗時&lt;/strong>：LAG window function 計算相鄰步驟的 timestamp 差值&lt;/li>
&lt;li>&lt;strong>漏斗順序驗證&lt;/strong>：用 ROW_NUMBER + CASE 確認步驟順序&lt;/li>
&lt;li>&lt;strong>Cohort 分群的 funnel&lt;/strong>：按使用者註冊日期 / 版本 / 平台分群看不同 cohort 的 funnel 差異&lt;/li>
&lt;/ul>
&lt;h2 id="jsonl-匯出後的臨時分析">JSONL 匯出後的臨時分析&lt;/h2>
&lt;p>Collector 的 &lt;code>monitor export --format=jsonl&lt;/code> 可以匯出事件為 JSONL 格式。匯出後用 grep + jq 做一次性的臨時分析：&lt;/p></description><content:encoded><![CDATA[<p>自架 collector 收集的事件資料可以做基礎的 funnel 分析，不需要商業方案。分析的深度取決於 storage backend 的查詢能力 — SQLite 層能做每步事件計數，PostgreSQL 層能做 session 級轉換率分析。功能分層的完整定義見 <a href="/blog/monitoring/04-collector/feature-tier-boundary/" data-link-title="功能分層與 Backend 選擇" data-link-desc="SQLite 層和 PostgreSQL 層各自承載哪些功能 — 分界線是查詢模式而非資料量、觸發升級的是功能需求而非規模成長">功能分層與 Backend 選擇</a>。</p>
<h2 id="定義-funnel-步驟">定義 funnel 步驟</h2>
<p>Funnel 分析的第一步是列出每一步和對應的事件名稱。以一個透過 WebSocket 連接遠端終端機的 app 連線流程為例：</p>
<table>
  <thead>
      <tr>
          <th>步驟</th>
          <th>事件名稱</th>
          <th>意義</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>terminal.connect.start</td>
          <td>使用者點擊連線</td>
      </tr>
      <tr>
          <td>2</td>
          <td>auth.biometric.success</td>
          <td>生物辨識通過</td>
      </tr>
      <tr>
          <td>3</td>
          <td>terminal.connect.done</td>
          <td>WebSocket 連線成功</td>
      </tr>
      <tr>
          <td>4</td>
          <td>terminal.input.submit</td>
          <td>使用者開始打字</td>
      </tr>
  </tbody>
</table>
<h2 id="sqlite-層每步事件計數">SQLite 層：每步事件計數</h2>
<p>SQLite backend 能做的 funnel 是「每步有多少事件觸發」— 單表 GROUP BY，不需要跨事件 JOIN。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="k">count</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;terminal.connect.start&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;auth.biometric.success&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">               </span><span class="s1">&#39;terminal.connect.done&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;terminal.input.submit&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">  </span><span class="k">AND</span><span class="w"> </span><span class="n">ts</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">datetime</span><span class="p">(</span><span class="s1">&#39;now&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;-7 days&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">name</span><span class="p">;</span></span></span></code></pre></div><p>步驟 N 的轉換率 = 步驟 N 的事件數 / 步驟 N-1 的事件數。流失率 = 1 - 轉換率。</p>
<h3 id="能做的">能做的</h3>
<ul>
<li>每步事件計數（單表 GROUP BY）</li>
<li>按 source.version 或 source.platform 分群（加 WHERE 條件）</li>
<li>按天/按週看趨勢（strftime 分桶 + GROUP BY）</li>
</ul>
<h3 id="做不到的">做不到的</h3>
<ul>
<li><strong>Session 級轉換率</strong>：「同一個 session 完成步驟 1 到步驟 4 的比例」需要 JOIN 同 session 的多個事件、跨所有 session 聚合。SQLite 能做這個 JOIN，但在大量 session 時效能不足。</li>
<li><strong>步驟間耗時</strong>：「使用者在步驟 1 和步驟 2 之間等了多久」需要 self-join on session_id + timestamp 差值計算。</li>
<li><strong>漏斗順序驗證</strong>：確認使用者是按 1→2→3→4 順序完成、不是跳步。</li>
</ul>
<h2 id="postgresql-層session-級-funnel">PostgreSQL 層：Session 級 funnel</h2>
<p>PostgreSQL backend 提供 window function 和高效 JOIN，能做完整的 session 級 funnel 分析。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">WITH</span><span class="w"> </span><span class="n">session_steps</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">  </span><span class="k">SELECT</span><span class="w"> </span><span class="n">session_id</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">         </span><span class="n">ROW_NUMBER</span><span class="p">()</span><span class="w"> </span><span class="n">OVER</span><span class="w"> </span><span class="p">(</span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">session_id</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">ts</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">step_order</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="k">WHERE</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;terminal.connect.start&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;auth.biometric.success&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">                 </span><span class="s1">&#39;terminal.connect.done&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;terminal.input.submit&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="k">AND</span><span class="w"> </span><span class="n">ts</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">NOW</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;7 days&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="n">session_max_step</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">  </span><span class="k">SELECT</span><span class="w"> </span><span class="n">session_id</span><span class="p">,</span><span class="w"> </span><span class="k">MAX</span><span class="p">(</span><span class="n">step_order</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">reached</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">  </span><span class="k">FROM</span><span class="w"> </span><span class="n">session_steps</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">  </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">session_id</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">reached</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">sessions</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">session_max_step</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">reached</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">reached</span><span class="p">;</span></span></span></code></pre></div><h3 id="新增能力">新增能力</h3>
<ul>
<li><strong>Session 級轉換率</strong>：每個 session 到達了哪一步、在哪一步流失</li>
<li><strong>步驟間耗時</strong>：LAG window function 計算相鄰步驟的 timestamp 差值</li>
<li><strong>漏斗順序驗證</strong>：用 ROW_NUMBER + CASE 確認步驟順序</li>
<li><strong>Cohort 分群的 funnel</strong>：按使用者註冊日期 / 版本 / 平台分群看不同 cohort 的 funnel 差異</li>
</ul>
<h2 id="jsonl-匯出後的臨時分析">JSONL 匯出後的臨時分析</h2>
<p>Collector 的 <code>monitor export --format=jsonl</code> 可以匯出事件為 JSONL 格式。匯出後用 grep + jq 做一次性的臨時分析：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">for</span> step in terminal.connect.start auth.biometric.success terminal.connect.done terminal.input.submit<span class="p">;</span> <span class="k">do</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  <span class="nv">count</span><span class="o">=</span><span class="k">$(</span>grep <span class="s2">&#34;\&#34;name\&#34;:\&#34;</span><span class="nv">$step</span><span class="s2">\&#34;&#34;</span> exported-events.jsonl <span class="p">|</span> wc -l<span class="k">)</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">  <span class="nb">echo</span> <span class="s2">&#34;</span><span class="nv">$step</span><span class="s2">: </span><span class="nv">$count</span><span class="s2">&#34;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="k">done</span></span></span></code></pre></div><p>JSONL 臨時分析適合「快速看一眼大概數字」的場景。持續性的 funnel 監控應該用 SQLite 或 PostgreSQL 的 SQL 查詢，結果穩定且可重現。</p>
<h2 id="自架-vs-商業方案">自架 vs 商業方案</h2>
<table>
  <thead>
      <tr>
          <th>需求</th>
          <th>自架能力</th>
          <th>商業方案</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>每步事件計數</td>
          <td>SQLite GROUP BY</td>
          <td>Mixpanel / Amplitude 內建</td>
      </tr>
      <tr>
          <td>Session 級轉換率</td>
          <td>PostgreSQL window function</td>
          <td>Mixpanel / Amplitude 內建</td>
      </tr>
      <tr>
          <td>視覺化 funnel 漏斗圖</td>
          <td>自建 dashboard</td>
          <td>商業方案內建、拖拉設定</td>
      </tr>
      <tr>
          <td>即時更新</td>
          <td>定期重算 + dashboard 刷新</td>
          <td>商業方案即時</td>
      </tr>
      <tr>
          <td>A/B test 分群 funnel</td>
          <td>PostgreSQL + feature flag</td>
          <td>Optimizely / LaunchDarkly 整合</td>
      </tr>
  </tbody>
</table>
<p>自用工具場景下，SQLite 層的每步事件計數通常足夠。商業產品需要 session 級分析時，PostgreSQL 層的 SQL 能力和商業方案的分析能力在功能上對等，差異在 UI 和設定便利性。</p>
<h2 id="下一步路由">下一步路由</h2>
<ul>
<li>Funnel 分析的完整方法論 → <a href="/blog/monitoring/08-business-analytics/funnel-analysis/" data-link-title="Funnel Analysis" data-link-desc="使用者在哪一步流失 — 從事件序列計算每步轉換率、找出流失最嚴重的步驟、區分設計問題和技術問題">Funnel analysis</a></li>
<li>事件設計如何影響分析品質 → <a href="/blog/monitoring/08-business-analytics/behavior-event-design/" data-link-title="行為事件設計" data-link-desc="事件命名規範、屬性設計、funnel 定義 — 行為分析的品質取決於事件設計的品質">行為事件設計</a></li>
<li>功能分層定義 → <a href="/blog/monitoring/04-collector/feature-tier-boundary/" data-link-title="功能分層與 Backend 選擇" data-link-desc="SQLite 層和 PostgreSQL 層各自承載哪些功能 — 分界線是查詢模式而非資料量、觸發升級的是功能需求而非規模成長">功能分層與 Backend 選擇</a></li>
<li>去識別化是分析的入場條件 → <a href="/blog/monitoring/07-security-privacy/" data-link-title="模組七：資安與隱私" data-link-desc="SDK redaction / transport 加密 / collector access control / 去識別化 — 蒐集的資料本身就是風險資產">模組七 資安與隱私</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL → PostgreSQL：從 SQL dialect diff 跑出來的 Type A 6-phase migration</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-postgresql/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-postgresql/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a>。本文是 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> Type A 的標準形態實證。&lt;/p>&lt;/blockquote>
&lt;h2 id="三類-sql-dialect-diff-sample先看具體差距">三類 SQL dialect diff sample：先看具體差距&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- 1. Auto increment / sequence
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="c1">-- MySQL
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">INT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">AUTO_INCREMENT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">PRIMARY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- PostgreSQL
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">SERIAL&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">PRIMARY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 或 PG 10+:
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">INT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">GENERATED&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">ALWAYS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">AS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">IDENTITY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">PRIMARY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 2. String concatenation
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="c1">-- MySQL: CONCAT(a, b) 或 a || b 在 ANSI mode
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">CONCAT&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">first_name&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39; &amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">last_name&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- PostgreSQL: a || b 或 CONCAT(a, b)
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">first_name&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">||&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39; &amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">||&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">last_name&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 注意: PostgreSQL 對 NULL || x = NULL、MySQL CONCAT 對 NULL 處理不同
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 3. UPSERT
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl">&lt;span class="c1">-- MySQL
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">18&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">INSERT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">INTO&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">VALUES&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;Alice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">19&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">DUPLICATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">UPDATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">VALUES&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">20&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- PostgreSQL (9.5+)
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">21&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">INSERT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">INTO&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">VALUES&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;Alice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">22&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">CONFLICT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">DO&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">UPDATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">SET&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXCLUDED&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">23&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">24&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 4. Index hint / FORCE INDEX
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">25&lt;/span>&lt;span class="cl">&lt;span class="c1">-- MySQL
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">26&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FORCE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">INDEX&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">idx_created_at&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">created_at&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;2025-01-01&amp;#39;&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">27&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- PostgreSQL: 沒對應 syntax、依賴 planner + statistics
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">28&lt;/span>&lt;span class="cl">&lt;span class="c1">-- 必要時用 enable_seqscan=off 或 pg_hint_plan extension
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">29&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">30&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 5. JSON path
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">31&lt;/span>&lt;span class="cl">&lt;span class="c1">-- MySQL 5.7+
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">32&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">data&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="s1">&amp;#39;$.name&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">33&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- PostgreSQL
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">34&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">data&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="s1">&amp;#39;name&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">35&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">data&lt;/span>&lt;span class="o">-&amp;gt;&amp;gt;&lt;/span>&lt;span class="s1">&amp;#39;name&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">-- 取出 text&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>5 個 sample 看出 MySQL → PostgreSQL 主要工作是 &lt;em>SQL dialect translation&lt;/em>；不是 5-10 個函數差、是 &lt;em>跨整個 application SQL surface 的 audit + 改寫&lt;/em>。對應 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit&lt;/a> 結果：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> 跟 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a>。本文是 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> Type A 的標準形態實證。</p></blockquote>
<h2 id="三類-sql-dialect-diff-sample先看具體差距">三類 SQL dialect diff sample：先看具體差距</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. Auto increment / sequence
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1">-- MySQL
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="nb">INT</span><span class="w"> </span><span class="n">AUTO_INCREMENT</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- PostgreSQL
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="c1">-- 或 PG 10+:
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="nb">INT</span><span class="w"> </span><span class="k">GENERATED</span><span class="w"> </span><span class="n">ALWAYS</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">IDENTITY</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- 2. String concatenation
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1">-- MySQL: CONCAT(a, b) 或 a || b 在 ANSI mode
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">CONCAT</span><span class="p">(</span><span class="n">first_name</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39; &#39;</span><span class="p">,</span><span class="w"> </span><span class="n">last_name</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="c1">-- PostgreSQL: a || b 或 CONCAT(a, b)
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">first_name</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="s1">&#39; &#39;</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="n">last_name</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="c1">-- 注意: PostgreSQL 對 NULL || x = NULL、MySQL CONCAT 對 NULL 處理不同
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="c1">-- 3. UPSERT
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1">-- MySQL
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Alice&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="n">DUPLICATE</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="k">UPDATE</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">VALUES</span><span class="p">(</span><span class="n">name</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="c1">-- PostgreSQL (9.5+)
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Alice&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="n">CONFLICT</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">)</span><span class="w"> </span><span class="k">DO</span><span class="w"> </span><span class="k">UPDATE</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">EXCLUDED</span><span class="p">.</span><span class="n">name</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w"></span><span class="c1">-- 4. Index hint / FORCE INDEX
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="c1">-- MySQL
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">FORCE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="p">(</span><span class="n">idx_created_at</span><span class="p">)</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2025-01-01&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w"></span><span class="c1">-- PostgreSQL: 沒對應 syntax、依賴 planner + statistics
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="c1">-- 必要時用 enable_seqscan=off 或 pg_hint_plan extension
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="w"></span><span class="c1">-- 5. JSON path
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="c1">-- MySQL 5.7+
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">data</span><span class="o">-&gt;</span><span class="s1">&#39;$.name&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="w"></span><span class="c1">-- PostgreSQL
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">data</span><span class="o">-&gt;</span><span class="s1">&#39;name&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">data</span><span class="o">-&gt;&gt;</span><span class="s1">&#39;name&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 取出 text</span></span></span></code></pre></div><p>5 個 sample 看出 MySQL → PostgreSQL 主要工作是 <em>SQL dialect translation</em>；不是 5-10 個函數差、是 <em>跨整個 application SQL surface 的 audit + 改寫</em>。對應 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a> 結果：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>SQL dialect 差大、CREATE TABLE / INDEX / function 都差</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>兩者都 OLTP RDBMS、replication 概念對等但語法不同</td>
          <td>Medium</td>
      </tr>
      <tr>
          <td>Abstraction / paradigm</td>
          <td>同 SQL RDBMS</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Number of components</td>
          <td>同 1 個</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>ORM 多數能 cover、raw SQL 必改</td>
          <td>Medium</td>
      </tr>
  </tbody>
</table>
<p>主導維度 Schema = High、走 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Type A 6-phase playbook</a> 標準結構。</p>
<h2 id="phase-0rule-audit--sql-surface-盤點">Phase 0：rule audit + SQL surface 盤點</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. 列所有 stored procedure
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">routine_schema</span><span class="p">,</span><span class="w"> </span><span class="k">routine_name</span><span class="p">,</span><span class="w"> </span><span class="n">routine_type</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">information_schema</span><span class="p">.</span><span class="n">routines</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="k">routine_schema</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;mysql&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;sys&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;information_schema&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;performance_schema&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="c1">-- 2. 列所有 trigger
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">trigger_name</span><span class="p">,</span><span class="w"> </span><span class="n">event_object_table</span><span class="p">,</span><span class="w"> </span><span class="n">action_statement</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">information_schema</span><span class="p">.</span><span class="n">triggers</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 3. 列所有 view
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">table_name</span><span class="p">,</span><span class="w"> </span><span class="n">view_definition</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">information_schema</span><span class="p">.</span><span class="n">views</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="c1">-- 4. 列所有 index 含 prefix length
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1"></span><span class="k">SHOW</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="c1">-- PostgreSQL 對 prefix index 處理不同、要逐個 audit</span></span></span></code></pre></div><p>Audit 主要產出三類清單：</p>
<ul>
<li><strong>Direct port</strong>：標準 SQL feature、PG 直接接受</li>
<li><strong>Translate</strong>：MySQL-specific syntax、需要改寫（UPSERT / CONCAT NULL 行為 / index hint）</li>
<li><strong>Refactor</strong>：MySQL-specific behavior（auto_increment session-level / SELECT FOUND_ROWS / GROUP BY 寬鬆 / TEXT 隱性 cast）— 不能直接 port、application code 也要改</li>
</ul>
<h2 id="phase-1schema-對位">Phase 1：schema 對位</h2>
<table>
  <thead>
      <tr>
          <th>MySQL</th>
          <th>PostgreSQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>INT AUTO_INCREMENT</code></td>
          <td><code>INT GENERATED ALWAYS AS IDENTITY</code> 或 <code>SERIAL</code></td>
      </tr>
      <tr>
          <td><code>TINYINT(1)</code> (boolean usage)</td>
          <td><code>BOOLEAN</code></td>
      </tr>
      <tr>
          <td><code>DATETIME</code></td>
          <td><code>TIMESTAMP WITHOUT TIME ZONE</code></td>
      </tr>
      <tr>
          <td><code>DATETIME(6)</code> (microsecond)</td>
          <td><code>TIMESTAMP(6)</code></td>
      </tr>
      <tr>
          <td><code>VARCHAR(N)</code> with charset</td>
          <td><code>VARCHAR(N)</code> (UTF-8 always)</td>
      </tr>
      <tr>
          <td><code>TEXT</code></td>
          <td><code>TEXT</code> (no length limit)</td>
      </tr>
      <tr>
          <td><code>LONGTEXT</code></td>
          <td><code>TEXT</code></td>
      </tr>
      <tr>
          <td><code>JSON</code></td>
          <td><code>JSONB</code> (推薦、indexed) 或 <code>JSON</code></td>
      </tr>
      <tr>
          <td><code>ENUM('a','b','c')</code></td>
          <td>自定 <code>TYPE foo AS ENUM('a','b','c')</code> 或 <code>VARCHAR + CHECK</code></td>
      </tr>
      <tr>
          <td><code>SET('a','b')</code></td>
          <td>Array <code>TEXT[]</code> + CHECK</td>
      </tr>
      <tr>
          <td><code>BINARY(N)</code></td>
          <td><code>BYTEA</code></td>
      </tr>
      <tr>
          <td>Index prefix <code>KEY (col(10))</code></td>
          <td>Functional index <code>CREATE INDEX ON t (LEFT(col, 10))</code></td>
      </tr>
      <tr>
          <td><code>FULLTEXT INDEX</code></td>
          <td><code>tsvector</code> + GIN index</td>
      </tr>
      <tr>
          <td>Geographic types</td>
          <td>PostGIS extension（必須先裝）</td>
      </tr>
  </tbody>
</table>
<p>Schema 對位表存版控、application code refactor 時對照。</p>
<h2 id="phase-2translation-pipeline3-tier-跟-splunk--elastic-類似">Phase 2：Translation pipeline（3-tier 跟 Splunk → Elastic 類似）</h2>
<h3 id="tier-1vendor--community-tool">Tier 1：vendor / community tool</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># pgloader：成熟工具、cover ~70-80% schema + data</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pgloader mysql://user:pass@mysql-host/dbname <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>         postgresql://user:pass@pg-host/dbname
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 或 AWS DMS（managed、適合 RDS / Aurora target）</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># DMS task: Full Load + CDC</span></span></span></code></pre></div><h3 id="tier-2自家-sql-refactor">Tier 2：自家 SQL refactor</h3>
<p>對 ORM 不能 cover 的 raw SQL：</p>
<ul>
<li>Manual grep <code>application code</code> 找 <code>auto_increment</code> / <code>ON DUPLICATE KEY</code> / <code>FORCE INDEX</code> / <code>FOUND_ROWS()</code> / <code>CONCAT NULL</code></li>
<li>寫 codemod / lint rule、CI 強制 check（PG-incompatible SQL block PR）</li>
</ul>
<h3 id="tier-3tricky-case-manual">Tier 3：tricky case manual</h3>
<p>例：MySQL <code>SELECT * FROM t1, t2 WHERE t1.id = t2.id GROUP BY t1.id</code>（implicit GROUP BY 寬鬆）— PG 嚴格 GROUP BY 必須 list 所有 non-aggregate column；application code refactor 必要。</p>
<h2 id="phase-3parallel-run">Phase 3：Parallel run</h2>
<p>雙寫 + 雙讀比對 1-2 個月：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Application ──→ MySQL (write + read primary)
</span></span><span class="line"><span class="ln">2</span><span class="cl">            └─→ PostgreSQL (write only + read shadow)
</span></span><span class="line"><span class="ln">3</span><span class="cl">                                    ↓
</span></span><span class="line"><span class="ln">4</span><span class="cl">                            Diff checker (latency / result diff)</span></span></code></pre></div><p><code>pt-table-checksum</code> (MySQL) + 自家 checksum scanner 對 sample table 跑 daily checksum、找 schema 對位錯。</p>
<h2 id="phase-4cutover">Phase 4：Cutover</h2>
<ul>
<li>設 application maintenance window（30 分鐘）</li>
<li>Drain MySQL write、等 last LSN propagated to PG</li>
<li>Application switch connection string → PG</li>
<li>解除 maintenance、monitor 24-48 hours</li>
</ul>
<h2 id="phase-5cleanup">Phase 5：Cleanup</h2>
<ul>
<li>MySQL read-only 1-2 週（fallback window）</li>
<li>之後 stop replication、decommission MySQL</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1auto_increment-vs-serial-跨-transaction-行為差">Case 1：Auto_increment vs SERIAL 跨 transaction 行為差</h3>
<p><strong>徵兆</strong>：cutover 後某 batch job 跑得比 MySQL 慢 5-10x、PG log 顯示 sequence 競爭。</p>
<p><strong>根因</strong>：MySQL <code>AUTO_INCREMENT</code> 取值受 <code>innodb_autoinc_lock_mode</code> 控制（8.0 預設 mode=2 interleaved 可並行、mode=0 才是 table-level lock；詳見 <a href="/blog/backend/01-database/vendors/mysql/lock-contention/" data-link-title="MySQL Lock Contention：在 staging 重現的 deadlock、production 跑 6 個月才出現" data-link-desc="MySQL InnoDB 的 lock 是 row-level、但 *為什麼某些 row 莫名其妙也被 lock* 是 gap lock / next-key lock 設計造成的隱性行為。本文從一個 production case 開場（staging 重現 deadlock / production 6 個月後突然爆）、走 5 種 InnoDB lock 類型（record / gap / next-key / insert intention / auto-inc）、isolation level 對 lock 行為的決定性影響、deadlock detection / SHOW ENGINE INNODB STATUS 解讀、5 production 踩雷（gap lock 阻塞 INSERT / auto-inc lock contention / FK lock cascading / large transaction lock holding / READ COMMITTED 跟 binlog ROW 互動）">Lock contention</a>）、PG <code>SERIAL</code> 是 <em>sequence-level non-transactional</em>；mode=0 場景跟 PG SERIAL 差異最大、mode=2 跟 PG SERIAL 行為較接近（皆可亂號、皆可並行）。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>改 UUID v7 / bigserial</strong>：消除 sequence 競爭</li>
<li><strong>bigserial + cache</strong>：<code>CREATE SEQUENCE ... CACHE 100</code>、batch 預取 100 個 ID 降 contention</li>
<li><strong>批量 insert 改 COPY</strong>：<code>COPY t FROM STDIN</code> 是 PG 對 batch 最快路徑</li>
</ol>
<h3 id="case-2charset--collation-跑出-unicode-異常">Case 2：Charset / collation 跑出 unicode 異常</h3>
<p><strong>徵兆</strong>：cutover 後某些用戶名 / 中文文字 query 對不到結果、<code>SELECT * WHERE name = '張三'</code> 返回空。</p>
<p><strong>根因</strong>：MySQL default <code>utf8mb3</code>（3-byte UTF-8、不能存 emoji / 部分 unicode）、PG default <code>UTF8</code> 全 unicode；資料遷移時 MySQL 端的 utf8mb3 column 帶到 PG 後 <em>bytes 不變</em> 但 <em>collation rule 變</em>；string comparison 結果差。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration audit</strong>：MySQL 強制 <code>utf8mb4</code>、avoid utf8mb3 data</li>
<li><strong>Collation 對位</strong>：MySQL <code>utf8mb4_unicode_ci</code> → PG <code>LC_COLLATE = 'C.utf8'</code> 或 ICU collation</li>
<li><strong>Application encoding contract</strong>：明示 UTF-8 全範圍、不接受 utf8mb3-only client</li>
</ol>
<h3 id="case-3case-sensitivity-反轉">Case 3：Case sensitivity 反轉</h3>
<p><strong>徵兆</strong>：cutover 後 application query <code>SELECT * FROM users</code> 報錯 <code>relation does not exist</code>；但 <code>SELECT * FROM &quot;Users&quot;</code> works。</p>
<p><strong>根因</strong>：MySQL Linux default <em>table name case-sensitive</em>、Windows <em>case-insensitive</em>、配置 <code>lower_case_table_names</code> 影響；PG <em>all identifier folded to lowercase unless quoted</em>。MySQL on macOS 開發環境是 case-insensitive、PG 嚴格 case-sensitive、application code 端可能用 mixed case。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Schema migration 階段強制 lowercase</strong>：所有 table / column name 統一 lowercase</li>
<li><strong>Application code refactor</strong>：grep raw SQL 找 mixed case identifier、改 lowercase</li>
<li><strong>ORM 端設定 <code>naming_strategy</code></strong>：JPA / Hibernate 等明示 lowercase mapping</li>
</ol>
<h3 id="case-4replication-行為差cdc-pipeline-失效">Case 4：Replication 行為差、CDC pipeline 失效</h3>
<p><strong>徵兆</strong>：MySQL 端 binlog-based CDC（Debezium MySQL connector）跑得好好的、cutover 後 PG 端要重建 CDC pipeline、初期 1-2 週 message 模式異常。</p>
<p><strong>根因</strong>：MySQL binlog row format vs PG logical replication slot 完全不同 protocol；Debezium 對兩家連接器是 <em>獨立</em> binary、message schema 部分對等但不直通。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-cutover 建 PG 端 CDC</strong>：Debezium PG connector 提前部署、初期跟 MySQL CDC 並存比對</li>
<li><strong>Schema registry 同步</strong>：Avro schema 從 MySQL 端 export、註冊 PG 端 connector 用同 schema</li>
<li><strong>Consumer 端 idempotent</strong>：cutover 期間 dual-source、consumer 必須 idempotent 避免 duplicate</li>
</ol>
<h3 id="case-5fulltext-index-對應-tsvectorapplication-search-broken">Case 5：FULLTEXT INDEX 對應 tsvector、application search broken</h3>
<p><strong>徵兆</strong>：cutover 後 application 全文搜尋功能失效、<code>MATCH(name) AGAINST('xxx')</code> 不被 PG 認；application 端 raw SQL 對 search 寫死。</p>
<p><strong>根因</strong>：MySQL <code>FULLTEXT INDEX</code> + <code>MATCH ... AGAINST</code> syntax PG 不支援；PG 用 <code>tsvector + ts_rank + to_tsquery</code>、概念對等但 syntax 完全不同。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration</strong>：列 application 用到的 fulltext search 場景、改寫成 tsvector pattern</li>
<li><strong>大型 search 改 Elasticsearch / Meilisearch</strong>：fulltext 是專門 search engine 的本職、不該用 RDBMS 解</li>
<li><strong>降級為 LIKE</strong>：簡單 case <code>WHERE name ILIKE '%xxx%'</code>、performance 較差但相容性好</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>MySQL</th>
          <th>PostgreSQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Instance cost</td>
          <td>對等（同 EC2 / RDS spec）</td>
          <td>對等</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>對等</td>
          <td>對等</td>
      </tr>
      <tr>
          <td>Connection pooling</td>
          <td>proxysql / mysql-proxy</td>
          <td>PgBouncer（更成熟）</td>
      </tr>
      <tr>
          <td>Index performance</td>
          <td>對等</td>
          <td>對等</td>
      </tr>
      <tr>
          <td>JSON performance</td>
          <td>Improving</td>
          <td>JSONB 領先</td>
      </tr>
      <tr>
          <td>Replication</td>
          <td>Async binlog</td>
          <td>Async streaming + logical</td>
      </tr>
      <tr>
          <td>Extension ecosystem</td>
          <td>少</td>
          <td>大（PostGIS / TimescaleDB / pgvector）</td>
      </tr>
      <tr>
          <td>Migration cost (one-time)</td>
          <td>-</td>
          <td>2-6 FTE 月 × project length（含 application）</td>
      </tr>
  </tbody>
</table>
<p>Migration 主要 cost 在 <em>application code refactor + dual-write window operational</em>、不是 DB itself。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-postgresql--aurora-migration-串接">跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora migration</a> 串接</h3>
<p>部分組織走 <em>MySQL → PostgreSQL → Aurora</em> 兩段：</p>
<ul>
<li>先 MySQL → self-managed PostgreSQL（schema 對位 + application 改）</li>
<li>穩定後 self-managed PostgreSQL → Aurora（operational simplification）</li>
</ul>
<p>不要一次跑 <em>MySQL → Aurora PostgreSQL compat</em>、認知負擔太大、failure mode 互相干擾。</p>
<h3 id="跟-logical-replication--debezium-對位">跟 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a> 對位</h3>
<p>PG 端 CDC pipeline 在 cutover 完成後立刻可用；可作為 <em>downstream CDC 重建</em> 的契機、設計 outbox pattern 更穩。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>MySQL 8 vs PostgreSQL 16 feature gap</strong>：MySQL 8 加了 CTE / window function / generated column；2025+ feature parity 漸高、migration ROI 評估會變</li>
<li><strong>Reverse migration</strong>（PG → MySQL）：少見、通常是 application 端 dependency lock-in（用了 MySQL-specific stored procedure）</li>
<li><strong>MariaDB → PostgreSQL</strong>：跟 MySQL → PG 類似、MariaDB 部分 syntax 略接近 PG（如 <code>RETURNING</code>）</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source / target vendor：<a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> / <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>後續路線：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a></li>
<li>平行 migration playbook：<a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic</a>（同為 Type A 高 schema 差）</li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a>（本文驗證 Type A 標準形態）</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/patroni-ha/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/patroni-ha/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PostgreSQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>Patroni-based HA&lt;/em> 的 lifecycle 設計 — 從正常運作到 failover 完成的 5 段、每段配置 + failure mode + recovery。&lt;/p>&lt;/blockquote>
&lt;h2 id="failover-lifecycle5-段不是一條曲線">Failover lifecycle：5 段不是一條曲線&lt;/h2>
&lt;p>PostgreSQL 原生沒有 auto-failover；primary 掛了、application 卡死、SRE 手動 promote standby — 整個過程通常 5-30 分鐘。Patroni 把這條鏈拆成 &lt;em>自動化的 5 段 lifecycle&lt;/em>、每段有自己的 trigger、配置、失敗模式：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>段&lt;/th>
 &lt;th>觸發&lt;/th>
 &lt;th>動作&lt;/th>
 &lt;th>失敗模式&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>1. Detection&lt;/strong>&lt;/td>
 &lt;td>Leader heartbeat 在 DCS（etcd / Consul）失聯&lt;/td>
 &lt;td>Standby 們開始觀察、累積失聯時間到 TTL&lt;/td>
 &lt;td>DCS 本身分裂 → false detection 啟動失敗 failover&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>2. Election&lt;/strong>&lt;/td>
 &lt;td>TTL 過、DCS 開放 leader lock&lt;/td>
 &lt;td>Standby 競爭寫 leader key（DCS quorum-based）&lt;/td>
 &lt;td>Network partition → 兩邊都自認 leader（split-brain）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>3. Promotion&lt;/strong>&lt;/td>
 &lt;td>新 leader 寫 DCS key 成功&lt;/td>
 &lt;td>跑 &lt;code>pg_ctl promote&lt;/code>、停 streaming replication、開始接寫&lt;/td>
 &lt;td>Standby 落後太多 → 拒 promote 或承接時資料缺&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>4. Reconfiguration&lt;/strong>&lt;/td>
 &lt;td>Patroni REST API 通知 routing 層&lt;/td>
 &lt;td>HAProxy / PgBouncer 切流量到新 leader&lt;/td>
 &lt;td>Routing 層 health check 慢 → 流量持續打舊 leader&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>5. Recovery&lt;/strong>&lt;/td>
 &lt;td>舊 leader 恢復（手動 / 自動）&lt;/td>
 &lt;td>跑 &lt;code>pg_rewind&lt;/code> + 重接 streaming replication 為 standby&lt;/td>
 &lt;td>WAL divergence 太大 → 必須重 base backup&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>每段都有獨立配置、不是「設一個 timeout 就好」。後面分段展開。&lt;/p>
&lt;h2 id="stage-1detection--dcs-heartbeat-跟-ttl">Stage 1：Detection — DCS heartbeat 跟 TTL&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c"># patroni.yml 核心配置&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">scope&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">myapp-pg-cluster&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">/db/&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">pg-node-1 &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># 跟 hostname 一致&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">etcd&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">hosts&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">etcd1:2379,etcd2:2379,etcd3:2379 &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># DCS quorum&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">protocol&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">https&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">bootstrap&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">dcs&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">ttl&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">30&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># leader lock TTL&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">loop_wait&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">10&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># patroni 主循環間隔&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">retry_timeout&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">10&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># DCS retry 上限&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">maximum_lag_on_failover&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">1048576&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># standby 落後 1MB 內才能 promote&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">synchronous_mode&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kc">false&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># async / sync 取捨&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>關鍵直覺：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PostgreSQL 在 OLTP 譜系的定位、本文聚焦 <em>Patroni-based HA</em> 的 lifecycle 設計 — 從正常運作到 failover 完成的 5 段、每段配置 + failure mode + recovery。</p></blockquote>
<h2 id="failover-lifecycle5-段不是一條曲線">Failover lifecycle：5 段不是一條曲線</h2>
<p>PostgreSQL 原生沒有 auto-failover；primary 掛了、application 卡死、SRE 手動 promote standby — 整個過程通常 5-30 分鐘。Patroni 把這條鏈拆成 <em>自動化的 5 段 lifecycle</em>、每段有自己的 trigger、配置、失敗模式：</p>
<table>
  <thead>
      <tr>
          <th>段</th>
          <th>觸發</th>
          <th>動作</th>
          <th>失敗模式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>1. Detection</strong></td>
          <td>Leader heartbeat 在 DCS（etcd / Consul）失聯</td>
          <td>Standby 們開始觀察、累積失聯時間到 TTL</td>
          <td>DCS 本身分裂 → false detection 啟動失敗 failover</td>
      </tr>
      <tr>
          <td><strong>2. Election</strong></td>
          <td>TTL 過、DCS 開放 leader lock</td>
          <td>Standby 競爭寫 leader key（DCS quorum-based）</td>
          <td>Network partition → 兩邊都自認 leader（split-brain）</td>
      </tr>
      <tr>
          <td><strong>3. Promotion</strong></td>
          <td>新 leader 寫 DCS key 成功</td>
          <td>跑 <code>pg_ctl promote</code>、停 streaming replication、開始接寫</td>
          <td>Standby 落後太多 → 拒 promote 或承接時資料缺</td>
      </tr>
      <tr>
          <td><strong>4. Reconfiguration</strong></td>
          <td>Patroni REST API 通知 routing 層</td>
          <td>HAProxy / PgBouncer 切流量到新 leader</td>
          <td>Routing 層 health check 慢 → 流量持續打舊 leader</td>
      </tr>
      <tr>
          <td><strong>5. Recovery</strong></td>
          <td>舊 leader 恢復（手動 / 自動）</td>
          <td>跑 <code>pg_rewind</code> + 重接 streaming replication 為 standby</td>
          <td>WAL divergence 太大 → 必須重 base backup</td>
      </tr>
  </tbody>
</table>
<p>每段都有獨立配置、不是「設一個 timeout 就好」。後面分段展開。</p>
<h2 id="stage-1detection--dcs-heartbeat-跟-ttl">Stage 1：Detection — DCS heartbeat 跟 TTL</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># patroni.yml 核心配置</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">scope</span><span class="p">:</span><span class="w"> </span><span class="l">myapp-pg-cluster</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">/db/</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">pg-node-1                               </span><span class="w"> </span><span class="c"># 跟 hostname 一致</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="nt">etcd</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">  </span><span class="nt">hosts</span><span class="p">:</span><span class="w"> </span><span class="l">etcd1:2379,etcd2:2379,etcd3:2379      </span><span class="w"> </span><span class="c"># DCS quorum</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">  </span><span class="nt">protocol</span><span class="p">:</span><span class="w"> </span><span class="l">https</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="nt">bootstrap</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">  </span><span class="nt">dcs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="nt">ttl</span><span class="p">:</span><span class="w"> </span><span class="m">30</span><span class="w">                                     </span><span class="c"># leader lock TTL</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">    </span><span class="nt">loop_wait</span><span class="p">:</span><span class="w"> </span><span class="m">10</span><span class="w">                               </span><span class="c"># patroni 主循環間隔</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">    </span><span class="nt">retry_timeout</span><span class="p">:</span><span class="w"> </span><span class="m">10</span><span class="w">                           </span><span class="c"># DCS retry 上限</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">    </span><span class="nt">maximum_lag_on_failover</span><span class="p">:</span><span class="w"> </span><span class="m">1048576</span><span class="w">            </span><span class="c"># standby 落後 1MB 內才能 promote</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">    </span><span class="nt">synchronous_mode</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="w">                     </span><span class="c"># async / sync 取捨</span></span></span></code></pre></div><p>關鍵直覺：</p>
<ul>
<li><strong>TTL (30s) = leader 失聯多久才被視為 dead</strong>。設太短（&lt; 15s）會把 transient network jitter 當 dead；設太長（&gt; 60s）unavailability 拖長</li>
<li><strong>loop_wait + retry_timeout &lt; TTL</strong>：Patroni 必須在 TTL 內成功跟 DCS 互動 N 次、<code>loop_wait=10 + retry_timeout=10</code> 給每個循環 20s buffer</li>
<li><strong>maximum_lag_on_failover</strong>：standby WAL 落後超過這個閾值就 <em>不參與 election</em>；防止「promote 一個落後 5 分鐘的 standby」資料丟失</li>
</ul>
<h2 id="stage-2election--dcs-quorum--watchdog-防-split-brain">Stage 2：Election — DCS quorum + watchdog 防 split-brain</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln">1</span><span class="cl"><span class="nt">watchdog</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">  </span><span class="nt">mode</span><span class="p">:</span><span class="w"> </span><span class="l">required                               </span><span class="w"> </span><span class="c"># required / automatic / off</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="nt">device</span><span class="p">:</span><span class="w"> </span><span class="l">/dev/watchdog</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="nt">safety_margin</span><span class="p">:</span><span class="w"> </span><span class="m">5</span></span></span></code></pre></div><p>Election 期間最大風險是 <em>split-brain</em> — network partition 下、舊 leader 還活著但跟 DCS 斷線；新 leader 從 standby 升上來、application 同時連兩個 PostgreSQL 寫。資料 divergence 後 <em>無法自動 reconcile</em>。</p>
<p>防護機制兩層：</p>
<ol>
<li><strong>DCS quorum</strong>：etcd / Consul 至少 3 node、過半 quorum 才能寫 leader key — 少數派 partition 無法 elect 新 leader</li>
<li><strong>Watchdog (Linux kernel)</strong>：required mode 強制 — Patroni 必須定期 <em>poke</em> <code>/dev/watchdog</code>、若 Patroni 自己掛或被 OS 凍結、kernel 自動 reboot 整台機器、避免舊 leader 在 DCS 失聯後繼續接寫</li>
</ol>
<p>Watchdog <code>required</code> 是 production-grade 的硬要求 — <code>automatic</code> / <code>off</code> 在 split-brain 場景下無法防護。</p>
<h2 id="stage-3promotion--pg_ctl--replication-slot-切換">Stage 3：Promotion — pg_ctl + replication slot 切換</h2>
<p>新 leader 寫 DCS key 成功後、Patroni 自動執行：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Patroni 內部、不要手動跑</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pg_ctl promote -D /var/lib/postgresql/data
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># postgresql.auto.conf 移除 primary_conninfo</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># postgresql.auto.conf 重新計算 timeline ID</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 啟動接寫</span></span></span></code></pre></div><p>Promotion 期間關鍵議題：</p>
<ul>
<li><strong>timeline divergence</strong>：新 leader 開新 timeline ID（從 leader 失聯時的 LSN 開始）；其他 standby 需要 <code>pg_rewind</code> 把自己的 WAL fork 點對齊新 timeline</li>
<li><strong>replication slot 處理</strong>：舊 leader 上的 replication slot 在 DCS 中已 stale、新 leader 重建 slot；如果 logical replication consumer 沒 idempotent、會 replay 部分訊息</li>
<li><strong>promotion latency</strong>：通常 3-10 秒（pg_ctl 本身 &lt; 5s、加 DCS 寫確認）</li>
</ul>
<h2 id="stage-4reconfiguration--client-routing-切換">Stage 4：Reconfiguration — client routing 切換</h2>
<p>PostgreSQL 自己升 leader 還不夠、application 不知道；要靠前端 routing 層轉發。三種典型 pattern：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">[client] → [HAProxy / pgBouncer] → [pg-node-1 (leader)]
</span></span><span class="line"><span class="ln">2</span><span class="cl">                                 → [pg-node-2 (standby, read)]
</span></span><span class="line"><span class="ln">3</span><span class="cl">                                 → [pg-node-3 (standby, read)]</span></span></code></pre></div><p>Patroni REST API 暴露 <code>/leader</code> / <code>/replica</code> / <code>/health</code> endpoint、HAProxy 用 <em>health check</em> 跑這些 endpoint：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl"># haproxy.cfg
</span></span><span class="line"><span class="ln">2</span><span class="cl">backend pg-write
</span></span><span class="line"><span class="ln">3</span><span class="cl">  option httpchk OPTIONS /leader
</span></span><span class="line"><span class="ln">4</span><span class="cl">  http-check expect status 200
</span></span><span class="line"><span class="ln">5</span><span class="cl">  server pg-node-1 pg-node-1:5432 check port 8008
</span></span><span class="line"><span class="ln">6</span><span class="cl">  server pg-node-2 pg-node-2:5432 check port 8008 backup
</span></span><span class="line"><span class="ln">7</span><span class="cl">  server pg-node-3 pg-node-3:5432 check port 8008 backup</span></span></code></pre></div><p>Reconfiguration 期間關鍵延遲：</p>
<ul>
<li>HAProxy health check 間隔（預設 2s）+ failure threshold（預設 3 次）= ~6s 切換感應</li>
<li>PgBouncer 不主動 health check、要靠 application 端 retry 跟 connection drop 觸發重連</li>
<li>整個 reconfiguration 端到端通常 10-20s（含 PostgreSQL promotion 時間）</li>
</ul>
<h2 id="stage-5recovery--pg_rewind-跟-base-backup-取捨">Stage 5：Recovery — pg_rewind 跟 base backup 取捨</h2>
<p>舊 leader 恢復後變 standby，但 WAL 已 divergence — 必須選一條 recovery path：</p>
<ul>
<li><strong><code>pg_rewind</code></strong>：rewind 舊 leader WAL 到分歧點、重新接 streaming replication；條件 = 分歧 WAL 量小（&lt; 幾 GB）且 timeline 可對齊</li>
<li><strong>重 base backup</strong>：用 <code>pg_basebackup</code> 從新 leader 拉完整 base + WAL；條件 = 任何時候都可、但時間長（TB 級 1-4 小時）</li>
</ul>
<p>Patroni 預設嘗試 pg_rewind、失敗才退 base backup。production 配置：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln">1</span><span class="cl"><span class="nt">postgresql</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">  </span><span class="nt">use_pg_rewind</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="nt">remove_data_directory_on_rewind_failure</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">   </span><span class="c"># rewind 失敗自動清 data dir、再 base backup</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="nt">remove_data_directory_on_diverged_timelines</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span></span></span></code></pre></div><h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1split-brain-due-to-dcs-partition">Case 1：Split-brain due to DCS partition</h3>
<p><strong>徵兆</strong>：兩個 PostgreSQL node 都在接寫、application 大量寫入 conflict / unique constraint violation。</p>
<p><strong>根因</strong>：DCS（etcd）partition — 兩個 etcd node 在 partition 兩側、都自認 quorum；其實是 split-vote、兩邊都不應該。Patroni 在兩邊各 elect 一個 leader。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>DCS 必須奇數 node（3 / 5 / 7）、過半 quorum 嚴格 enforce</li>
<li>DCS 部署跨 AZ / region 時、quorum size 要考慮 partition 機率（3 AZ 各 1 node 是 production 最低標）</li>
<li>Watchdog <code>required</code> mode 是最後一道閘門 — DCS partition 加 quorum 失靈時、watchdog 強制 reboot 失聯 node</li>
</ol>
<h3 id="case-2standby-落後太多無法-failover">Case 2：Standby 落後太多、無法 failover</h3>
<p><strong>徵兆</strong>：primary 失聯後、Patroni log 顯示 <code>Following members have lag greater than maximum_lag_on_failover</code>、所有 standby 都被拒 promote、cluster unavailable。</p>
<p><strong>根因</strong>：maximum_lag_on_failover 設 1MB、但 standby replication lag 累積到 50MB（write-heavy workload + slow disk on standby）。安全機制觸發、但代價是 <em>無 standby 可升</em>、需要人工降低門檻或等 standby catch up。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：standby 容量 / IO 對齊 primary、避免 lag 累積；prometheus alert <code>pg_replication_lag_bytes &gt; 10MB</code> 觸發前 catch</li>
<li><strong>臨時</strong>：手動 <code>patronictl edit-config</code> 把 maximum_lag_on_failover 暫時拉到 50MB、接受可能丟 50MB worth of writes、換 availability</li>
<li><strong>長期</strong>：sync replication（一個 standby 強制同步）、保證至少一個 standby zero-lag</li>
</ol>
<h3 id="case-3promotion-後-application-connection-storm">Case 3：Promotion 後 application connection storm</h3>
<p><strong>徵兆</strong>：failover 完成後 30-120 秒內、application log 大量 <code>connection refused</code> / <code>password authentication failed</code>、application 自己 retry storm。</p>
<p><strong>根因</strong>：新 leader 剛 promote、PostgreSQL <code>max_connections</code> 容量還在 warm up（shared memory / cache 未 prime）、application 同時湧入大量 connection request；應用 retry 不夠 jitter、queue 堆積。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Application 用 <em>exponential backoff with jitter</em>、不要 immediate retry</li>
<li>PgBouncer / connection pool 限制每 application instance 對 PG 的 connection 上限、不直連 PG</li>
<li>預先在 standby 跑 <code>pg_prewarm</code> 把熱表 cache 預熱、promotion 後 cache miss 不爆</li>
</ol>
<h3 id="case-4pg_rewind-失敗退到-base-backup-沒做">Case 4：pg_rewind 失敗、退到 base backup 沒做</h3>
<p><strong>徵兆</strong>：舊 leader 恢復後、Patroni log 顯示 <code>pg_rewind failed</code>、舊 leader 一直 STARTING、無法重接 cluster；SRE 手動跑 pg_basebackup 才恢復。</p>
<p><strong>根因</strong>：<code>remove_data_directory_on_rewind_failure: false</code>（預設）— rewind 失敗時 Patroni 不主動清 data dir、需要 SRE 手動處理；運維沒 runbook、卡在這步幾小時。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Production 設 <code>remove_data_directory_on_rewind_failure: true</code> + <code>remove_data_directory_on_diverged_timelines: true</code>、讓 Patroni 自動 fallback</li>
<li>data dir 跑在獨立 PV / disk、清掉風險可控（不要跑 root disk）</li>
<li>容量規劃：base backup 時間預估納入 RTO（TB 級 base backup 1-4 小時、不是 RTO 30 分鐘所能承受）</li>
</ol>
<h3 id="case-5watchdog-觸發整機-reboot誤殺">Case 5：Watchdog 觸發整機 reboot、誤殺</h3>
<p><strong>徵兆</strong>：production server 在無故障時 unexpected reboot、<code>dmesg</code> 顯示 <code>watchdog: BUG: soft lockup</code>。</p>
<p><strong>根因</strong>：Patroni 主循環因 etcd 短暫慢回應卡住 60+ 秒、kernel watchdog 觸發 reboot；但實際 PostgreSQL 沒 hang、是 Patroni-watchdog 鏈過敏。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><code>safety_margin</code> 設大一點（10-15）、給 Patroni loop_wait 抖動空間</li>
<li>etcd 跟 Patroni 部署在低延遲 network 內（同 AZ &lt; 5ms）、跨 region etcd 不建議</li>
<li>watchdog device 用 softdog（軟體模擬）vs 硬體 watchdog、debug 時 softdog 容易觀察</li>
</ol>
<h2 id="容量規劃">容量規劃</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>估算</th>
          <th>警戒</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster size</td>
          <td>3-5 node（含 leader + 2-4 standby）</td>
          <td>&lt; 3 不能 HA（單 standby 失敗整 cluster 掛）</td>
      </tr>
      <tr>
          <td>DCS size</td>
          <td>3 / 5 / 7 node（奇數 quorum）</td>
          <td>etcd 5 node 是 prod standard</td>
      </tr>
      <tr>
          <td>TTL</td>
          <td>30s（default 30、production 20-60）</td>
          <td>&lt; 15s 過敏、&gt; 60s 過鈍</td>
      </tr>
      <tr>
          <td>maximum_lag_on_failover</td>
          <td>1MB（default）</td>
          <td>大表 write-heavy 可放 10-100MB</td>
      </tr>
      <tr>
          <td>Synchronous standby</td>
          <td>1 個 sync + N 個 async 是 production 預設</td>
          <td>全 async 容易丟資料、全 sync write latency 爆</td>
      </tr>
      <tr>
          <td>RTO</td>
          <td>10-30 秒（detection 30s 內 + promotion 5-10s + reconfig 5s）</td>
          <td>&gt; 60s 要 audit 鏈路</td>
      </tr>
      <tr>
          <td>RPO</td>
          <td>sync mode 接近 0、async mode 跟 lag 同數量級</td>
          <td>async 在 disk IO 慢時 lag 可能 MB-GB level</td>
      </tr>
  </tbody>
</table>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-pgbouncer-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PgBouncer</a> 整合</h3>
<p>PgBouncer 不主動感知 Patroni failover、要靠：</p>
<ol>
<li><strong>HAProxy 在 PgBouncer 上層</strong>：HAProxy 跑 Patroni health check、PgBouncer connection 重新路由</li>
<li><strong>PgBouncer reload</strong>：failover 後 SRE / automation 跑 <code>pgbouncer -R</code>、強制重連 backend</li>
<li><strong>Connection pool drain</strong>：application 端 connection pool 設 <code>pool_lifetime_max=5min</code>、舊 connection 自然汰換</li>
</ol>
<h3 id="跟-cert-managertls-rotation">跟 cert-manager（TLS rotation）</h3>
<p>Patroni REST API 跟 PostgreSQL streaming replication 都用 TLS、cert rotation 不能停服務：</p>
<ol>
<li>cert-manager 自動換證後、Patroni 跟 PostgreSQL 都需要 reload（不是 restart）</li>
<li><code>patronictl reload &lt;cluster&gt;</code> 不會觸發 failover、只 reload config</li>
<li>PostgreSQL <code>pg_ctl reload</code> 是 SIGHUP、平滑載入新 cert</li>
</ol>
<h3 id="跟-backup--pitr">跟 backup / PITR</h3>
<p>Patroni 不管 backup — 但 standby promotion 後、WAL archive 必須跟新 leader 的 timeline 對齊：</p>
<ol>
<li>WAL archive 命令模板含 <code>%t</code>（timeline）：<code>archive_command = 'wal-g wal-push %p'</code></li>
<li>Backup tool（pgBackRest / WAL-G）支援 timeline 切換、archive 不會中斷</li>
<li>詳見 <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL archiving deep article</a></li>
</ol>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Multi-region Patroni</strong>：跨 region 部署的 DCS quorum 設計、跟單 region 的取捨完全不同</li>
<li><strong>PostgreSQL 16+ streaming replication slot 持久化</strong>：簡化 standby promotion 後 logical consumer 重連</li>
<li><strong>跟 Kubernetes operator 整合</strong>：Patroni 跑在 K8s 時、StatefulSet + pod identity + DCS 部署模式</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>上游 chapter：<a href="/blog/backend/01-database/high-concurrency-access/" data-link-title="1.1 高併發下的 SQL 讀寫邊界" data-link-desc="說明高併發服務如何共用資料庫 client、控制 transaction、管理 connection pool、避免資料庫成為瓶頸">High Concurrency Access</a> — connection / replication / HA 全鏈</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgBouncer 配置</a> / <a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/dynamic-credential/" data-link-title="HashiCorp Vault Dynamic Credential：lease 治理跟 application 整合的實作層" data-link-desc="Vault database secrets engine 怎麼配、application 怎麼 renew lease、production 五大踩雷（lease 過期 race、DB max_connections 撞牆、Vault sealed、token expire、scope 過寬）、容量規劃跟 vault-agent injector 整合">Vault Dynamic Credential</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN + replication slot 的三軸組合</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/replication-topology/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/replication-topology/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>streaming replication topology&lt;/em> — 從 single primary 到 multi-standby 部署的 3 個 trade-off 軸 + LSN + replication slot 機制。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="replication-的-3-個-trade-off-軸--mode-選擇">Replication 的 3 個 trade-off 軸 + mode 選擇&lt;/h2>
&lt;p>PG streaming replication mode 選擇看起來是「async 還是 sync」、實際是 3 個獨立 trade-off 軸的組合、async / sync / quorum-based sync 是這些軸的常見組合 &lt;em>名稱&lt;/em>：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>軸&lt;/th>
 &lt;th>端 A&lt;/th>
 &lt;th>端 B&lt;/th>
 &lt;th>PG 旋鈕&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>Durability&lt;/strong>&lt;/td>
 &lt;td>primary 寫完就 commit&lt;/td>
 &lt;td>至少一個 standby 收到才 commit&lt;/td>
 &lt;td>&lt;code>synchronous_commit&lt;/code> / &lt;code>synchronous_standby_names&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Latency&lt;/strong>&lt;/td>
 &lt;td>client 等 primary 寫完 OK&lt;/td>
 &lt;td>client 等 standby ack（額外 RTT）&lt;/td>
 &lt;td>同上&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Consistency&lt;/strong>&lt;/td>
 &lt;td>standby 隨時可能 stale&lt;/td>
 &lt;td>standby 跟 primary 保證讀到一致&lt;/td>
 &lt;td>application read routing rule（不是 replication 旋鈕）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>跟這三軸獨立的、是 &lt;em>replication 機制本身的可維護性&lt;/em>：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>LSN（Log Sequence Number）&lt;/strong>：PG 用全域 byte offset 標 WAL 進度、所有 standby 同步用 LSN 對齊、不像 MySQL 早期 binlog position + file 雙欄&lt;/li>
&lt;li>&lt;strong>Replication slot&lt;/strong>：primary 紀錄每個 standby 已接收的 LSN、防 standby 失聯期間 WAL 被清掉、是 streaming replication 的 &lt;em>持久化進度追蹤&lt;/em>&lt;/li>
&lt;/ul>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology&lt;/a> 對比、PG 的 LSN + replication slot 直接內建 &lt;em>standby 進度追蹤&lt;/em>、不像 MySQL 5.7- 要靠 binlog position + GTID 雙機制；但 slot 是 &lt;em>primary 紀錄&lt;/em>、orphan slot 是 PG-specific 議題（slot 留 WAL 直到 standby 重連、standby 永久失聯 → primary disk 爆）。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>streaming replication topology</em> — 從 single primary 到 multi-standby 部署的 3 個 trade-off 軸 + LSN + replication slot 機制。</p></blockquote>
<hr>
<h2 id="replication-的-3-個-trade-off-軸--mode-選擇">Replication 的 3 個 trade-off 軸 + mode 選擇</h2>
<p>PG streaming replication mode 選擇看起來是「async 還是 sync」、實際是 3 個獨立 trade-off 軸的組合、async / sync / quorum-based sync 是這些軸的常見組合 <em>名稱</em>：</p>
<table>
  <thead>
      <tr>
          <th>軸</th>
          <th>端 A</th>
          <th>端 B</th>
          <th>PG 旋鈕</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Durability</strong></td>
          <td>primary 寫完就 commit</td>
          <td>至少一個 standby 收到才 commit</td>
          <td><code>synchronous_commit</code> / <code>synchronous_standby_names</code></td>
      </tr>
      <tr>
          <td><strong>Latency</strong></td>
          <td>client 等 primary 寫完 OK</td>
          <td>client 等 standby ack（額外 RTT）</td>
          <td>同上</td>
      </tr>
      <tr>
          <td><strong>Consistency</strong></td>
          <td>standby 隨時可能 stale</td>
          <td>standby 跟 primary 保證讀到一致</td>
          <td>application read routing rule（不是 replication 旋鈕）</td>
      </tr>
  </tbody>
</table>
<p>跟這三軸獨立的、是 <em>replication 機制本身的可維護性</em>：</p>
<ul>
<li><strong>LSN（Log Sequence Number）</strong>：PG 用全域 byte offset 標 WAL 進度、所有 standby 同步用 LSN 對齊、不像 MySQL 早期 binlog position + file 雙欄</li>
<li><strong>Replication slot</strong>：primary 紀錄每個 standby 已接收的 LSN、防 standby 失聯期間 WAL 被清掉、是 streaming replication 的 <em>持久化進度追蹤</em></li>
</ul>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a> 對比、PG 的 LSN + replication slot 直接內建 <em>standby 進度追蹤</em>、不像 MySQL 5.7- 要靠 binlog position + GTID 雙機制；但 slot 是 <em>primary 紀錄</em>、orphan slot 是 PG-specific 議題（slot 留 WAL 直到 standby 重連、standby 永久失聯 → primary disk 爆）。</p>
<h2 id="async-streamingdefault--高-throughput-的代價">Async streaming：default + 高 throughput 的代價</h2>
<p>Async 是 PG 預設、行為：</p>
<ol>
<li>Primary 寫 WAL 進 <code>pg_wal/</code> 目錄、commit、回應 client OK</li>
<li>WAL sender process 把 WAL stream 給 standby</li>
<li>Standby WAL receiver 寫 standby 的 <code>pg_wal/</code>、startup 進程 redo 套用</li>
</ol>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>Durability：primary commit 後 standby 還沒收 → primary 永久故障 → <em>data loss</em>（已 commit 的 transaction 在 standby 不存在）</li>
<li>Latency：client 寫入延遲 = primary 自身 fsync WAL 的時間（<code>fsync=on</code> + <code>synchronous_commit=on</code> 預設、通常 &lt; 1ms 在 SSD / NVMe）</li>
<li>Consistency：standby 可能 lag、application 讀 standby 會 stale；用 <code>pg_stat_replication.write_lag / flush_lag / replay_lag</code> 看</li>
</ul>
<p><strong>配置</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># postgresql.conf on primary</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">wal_level</span> <span class="o">=</span> <span class="s">replica          # 至少 replica（logical 是 superset）</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">max_wal_senders</span> <span class="o">=</span> <span class="s">10         # 並行 WAL sender process 數（依 standby 數量）</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">wal_keep_size</span> <span class="o">=</span> <span class="s">1024MB       # WAL 保留量（slot 為主、但 backup buffer）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">synchronous_commit</span> <span class="o">=</span> <span class="s">on      # 預設、primary 自己 fsync WAL</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># synchronous_standby_names 留空 = async</span></span></span></code></pre></div><p><strong>適用</strong>：</p>
<ul>
<li>主流選擇（90% 場景）</li>
<li>Failover loss 在容忍範圍（多數 web 應用容忍 1-2 秒 data loss）</li>
<li>Read scaling 為主要 driver、絕對 durability 非首要</li>
</ul>
<h2 id="sync-streaming至少一個-standby-flush-wal-才-commit">Sync streaming：至少一個 standby flush WAL 才 commit</h2>
<p>Sync mode 在 async 基礎上加 <em>primary 等指定 standby flush WAL 才回 client</em>：</p>
<ol>
<li>Primary 寫 WAL、send to standby</li>
<li>Standby 收到 WAL、寫進 <code>pg_wal/</code>、fsync、回 ack</li>
<li><em>Primary 等 ack</em> → commit → 回 client</li>
</ol>
<p><code>synchronous_commit</code> 有 5 個 level、不是 binary：</p>
<table>
  <thead>
      <tr>
          <th>Level</th>
          <th>行為</th>
          <th>Latency 影響</th>
          <th>Crash data loss</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>off</code></td>
          <td>primary 不等自己 fsync、background flush</td>
          <td>+0</td>
          <td>primary crash 丟 0-1 秒</td>
      </tr>
      <tr>
          <td><code>local</code></td>
          <td>primary fsync own WAL（不等 standby）</td>
          <td>baseline</td>
          <td>primary crash 0、standby 丟</td>
      </tr>
      <tr>
          <td><code>remote_write</code></td>
          <td>primary fsync + standby 收到（不必 standby fsync）</td>
          <td>+1 RTT 大致</td>
          <td>OS crash on standby 丟</td>
      </tr>
      <tr>
          <td><code>on</code> (預設)</td>
          <td>primary fsync + standby fsync（standby 收進 disk）</td>
          <td>+1 RTT + fsync</td>
          <td>全 crash 都不丟</td>
      </tr>
      <tr>
          <td><code>remote_apply</code></td>
          <td>primary fsync + standby fsync + standby 已 <em>replay</em>（visible to read）</td>
          <td>+1 RTT + fsync + replay</td>
          <td>全 crash 都不丟 + replica 立刻可讀</td>
      </tr>
  </tbody>
</table>
<p><strong>配置（synchronous）</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">synchronous_commit</span> <span class="o">=</span> <span class="s">on</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">synchronous_standby_names</span> <span class="o">=</span> <span class="s">&#39;FIRST 1 (standby1, standby2)&#39;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># &#39;FIRST 1&#39; = 第一個 active standby ack 即可</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># &#39;ANY 2 (s1, s2, s3)&#39; = 任 2 個 ack 即可（quorum-based）</span></span></span></code></pre></div><p><strong>Quorum-based sync</strong>：用 <code>ANY N</code> 語法、達到 N 個 ack 就 commit、提高 latency stability（不依賴特定 standby）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">synchronous_standby_names</span> <span class="o">=</span> <span class="s">&#39;ANY 2 (standby1, standby2, standby3)&#39;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 3 個 standby 中任 2 個 ack 即 commit</span></span></span></code></pre></div><p><strong>適用</strong>：</p>
<ul>
<li>金融交易 / 訂單 / payment ledger（不允許 data loss）</li>
<li>已有 multi-AZ deploy、replica 物理上可靠</li>
<li>可接受寫入延遲 +1-3ms (跨 AZ)</li>
</ul>
<p><strong>不適用</strong>：</p>
<ul>
<li>跨 region sync（RTT 50-200ms）— 寫吞吐砍半、改用 <em>region-local sync + cross-region async</em></li>
<li>寫吞吐 &gt; 50K WPS + 容忍 sub-second loss — async 即可</li>
</ul>
<h2 id="lsn--replication-slotpg-的進度追蹤機制">LSN + Replication Slot：PG 的進度追蹤機制</h2>
<p>PG 每個 WAL 寫入都標 <em>LSN</em>（64-bit byte offset）。Standby 紀錄 <em>已收到 / 已 flush / 已 replay</em> 的 LSN、primary 透過 streaming protocol 知道每個 standby 進度。</p>
<p><strong>Replication slot</strong> 是 <em>primary 端的 standby 進度紀錄</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 建 physical replication slot（給 streaming replication 用）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_create_physical_replication_slot</span><span class="p">(</span><span class="s1">&#39;standby1_slot&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 查 slot 狀態
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">slot_name</span><span class="p">,</span><span class="w"> </span><span class="n">active</span><span class="p">,</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">,</span><span class="w"> </span><span class="n">confirmed_flush_lsn</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">       </span><span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_wal_lsn_diff</span><span class="p">(</span><span class="n">pg_current_wal_lsn</span><span class="p">(),</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">lag</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="p">;</span></span></span></code></pre></div><p><strong>Slot 的核心責任</strong>：</p>
<ul>
<li><em>防 WAL premature deletion</em>：standby 失聯（restart / network blip）、primary 仍保留 slot 對應 LSN 之後的 WAL、standby 重連可繼續 stream</li>
<li><em>無需 base backup re-build</em>：跟沒 slot 的 standby 對比、有 slot 的 standby 失聯後重連、不用重建</li>
</ul>
<p><strong>Slot 跟 <code>wal_keep_size</code></strong>：</p>
<ul>
<li><code>wal_keep_size</code>（PG 13+）/ <code>wal_keep_segments</code>（&lt; 13）：minimum WAL 保留量、不依賴 slot</li>
<li>Slot 是 <em>動態保留</em>：直到 slot 的 standby 推進 LSN 才釋放對應 WAL</li>
<li>兩者組合：<code>wal_keep_size</code> 是底線、slot 是 standby-specific 動態保留</li>
</ul>
<p><strong>Standby 配置（用 slot）</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># standby1 postgresql.conf</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">primary_conninfo</span> <span class="o">=</span> <span class="s">&#39;host=primary.example.com port=5432 user=replication password=...&#39;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">primary_slot_name</span> <span class="o">=</span> <span class="s">&#39;standby1_slot&#39;   # 用 primary 上預先建的 slot</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">hot_standby</span> <span class="o">=</span> <span class="s">on                       # 讓 standby 接受 read query</span></span></span></code></pre></div><p><code>standby.signal</code> 空檔案在 PG_DATA 內、告訴 PG 這是 standby、進入 recovery mode。</p>
<h2 id="配置-step-by-stepsync-streaming--slot">配置 step-by-step（sync streaming + slot）</h2>
<p>實務最常見組合：sync streaming + replication slot + cross-AZ replica。</p>
<h3 id="step-1primary-配置">Step 1：Primary 配置</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="na">wal_level</span> <span class="o">=</span> <span class="s">replica</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="na">max_wal_senders</span> <span class="o">=</span> <span class="s">10</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="na">max_replication_slots</span> <span class="o">=</span> <span class="s">10</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="na">synchronous_commit</span> <span class="o">=</span> <span class="s">on</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="na">synchronous_standby_names</span> <span class="o">=</span> <span class="s">&#39;FIRST 1 (standby1, standby2)&#39;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="na">wal_keep_size</span> <span class="o">=</span> <span class="s">1024MB</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># pg_hba.conf — 允許 replication 連線</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="na">host replication replication 10.0.0.0/16 scram-sha-256</span></span></span></code></pre></div><p>Restart primary 套用。</p>
<h3 id="step-2建-replication-user--slot">Step 2：建 replication user + slot</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">USER</span><span class="w"> </span><span class="n">replication</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="n">REPLICATION</span><span class="w"> </span><span class="n">PASSWORD</span><span class="w"> </span><span class="s1">&#39;...&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_create_physical_replication_slot</span><span class="p">(</span><span class="s1">&#39;standby1_slot&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_create_physical_replication_slot</span><span class="p">(</span><span class="s1">&#39;standby2_slot&#39;</span><span class="p">);</span></span></span></code></pre></div><h3 id="step-3standby-base-backup">Step 3：Standby base backup</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 在 standby 上跑</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pg_basebackup -h primary.example.com -D /var/lib/postgresql/data <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  -U replication -P -X stream <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  -S standby1_slot -R
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># -R: 自動生成 standby.signal + primary_conninfo</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># -X stream: 邊 backup 邊 stream 增量 WAL（避免 backup 期間 WAL gap）</span></span></span></code></pre></div><h3 id="step-4standby-啟動">Step 4：Standby 啟動</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># standby /var/lib/postgresql/data/postgresql.auto.conf 已有：</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># primary_conninfo = &#39;host=primary.example.com user=replication password=... application_name=standby1&#39;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># primary_slot_name = &#39;standby1_slot&#39;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl">pg_ctl -D /var/lib/postgresql/data start</span></span></code></pre></div><h3 id="step-5驗證">Step 5：驗證</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Primary: 確認 standby 連上
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">application_name</span><span class="p">,</span><span class="w"> </span><span class="k">state</span><span class="p">,</span><span class="w"> </span><span class="n">sync_state</span><span class="p">,</span><span class="w"> </span><span class="n">write_lag</span><span class="p">,</span><span class="w"> </span><span class="n">flush_lag</span><span class="p">,</span><span class="w"> </span><span class="n">replay_lag</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_replication</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 應顯示 standby1 / streaming / sync / 各 lag
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- Standby: 確認在 recovery + 收到 WAL
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_is_in_recovery</span><span class="p">(),</span><span class="w"> </span><span class="n">pg_last_wal_receive_lsn</span><span class="p">(),</span><span class="w"> </span><span class="n">pg_last_wal_replay_lsn</span><span class="p">();</span></span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-standby-lag-暴衝--single-replay-process-bottleneck">1. Standby lag 暴衝 — Single replay process bottleneck</h3>
<p>PG standby 是 <em>single startup process</em> 套用 WAL（不像 MySQL multi-thread replication）、primary 高並發寫入時 standby 跟不上、lag 從 &lt; 100ms 飆到分鐘級。常見觸發：批次 UPDATE / DELETE、大 transaction、index 建立、autovacuum 大量 dead tuple cleanup。</p>
<p>修法：</p>
<ul>
<li><em>Parallel WAL apply</em>（PG 14+）：<code>max_parallel_workers_per_gather</code> 增加 background worker、但仍受 startup process 主導</li>
<li>對 <em>read scaling</em> 場景接受 standby lag、application 用 <em>primary read 對 latency-critical query</em></li>
<li><em>Cascading replication</em> 對 high-fan-out 解決 sender CPU bottleneck、但 standby replay 仍 single-thread</li>
</ul>
<p>監控：<code>pg_stat_replication.replay_lag</code> 是 <em>最後一個 commit 到 standby replay 的時間差</em>、超過 threshold 即告警。</p>
<h3 id="2-sync-standby-失聯時-primary-commit-卡住">2. Sync standby 失聯時 primary commit 卡住</h3>
<p><code>synchronous_standby_names = 'FIRST 1 (standby1)'</code> + standby1 down → primary commit <em>等永遠</em>。Application 全部 timeout。</p>
<p>修法：</p>
<ul>
<li>用 <code>ANY N</code> quorum：<code>synchronous_standby_names = 'ANY 1 (standby1, standby2)'</code> — 任一 standby ack 即可</li>
<li>設多 standby、防單一失聯</li>
<li>監控 sync standby 健康、自動 failover 切 sync mode 到其他 standby（Patroni 自動做）</li>
<li>緊急情況：在 primary 跑 <code>ALTER SYSTEM SET synchronous_standby_names = ''; SELECT pg_reload_conf();</code> 暫時退 async（接受 data loss risk）</li>
</ul>
<h3 id="3-orphan-replication-slot--primary-disk-爆">3. Orphan replication slot — Primary disk 爆</h3>
<p>Standby 失聯（永久故障 / 重 decommission 但忘了 drop slot）、primary slot 持續保留 WAL、<code>pg_wal/</code> 累積到 disk 滿、primary 也掛。</p>
<p>修法：</p>
<ul>
<li>
<p>監控 <code>pg_replication_slots.active</code> — <code>false</code> 持續 &gt; N 小時是警訊</p>
</li>
<li>
<p>監控 slot lag：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">slot_name</span><span class="p">,</span><span class="w"> </span><span class="n">active</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">       </span><span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_wal_lsn_diff</span><span class="p">(</span><span class="n">pg_current_wal_lsn</span><span class="p">(),</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">retained_wal</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">retained_wal</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">10</span><span class="n">GB</span><span class="p">;</span></span></span></code></pre></div></li>
<li>
<p>設 <code>max_slot_wal_keep_size</code>（PG 13+）— slot 對應 WAL 超過 limit 自動 invalidate slot（standby 之後要 base backup 重來）</p>
</li>
<li>
<p>DR runbook 紀錄 <em>standby 退役流程</em> 必須包含 <code>pg_drop_replication_slot('xxx')</code></p>
</li>
</ul>
<h3 id="4-cascading-replication-雪崩">4. Cascading replication 雪崩</h3>
<p>Topology <code>primary → standby1 → standby2 → ...</code>（每層遞迴 stream）。Standby1 startup process 卡住、後續 standby 都被 block、整條 chain 雪崩。</p>
<p>修法：</p>
<ul>
<li>避免超過 2 層 cascade（primary → tier1 → tier2 是上限）</li>
<li>跨 region 用 <em>region-local tier1 + cross-region tier2</em>、不是長 chain</li>
<li>真的大規模、改用 <em>binlog server</em> style：<a href="https://github.com/postgresml/PgCat">Citus / PgCat</a> 等中介、或 logical replication 解耦</li>
</ul>
<h3 id="5-failover-後-timeline-分歧">5. Failover 後 timeline 分歧</h3>
<p>Primary 失敗、standby1 promote 為新 primary、其他 standby（standby2 / 3）原本連舊 primary、必須重新連 standby1。但 PG 用 <em>timeline</em>（每次 promotion 增 1）標 WAL 分支、原 standby 的 timeline 跟新 primary 不同。重連時看到 timeline mismatch、報錯。</p>
<p>修法：</p>
<ul>
<li><em>pg_rewind</em> 工具：對比新 primary 跟舊 standby 的 timeline 分歧點、把舊 standby 上 <em>新 primary 沒有的 WAL</em> 倒退、然後從分歧點重新跟新 primary 同步</li>
<li><em>Base backup re-build</em>：對舊 standby 重建 — 慢但保證乾淨</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni</a> 自動處理 pg_rewind / base backup 選擇</li>
</ul>
<h2 id="容量--cost-對照">容量 / cost 對照</h2>
<table>
  <thead>
      <tr>
          <th>配置</th>
          <th>寫吞吐影響</th>
          <th>Standby overhead</th>
          <th>適合 workload</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Async streaming + slot</td>
          <td>baseline</td>
          <td>低（WAL receive + startup）</td>
          <td>高吞吐、容忍 sub-second loss</td>
      </tr>
      <tr>
          <td>Sync <code>remote_write</code> + 1 standby</td>
          <td>-5% ~ -10%</td>
          <td>同上 + RTT</td>
          <td>一般 production、可接受 OS crash 丟</td>
      </tr>
      <tr>
          <td>Sync <code>on</code> + 1 standby</td>
          <td>-10% ~ -20%</td>
          <td>同上 + fsync</td>
          <td>金融、訂單、不容忍 data loss</td>
      </tr>
      <tr>
          <td>Sync <code>on</code> + ANY 2 quorum</td>
          <td>-15% ~ -30%</td>
          <td>同上、跨 AZ</td>
          <td>強 durability + multi-AZ HA</td>
      </tr>
      <tr>
          <td>Sync <code>remote_apply</code> + 1 standby</td>
          <td>-20% ~ -40%</td>
          <td>同上 + replay</td>
          <td>強一致 read on standby（少用、成本高）</td>
      </tr>
  </tbody>
</table>
<p>跨 AZ sync 通常加 1-3ms、跨 region 加 50-200ms — 寫密集 workload 跨 region sync 通常不划算、改用 <em>region-local sync + cross-region async chain</em>。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="patroni-ha">Patroni HA</h3>
<p><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni</a> 是 PG HA 自動 failover 標準、依賴 DCS（etcd / Consul）+ 本文 replication topology。Patroni 自動：</p>
<ul>
<li>偵測 primary 失聯、promote 適合 standby</li>
<li>處理 timeline 分歧（pg_rewind）</li>
<li>重配 sync standby（避免 sync standby 失聯卡 primary）</li>
</ul>
<h3 id="logical-replication--debezium">Logical Replication + Debezium</h3>
<p><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical replication + Debezium</a> 是 <em>跟 streaming replication 共用 WAL</em> 但不同 abstraction — logical decoding output event、streaming replication output physical bytes。Logical replication slot 跟 physical slot 共存、各自獨立 retention。</p>
<h3 id="pitr--wal-archiving">PITR + WAL Archiving</h3>
<p><a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL Archiving</a> 用 <em>archive_command</em> 把 WAL ship 到 S3、跟 streaming replication 並行：</p>
<ul>
<li>Streaming：給 <em>活的 standby</em>（real-time read scaling / HA）</li>
<li>Archive：給 <em>PITR + 新 standby base backup source</em></li>
</ul>
<p>兩者使用同一 WAL stream、不衝突。</p>
<h3 id="connection-路由pgbouncer--readwrite-split">Connection 路由（PgBouncer + read/write split）</h3>
<p><a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PgBouncer</a> 不做 read/write split（transaction pool 不看 SQL）。Read replica routing 通常用 <em>application-level</em> 或 <em>HAProxy 監控 standby health</em>。</p>
<h3 id="跟-mysql-replication-topology-對比">跟 MySQL Replication Topology 對比</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PG streaming replication</th>
          <th>MySQL replication</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>進度追蹤</td>
          <td>LSN（單一 byte offset）</td>
          <td>GTID 或 binlog (file, position)</td>
      </tr>
      <tr>
          <td>標準工具</td>
          <td>streaming replication（physical）+ logical</td>
          <td>binlog ROW format</td>
      </tr>
      <tr>
          <td>Sync 機制</td>
          <td><code>synchronous_commit</code> + standby names</td>
          <td>semi-sync plugin</td>
      </tr>
      <tr>
          <td>Quorum</td>
          <td><code>ANY N</code> syntax</td>
          <td><code>rpl_semi_sync_master_wait_for_slave_count</code></td>
      </tr>
      <tr>
          <td>Replay parallelism</td>
          <td>Single startup process</td>
          <td>Multi-thread (logical clock / writeset)</td>
      </tr>
      <tr>
          <td>Replica routing</td>
          <td>PgBouncer 不看 SQL、需外接</td>
          <td>ProxySQL 內建 query routing</td>
      </tr>
  </tbody>
</table>
<p>兩者 high-level 對等、低層機制有顯著差異。詳見 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">PG Patroni HA</a>（HA failover、依賴本文 replication topology）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">PG Logical Replication + Debezium</a>（不同 abstraction、共用 WAL）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PG PITR + WAL Archiving</a>（streaming + archive 並行）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PG PgBouncer</a>（connection pool、不做 read/write split）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>（sibling、不同機制）</li>
<li><a href="/blog/backend/knowledge-cards/quorum/" data-link-title="Quorum" data-link-desc="分散式系統以多數節點同意作為提交或讀取有效性的門檻">quorum 卡片</a> / <a href="/blog/backend/knowledge-cards/stale-read/" data-link-title="Stale Read" data-link-desc="讀取到落後於最新寫入版本的舊資料">stale-read 卡片</a> / <a href="/blog/backend/knowledge-cards/eventual-consistency/" data-link-title="Eventual Consistency" data-link-desc="允許短暫不一致、最終收斂到同一資料狀態的一致性語意">eventual-consistency 卡片</a></li>
<li>官方：<a href="https://www.postgresql.org/docs/current/warm-standby.html">PG Streaming Replication</a> / <a href="https://www.postgresql.org/docs/current/app-pgbasebackup.html">pg_basebackup</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/online-schema-change/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/online-schema-change/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>online schema change&lt;/em> — 先看 PG ALTER 哪些已 fast catalog-only、再看 pg_repack / pg-osc 何時必要。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;p>跟 MySQL 不同：PG 大量 schema change &lt;em>內建&lt;/em> fast catalog-only 行為、不必走 ghost table tool。MySQL 對應的 gh-ost / pt-online-schema-change 之於 PG 是 &lt;em>少數場景才需要的 escape hatch&lt;/em>、不是 standard practice。&lt;/p>
&lt;p>寫作 OSC 時必須 &lt;em>先看 PG 自身 ALTER 行為&lt;/em>、確認真的需要再上 pg_repack / pg-osc — 否則徒增複雜度。&lt;/p>
&lt;h2 id="pg-alter-table-的-fast--slow-分類">PG ALTER TABLE 的 fast / slow 分類&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- ALTER TABLE 的操作大致三類&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="類-afast-catalog-only-1-秒metadata-改">類 A：Fast catalog-only（&amp;lt; 1 秒、metadata 改）&lt;/h3>
&lt;p>PG 9.4+ / 11+ 多數 ALTER 已 catalog-only：&lt;/p>
&lt;ul>
&lt;li>&lt;code>ADD COLUMN col TYPE NULL DEFAULT NULL&lt;/code> — 直接 metadata、不 rewrite&lt;/li>
&lt;li>&lt;code>ADD COLUMN col TYPE NOT NULL DEFAULT &amp;lt;constant&amp;gt;&lt;/code>（PG 11+）— optimizer 把 default 存在 metadata、舊 row read 時動態返回 default、不 rewrite&lt;/li>
&lt;li>&lt;code>DROP COLUMN&lt;/code> — metadata 標 dropped、實際 row 不 rewrite（VACUUM 之後逐步清理）&lt;/li>
&lt;li>&lt;code>ALTER COLUMN ... SET DEFAULT &amp;lt;constant&amp;gt;&lt;/code> — metadata&lt;/li>
&lt;li>&lt;code>RENAME COLUMN&lt;/code> / &lt;code>RENAME TABLE&lt;/code> — metadata&lt;/li>
&lt;li>&lt;code>ADD CONSTRAINT ... NOT VALID&lt;/code> — 標記 constraint 不 validate、之後 &lt;code>VALIDATE CONSTRAINT&lt;/code> 才 scan&lt;/li>
&lt;li>&lt;code>ALTER COLUMN ... TYPE&lt;/code> 同 binary-compat 類型（&lt;code>VARCHAR(10) → VARCHAR(20)&lt;/code>、&lt;code>TEXT → VARCHAR&lt;/code> 等）— catalog-only&lt;/li>
&lt;/ul>
&lt;p>這類 ALTER &lt;em>直接跑、不必任何工具&lt;/em>。&lt;/p>
&lt;h3 id="類-block-heavyrewrites-tableproduction-慎用">類 B：Lock heavy（rewrites table、production 慎用）&lt;/h3>
&lt;p>需要 &lt;em>rewrite 整張 table&lt;/em>、ACCESS EXCLUSIVE lock 整個 ALTER 期間：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>online schema change</em> — 先看 PG ALTER 哪些已 fast catalog-only、再看 pg_repack / pg-osc 何時必要。</p></blockquote>
<hr>
<p>跟 MySQL 不同：PG 大量 schema change <em>內建</em> fast catalog-only 行為、不必走 ghost table tool。MySQL 對應的 gh-ost / pt-online-schema-change 之於 PG 是 <em>少數場景才需要的 escape hatch</em>、不是 standard practice。</p>
<p>寫作 OSC 時必須 <em>先看 PG 自身 ALTER 行為</em>、確認真的需要再上 pg_repack / pg-osc — 否則徒增複雜度。</p>
<h2 id="pg-alter-table-的-fast--slow-分類">PG ALTER TABLE 的 fast / slow 分類</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- ALTER TABLE 的操作大致三類</span></span></span></code></pre></div><h3 id="類-afast-catalog-only-1-秒metadata-改">類 A：Fast catalog-only（&lt; 1 秒、metadata 改）</h3>
<p>PG 9.4+ / 11+ 多數 ALTER 已 catalog-only：</p>
<ul>
<li><code>ADD COLUMN col TYPE NULL DEFAULT NULL</code> — 直接 metadata、不 rewrite</li>
<li><code>ADD COLUMN col TYPE NOT NULL DEFAULT &lt;constant&gt;</code>（PG 11+）— optimizer 把 default 存在 metadata、舊 row read 時動態返回 default、不 rewrite</li>
<li><code>DROP COLUMN</code> — metadata 標 dropped、實際 row 不 rewrite（VACUUM 之後逐步清理）</li>
<li><code>ALTER COLUMN ... SET DEFAULT &lt;constant&gt;</code> — metadata</li>
<li><code>RENAME COLUMN</code> / <code>RENAME TABLE</code> — metadata</li>
<li><code>ADD CONSTRAINT ... NOT VALID</code> — 標記 constraint 不 validate、之後 <code>VALIDATE CONSTRAINT</code> 才 scan</li>
<li><code>ALTER COLUMN ... TYPE</code> 同 binary-compat 類型（<code>VARCHAR(10) → VARCHAR(20)</code>、<code>TEXT → VARCHAR</code> 等）— catalog-only</li>
</ul>
<p>這類 ALTER <em>直接跑、不必任何工具</em>。</p>
<h3 id="類-block-heavyrewrites-tableproduction-慎用">類 B：Lock heavy（rewrites table、production 慎用）</h3>
<p>需要 <em>rewrite 整張 table</em>、ACCESS EXCLUSIVE lock 整個 ALTER 期間：</p>
<ul>
<li><code>ALTER COLUMN ... TYPE</code> binary 不相容類型（<code>INT → BIGINT</code> 永遠 rewrite、<code>TEXT → INT</code> 也是）— 雖然語意「擴大」、底層 4-byte 跟 8-byte storage 不同、全表 rewrite + ACCESS EXCLUSIVE 不可省</li>
<li><code>ALTER COLUMN ... SET NOT NULL</code> 對既有 nullable column（要 scan 整 table）</li>
<li><code>ALTER COLUMN ... DROP IDENTITY</code></li>
<li><code>ALTER TABLE ... SET TABLESPACE</code></li>
</ul>
<p>這類 ALTER 對大表 <em>production 不能直接跑</em>、要 ghost table tool。</p>
<h3 id="類-cconcurrent-index--online-operation無-table-lock">類 C：Concurrent index / online operation（無 table lock）</h3>
<ul>
<li><code>CREATE INDEX CONCURRENTLY</code> — 不 lock 寫入、background build、慢但安全</li>
<li><code>REINDEX INDEX CONCURRENTLY</code>（PG 12+） — 同上</li>
<li><code>DROP INDEX CONCURRENTLY</code> — 短 ACCESS EXCLUSIVE lock 只在最後 swap</li>
</ul>
<h2 id="何時需要-ghost-table-tool">何時需要 ghost table tool</h2>
<p>只在以下場景才需要 pg_repack / pg-osc：</p>
<ol>
<li><strong>Rewrite-required type change</strong>（類 B <code>ALTER COLUMN TYPE</code>）對大表</li>
<li><strong>VACUUM FULL 替代</strong>：pg_repack 比 VACUUM FULL 安全（不 lock 整表）</li>
<li><strong>Bloat 重組</strong>：大表 dead tuple 累積、想完整 rewrite</li>
</ol>
<p>對「add column」「drop column」「create index」等場景 <em>PG 內建 fast 已夠</em>、不必 ghost table tool。</p>
<h2 id="tool-1pg_repack--trigger-based--雙-table-swap">Tool 1：pg_repack — Trigger-based + 雙 table swap</h2>
<p>pg_repack 是 PG community 標準 online table rewrite 工具：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">pg_repack -h primary.example.com -p <span class="m">5432</span> -d production -U postgres <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --table<span class="o">=</span>orders --no-superuser-check</span></span></code></pre></div><p><strong>Mechanism</strong>：</p>
<ol>
<li>CREATE <code>repack.table_&lt;oid&gt;</code> 跟原表同 schema</li>
<li>在原表加 3 個 trigger：INSERT / UPDATE / DELETE → 寫入 log table <code>repack.log_&lt;oid&gt;</code></li>
<li>從原表 <code>INSERT INTO repack.table_&lt;oid&gt; SELECT * FROM original</code> 複製 row</li>
<li>邊複製邊 apply log table 紀錄的變更</li>
<li>切換：rename 原表 → original_old、rename repack.table_<oid> → original（atomic）</li>
<li>Drop 舊原表跟 trigger / log</li>
</ol>
<p><strong>Trade-off</strong>：</p>
<ul>
<li><em>Trigger overhead</em>：每個 primary 寫入加 trigger 執行（10-30% 寫吞吐降）</li>
<li><em>FK 處理</em>：需要 drop &amp; re-create FK referencing original table（pg_repack 自動處理但有 lock window）</li>
<li>適用 <em>PG-version 綁定</em> — pg_repack 13 不能對 PG 14 cluster 跑</li>
</ul>
<p><strong>配置</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Primary 安裝
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_repack</span><span class="p">;</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Repack orders</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pg_repack -d production --table<span class="o">=</span>orders
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># 監控 lock：另一 session 跑 SELECT * FROM pg_stat_activity</span></span></span></code></pre></div><h2 id="tool-2pg-osc--pg-online-schema-change--wal-shipping-style">Tool 2：pg-osc / pg-online-schema-change — WAL-shipping style</h2>
<p><a href="https://github.com/shayonj/pg-osc">pg-osc</a>（Shayon Mukherjee、2023）是較新的工具、模仿 gh-ost mechanism：</p>
<p><strong>Mechanism</strong>：</p>
<ol>
<li>用 logical replication slot 從 primary WAL stream 變更</li>
<li>CREATE shadow table + 套 ALTER 變更</li>
<li>Stream WAL event 同步 shadow table（不靠 trigger）</li>
<li>完成後 swap</li>
</ol>
<p><strong>Trade-off</strong>：</p>
<ul>
<li><em>Primary 寫入 overhead</em>：0（WAL 已存在）</li>
<li>比 pg_repack 較新（社群驗證度低）</li>
<li>適合 <em>trigger overhead 不可接受</em> 的高吞吐 production</li>
</ul>
<p><strong>配置</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 用 gem install</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">gem install pg_online_schema_change
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># Run</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">pg-online-schema-change perform <span class="se">\
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="se"></span>  --alter-statement<span class="o">=</span><span class="s2">&#34;ALTER TABLE orders ADD COLUMN status VARCHAR(20)&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="se"></span>  --schema<span class="o">=</span>public <span class="se">\
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="se"></span>  --dbname<span class="o">=</span>production <span class="se">\
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="se"></span>  --host<span class="o">=</span>primary.example.com</span></span></code></pre></div><h2 id="配置-step-by-steppg_repack-為主">配置 step-by-step（pg_repack 為主）</h2>
<p>實務多數 PG OSC 用 pg_repack。pg-osc 是 high-write-throughput escape hatch。</p>
<h3 id="step-1安裝--確認版本">Step 1：安裝 + 確認版本</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 安裝 pg_repack（versioned）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_repack</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_available_extensions</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;pg_repack&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 確認 installed_version 跟 PG major version 對齊</span></span></span></code></pre></div><h3 id="step-2跑-pg_repack">Step 2：跑 pg_repack</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">pg_repack -h primary -d production -U postgres <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --table<span class="o">=</span>orders <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --jobs<span class="o">=</span><span class="m">4</span> <span class="se">\ </span>                      <span class="c1"># 並行 worker</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">  --wait-timeout<span class="o">=</span><span class="m">60</span> <span class="se">\ </span>             <span class="c1"># 等 lock 超時（秒）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">  --no-kill-backend                <span class="c1"># 不主動 kill 卡 lock 的 query</span></span></span></code></pre></div><h3 id="step-3監控">Step 3：監控</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 看 pg_repack 進度
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pid</span><span class="p">,</span><span class="w"> </span><span class="n">query</span><span class="p">,</span><span class="w"> </span><span class="k">state</span><span class="p">,</span><span class="w"> </span><span class="n">wait_event_type</span><span class="p">,</span><span class="w"> </span><span class="n">wait_event</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_activity</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">query</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="s1">&#39;%repack%&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- 看 lock 狀態
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_locks</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">relation</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w">  </span><span class="k">SELECT</span><span class="w"> </span><span class="n">oid</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_class</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">relname</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;orders&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;repack.table_xxx&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><h3 id="step-4驗證">Step 4：驗證</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 跑完後對比 row count + 抽樣 query
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- 跟 pg_repack 之前 count 對比</span></span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-alter-直接跑沒看是不是-fast-變-lock-heavy">1. ALTER 直接跑沒看是不是 fast 變 lock heavy</h3>
<p><code>ALTER TABLE orders ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'pending'</code> — 預期 catalog-only（PG 11+）、但若 PG 10 跑這個就會 rewrite 整表、ACCESS EXCLUSIVE lock 幾小時。</p>
<p>修法：</p>
<ul>
<li>寫 schema migration 前 <em>確認 PG version</em></li>
<li>看 <a href="https://www.postgresql.org/docs/current/sql-altertable.html">PG ALTER doc</a>、each subcommand 標 <em>Note</em> 段是否 fast</li>
<li>Production 跑前 staging 測 + 監控 <code>pg_stat_activity</code> lock wait</li>
</ul>
<h3 id="2-vacuum-full-誤用--production-downtime">2. VACUUM FULL 誤用 — Production downtime</h3>
<p><code>VACUUM FULL</code> 等於「rewrite 整表 + ACCESS EXCLUSIVE lock」。Production 跑 = 表變 unavailable 幾分鐘到幾小時。</p>
<p>修法：</p>
<ul>
<li><em>永遠用 pg_repack</em> 取代 VACUUM FULL（除非 maintenance window）</li>
<li>對 bloat 議題、定期跑 pg_repack</li>
<li>autovacuum tuning 第一優先（<a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum-tuning</a> 詳細）</li>
</ul>
<h3 id="3-pg_repack-version-mismatch">3. pg_repack version mismatch</h3>
<p>PG cluster 升 14、但 <code>pg_repack</code> extension 還是 13 版本。試 ALTER 跑 <code>pg_repack</code> 命令、ERROR: <code>program &quot;pg_repack 14.x&quot; does not match installed extension &quot;pg_repack 13.x&quot;</code>。</p>
<p>修法：</p>
<ul>
<li>升 PG cluster 後 <em>立即 ALTER EXTENSION pg_repack UPDATE</em></li>
<li>若 pg_repack 還沒釋出對應 PG 版本（早期升級）、暫時用 pg-osc 替代或等待</li>
<li>升級 runbook 紀錄 pg_repack 是 <em>必同步升級的 extension</em></li>
</ul>
<h3 id="4-create-index-concurrently-失敗清理">4. CREATE INDEX CONCURRENTLY 失敗清理</h3>
<p><code>CREATE INDEX CONCURRENTLY</code> 跑到一半被 cancel（用戶 Ctrl-C / connection drop）、產生 <em>invalid index</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">indexrelid</span><span class="p">::</span><span class="n">regclass</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_index</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="n">indisvalid</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="c1">-- 顯示一個 idx_orders_status_invalid</span></span></span></code></pre></div><p>Invalid index 仍佔 disk、但 optimizer 不會用。</p>
<p>修法：</p>
<ul>
<li>跑 <code>DROP INDEX CONCURRENTLY idx_orders_status_invalid</code></li>
<li>之後重新 <code>CREATE INDEX CONCURRENTLY</code></li>
<li>避免在 connection 不穩的 session 跑長時間 CREATE INDEX CONCURRENTLY、改用 cron 或 deploy pipeline</li>
</ul>
<h3 id="5-generated-stored-column-不能-online-add">5. Generated stored column 不能 online ADD</h3>
<p><code>ADD COLUMN total NUMERIC GENERATED ALWAYS AS (price * qty) STORED</code> — <em>stored</em> generated column 必須 rewrite 整表計算 column value、不是 catalog-only。</p>
<p>修法：</p>
<ul>
<li>
<p>用 <code>GENERATED ALWAYS AS (...) VIRTUAL</code>（PG 18+）— 不存實際 value、catalog-only</p>
</li>
<li>
<p>或 <em>先加 nullable column + backfill + 加 NOT NULL constraint</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">total</span><span class="w"> </span><span class="nb">NUMERIC</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">UPDATE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">price</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">qty</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="k">BETWEEN</span><span class="w"> </span><span class="p">...;</span><span class="w">  </span><span class="c1">-- chunked
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">ALTER</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">total</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 之後加 trigger 或 application 層維護 total</span></span></span></code></pre></div></li>
<li>
<p>或用 pg_repack 跑 rewrite ADD GENERATED STORED</p>
</li>
</ul>
<h2 id="容量--時間估算">容量 / 時間估算</h2>
<p>對 100 GB 表、ADD COLUMN 加 index 為例：</p>
<table>
  <thead>
      <tr>
          <th>操作</th>
          <th>時間</th>
          <th>Lock 影響</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>ADD COLUMN col TYPE NULL</code> (PG 11+)</td>
          <td>&lt; 1 秒</td>
          <td>ACCESS EXCLUSIVE（毫秒級）</td>
      </tr>
      <tr>
          <td><code>ADD COLUMN col TYPE NOT NULL DEFAULT 0</code> (PG 11+)</td>
          <td>&lt; 1 秒</td>
          <td>ACCESS EXCLUSIVE（毫秒級）</td>
      </tr>
      <tr>
          <td><code>CREATE INDEX CONCURRENTLY</code></td>
          <td>2-6 小時</td>
          <td>無 table lock</td>
      </tr>
      <tr>
          <td><code>pg_repack table</code></td>
          <td>4-8 小時</td>
          <td>短 ACCESS EXCLUSIVE（swap）</td>
      </tr>
      <tr>
          <td><code>ALTER COLUMN TYPE</code> rewrite</td>
          <td>4-8 小時</td>
          <td>ACCESS EXCLUSIVE 全程</td>
      </tr>
      <tr>
          <td><code>VACUUM FULL</code></td>
          <td>同 pg_repack</td>
          <td>ACCESS EXCLUSIVE 全程（不要跑）</td>
      </tr>
  </tbody>
</table>
<h2 id="跟-mysql-gh-ost--pt-osc-對照">跟 MySQL gh-ost / pt-osc 對照</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PG pg_repack</th>
          <th>PG pg-osc</th>
          <th>MySQL gh-ost</th>
          <th>MySQL pt-osc</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>機制</td>
          <td>Trigger + log table</td>
          <td>WAL logical stream</td>
          <td>Binlog stream</td>
          <td>Trigger + log table</td>
      </tr>
      <tr>
          <td>Primary 寫 overhead</td>
          <td>中（trigger）</td>
          <td>0（WAL 已存在）</td>
          <td>0（binlog 已存在）</td>
          <td>中（trigger）</td>
      </tr>
      <tr>
          <td>Throttle 支援</td>
          <td>部分</td>
          <td>支援</td>
          <td>強</td>
          <td>部分</td>
      </tr>
      <tr>
          <td>Pause / Resume</td>
          <td>不支援</td>
          <td>不支援</td>
          <td>支援</td>
          <td>不支援</td>
      </tr>
      <tr>
          <td>工具成熟度</td>
          <td>高</td>
          <td>中（2023+）</td>
          <td>高</td>
          <td>高</td>
      </tr>
      <tr>
          <td>Use case 比例</td>
          <td>PG 主流（90% case）</td>
          <td>高吞吐 escape hatch</td>
          <td>MySQL 主流（dev）</td>
          <td>MySQL legacy + FK</td>
      </tr>
  </tbody>
</table>
<p>PG OSC tool 使用頻率比 MySQL 低 — 因為 PG 內建 fast ALTER 已 cover 90% schema change、ghost table tool 只對 <em>少數 rewrite-required</em> 場景。</p>
<p>詳見 <a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">MySQL Online Schema Change Tools</a> — sibling、不同 use case mix。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p>ALTER TABLE / pg_repack / pg-osc 都產生 WAL、會 replicate 到 standby。Standby 上的 long-running query 可能跟 ALTER 衝突、被 <code>hot_standby_feedback</code> 影響 primary autovacuum。詳見 <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>。</p>
<h3 id="跟-autovacuum-tuning">跟 Autovacuum Tuning</h3>
<p>Schema change 後常產生 dead tuple、autovacuum 需要重新 cover。詳見 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a>。</p>
<h3 id="跟-logical-replication">跟 Logical Replication</h3>
<p>logical replication 透過 publication / subscription 同步 — DDL <em>不會</em> logical replicate（PG 16 之前）、必須 <em>在 publisher / subscriber 各自跑 DDL</em>。詳見 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a>。</p>
<h3 id="跟-patroni-ha">跟 Patroni HA</h3>
<p>Patroni promote 新 primary 後、pg_repack extension state（slot / catalog）跟著走、新 primary 仍可繼續 pg_repack。詳見 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a>。</p>
<h2 id="何時用哪個">何時用哪個</h2>
<table>
  <thead>
      <tr>
          <th>情境</th>
          <th>選擇</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ADD COLUMN nullable / DROP COLUMN / RENAME 等</td>
          <td>直接 ALTER（fast catalog-only）</td>
      </tr>
      <tr>
          <td>CREATE INDEX 大表</td>
          <td><code>CREATE INDEX CONCURRENTLY</code></td>
      </tr>
      <tr>
          <td>ALTER COLUMN TYPE rewrite（大表）</td>
          <td>pg_repack</td>
      </tr>
      <tr>
          <td>Bloat 重組</td>
          <td>pg_repack</td>
      </tr>
      <tr>
          <td>高吞吐 + trigger overhead 不可接受</td>
          <td>pg-osc</td>
      </tr>
      <tr>
          <td>ADD GENERATED STORED column</td>
          <td>nullable + backfill + constraint</td>
      </tr>
      <tr>
          <td>Cluster on Cloud（RDS / Aurora）</td>
          <td>RDS / Aurora 內建 fast DDL 多數已 cover、pg_repack 視 vendor 支援</td>
      </tr>
  </tbody>
</table>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（ALTER 跟 streaming replication 互動）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">PG Autovacuum Tuning</a>（schema change 後 vacuum 議題）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">PG Logical Replication + Debezium</a>（DDL 不 replicate 議題）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">PG Patroni HA</a>（HA 跟 pg_repack 整合）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">MySQL Online Schema Change Tools</a>（sibling、tool ecosystem 不同）</li>
<li><a href="/blog/backend/knowledge-cards/expand-contract/" data-link-title="Expand / Contract" data-link-desc="說明先擴充相容面、再收斂舊路徑的遷移做法">Expand / Contract 卡片</a>（schema migration 設計原則）</li>
<li>官方：<a href="https://www.postgresql.org/docs/current/sql-altertable.html">ALTER TABLE</a> / <a href="https://github.com/reorg/pg_repack">pg_repack GitHub</a> / <a href="https://github.com/shayonj/pg-osc">pg-osc GitHub</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Connection Scaling：process-per-connection model 跟為什麼 pooler 是必裝</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/connection-scaling/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/connection-scaling/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>connection scaling 的根因&lt;/em> — 為什麼 PG 比多數 DB 更需要 pooler、跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &amp;#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgbouncer-config&lt;/a> 是 &lt;em>根因 vs 配置&lt;/em> 的關係。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="connection-per-process-model-是-pg-的結構性選擇">Connection-per-Process Model 是 PG 的結構性選擇&lt;/h2>
&lt;p>PG 接受 client connection 時的行為跟多數現代 DB 不同：每個 connection 由 postmaster &lt;code>fork()&lt;/code> 一個獨立的 OS process（backend）來服務。這個 process 在 connection lifetime 內專屬該 client、不跟其他 client 共享。&lt;/p>
&lt;p>對比常見 DB 的 connection model：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Vendor&lt;/th>
 &lt;th>Connection model&lt;/th>
 &lt;th>每 connection 資源&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>PostgreSQL&lt;/td>
 &lt;td>Process-per-connection（fork）&lt;/td>
 &lt;td>5-15MB RAM、獨立 PID&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>MySQL&lt;/td>
 &lt;td>Thread-per-connection&lt;/td>
 &lt;td>256KB-2MB RAM、共享 process&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Oracle&lt;/td>
 &lt;td>Shared server / dedicated 可選&lt;/td>
 &lt;td>配置決定&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>SQL Server&lt;/td>
 &lt;td>Thread-per-connection（pooled）&lt;/td>
 &lt;td>~512KB&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>MongoDB&lt;/td>
 &lt;td>Thread-per-connection&lt;/td>
 &lt;td>~1MB&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>PG 選 process 不選 thread 是 1990s 設計決定 — 當時 thread library 在多 UNIX 平台不穩定、process 隔離性更好（一個 backend crash 不會帶倒整個 DB）。這個 trade-off 一路保留到今天、是 PG 在 high-connection-count workload 的 &lt;em>結構性負擔&lt;/em>。&lt;/p>
&lt;h2 id="量化connection-數量對-ram-跟-cpu-的壓力">量化：connection 數量對 RAM 跟 CPU 的壓力&lt;/h2>
&lt;p>一個 PG backend process 的 RAM footprint 由三部分組成：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">backend_rss ≈ shared_buffers_attach + process_private + work_mem 高水位&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>shared_buffers&lt;/code> 是所有 backend 共享的、不重複計、但 &lt;code>process_private&lt;/code>（catalog cache / plan cache / temp buffer）跟 &lt;code>work_mem&lt;/code> 是 per-backend：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Workload 類型&lt;/th>
 &lt;th>process_private&lt;/th>
 &lt;th>work_mem 高水位&lt;/th>
 &lt;th>單 backend RAM&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Idle / 簡單 OLTP&lt;/td>
 &lt;td>3-5MB&lt;/td>
 &lt;td>4MB&lt;/td>
 &lt;td>7-9MB&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>中等 query（join / sort）&lt;/td>
 &lt;td>5-8MB&lt;/td>
 &lt;td>16-64MB&lt;/td>
 &lt;td>21-72MB&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Heavy analytical（CTE / window）&lt;/td>
 &lt;td>8-15MB&lt;/td>
 &lt;td>256MB+&lt;/td>
 &lt;td>264MB+&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>500 個 connection、平均 30MB 各 ≈ 15GB RAM 給 backend processes（還沒算 shared_buffers）。這是 PG 在 cloud instance 上很快撞到 RAM ceiling 的根因。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>connection scaling 的根因</em> — 為什麼 PG 比多數 DB 更需要 pooler、跟 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgbouncer-config</a> 是 <em>根因 vs 配置</em> 的關係。</p></blockquote>
<hr>
<h2 id="connection-per-process-model-是-pg-的結構性選擇">Connection-per-Process Model 是 PG 的結構性選擇</h2>
<p>PG 接受 client connection 時的行為跟多數現代 DB 不同：每個 connection 由 postmaster <code>fork()</code> 一個獨立的 OS process（backend）來服務。這個 process 在 connection lifetime 內專屬該 client、不跟其他 client 共享。</p>
<p>對比常見 DB 的 connection model：</p>
<table>
  <thead>
      <tr>
          <th>Vendor</th>
          <th>Connection model</th>
          <th>每 connection 資源</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PostgreSQL</td>
          <td>Process-per-connection（fork）</td>
          <td>5-15MB RAM、獨立 PID</td>
      </tr>
      <tr>
          <td>MySQL</td>
          <td>Thread-per-connection</td>
          <td>256KB-2MB RAM、共享 process</td>
      </tr>
      <tr>
          <td>Oracle</td>
          <td>Shared server / dedicated 可選</td>
          <td>配置決定</td>
      </tr>
      <tr>
          <td>SQL Server</td>
          <td>Thread-per-connection（pooled）</td>
          <td>~512KB</td>
      </tr>
      <tr>
          <td>MongoDB</td>
          <td>Thread-per-connection</td>
          <td>~1MB</td>
      </tr>
  </tbody>
</table>
<p>PG 選 process 不選 thread 是 1990s 設計決定 — 當時 thread library 在多 UNIX 平台不穩定、process 隔離性更好（一個 backend crash 不會帶倒整個 DB）。這個 trade-off 一路保留到今天、是 PG 在 high-connection-count workload 的 <em>結構性負擔</em>。</p>
<h2 id="量化connection-數量對-ram-跟-cpu-的壓力">量化：connection 數量對 RAM 跟 CPU 的壓力</h2>
<p>一個 PG backend process 的 RAM footprint 由三部分組成：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">backend_rss ≈ shared_buffers_attach + process_private + work_mem 高水位</span></span></code></pre></div><p><code>shared_buffers</code> 是所有 backend 共享的、不重複計、但 <code>process_private</code>（catalog cache / plan cache / temp buffer）跟 <code>work_mem</code> 是 per-backend：</p>
<table>
  <thead>
      <tr>
          <th>Workload 類型</th>
          <th>process_private</th>
          <th>work_mem 高水位</th>
          <th>單 backend RAM</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Idle / 簡單 OLTP</td>
          <td>3-5MB</td>
          <td>4MB</td>
          <td>7-9MB</td>
      </tr>
      <tr>
          <td>中等 query（join / sort）</td>
          <td>5-8MB</td>
          <td>16-64MB</td>
          <td>21-72MB</td>
      </tr>
      <tr>
          <td>Heavy analytical（CTE / window）</td>
          <td>8-15MB</td>
          <td>256MB+</td>
          <td>264MB+</td>
      </tr>
  </tbody>
</table>
<p>500 個 connection、平均 30MB 各 ≈ 15GB RAM 給 backend processes（還沒算 shared_buffers）。這是 PG 在 cloud instance 上很快撞到 RAM ceiling 的根因。</p>
<p>CPU 層面、<code>fork()</code> 系統呼叫在 Linux 通常 1-3ms、context switch ~3-5μs。100 connection burst 在 1 秒內進來、accumulated fork cost 100-300ms、加 query 本身的 CPU 跟 scheduler latency、平均 query 延遲會跳 2-5x。</p>
<h2 id="三個-guc-互動max_connections--shared_buffers--work_mem">三個 GUC 互動：max_connections / shared_buffers / work_mem</h2>
<p>PG 的 memory 規劃由這三個 GUC 互動決定、不能獨立調：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">total_RAM ≈ shared_buffers + (max_connections × work_mem 高水位) + OS overhead</span></span></code></pre></div><p>實務 sizing 規則（16GB instance、OLTP workload）：</p>
<table>
  <thead>
      <tr>
          <th>GUC</th>
          <th>建議值</th>
          <th>理由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>shared_buffers</code></td>
          <td>25% RAM（4GB）</td>
          <td>太大 OS file cache 收益遞減、&lt; 25% wastes RAM</td>
      </tr>
      <tr>
          <td><code>work_mem</code></td>
          <td>8-32MB</td>
          <td>每 query operation 用一份、不是每 connection 一份</td>
      </tr>
      <tr>
          <td><code>max_connections</code></td>
          <td>100-200</td>
          <td>超過 200 需 pooler、不是調更大</td>
      </tr>
      <tr>
          <td><code>effective_cache_size</code></td>
          <td>50-75% RAM</td>
          <td>planner 估 cost 用、不是實際配置</td>
      </tr>
      <tr>
          <td><code>maintenance_work_mem</code></td>
          <td>64-512MB</td>
          <td>VACUUM / CREATE INDEX 用</td>
      </tr>
  </tbody>
</table>
<p><code>max_connections = 1000</code> 是常見 anti-pattern — 真實 active query 可能只 50-100、剩下都 idle、但每個還是吃 RAM 跟 process slot、context switch overhead 還在。</p>
<h2 id="pooler-為什麼是-production-prerequisite">Pooler 為什麼是 <em>production prerequisite</em></h2>
<blockquote>
<p>本段是「為什麼必裝」、實際 PgBouncer 配置看 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgbouncer-config</a>。</p></blockquote>
<p>Pooler 的核心責任是 <em>把 N 個 application connection multiplex 成 M 個 PG backend（M ≪ N）</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Application (3000 connection)
</span></span><span class="line"><span class="ln">2</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">3</span><span class="cl">Pooler（PgBouncer / PgCat）
</span></span><span class="line"><span class="ln">4</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">5</span><span class="cl">PostgreSQL (50 backend process)</span></span></code></pre></div><p>Application 看到的是 <em>無限 connection 池</em>、PG 看到的是 <em>穩定 50 個 backend</em>。三個層次的效益：</p>
<ol>
<li><strong>RAM 節省</strong>：3000 connection × 30MB = 90GB → 50 backend × 30MB = 1.5GB</li>
<li><strong>Fork() cost 攤平</strong>：backend 重用、不是每個 client 都 fork</li>
<li><strong>Connection storm 緩衝</strong>：application 重啟 / scaling event 不會直接打到 PG</li>
</ol>
<p>Pooler 有三種 pool mode、各有 application 層相容性 trade-off：</p>
<table>
  <thead>
      <tr>
          <th>Pool mode</th>
          <th>Session 隔離</th>
          <th>適用 application</th>
          <th>PG feature 限制</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Session</td>
          <td>每 client 獨佔 1 backend</td>
          <td>用 prepared statement、SET、temp table</td>
          <td>等同沒 pool、僅救 fork cost</td>
      </tr>
      <tr>
          <td>Transaction</td>
          <td>每 transaction 換 backend</td>
          <td>多數 stateless API（最常用）</td>
          <td>不能用 session-level state</td>
      </tr>
      <tr>
          <td>Statement</td>
          <td>每 statement 換 backend</td>
          <td>Read-only / analytical</td>
          <td>不能用 transaction</td>
      </tr>
  </tbody>
</table>
<p>Production 多數選 transaction pool — 救 RAM 又保留 transaction semantics、代價是 application 不能用 session-level <code>SET</code>、<code>LISTEN/NOTIFY</code>、prepared statement（部分 pooler 已支援）。</p>
<h2 id="application-side-pool-vs-middleware-pool-vs-rds-proxy">Application-side Pool vs Middleware Pool vs RDS Proxy</h2>
<p>三層 pool 都能解 connection 問題、但解的問題不同：</p>
<table>
  <thead>
      <tr>
          <th>層級</th>
          <th>代表</th>
          <th>解的問題</th>
          <th>限制</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Application-side（driver）</td>
          <td>HikariCP（Java）/ pgx pool（Go）/ asyncpg / Sequelize</td>
          <td>Connection 重用 + lifecycle 管理</td>
          <td>仍每 app instance 開 N 個到 PG、總量沒收斂</td>
      </tr>
      <tr>
          <td>Middleware pooler</td>
          <td>PgBouncer / PgCat</td>
          <td>Multiplex 所有 application instance 到少數 backend</td>
          <td>多一跳 latency 0.1-1ms、需自管 HA</td>
      </tr>
      <tr>
          <td>Cloud-managed proxy</td>
          <td>RDS Proxy / Cloud SQL Proxy</td>
          <td>Multiplex + IAM auth + Secrets Manager integration</td>
          <td>Latency 1-3ms、cost premium、PG feature 受限</td>
      </tr>
  </tbody>
</table>
<p><strong>典型 production 拓撲</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Application (HikariCP pool 10/instance × 50 instance = 500)
</span></span><span class="line"><span class="ln">2</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">3</span><span class="cl">PgBouncer transaction pool（50 backend）
</span></span><span class="line"><span class="ln">4</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">5</span><span class="cl">PostgreSQL primary</span></span></code></pre></div><p>Application pool 救 fork cost、PgBouncer 救 backend 總量、兩層各做各的事不衝突。</p>
<p><strong>雙層 pool 配置容易出錯</strong>：application pool size 5 + PgBouncer default_pool_size 50 + 100 個 app instance、application 願意開 500 connection、PgBouncer 只給 50 個 backend — 多 450 個 application connection wait、看起來像「DB 慢」但實際是 pool 不足。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="case-1connection-storm重啟--autoscale-同時打進來">Case 1：Connection storm（重啟 / autoscale 同時打進來）</h3>
<p><strong>情境</strong>：Kubernetes rolling restart、200 個 pod 同時重連、每 pod 開 20 個 connection、瞬間 4000 個 connection 嘗試打到 PG。</p>
<p>PG <code>max_connections = 500</code> 直接拒絕 3500 個、application 看到 <code>FATAL: sorry, too many clients already</code>、retry storm 雪上加霜。</p>
<p>修法：</p>
<ul>
<li>PgBouncer 在前面、application 連 PgBouncer 不直連 PG</li>
<li><code>reserve_pool_size = 5</code> 給管理流量留 buffer</li>
<li>Application 端加 jittered exponential backoff、避免 retry 同步</li>
</ul>
<h3 id="case-2fork-cost-在-burst-流量">Case 2：fork() cost 在 burst 流量</h3>
<p><strong>情境</strong>：Cron job 每分鐘整點觸發、500 個 worker 同時開 short-lived connection 跑 30ms query、結束關閉。</p>
<p>每分鐘 500 次 <code>fork()</code> + 500 次 <code>exit()</code>、fork cost 500-1500ms、CPU spike、其他 OLTP query 延遲飆。</p>
<p>修法：</p>
<ul>
<li>Worker 改 connect 到 PgBouncer transaction pool、backend 重用、fork 只在 PgBouncer 首次拓展時</li>
<li>或 worker 改成 long-lived process + 內部 task queue、避免每分鐘重 fork</li>
</ul>
<h3 id="case-3shared_buffers-跟-max_connections-互相壓縮">Case 3：shared_buffers 跟 max_connections 互相壓縮</h3>
<p><strong>情境</strong>：16GB instance、<code>shared_buffers = 8GB</code>（50%）、<code>max_connections = 800</code>、<code>work_mem = 16MB</code>。</p>
<p>預估 RAM：8GB + 800 × ~30MB = 32GB ≫ 16GB instance、OOM kill 來訪。</p>
<p>修法（重新分配）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">shared_buffers</span> <span class="o">=</span> <span class="s">4GB           # 25%</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">max_connections</span> <span class="o">=</span> <span class="s">200          # 透過 PgBouncer multiplex</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">work_mem</span> <span class="o">=</span> <span class="s">16MB</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">effective_cache_size</span> <span class="o">=</span> <span class="s">12GB</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">maintenance_work_mem</span> <span class="o">=</span> <span class="s">512MB</span></span></span></code></pre></div><p>關鍵：<code>max_connections</code> 不是調更大救 connection 不足、是調 <em>PgBouncer pool size</em> 拓展 application 容量。</p>
<h3 id="case-4double-pool-配置失敗">Case 4：Double-pool 配置失敗</h3>
<p><strong>情境</strong>：Application HikariCP pool size = 50、50 個 instance、PgBouncer <code>default_pool_size = 20</code>、PG <code>max_connections = 100</code>。</p>
<p>Application 願意開 2500 個 connection、PgBouncer 只給 20 個 backend、application thread 大量 block 在 PgBouncer 等 backend 釋出。</p>
<p>修法：</p>
<ul>
<li>計算 <em>application 願意的並發</em> vs <em>PgBouncer 允許的 backend</em> vs <em>PG max_connections</em> 三層匹配</li>
<li>通常 <code>application_total_connection ≪ pgbouncer_max_client_conn</code> + <code>pgbouncer_default_pool_size + reserve ≪ pg_max_connections</code></li>
<li>Monitor PgBouncer <code>SHOW POOLS</code> 的 <code>cl_waiting</code>、長期 &gt; 0 表示 pool 不足</li>
</ul>
<h3 id="case-5max_connections-設太大反而慢">Case 5：max_connections 設太大反而慢</h3>
<p><strong>情境</strong>：team 看到 <code>connection refused</code>、把 <code>max_connections</code> 從 200 調到 2000、想說「給更多 connection 應該更好」。</p>
<p>調完 throughput 反而降 30% — context switch overhead、planner cache 競爭、lock manager 競爭都跟 connection 數線性放大。</p>
<p>修法：</p>
<ul>
<li><code>max_connections</code> 上限通常 200-500、超過要靠 pooler multiplex</li>
<li>用 <code>pg_stat_activity</code> 看真實 active connection（state != &lsquo;idle&rsquo;）、通常 &lt; 100</li>
<li>真實上限 = active 高水位 × 安全係數 1.5、不是「未來可能會用到的數量」</li>
</ul>
<h2 id="跟-mysql-connection-model-對比">跟 MySQL connection model 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PostgreSQL</th>
          <th>MySQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Connection 模型</td>
          <td>Process-per-connection（fork）</td>
          <td>Thread-per-connection</td>
      </tr>
      <tr>
          <td>單 connection RAM</td>
          <td>5-15MB（idle）/ 30-200MB（heavy）</td>
          <td>256KB-2MB</td>
      </tr>
      <tr>
          <td>Fork / spawn cost</td>
          <td>1-3ms</td>
          <td>&lt; 100μs</td>
      </tr>
      <tr>
          <td>Pooler 必要性</td>
          <td><strong>強烈必要</strong>（300+ connection 必裝）</td>
          <td>中等（ProxySQL 對特定 case 有用）</td>
      </tr>
      <tr>
          <td>主流 pooler</td>
          <td>PgBouncer / PgCat</td>
          <td>ProxySQL / MySQL Router</td>
      </tr>
  </tbody>
</table>
<p>MySQL thread-per-connection model 讓它在 high-connection-count workload 上 <em>看起來</em> 更省 — 但 PG 透過 PgBouncer 達到的 application 看到的容量跟 MySQL 直連是一樣的、只是多一層 indirection。</p>
<p>實務影響：</p>
<ul>
<li>MySQL 直連 1000 connection 還 OK、PG 直連 1000 connection 通常 OOM</li>
<li>PG + PgBouncer 1000 application connection、後端 50 backend、表現跟 MySQL 1000 直連相當</li>
<li>沒有 <em>PG 更耗 RAM</em> 的本質結論、是 <em>PG 預設不 multiplex、需要外掛 multiplex 層</em></li>
</ul>
<h2 id="pg-17-的-connection-進展">PG 17+ 的 connection 進展</h2>
<p>PG 17（2024）對 connection 仍維持 process-per-connection、但有幾個減壓改進：</p>
<ul>
<li><strong>Per-process memory 降低</strong>：catalog cache 改 generational allocator、idle backend RAM 降 ~20%</li>
<li><strong>Subscriber-side parallel apply</strong>：logical replication 減少 connection 開銷</li>
<li><strong><code>io_combine_limit</code></strong>：buffered read 合併、降 syscall overhead</li>
</ul>
<p>但 <em>process-per-connection model 本身</em> 沒換 — 短期內 PG 仍需 pooler。長期方向（PG 18+ 討論）可能引入 thread-based backend、但目前是 experimental patch。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgbouncer-config</a>：PgBouncer 操作配置 + 5 case</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">replication-topology</a>：Read replica + connection 分流</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization</a>：<code>work_mem</code> 影響 plan</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">mvcc-lock-model</a>：connection idle in transaction 卡 vacuum</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum-tuning</a>：autovacuum 也吃 connection slot</li>
</ul>
<h2 id="下一步">下一步</h2>
<ul>
<li>連到 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgbouncer-config</a> 學配置細節</li>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview</a> 回到全圖</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Index Selection：B-tree / GIN / GiST / BRIN / Hash 對應 workload 的決策樹</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/index-selection/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/index-selection/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>index 選型&lt;/em> — 何時用哪種 index、跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization&lt;/a> 的「為什麼這個 plan 慢」互補。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="6-種-index-method-對應-workload">6 種 Index Method 對應 Workload&lt;/h2>
&lt;p>PG 有 6 種 index access method、各有自己擅長的 query pattern：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Index method&lt;/th>
 &lt;th>適用 query pattern&lt;/th>
 &lt;th>典型 column type&lt;/th>
 &lt;th>儲存成本&lt;/th>
 &lt;th>&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>B-tree&lt;/td>
 &lt;td>&lt;code>=&lt;/code> / &lt;code>&amp;lt;&lt;/code> / &lt;code>&amp;gt;&lt;/code> / &lt;code>BETWEEN&lt;/code> / &lt;code>IS NULL&lt;/code> / &lt;code>LIKE 'prefix%'&lt;/code>&lt;/td>
 &lt;td>任何 scalar、最常用&lt;/td>
 &lt;td>中&lt;/td>
 &lt;td>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Hash&lt;/td>
 &lt;td>純 &lt;code>=&lt;/code> 比對&lt;/td>
 &lt;td>scalar、不常用&lt;/td>
 &lt;td>低&lt;/td>
 &lt;td>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>GIN&lt;/td>
 &lt;td>&lt;code>@&amp;gt;&lt;/code> / &lt;code>?&lt;/code> / `?&lt;/td>
 &lt;td>` / FTS / array 包含&lt;/td>
 &lt;td>JSONB / tsvector / array&lt;/td>
 &lt;td>高（write 慢）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>GiST&lt;/td>
 &lt;td>範圍 / 空間 / 自訂 operator&lt;/td>
 &lt;td>geometry / tsvector / range&lt;/td>
 &lt;td>中&lt;/td>
 &lt;td>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>SP-GiST&lt;/td>
 &lt;td>Non-balanced 樹結構&lt;/td>
 &lt;td>IP / phone prefix / quad-tree&lt;/td>
 &lt;td>中&lt;/td>
 &lt;td>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>BRIN&lt;/td>
 &lt;td>大表的 range scan、physical order 跟 logical order 相關&lt;/td>
 &lt;td>timestamp / id（append-only）&lt;/td>
 &lt;td>極低&lt;/td>
 &lt;td>&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>選錯 index 的代價：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>index 選型</em> — 何時用哪種 index、跟 <a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization</a> 的「為什麼這個 plan 慢」互補。</p></blockquote>
<hr>
<h2 id="6-種-index-method-對應-workload">6 種 Index Method 對應 Workload</h2>
<p>PG 有 6 種 index access method、各有自己擅長的 query pattern：</p>
<table>
  <thead>
      <tr>
          <th>Index method</th>
          <th>適用 query pattern</th>
          <th>典型 column type</th>
          <th>儲存成本</th>
          <th></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>B-tree</td>
          <td><code>=</code> / <code>&lt;</code> / <code>&gt;</code> / <code>BETWEEN</code> / <code>IS NULL</code> / <code>LIKE 'prefix%'</code></td>
          <td>任何 scalar、最常用</td>
          <td>中</td>
          <td></td>
      </tr>
      <tr>
          <td>Hash</td>
          <td>純 <code>=</code> 比對</td>
          <td>scalar、不常用</td>
          <td>低</td>
          <td></td>
      </tr>
      <tr>
          <td>GIN</td>
          <td><code>@&gt;</code> / <code>?</code> / `?</td>
          <td>` / FTS / array 包含</td>
          <td>JSONB / tsvector / array</td>
          <td>高（write 慢）</td>
      </tr>
      <tr>
          <td>GiST</td>
          <td>範圍 / 空間 / 自訂 operator</td>
          <td>geometry / tsvector / range</td>
          <td>中</td>
          <td></td>
      </tr>
      <tr>
          <td>SP-GiST</td>
          <td>Non-balanced 樹結構</td>
          <td>IP / phone prefix / quad-tree</td>
          <td>中</td>
          <td></td>
      </tr>
      <tr>
          <td>BRIN</td>
          <td>大表的 range scan、physical order 跟 logical order 相關</td>
          <td>timestamp / id（append-only）</td>
          <td>極低</td>
          <td></td>
      </tr>
  </tbody>
</table>
<p>選錯 index 的代價：</p>
<ul>
<li><strong>Write workload</strong>：每 write 都更新所有相關 index、5 個 unused index = 5x write 放大</li>
<li><strong>Storage</strong>：JSONB 加 GIN 可能比表本身還大</li>
<li><strong>Plan misjudge</strong>：planner 看到 index 不一定用、<code>EXPLAIN</code> 才確認</li>
</ul>
<h2 id="b-tree預設選擇95-workload-適用">B-tree：預設選擇、95% workload 適用</h2>
<p>B-tree 是 PG 預設 index、CREATE INDEX 不指定 method 就是 B-tree：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_user_id</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_created_at</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">created_at</span><span class="p">);</span></span></span></code></pre></div><p>B-tree 擅長的 query：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 等值
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">42</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- 範圍
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="k">BETWEEN</span><span class="w"> </span><span class="s1">&#39;2025-01-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="s1">&#39;2025-01-31&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- IS NULL
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">shipped_at</span><span class="w"> </span><span class="k">IS</span><span class="w"> </span><span class="k">NULL</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- Prefix LIKE
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">sku</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="s1">&#39;ABC%&#39;</span><span class="p">;</span></span></span></code></pre></div><p>B-tree 不擅長：</p>
<ul>
<li><code>LIKE '%suffix'</code>（前綴 wildcard）→ 改 trigram + GIN</li>
<li><code>column @&gt; array</code>（包含）→ 改 GIN</li>
<li>JSON 內部 path query → 改 GIN on JSONB</li>
</ul>
<p><strong>Multi-column B-tree</strong> 的順序很重要：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 假設常 query: WHERE user_id = ? AND status = ?
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_user_status</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">status</span><span class="p">);</span><span class="w">  </span><span class="c1">-- 對
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_status_user</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">status</span><span class="p">,</span><span class="w"> </span><span class="n">user_id</span><span class="p">);</span><span class="w">  </span><span class="c1">-- 錯（status 選擇性低）</span></span></span></code></pre></div><p>順序原則：</p>
<ol>
<li><strong>等值 column 在前</strong>（高選擇性）</li>
<li><strong>範圍 column 在後</strong>（B-tree leftmost 規則）</li>
<li><strong>selectivity 高的在前</strong>（filter 更多 row）</li>
</ol>
<h2 id="ginjsonb--fts--array-的標配">GIN：JSONB / FTS / Array 的標配</h2>
<p>GIN（Generalized Inverted Index）對「一個 value 內含多個 sub-element」的 column 高效：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- JSONB
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_products_metadata</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- Array
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_articles_tags</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">tags</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- Full-text search
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_articles_content</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">content</span><span class="p">));</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- Trigram（fuzzy match）
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_trgm</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_products_name_trgm</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="n">gin_trgm_ops</span><span class="p">);</span></span></span></code></pre></div><p>GIN 代價：</p>
<ul>
<li><strong>Write 慢 2-10x</strong>：每個 sub-element 都要更新 inverted index</li>
<li><strong>Storage 大</strong>：可能比表還大</li>
<li><strong>Vacuum 沉重</strong>：bloat 累積快</li>
</ul>
<p><strong>Operator class</strong> 選擇影響大：</p>
<table>
  <thead>
      <tr>
          <th>Op class</th>
          <th>適用</th>
          <th>索引大小</th>
          <th>支援 operator</th>
          <th></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>jsonb_ops</code>（預設）</td>
          <td>通用</td>
          <td>大</td>
          <td><code>@&gt;</code> / <code>?</code> / `?</td>
          <td><code>/</code>?&amp;`</td>
      </tr>
      <tr>
          <td><code>jsonb_path_ops</code></td>
          <td>只 <code>@&gt;</code> containment</td>
          <td>1/3-1/2</td>
          <td>只 <code>@&gt;</code></td>
          <td></td>
      </tr>
  </tbody>
</table>
<p>只用 <code>@&gt;</code> query 時、<code>jsonb_path_ops</code> 救大量 storage。</p>
<h2 id="gist範圍--空間--自訂">GiST：範圍 / 空間 / 自訂</h2>
<p>GiST（Generalized Search Tree）擅長範圍跟空間：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 範圍 type（PostgreSQL 內建 int4range / tsrange 等）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_bookings_period</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">bookings</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GiST</span><span class="w"> </span><span class="p">(</span><span class="n">period</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 空間（PostGIS）
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_locations_geom</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">locations</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GiST</span><span class="w"> </span><span class="p">(</span><span class="n">geom</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- Exclusion constraint（範圍不重疊）
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">bookings</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">CONSTRAINT</span><span class="w"> </span><span class="n">no_overlap</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="n">EXCLUDE</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GiST</span><span class="w"> </span><span class="p">(</span><span class="n">room_id</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="o">=</span><span class="p">,</span><span class="w"> </span><span class="n">period</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="p">);</span></span></span></code></pre></div><p>GiST vs GIN 對 FTS 的選擇：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>GIN</th>
          <th>GiST</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Lookup 速度</td>
          <td>快 3x</td>
          <td>慢</td>
      </tr>
      <tr>
          <td>Update 速度</td>
          <td>慢 3x</td>
          <td>快</td>
      </tr>
      <tr>
          <td>索引大小</td>
          <td>大</td>
          <td>小</td>
      </tr>
      <tr>
          <td>適合場景</td>
          <td>Read-heavy FTS</td>
          <td>Write-heavy / 即時更新</td>
      </tr>
  </tbody>
</table>
<p>多數 FTS workload 選 GIN — read 占多、index size 換 query latency 划算。</p>
<h2 id="brin大表--physical-order-correlated">BRIN：大表 + Physical Order Correlated</h2>
<p>BRIN（Block Range Index）對 <em>physical 儲存順序跟 logical 順序強相關</em> 的 column 高效：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- timestamp column（append-only insert、physical 順序 = 時間順序）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_events_created_at</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">BRIN</span><span class="w"> </span><span class="p">(</span><span class="n">created_at</span><span class="p">);</span></span></span></code></pre></div><p>BRIN 機制：每個 block range（預設 128 page）記 min/max、query 時跳過 range 外的 block。</p>
<p>適用場景：</p>
<ul>
<li><strong>append-only 表</strong>：log、metrics、events</li>
<li><strong>大表</strong>（10GB+）：B-tree 太貴、BRIN 1/1000 大小</li>
<li><strong>column physical order 跟 query 一致</strong>：時間欄、自增 id</li>
</ul>
<p><strong>BRIN 失效情境</strong>：</p>
<ul>
<li>UPDATE 破壞 physical order（row 被 vacuum 移到別 block）→ BRIN 失效</li>
<li>隨機 insert（uuid / hash id）→ BRIN range 完全沒選擇性</li>
</ul>
<p><strong>何時不該用 BRIN</strong>：表 &lt; 1GB（沒省 storage 收益）、column 沒 physical order correlation（CLUSTER 後可能改善）。</p>
<h2 id="partial-index條件式-index-救-storage">Partial Index：條件式 index 救 storage</h2>
<p>對 <em>只 query 部分 row</em> 的 column、partial index 救大量 storage：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 只 index unshipped order
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_unshipped</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">created_at</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">shipped_at</span><span class="w"> </span><span class="k">IS</span><span class="w"> </span><span class="k">NULL</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="c1">-- 只 index active user
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_users_active</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">email</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- 只 index 高金額 transaction
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_high_value</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">total</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">1000</span><span class="p">;</span></span></span></code></pre></div><p>Partial index 的 query 要 <em>完全匹配 WHERE 條件</em> 才用得到：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 用得到 partial index
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">shipped_at</span><span class="w"> </span><span class="k">IS</span><span class="w"> </span><span class="k">NULL</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2025-01-01&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 用不到（planner 不 prove WHERE 包含 partial 條件）
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2025-01-01&#39;</span><span class="p">;</span></span></span></code></pre></div><p>實務 size 救法：unshipped order 只 1% 總量、partial index 1/100 大小。</p>
<h2 id="expression-index對函式結果-index">Expression Index：對函式結果 index</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 對 lowercased email index（case-insensitive search）
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_users_email_lower</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="k">lower</span><span class="p">(</span><span class="n">email</span><span class="p">));</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="k">lower</span><span class="p">(</span><span class="n">email</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">lower</span><span class="p">(</span><span class="s1">&#39;USER@example.com&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="c1">-- 對 JSONB 內部欄位
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_products_category</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="p">((</span><span class="n">metadata</span><span class="o">-&gt;&gt;</span><span class="s1">&#39;category&#39;</span><span class="p">));</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="o">-&gt;&gt;</span><span class="s1">&#39;category&#39;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;shoes&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- 對日期截斷
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_day</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">date_trunc</span><span class="p">(</span><span class="s1">&#39;day&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">created_at</span><span class="p">));</span></span></span></code></pre></div><p>Expression 必須 IMMUTABLE — <code>now()</code> / <code>random()</code> 不能用、<code>timezone('UTC', ts)</code> 可以。</p>
<h2 id="covering-indexinclude避免回表">Covering Index（INCLUDE）：避免回表</h2>
<p>PG 11+ 支援 INCLUDE column：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 只 index user_id、但 query 常要 email
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_users_user_id_covering</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span><span class="w"> </span><span class="n">INCLUDE</span><span class="w"> </span><span class="p">(</span><span class="n">email</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Index-only scan：不用回表
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">email</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">42</span><span class="p">;</span></span></span></code></pre></div><p>INCLUDE column 不參與 sorting / equality、只放 leaf node、救 IO。</p>
<h2 id="index-選擇決策樹">Index 選擇決策樹</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Query pattern 是什麼？
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">├─ 等值 / 範圍 / prefix LIKE / IS NULL
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">│  └─ B-tree（90% 場景）
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">│     ├─ 只 query 部分 row？→ Partial B-tree
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">│     ├─ 對函式結果？→ Expression B-tree
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│     └─ 需要回表更多 column？→ Covering（INCLUDE）
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">├─ JSONB 內部 query / array 包含 / FTS
</span></span><span class="line"><span class="ln">10</span><span class="cl">│  └─ GIN
</span></span><span class="line"><span class="ln">11</span><span class="cl">│     ├─ 只用 @&gt;？→ jsonb_path_ops 救 storage
</span></span><span class="line"><span class="ln">12</span><span class="cl">│     └─ FTS write-heavy？→ 改 GiST
</span></span><span class="line"><span class="ln">13</span><span class="cl">│
</span></span><span class="line"><span class="ln">14</span><span class="cl">├─ 範圍 type（int4range / tsrange）/ 空間
</span></span><span class="line"><span class="ln">15</span><span class="cl">│  └─ GiST
</span></span><span class="line"><span class="ln">16</span><span class="cl">│
</span></span><span class="line"><span class="ln">17</span><span class="cl">├─ 大表 + append-only + physical order correlated
</span></span><span class="line"><span class="ln">18</span><span class="cl">│  └─ BRIN
</span></span><span class="line"><span class="ln">19</span><span class="cl">│
</span></span><span class="line"><span class="ln">20</span><span class="cl">├─ 純 equality + 簡單 column
</span></span><span class="line"><span class="ln">21</span><span class="cl">│  └─ Hash（很少用、B-tree 通常更好）
</span></span><span class="line"><span class="ln">22</span><span class="cl">│
</span></span><span class="line"><span class="ln">23</span><span class="cl">└─ Non-balanced 樹（IP prefix / quad-tree）
</span></span><span class="line"><span class="ln">24</span><span class="cl">   └─ SP-GiST（罕見）</span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="case-1過度-indexwrite-放大">Case 1：過度 index（write 放大）</h3>
<p><strong>情境</strong>：team「為了 query 快」對 20 個 column 各建 index、寫入量大時 INSERT 慢 10x。</p>
<p>每個 INSERT 要更新 20 個 index、WAL volume 也跟著放大、replication lag 拉長。</p>
<p>修法：</p>
<ul>
<li>用 <code>pg_stat_user_indexes</code> 找 <em>idx_scan = 0</em> 的 index、可能根本沒用</li>
<li>用 <code>pg_stat_statements</code> 找實際被執行的 query、反推真正需要的 index</li>
<li>同 column 多 index（user_id 單欄 + (user_id, status) 多欄）通常可拆掉單欄</li>
</ul>
<h3 id="case-2partial-index-條件跟-query-不匹配">Case 2：Partial index 條件跟 query 不匹配</h3>
<p><strong>情境</strong>：建 <code>WHERE status = 'active'</code> partial index、application query 寫 <code>WHERE status IN ('active')</code>、planner 不 prove 等價、不用 index。</p>
<p>修法：</p>
<ul>
<li>Partial 條件用最 generic form（避免 IN / OR 跟 = 的差異）</li>
<li>寫完用 <code>EXPLAIN</code> 驗證 query 真的用到 partial index</li>
<li>Application 統一 query 寫法、不要混 <code>=</code> 跟 <code>IN</code> 跟 <code>ANY</code></li>
</ul>
<h3 id="case-3b-tree-對-jsonb-內部欄位無效">Case 3：B-tree 對 JSONB 內部欄位無效</h3>
<p><strong>情境</strong>：對 <code>metadata</code> JSONB column 建 B-tree、query <code>metadata-&gt;&gt;'category' = 'shoes'</code> 不用 index。</p>
<p>B-tree 對 <em>整個 JSONB</em> 排序、但 path query 不是整個 JSONB 的比對。</p>
<p>修法：</p>
<ul>
<li>對固定 path 建 expression index：<code>CREATE INDEX ... ON products ((metadata-&gt;&gt;'category'))</code></li>
<li>對動態 path 建 GIN index：<code>CREATE INDEX ... USING GIN (metadata)</code></li>
<li>兩者並存可、<code>EXPLAIN</code> 看 planner 選哪個</li>
</ul>
<h3 id="case-4brin-對非-correlated-資料無效">Case 4：BRIN 對非 correlated 資料無效</h3>
<p><strong>情境</strong>：對 <code>user_id</code> 建 BRIN index（user_id 是隨機 UUID）、query 完全跑 seq scan。</p>
<p>UUID 沒 physical order correlation、每個 block range 的 min/max 涵蓋整個 ID space、BRIN 完全沒 prune 效果。</p>
<p>修法：</p>
<ul>
<li>BRIN 只用 <code>timestamp</code> / 自增 <code>id</code> / 其他自然 correlate 的 column</li>
<li>用 <code>pg_stats</code> 看 <code>correlation</code> value、&lt; 0.1 就不適合 BRIN</li>
<li>真要對 random column 加 index、回 B-tree</li>
</ul>
<h3 id="case-5multi-column-index-順序錯">Case 5：Multi-column index 順序錯</h3>
<p><strong>情境</strong>：常見 query <code>WHERE status = 'pending' AND user_id = 42</code>、建 index <code>(status, user_id)</code>、效能差。</p>
<p><code>status</code> 只 5 個 distinct value、選擇性 1/5；<code>user_id</code> 1M distinct、選擇性 1/1M。Index leftmost 是 status、scan range 太大。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 拆兩個或調順序
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_user_status</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">status</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 或加 partial 限定低選擇性 column
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_pending</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;pending&#39;</span><span class="p">;</span></span></span></code></pre></div><h2 id="跟-mysql-index-差異">跟 MySQL Index 差異</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PostgreSQL</th>
          <th>MySQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Index method</td>
          <td>6 種（B-tree / Hash / GIN / GiST / SP-GiST / BRIN）</td>
          <td>主要 B-tree、空間另算 R-tree</td>
      </tr>
      <tr>
          <td>預設</td>
          <td>B-tree</td>
          <td>B-tree（InnoDB clustered）</td>
      </tr>
      <tr>
          <td>Clustered index</td>
          <td>沒有原生（CLUSTER 一次性）</td>
          <td>InnoDB primary key 永遠 clustered</td>
      </tr>
      <tr>
          <td>Covering</td>
          <td>INCLUDE（PG 11+）</td>
          <td>自然支援（secondary index 帶 PK）</td>
      </tr>
      <tr>
          <td>JSON index</td>
          <td>GIN on JSONB（強）</td>
          <td>functional index on JSON（弱）</td>
      </tr>
      <tr>
          <td>Partial index</td>
          <td>原生支援</td>
          <td>8.0+ 支援（受限）</td>
      </tr>
      <tr>
          <td>Expression index</td>
          <td>原生支援</td>
          <td>5.7+ functional index</td>
      </tr>
      <tr>
          <td>BRIN-like</td>
          <td>原生</td>
          <td>沒有</td>
      </tr>
      <tr>
          <td>Spatial</td>
          <td>GiST / PostGIS</td>
          <td>R-tree（基本）</td>
      </tr>
  </tbody>
</table>
<p>PG index 系統比 MySQL 表達力高、但代價是 <em>選對 index method 是 application 責任</em>、MySQL 預設 B-tree 多數場景夠用。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization</a>：EXPLAIN 看 index 用沒用</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">jsonb-deep-dive</a>：JSONB + GIN 細節</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/full-text-search/" data-link-title="PostgreSQL Full-Text Search：tsvector / tsquery / GIN index 跟 pg_trgm fuzzy 三層搜尋" data-link-desc="PG 內建 full-text search 用 *tsvector / tsquery / GIN index* 三件組、適合中小規模搜尋（&lt; 100M 文件）；pg_trgm 提供 fuzzy match。本文走 FTS 機制（tsvector 是 lexeme &#43; position 的 vector）、3 種 query（match / ranking / weighted）、multi-language support、跟 pg_trgm fuzzy match 互補、5 production 踩雷（dictionary 選錯 / GIN 跟 GiST 取捨 / ranking 評分權重 / multi-language column 處理 / 何時不該用 PG FTS 改 Elasticsearch）">full-text-search</a>：FTS + GIN</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum-tuning</a>：index bloat</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">online-schema-change</a>：CREATE INDEX CONCURRENTLY</li>
</ul>
<h2 id="下一步">下一步</h2>
<ul>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization</a> 驗證 index 有沒有被 plan 用到</li>
<li>回 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview</a> 看全圖</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/citus-distributed/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/citus-distributed/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>Citus distributed extension&lt;/em> — 把 PG 變成 sharded cluster 的方式。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;p>當 PG single-primary 寫吞吐撞上單機極限（50K-100K WPS）、選項三條：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Application 層 sharding&lt;/strong>：應用層自管 shard routing&lt;/li>
&lt;li>&lt;strong>Citus&lt;/strong>：PG extension、自動 routing + cross-shard query&lt;/li>
&lt;li>&lt;strong>Distributed SQL&lt;/strong>（CockroachDB / Aurora DSQL / Spanner）：不同 engine&lt;/li>
&lt;/ol>
&lt;p>選 Citus 的核心 driver：&lt;em>保留 PG SQL syntax + extension 生態&lt;/em>。但「應用層幾乎不必改」是樂觀說法 — 實際上 application 必須圍繞 distribution column 重設計（query 加 filter / transaction 限定同 shard / reference table 量控制）、跟 Vitess 比 cross-shard query 自動化弱。代價是 &lt;em>coordinator / worker 部署複雜度 + cross-shard query 限制 + application schema 改造工作量&lt;/em>。&lt;/p>
&lt;p>閱讀本文前可先對齊 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">Database Sharding&lt;/a> 的 shard key、routing、resharding 與 cross-shard query 語意；容量失衡時再接 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition&lt;/a>。&lt;/p>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">MySQL Vitess sharding&lt;/a> 的核心差異：Citus 是 &lt;em>PG extension&lt;/em>（PG 自己跑）、Vitess 是 &lt;em>獨立 proxy + tablet 系統&lt;/em>（包 MySQL）。Citus 用 PG 原生機制（FDW / extension hook）、Vitess 是 &lt;em>外部包裝&lt;/em>。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>Citus distributed extension</em> — 把 PG 變成 sharded cluster 的方式。</p></blockquote>
<hr>
<p>當 PG single-primary 寫吞吐撞上單機極限（50K-100K WPS）、選項三條：</p>
<ol>
<li><strong>Application 層 sharding</strong>：應用層自管 shard routing</li>
<li><strong>Citus</strong>：PG extension、自動 routing + cross-shard query</li>
<li><strong>Distributed SQL</strong>（CockroachDB / Aurora DSQL / Spanner）：不同 engine</li>
</ol>
<p>選 Citus 的核心 driver：<em>保留 PG SQL syntax + extension 生態</em>。但「應用層幾乎不必改」是樂觀說法 — 實際上 application 必須圍繞 distribution column 重設計（query 加 filter / transaction 限定同 shard / reference table 量控制）、跟 Vitess 比 cross-shard query 自動化弱。代價是 <em>coordinator / worker 部署複雜度 + cross-shard query 限制 + application schema 改造工作量</em>。</p>
<p>閱讀本文前可先對齊 <a href="/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">Database Sharding</a> 的 shard key、routing、resharding 與 cross-shard query 語意；容量失衡時再接 <a href="/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition</a>。</p>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">MySQL Vitess sharding</a> 的核心差異：Citus 是 <em>PG extension</em>（PG 自己跑）、Vitess 是 <em>獨立 proxy + tablet 系統</em>（包 MySQL）。Citus 用 PG 原生機制（FDW / extension hook）、Vitess 是 <em>外部包裝</em>。</p>
<h2 id="citus-架構coordinator--worker">Citus 架構：Coordinator + Worker</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">                ┌─────────────────┐
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">   Application  │   Coordinator   │  ← 對外 PG wire protocol、planner、routing
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">                │   (Citus + PG)  │
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">                └────┬─────┬──────┘
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">                     │     │
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">              ┌──────┘     └──────┐
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">              ▼                   ▼
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">        ┌──────────┐         ┌──────────┐
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">        │ Worker 1 │         │ Worker 2 │  ← 各跑 PG + Citus extension
</span></span><span class="line"><span class="ln">10</span><span class="cl">        │  (PG)    │         │  (PG)    │
</span></span><span class="line"><span class="ln">11</span><span class="cl">        │ shard 1,3│         │ shard 2,4│
</span></span><span class="line"><span class="ln">12</span><span class="cl">        └──────────┘         └──────────┘</span></span></code></pre></div><p><strong>Coordinator</strong>：</p>
<ul>
<li>對 application 看起來像 PG（同 port / 同 wire protocol）</li>
<li>接 SQL → Citus planner 把 query 分解 + route 給 worker</li>
<li>不存 data（distributed table 的 shard 在 worker 上）</li>
<li>存 <em>metadata</em>（哪個 shard 在哪個 worker）</li>
</ul>
<p><strong>Worker</strong>：</p>
<ul>
<li>標準 PG instance + Citus extension</li>
<li>各存若干 shard</li>
<li>接 coordinator 來的 query、跑 local execute、回結果</li>
</ul>
<p><strong>Shard</strong>：</p>
<ul>
<li>Distributed table 拆成 N 個 shard（預設 32）</li>
<li>每 shard 是 worker 上的 <em>physical PG table</em>（含 <code>_&lt;shardid&gt;</code> 後綴）</li>
<li>行為跟一般 PG table 一樣、可以直接連 worker 用 PG 工具 access</li>
</ul>
<h2 id="3-種-table-type">3 種 Table Type</h2>
<h3 id="distributed-table--跨-shard-切分">Distributed table — 跨 shard 切分</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 建一般 PG table
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="n">BIGSERIAL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">user_id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">amount</span><span class="w"> </span><span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">created_at</span><span class="w"> </span><span class="k">TIMESTAMP</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">id</span><span class="p">)</span><span class="w">  </span><span class="c1">-- PK 必須含 distribution column
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 用 Citus 把它變 distributed
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;orders&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;user_id&#39;</span><span class="p">);</span></span></span></code></pre></div><p><code>user_id</code> 是 <em>distribution column</em> — Citus 用它的 hash 決定 row 屬哪個 shard。<code>PK 必須含 distribution column</code>（跟 MySQL partitioning 同要求）。</p>
<p>跟 Vitess Vindex 對比：</p>
<ul>
<li>Citus：hash distribution column → shard（單一 hash function、不可選 algorithm）</li>
<li>Vitess：Vindex 可選多種（hash / lookup_hash / xxhash / null）</li>
</ul>
<h3 id="reference-table--全-shard-共有">Reference table — 全 shard 共有</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">name</span><span class="w"> </span><span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">price</span><span class="w"> </span><span class="nb">DECIMAL</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_reference_table</span><span class="p">(</span><span class="s1">&#39;products&#39;</span><span class="p">);</span></span></span></code></pre></div><p><code>products</code> 在 <em>每個 worker 都有完整 copy</em>、寫入 coordinator 廣播給所有 worker。</p>
<p>用途：</p>
<ul>
<li>小 lookup table（country code / product category 等）</li>
<li>跨 distributed table JOIN 時、reference table 在每 worker 上、不必 cross-shard</li>
<li>寫入頻率低（廣播 cost 跟 worker 數 linear）</li>
</ul>
<h3 id="local-table--coordinator-上的-pg-table">Local table — Coordinator 上的 PG table</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">audit_log</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">event</span><span class="w"> </span><span class="n">JSONB</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 不調用 Citus function、預設留在 coordinator</span></span></span></code></pre></div><p>行為跟一般 PG table 一樣。用於 <em>不需 distribute</em> 的 table（如 admin metadata）。</p>
<h2 id="colocation跨-distributed-table-同-shard-對齊">Colocation：跨 distributed table 同 shard 對齊</h2>
<p>當兩個 distributed table 都用 <em>同 distribution column</em>（例如 <code>user_id</code>）+ 同 shard count、Citus 自動 colocate：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;orders&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;user_id&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;user_addresses&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;user_id&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">colocate_with</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="s1">&#39;orders&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Colocate 後：</p>
<ul>
<li><code>user_id = 100</code> 的 orders 跟 user_addresses 在 <em>同一 worker shard</em></li>
<li>JOIN 不跨 worker、效率高</li>
<li>可用 PG 原生 FK constraint（cross-table 但同 shard）</li>
</ul>
<p>Colocate 是 Citus 設計的核心 <em>跨 table 一致性</em> 機制。沒 colocate 的 cross-table query 變 cross-worker、效率大降。</p>
<h2 id="配置-step-by-steplocal-cluster">配置 step-by-step（local cluster）</h2>
<p>Production 用 Citus Cloud（Microsoft 託管）或 Azure Cosmos DB for PostgreSQL（同 engine）。Self-hosted：</p>
<h3 id="step-1coordinator--worker-都裝-pg--citus">Step 1：Coordinator + worker 都裝 PG + Citus</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 在每個 node（coordinator + 2 worker）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">apt install postgresql-14
</span></span><span class="line"><span class="ln">3</span><span class="cl">apt install postgresql-14-citus-12.0
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="nv">shared_preload_libraries</span> <span class="o">=</span> <span class="s1">&#39;citus&#39;</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">
</span></span><span class="line"><span class="ln">8</span><span class="cl">systemctl restart postgresql</span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 在每個 node 跑
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">citus</span><span class="p">;</span></span></span></code></pre></div><h3 id="step-2coordinator-註冊-worker">Step 2：Coordinator 註冊 worker</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 在 coordinator 跑
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">citus_add_node</span><span class="p">(</span><span class="s1">&#39;worker1.example.com&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">5432</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">citus_add_node</span><span class="p">(</span><span class="s1">&#39;worker2.example.com&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">5432</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 確認
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">citus_get_active_worker_nodes</span><span class="p">();</span></span></span></code></pre></div><h3 id="step-3建-distributed-table">Step 3：建 distributed table</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="n">BIGSERIAL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">user_id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">amount</span><span class="w"> </span><span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="n">created_at</span><span class="w"> </span><span class="k">TIMESTAMP</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">    </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">id</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;orders&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;user_id&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Citus 自動把 <code>orders</code> 拆成 32 個 shard（<code>orders_102008</code> 等）、分配到 worker。</p>
<h3 id="step-4application-連-coordinator">Step 4：Application 連 coordinator</h3>
<p>Application connection string 連 coordinator IP / port（不必知道 worker 存在）。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 從 application 跑 query、Citus 透明 route
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">amount</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">12345</span><span class="p">,</span><span class="w"> </span><span class="mi">50</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- → Citus 看 user_id=12345 hash 屬 shard 17、route 給對應 worker
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">12345</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- → Single-shard query、極快
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="c1">-- → Cross-shard aggregation、Citus 並行跑、合併結果</span></span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-distribution-column-選錯--cross-shard-query-變主流">1. Distribution column 選錯 — Cross-shard query 變主流</h3>
<p>選 <code>created_at</code> 或 <code>id</code>（auto increment）作 distribution column、看起來均勻、實際 <em>application query 多以 user_id 為主</em>、變成 <em>每個 query 都 cross-shard</em>、performance 雪崩。</p>
<p>修法：</p>
<ul>
<li><em>Distribution column 選 application 最常 filter / join 的 column</em>（通常是 <code>tenant_id</code> / <code>user_id</code>）</li>
<li>Audit application top query、確認 distribution column 對齊 query pattern</li>
<li>改 distribution column 要 <em>rewrite 所有 shard</em>、像 resharding、大工程</li>
</ul>
<h3 id="2-cross-shard-transaction-限制">2. Cross-shard transaction 限制</h3>
<p>跨多 shard 的 transaction（如：UPDATE 兩個 user_id 不同的 row）Citus 用 <em>2PC</em>（two-phase commit）但有限制：</p>
<ul>
<li>Multi-statement transaction 跨 shard 需明確開 <code>SET citus.multi_shard_modify_mode = 'sequential'</code></li>
<li>部分 isolation level 不保證 serializable across shards</li>
<li>DDL 跨 shard 是 sequential</li>
</ul>
<p>修法：</p>
<ul>
<li>Schema design 避免 cross-shard transaction（同 colocation group 內 transaction 沒問題）</li>
<li>必要 cross-shard 場景明確設 multi-shard mode</li>
<li>對 <em>strict cross-shard consistency</em>、考慮 distributed SQL（CockroachDB / Aurora DSQL）</li>
</ul>
<h3 id="3-reference-table-過大--寫入廣播-cost-爆">3. Reference table 過大 — 寫入廣播 cost 爆</h3>
<p>Reference table 在每 worker 都有 copy、寫入 <em>廣播給所有 worker</em>。Reference table 100K row + 高頻寫入 → 寫一次寫 N worker、cost N x。</p>
<p>修法：</p>
<ul>
<li>Reference table 限 <em>小 + 寫入頻率低</em> 的 lookup data</li>
<li>超大表不該是 reference table、考慮 distributed</li>
<li>監控 reference table 寫入 rate、超 threshold 重新評估</li>
</ul>
<h3 id="4-colocate-沒對齊--隱性-cross-shard-join">4. Colocate 沒對齊 — 隱性 cross-shard JOIN</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 看似可以、實際 cross-shard 慢
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">user_addresses</span><span class="w"> </span><span class="n">ua</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ua</span><span class="p">.</span><span class="n">user_id</span><span class="p">;</span></span></span></code></pre></div><p>若 <code>user_addresses</code> 沒 <code>colocate_with =&gt; 'orders'</code>、兩表 shard 分配獨立、JOIN 跨 worker。</p>
<p>修法：</p>
<ul>
<li>建相關 table 時 <code>colocate_with</code> 對齊</li>
<li>用 <code>SELECT * FROM citus_tables</code> 看 colocation_id、確認對齊</li>
<li>跨非 colocate table 的 JOIN 用 <em>materialized view</em> 或 application 層拆 query 避開</li>
</ul>
<h3 id="5-worker-failover--coordinator-必須知道">5. Worker failover — Coordinator 必須知道</h3>
<p>Worker 故障、Citus 預設 <em>coordinator 看到 query 失敗、不自動 failover</em>。</p>
<p>修法（Citus 11+）：</p>
<ul>
<li>用 <em>shard replication</em>（<code>citus.shard_replication_factor = 2</code>）— 每 shard 在 2 個 worker 有 copy</li>
<li>配 PG streaming replication 在 worker 層、外加 Patroni 管 failover</li>
<li>Coordinator 失敗 → 整個 cluster 失能、coordinator 也要 HA（Patroni）</li>
</ul>
<p>跟 Vitess 對比 Citus 的 HA story 較弱、production 必須認真規劃。</p>
<h2 id="何時用-citus">何時用 Citus</h2>
<table>
  <thead>
      <tr>
          <th>條件</th>
          <th>建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Multi-tenant SaaS、tenant_id 為自然 distribution</td>
          <td>是</td>
      </tr>
      <tr>
          <td>寫吞吐 &gt; 50K WPS、單 PG 撐不住</td>
          <td>是</td>
      </tr>
      <tr>
          <td>需要保留 PG SQL + extension（pgvector / TimescaleDB）</td>
          <td>是</td>
      </tr>
      <tr>
          <td>應用 query pattern 80% 都用同一 distribution column</td>
          <td>是</td>
      </tr>
      <tr>
          <td>應用大量 ad-hoc cross-tenant aggregation</td>
          <td>否（cross-shard 慢）</td>
      </tr>
      <tr>
          <td>強 cross-shard consistency 需求</td>
          <td>否（用 CockroachDB）</td>
      </tr>
      <tr>
          <td>想 zero-ops managed</td>
          <td>Azure Cosmos DB for PostgreSQL（同 engine）</td>
      </tr>
  </tbody>
</table>
<h2 id="容量規劃">容量規劃</h2>
<ul>
<li>Coordinator: 中等 CPU + RAM、metadata 不大、不存 data</li>
<li>Worker: per-worker spec 同 single PG production</li>
<li>Shard count: 預設 32、實務常設 worker count × 4-8</li>
<li>Replication factor: production 至少 2</li>
</ul>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p>Coordinator + worker 各跑 PG streaming replication、Citus 不取代 PG replication。Worker failover 用 Patroni / streaming replication。詳見 <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>。</p>
<h3 id="跟-pg-extensions">跟 PG Extensions</h3>
<p>Citus 跟其他 PG extension 多數兼容（pgvector / TimescaleDB / pg_stat_statements）— 它維持 <em>extension</em> 形態，保留 PostgreSQL 生態接點。詳見 <em>PG Extension Ecosystem</em> 篇（待寫）。</p>
<h3 id="跟-mysql-vitess">跟 MySQL Vitess</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Citus</th>
          <th>Vitess</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>部署模型</td>
          <td>PG extension</td>
          <td>獨立 proxy + tablet</td>
      </tr>
      <tr>
          <td>主要場景</td>
          <td>Multi-tenant SaaS</td>
          <td>超大規模分片</td>
      </tr>
      <tr>
          <td>Cross-shard JOIN</td>
          <td>colocate 對齊 + reference table</td>
          <td>VTGate 自動 split + aggregate</td>
      </tr>
      <tr>
          <td>FK</td>
          <td>同 colocation 內可用</td>
          <td>Vitess 18+ 支援、cross-shard 限制</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>依賴 Patroni + replication factor</td>
          <td>VTOrc + replication</td>
      </tr>
      <tr>
          <td>學習曲線</td>
          <td>中（PG ops 經驗夠）</td>
          <td>高（4 component）</td>
      </tr>
  </tbody>
</table>
<p>Citus 對 <em>PG-native</em> 場景更平順、Vitess 對 <em>MySQL-native</em> 場景更平順、不直接競爭。詳見 <a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">MySQL Vitess Sharding</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（per-worker replication）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">PG MVCC + Lock Model</a>（cross-shard transaction lock 行為）</li>
<li><a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a>（Citus vs CockroachDB vs Spanner）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">MySQL Vitess Sharding</a>（sibling、不同實作）</li>
<li><a href="/blog/backend/01-database/vendors/cosmosdb/" data-link-title="Azure Cosmos DB" data-link-desc="全球分散式 multi-model DB、5 個 consistency levels、Microsoft 自家 dogfood 證據">Cosmos DB vendor</a>（Azure Cosmos DB for PostgreSQL = managed Citus）</li>
<li>官方：<a href="https://docs.citusdata.com/">Citus Documentation</a> / <a href="https://github.com/citusdata/citus">Citus on GitHub</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL SQL Features：PG 早就有的、MySQL 8.0 才補的、PG 仍領先的</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/sql-features-baseline/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/sql-features-baseline/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>SQL features baseline&lt;/em> — PG 早期就有的、MySQL 8.0 才補的、PG 仍領先的、給從 MySQL 評估 PG 的讀者 reference。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="pg-sql-工程深度的歷史錨點">PG SQL 工程深度的歷史錨點&lt;/h2>
&lt;p>PG 在 SQL feature 上長期領先 MySQL：&lt;/p>
&lt;ul>
&lt;li>2009 (PG 8.4)：CTE / window function / recursive query&lt;/li>
&lt;li>2013 (PG 9.3)：lateral derived table / materialized view&lt;/li>
&lt;li>2014 (PG 9.4)：JSONB / partial index 早就有 / GIN index&lt;/li>
&lt;li>2015 (PG 9.5)：UPSERT (&lt;code>ON CONFLICT&lt;/code>)&lt;/li>
&lt;li>2017 (PG 10)：declarative partitioning / logical replication / multi-column statistics&lt;/li>
&lt;/ul>
&lt;p>MySQL 8.0（2018）才補 CTE / window / lateral / JSON_TABLE / hash join — &lt;em>PG 早 9 年起步&lt;/em>。&lt;/p>
&lt;p>對 &lt;em>從 MySQL 評估 PG&lt;/em> 的讀者來說、PG 的 SQL 工程深度不只是「該有的都有」、更多是「PG 結構性領先的特性 + MySQL 8.0 補了哪些 + PG 仍領先哪些」。&lt;/p>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features&lt;/a> 對比視角：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>SQL features baseline</em> — PG 早期就有的、MySQL 8.0 才補的、PG 仍領先的、給從 MySQL 評估 PG 的讀者 reference。</p></blockquote>
<hr>
<h2 id="pg-sql-工程深度的歷史錨點">PG SQL 工程深度的歷史錨點</h2>
<p>PG 在 SQL feature 上長期領先 MySQL：</p>
<ul>
<li>2009 (PG 8.4)：CTE / window function / recursive query</li>
<li>2013 (PG 9.3)：lateral derived table / materialized view</li>
<li>2014 (PG 9.4)：JSONB / partial index 早就有 / GIN index</li>
<li>2015 (PG 9.5)：UPSERT (<code>ON CONFLICT</code>)</li>
<li>2017 (PG 10)：declarative partitioning / logical replication / multi-column statistics</li>
</ul>
<p>MySQL 8.0（2018）才補 CTE / window / lateral / JSON_TABLE / hash join — <em>PG 早 9 年起步</em>。</p>
<p>對 <em>從 MySQL 評估 PG</em> 的讀者來說、PG 的 SQL 工程深度不只是「該有的都有」、更多是「PG 結構性領先的特性 + MySQL 8.0 補了哪些 + PG 仍領先哪些」。</p>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features</a> 對比視角：</p>
<ul>
<li>MySQL 8.0 視角：「我終於補齊 + 跟 PG 對比」</li>
<li>PG 視角：「我長期領先 + MySQL 8.0 才追上某些、其他我仍領先」</li>
</ul>
<h2 id="pg-結構性領先特性mysql-沒對應--弱對應">PG 結構性領先特性（MySQL 沒對應 / 弱對應）</h2>
<h3 id="1-materialized-view">1. Materialized View</h3>
<p>PG 9.3+ 內建 materialized view：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">orders_summary</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">order_count</span><span class="p">,</span><span class="w"> </span><span class="k">SUM</span><span class="p">(</span><span class="n">amount</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">total</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">user_id</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 手動 refresh
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="n">REFRESH</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">orders_summary</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- 或 concurrent refresh（PG 9.4+、不 lock read）
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="n">REFRESH</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">CONCURRENTLY</span><span class="w"> </span><span class="n">orders_summary</span><span class="p">;</span></span></span></code></pre></div><p>用途：</p>
<ul>
<li>預計算複雜 aggregation、查詢時極快</li>
<li>Concurrent refresh 不 lock read</li>
<li>可建 index on materialized view</li>
</ul>
<p><strong>MySQL 對應</strong>：沒原生 materialized view。常見替代：</p>
<ul>
<li>Trigger + summary table（手動維護）</li>
<li>Application 層 caching layer</li>
<li>用 view + cache layer（不是 materialization）</li>
</ul>
<p>MySQL 8.0+ 仍無原生 materialized view。</p>
<h3 id="2-partial-index">2. Partial Index</h3>
<p>PG 預設支援 partial index — 對 <em>滿足條件的 row</em> 才建 index：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 只對 active user 建 index
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_users_active_email</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">users</span><span class="p">(</span><span class="n">email</span><span class="p">)</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Index size 比 full index 小很多、query 性能跟 full index 一樣
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">email</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;x@y.com&#39;</span><span class="p">;</span></span></span></code></pre></div><p>用途：</p>
<ul>
<li><em>Soft-delete</em> 場景：對 <code>deleted_at IS NULL</code> 建 partial index</li>
<li><em>Hot subset</em> 場景：對 <code>status = 'pending'</code> 等熱資料建 partial</li>
<li>Index 大小 / 寫入成本大降</li>
</ul>
<p><strong>MySQL 對應</strong>：MySQL 沒原生 partial index。MySQL 8.0+ 有 <em>functional index</em> 但跟 partial 不同。MySQL 替代：</p>
<ul>
<li>Generated column + index（接近、但維護複雜）</li>
<li>或接受 full index cost</li>
</ul>
<h3 id="3-foreign-data-wrapper-fdw">3. Foreign Data Wrapper (FDW)</h3>
<p>PG FDW 讓 query 跨外部資料源：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">postgres_fdw</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">SERVER</span><span class="w"> </span><span class="n">remote_db</span><span class="w"> </span><span class="k">FOREIGN</span><span class="w"> </span><span class="k">DATA</span><span class="w"> </span><span class="n">WRAPPER</span><span class="w"> </span><span class="n">postgres_fdw</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="k">OPTIONS</span><span class="w"> </span><span class="p">(</span><span class="k">host</span><span class="w"> </span><span class="s1">&#39;remote.example.com&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">dbname</span><span class="w"> </span><span class="s1">&#39;analytics&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">USER</span><span class="w"> </span><span class="n">MAPPING</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="n">localuser</span><span class="w"> </span><span class="n">SERVER</span><span class="w"> </span><span class="n">remote_db</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">OPTIONS</span><span class="w"> </span><span class="p">(</span><span class="k">user</span><span class="w"> </span><span class="s1">&#39;remoteuser&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">password</span><span class="w"> </span><span class="s1">&#39;...&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">FOREIGN</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">remote_orders</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="nb">INT</span><span class="p">,</span><span class="w"> </span><span class="p">...)</span><span class="w"> </span><span class="n">SERVER</span><span class="w"> </span><span class="n">remote_db</span><span class="w"> </span><span class="k">OPTIONS</span><span class="w"> </span><span class="p">(</span><span class="k">table_name</span><span class="w"> </span><span class="s1">&#39;orders&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- 在 local PG query remote table
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">remote_orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="p">;</span></span></span></code></pre></div><p>支援 FDW：<code>postgres_fdw</code> / <code>mysql_fdw</code> / <code>oracle_fdw</code> / <code>mongo_fdw</code> / <code>file_fdw</code> / <code>redis_fdw</code> 等。</p>
<p><strong>MySQL 對應</strong>：MySQL 8.0+ 有 FEDERATED engine（受限、不推薦）。實務上 MySQL 跨 DB query 用 application 層處理。</p>
<h3 id="4-jsonb--gin-indexpg-結構性優勢">4. JSONB + GIN Index（PG 結構性優勢）</h3>
<p>PG JSONB 是 <em>binary 儲存</em> + 可 <em>直接 GIN index</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">metadata</span><span class="w"> </span><span class="n">JSONB</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="c1">-- GIN index over JSONB
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_products_metadata</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- 快 query
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@&gt;</span><span class="w"> </span><span class="s1">&#39;{&#34;category&#34;: &#34;shoes&#34;}&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@?</span><span class="w"> </span><span class="s1">&#39;$.variants[*].price &gt; 100&#39;</span><span class="p">;</span></span></span></code></pre></div><p><strong>MySQL 對應</strong>：MySQL 8.0 JSON_TABLE 是 SQL standard、但 <em>index 必須 generated column workaround</em>（不能 GIN index over JSON）。</p>
<p>詳見 <a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features</a> JSON_TABLE vs PG JSONB 對比段。</p>
<h3 id="5-range-types--exclusion-constraints">5. Range Types + Exclusion Constraints</h3>
<p>PG range types + exclusion constraints 防止 <em>時間範圍重疊</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">reservations</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">room_id</span><span class="w"> </span><span class="nb">INT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">during</span><span class="w"> </span><span class="n">TSRANGE</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">EXCLUDE</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIST</span><span class="w"> </span><span class="p">(</span><span class="n">room_id</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="o">=</span><span class="p">,</span><span class="w"> </span><span class="n">during</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- INSERT 重疊 booking 自動 reject
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">reservations</span><span class="w"> </span><span class="p">(</span><span class="n">room_id</span><span class="p">,</span><span class="w"> </span><span class="n">during</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;[2026-05-19 10:00, 2026-05-19 12:00)&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">reservations</span><span class="w"> </span><span class="p">(</span><span class="n">room_id</span><span class="p">,</span><span class="w"> </span><span class="n">during</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;[2026-05-19 11:00, 2026-05-19 13:00)&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- ERROR: conflicting key value violates exclusion constraint</span></span></span></code></pre></div><p><strong>MySQL 對應</strong>：完全沒對應、必須 application 層 enforce。</p>
<h3 id="6-check-constraint--domain-type">6. CHECK Constraint + Domain Type</h3>
<p>PG <code>CHECK</code> constraint 真執行（MySQL 8.0 才補）+ user-defined <code>DOMAIN</code>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">DOMAIN</span><span class="w"> </span><span class="n">positive_int</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="nb">INT</span><span class="w"> </span><span class="k">CHECK</span><span class="w"> </span><span class="p">(</span><span class="n">VALUE</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">0</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">quantity</span><span class="w"> </span><span class="n">positive_int</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="n">amount</span><span class="w"> </span><span class="nb">DECIMAL</span><span class="w"> </span><span class="k">CHECK</span><span class="w"> </span><span class="p">(</span><span class="n">amount</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="mi">0</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p><strong>MySQL 對應</strong>：8.0+ 有 CHECK constraint enforcement（5.7 可寫但不執行）。沒 user-defined DOMAIN。</p>
<h3 id="7-extension-ecosystem">7. Extension Ecosystem</h3>
<p>PG extension 是 <em>結構優勢</em>：</p>
<ul>
<li><code>pg_partman</code>：自動 partition lifecycle</li>
<li><code>pg_repack</code>：online table rewrite</li>
<li><code>pg_stat_statements</code>：query stats</li>
<li><code>pgvector</code>：vector similarity search</li>
<li><code>pg_cron</code>：scheduled job</li>
<li><code>PostGIS</code>：GIS</li>
<li><code>TimescaleDB</code>：time-series</li>
<li><code>Citus</code>：sharding</li>
</ul>
<p><strong>MySQL 對應</strong>：MySQL plugin 機制有、生態遠遠不如。詳見 <em>PG Extension Ecosystem</em> 篇（待寫）。</p>
<h2 id="mysql-80-補齊的-pg-既有特性">MySQL 8.0 補齊的 PG 既有特性</h2>
<table>
  <thead>
      <tr>
          <th>特性</th>
          <th>PG 推出</th>
          <th>MySQL 推出</th>
          <th>差異後說明</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CTE</td>
          <td>8.4 (2009)</td>
          <td>8.0 (2018)</td>
          <td>MySQL 補語法、行為 PG 12+ 跟 MySQL 接近</td>
      </tr>
      <tr>
          <td>Window function</td>
          <td>8.4 (2009)</td>
          <td>8.0 (2018)</td>
          <td>兩家都標準、frame spec 細節有差</td>
      </tr>
      <tr>
          <td>Lateral derived table</td>
          <td>9.3 (2013)</td>
          <td>8.0.14 (2019)</td>
          <td>MySQL 後加、planner 不如 PG 成熟</td>
      </tr>
      <tr>
          <td>Hash join</td>
          <td>早就有</td>
          <td>8.0.18 (2019)</td>
          <td>MySQL 受限（equality on indexed column）</td>
      </tr>
      <tr>
          <td>JSON_TABLE</td>
          <td>17 (2024)</td>
          <td>8.0 (2018)</td>
          <td>MySQL 較早、PG 17+ 補進、PG 自己有 JSONB 路線</td>
      </tr>
      <tr>
          <td>CHECK constraint</td>
          <td>早就有</td>
          <td>8.0 (2018)</td>
          <td>MySQL 5.7 可寫但不執行</td>
      </tr>
      <tr>
          <td>Role-based auth</td>
          <td>早就有</td>
          <td>8.0 (2018)</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Atomic DDL</td>
          <td>早就有</td>
          <td>8.0 (2018)</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Common keyword</td>
          <td>完整</td>
          <td>8.0 補</td>
          <td>MySQL 5.7 缺很多 (window/rank/lateral 等)</td>
      </tr>
  </tbody>
</table>
<p>MySQL 8.0 是 <em>補齊 9 年 SQL standard 落後</em>、不是 <em>新領先 PG</em>。</p>
<h2 id="pg-仍領先的特性">PG 仍領先的特性</h2>
<p>對應「MySQL 8.0 補了 → PG 仍沒輸」的視角。以下 14 條中、<em>production 影響最大</em> 的是 Materialized view / Partial index / JSONB GIN / Full-text search 跟 Range / Exclusion constraints（schema-level expressiveness）；<em>次要但常用</em> 的是 Multi-column statistics 跟 Procedural language；<em>非典型但 niche 重要</em> 的是 User-defined DOMAIN / Generic table inheritance（讀者不必然知道、但 ORM 跟 schema migration 工具會用）：</p>
<table>
  <thead>
      <tr>
          <th>PG 領先特性</th>
          <th>MySQL 對應狀態</th>
          <th>補充</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Materialized view</td>
          <td>無原生</td>
          <td>application-side 重算成本高</td>
      </tr>
      <tr>
          <td>Partial index</td>
          <td>無（functional index 不等同）</td>
          <td>對 boolean / status column 救 storage</td>
      </tr>
      <tr>
          <td>FDW</td>
          <td>弱（FEDERATED engine 不推薦）</td>
          <td>跨 DB query escape hatch</td>
      </tr>
      <tr>
          <td>JSONB GIN index</td>
          <td>無（generated column workaround）</td>
          <td>JSON workload 結構性差</td>
      </tr>
      <tr>
          <td>Range types</td>
          <td>無</td>
          <td>booking / availability schema 救命</td>
      </tr>
      <tr>
          <td>Exclusion constraints</td>
          <td>無</td>
          <td>range overlap 防護</td>
      </tr>
      <tr>
          <td>User-defined DOMAIN</td>
          <td>無</td>
          <td>column-level type constraint</td>
      </tr>
      <tr>
          <td>Extension ecosystem</td>
          <td>弱</td>
          <td>pgvector / TimescaleDB / PostGIS</td>
      </tr>
      <tr>
          <td>Full-text search 成熟</td>
          <td>InnoDB FTS 較弱</td>
          <td>tsvector + GIN + pg_trgm 三層</td>
      </tr>
      <tr>
          <td>Multi-column statistics</td>
          <td>8.0 histograms 部分對應、PG 更廣</td>
          <td>planner 更準</td>
      </tr>
      <tr>
          <td>Procedural language</td>
          <td>PL/pgSQL + 多語言（PL/Python / PL/Perl 等）</td>
          <td>Stored procedure（不擴語言）</td>
      </tr>
      <tr>
          <td>Recursive CTE 深度</td>
          <td>Unlimited</td>
          <td>1000（cte_max_recursion_depth）</td>
      </tr>
      <tr>
          <td>LSN-based replication</td>
          <td>簡潔</td>
          <td>binlog file+position（GTID 緩解）</td>
      </tr>
      <tr>
          <td>Generic table inheritance</td>
          <td>早就有</td>
          <td>無（multi-tenant schema 結構用）</td>
      </tr>
  </tbody>
</table>
<h2 id="對從-mysql-評估-pg的讀者">對「從 MySQL 評估 PG」的讀者</h2>
<p>讀者通常從 MySQL 8.0 過來、問題是 <em>「PG 比 MySQL 強在哪、弱在哪」</em>：</p>
<h3 id="pg-比-mysql-強">PG 比 MySQL 強</h3>
<ul>
<li><em>SQL 工程深度</em>：上面列的 7 個結構優勢</li>
<li><em>Extension ecosystem</em>：pgvector / TimescaleDB / Citus / pg_partman 等</li>
<li><em>Optimizer</em>：planner 對複雜 query 更成熟</li>
<li><em>Concurrency model</em>：MVCC + 少 lock（<a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">MVCC + Lock Model</a>）</li>
</ul>
<h3 id="pg-比-mysql-弱">PG 比 MySQL 弱</h3>
<ul>
<li><em>Replication 機制簡潔度</em>：MySQL GTID 比 PG WAL + replication slot 配置簡單（<a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>）</li>
<li><em>Sharding ecosystem</em>：Vitess / PlanetScale 比 Citus 規模驗證高</li>
<li><em>Operational tooling 廣度</em>：pt-toolkit / gh-ost / Orchestrator 等</li>
<li><em>VACUUM 維護</em>：PG MVCC 必須 VACUUM、autovacuum 配錯議題多（<a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a>）</li>
</ul>
<h3 id="選-pg-的核心-driver">選 PG 的核心 driver</h3>
<p>對 SQL 工程深度、extension、複雜 query / OLAP-style workload 的場景、PG 仍是首選。對純簡單 OLTP + 大規模 sharding、MySQL + Vitess 仍 competitive。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">MVCC + Lock Model</a>：PG MVCC 是 SQL feature 的並行控制基礎</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">Query Optimization</a>：PG planner 對 window / CTE / hash join 成熟</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/citus-distributed/" data-link-title="PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster" data-link-desc="Citus 是 PG extension、把單機 PG 變成 *coordinator &#43; worker* sharded cluster、保留 PG SQL &#43; 加 distributed table &#43; reference table &#43; columnar storage。本文走 Citus 架構（coordinator / worker / distribution column）、3 種 table type（distributed / reference / local）、配置 step-by-step、5 production 踩雷（distribution column 選錯 / cross-shard transaction / reference table 過大 / colocate 不對齊 / worker failover）、跟 MySQL Vitess sharding sibling 對比">Citus Distributed</a>：extension 之一、體現 extension 生態</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a>：MVCC 代價、跟 SQL feature 並行控制相關</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">PG MVCC + Lock Model</a>（concurrency 基礎）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">PG Query Optimization</a>（planner 成熟度）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/citus-distributed/" data-link-title="PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster" data-link-desc="Citus 是 PG extension、把單機 PG 變成 *coordinator &#43; worker* sharded cluster、保留 PG SQL &#43; 加 distributed table &#43; reference table &#43; columnar storage。本文走 Citus 架構（coordinator / worker / distribution column）、3 種 table type（distributed / reference / local）、配置 step-by-step、5 production 踩雷（distribution column 選錯 / cross-shard transaction / reference table 過大 / colocate 不對齊 / worker failover）、跟 MySQL Vitess sharding sibling 對比">PG Citus Distributed</a>（extension example）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">PG Autovacuum Tuning</a>（MVCC 維護）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features</a>（sibling、反向視角）</li>
<li>官方：<a href="https://www.postgresql.org/about/featurematrix/">PostgreSQL Features</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL BDR / Multi-Master：active-active 寫入的 3 種路徑跟 conflict 治理</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/bdr-multi-master/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/bdr-multi-master/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>multi-master / active-active replication&lt;/em> — 不是 PG 預設、需要 extension。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="pg-預設沒-multi-master得用-extension">PG 預設沒 multi-master、得用 extension&lt;/h2>
&lt;p>PG core 是 &lt;em>single-primary streaming replication&lt;/em>：&lt;/p>
&lt;ul>
&lt;li>寫入只能進 primary&lt;/li>
&lt;li>Standby 接受 read（hot_standby）但拒絕 write&lt;/li>
&lt;li>Failover 後新 primary 接管、不能多入口&lt;/li>
&lt;/ul>
&lt;p>對需要 &lt;em>active-active&lt;/em>（多 region 各自接受 local write）的場景、PG 提供 3 條 extension 路徑：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>方案&lt;/th>
 &lt;th>來源&lt;/th>
 &lt;th>機制&lt;/th>
 &lt;th>License&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>BDR&lt;/strong>&lt;/td>
 &lt;td>EDB（Enterprise）&lt;/td>
 &lt;td>Logical replication-based、雙向&lt;/td>
 &lt;td>商業（EDB 訂閱）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>pgEdge&lt;/strong>&lt;/td>
 &lt;td>pgEdge Inc.&lt;/td>
 &lt;td>基於 BDR、開源、加 Spock extension&lt;/td>
 &lt;td>開源（Spock）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Bucardo&lt;/strong>&lt;/td>
 &lt;td>community&lt;/td>
 &lt;td>Trigger-based、async、Perl 寫&lt;/td>
 &lt;td>開源（BSD）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>每條路徑有不同 trade-off。對 99% PG production case、&lt;em>不需要 multi-master&lt;/em> — single-primary streaming replication + read replica scaling 已夠。Multi-master 是 &lt;em>特殊需求&lt;/em>（跨 region active-active write / 不可中斷 maintenance）才上。&lt;/p>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &amp;#43; certification* 整個機制不同。本文走 GR 機制（GCE &amp;#43; certification &amp;#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">MySQL Group Replication&lt;/a> 對比：MySQL GR 是 &lt;em>官方內建&lt;/em>（5.7+）、PG 沒對應內建選項。MySQL 用戶 GR / InnoDB Cluster 直接套、PG 用戶要選 extension + license trade-off。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>multi-master / active-active replication</em> — 不是 PG 預設、需要 extension。</p></blockquote>
<hr>
<h2 id="pg-預設沒-multi-master得用-extension">PG 預設沒 multi-master、得用 extension</h2>
<p>PG core 是 <em>single-primary streaming replication</em>：</p>
<ul>
<li>寫入只能進 primary</li>
<li>Standby 接受 read（hot_standby）但拒絕 write</li>
<li>Failover 後新 primary 接管、不能多入口</li>
</ul>
<p>對需要 <em>active-active</em>（多 region 各自接受 local write）的場景、PG 提供 3 條 extension 路徑：</p>
<table>
  <thead>
      <tr>
          <th>方案</th>
          <th>來源</th>
          <th>機制</th>
          <th>License</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>BDR</strong></td>
          <td>EDB（Enterprise）</td>
          <td>Logical replication-based、雙向</td>
          <td>商業（EDB 訂閱）</td>
      </tr>
      <tr>
          <td><strong>pgEdge</strong></td>
          <td>pgEdge Inc.</td>
          <td>基於 BDR、開源、加 Spock extension</td>
          <td>開源（Spock）</td>
      </tr>
      <tr>
          <td><strong>Bucardo</strong></td>
          <td>community</td>
          <td>Trigger-based、async、Perl 寫</td>
          <td>開源（BSD）</td>
      </tr>
  </tbody>
</table>
<p>每條路徑有不同 trade-off。對 99% PG production case、<em>不需要 multi-master</em> — single-primary streaming replication + read replica scaling 已夠。Multi-master 是 <em>特殊需求</em>（跨 region active-active write / 不可中斷 maintenance）才上。</p>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &#43; certification* 整個機制不同。本文走 GR 機制（GCE &#43; certification &#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">MySQL Group Replication</a> 對比：MySQL GR 是 <em>官方內建</em>（5.7+）、PG 沒對應內建選項。MySQL 用戶 GR / InnoDB Cluster 直接套、PG 用戶要選 extension + license trade-off。</p>
<h2 id="multi-master-三方案對比">Multi-master 三方案對比</h2>
<h3 id="方案-1bdr-edb-postgres-distributed">方案 1：BDR (EDB Postgres Distributed)</h3>
<p>EDB 商業 distributed 方案、跑在 EDB Postgres Advanced Server 或 PG community 上。</p>
<p><strong>特性</strong>：</p>
<ul>
<li>雙向 logical replication、N-way active-active</li>
<li>Built-in conflict detection + resolution（LWW / column-level / user-defined）</li>
<li>Eager（sync）跟 async 兩種 mode</li>
<li>Tightly integrated with EDB tooling</li>
</ul>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>商業 license、EDB 訂閱</li>
<li>對 cross-region multi-master 成熟（北美 enterprise 廣用）</li>
<li>對 <em>新 PG version</em> 通常滯後幾個月</li>
</ul>
<h3 id="方案-2pgedge基於-spock-extension">方案 2：pgEdge（基於 Spock extension）</h3>
<p>pgEdge 開源 multi-master、基於 <em>Spock</em> extension（從 BDR 衍生）：</p>
<p><strong>特性</strong>：</p>
<ul>
<li>開源、可自管</li>
<li>跟 BDR 架構接近、無 license fee</li>
<li>Conflict resolution 用 LWW + column-level</li>
<li>對 <em>edge / 地理分散</em> 場景設計</li>
</ul>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>較新（2023+）、社群驗證度低於 BDR</li>
<li>Conflict resolution policy 比 BDR 簡單</li>
<li>部分 EDB 商業 feature 沒對應</li>
</ul>
<h3 id="方案-3bucardo">方案 3：Bucardo</h3>
<p>PG community async multi-master、Perl 寫、trigger-based：</p>
<p><strong>特性</strong>：</p>
<ul>
<li>完全開源</li>
<li>Trigger-based（不依賴 logical replication）</li>
<li>支援 multi-source replication（fan-in / fan-out）</li>
</ul>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>Async only — <em>higher latency conflict</em></li>
<li>Trigger overhead（影響 primary 寫吞吐）</li>
<li>維護 Perl + tools chain 不普及</li>
<li>對 <em>Sync 一致性</em> 需求不適用</li>
</ul>
<h2 id="multi-master-conflict-model">Multi-Master Conflict Model</h2>
<p>任何 multi-master 方案都要解決 <em>同一 row 兩地同時改</em> 的 conflict：</p>
<h3 id="conflict-來源">Conflict 來源</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Region A (primary 1)          Region B (primary 2)
</span></span><span class="line"><span class="ln">2</span><span class="cl">UPDATE orders                 UPDATE orders
</span></span><span class="line"><span class="ln">3</span><span class="cl">SET status=&#39;shipped&#39;          SET status=&#39;cancelled&#39;
</span></span><span class="line"><span class="ln">4</span><span class="cl">WHERE id=100                  WHERE id=100
</span></span><span class="line"><span class="ln">5</span><span class="cl">     ↓                              ↓
</span></span><span class="line"><span class="ln">6</span><span class="cl">   合併？哪個贏？</span></span></code></pre></div><p>跨 region 兩地各自 commit、replication lag 期間發現 conflict、必須 <em>自動 resolve</em>（不能丟給 application）。</p>
<h3 id="conflict-resolution-strategies">Conflict Resolution Strategies</h3>
<p><strong>1. Last-Write-Wins (LWW)</strong> — 最常見：</p>
<ul>
<li>比較 transaction commit timestamp、晚的贏</li>
<li>簡單但 <em>data loss</em>（前一個 commit 的變更被覆蓋）</li>
<li>需要 <em>clock 同步</em>（NTP）—  clock skew 造成不可預測</li>
</ul>
<p><strong>2. Column-level conflict resolution</strong>：</p>
<ul>
<li>不同 column 各自 LWW（status column 跟 amount column 獨立解）</li>
<li>比 row-level LWW 細、但需 application semantics 配合</li>
</ul>
<p><strong>3. User-defined trigger</strong>：</p>
<ul>
<li>寫 PG function 解 conflict</li>
<li>對 <em>特殊 business logic</em>（如：金額相加、不是覆蓋）有用</li>
<li>維護成本高</li>
</ul>
<p><strong>4. Manual reconciliation</strong>：</p>
<ul>
<li>Conflict 寫進 log table、application / DBA 手動處理</li>
<li>對 <em>無法自動 resolve</em> 場景（如金融）</li>
<li>高 ops cost</li>
</ul>
<p>對 99% case 用 LWW、接受 small data loss、application 設計 <em>idempotent / commutative</em> 操作避免衝突。</p>
<h3 id="conflict-機率取決於-application-pattern">Conflict 機率取決於 application pattern</h3>
<ul>
<li><em>Tenant-isolated</em> application（user_id 各自寫自己的 row）：基本無 conflict</li>
<li><em>Shared counter / inventory</em> application：高 conflict、multi-master 不適合</li>
<li><em>Append-only event log</em>：conflict 低、適合 multi-master</li>
</ul>
<h2 id="配置-step-by-steppgedge-為主">配置 step-by-step（pgEdge 為主）</h2>
<p>pgEdge 開源、最常見的 self-hosted 選擇。</p>
<h3 id="step-1在每個-region-node-裝-pgedge">Step 1：在每個 region node 裝 pgEdge</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Install pgEdge CLI</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">curl -fsSL https://pgedge-upstream.s3.amazonaws.com/REPO/install.py <span class="p">|</span> python3
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># Setup PG + Spock + pgEdge</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">./pgedge install pg16
</span></span><span class="line"><span class="ln">6</span><span class="cl">./pgedge install spock</span></span></code></pre></div><h3 id="step-2配置每個-node">Step 2：配置每個 node</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 在 node1（us-east） 跑
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">node_create</span><span class="p">(</span><span class="n">node_name</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;node1&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">dsn</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;host=node1.example.com port=5432 dbname=production&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 在 node2（eu-west）跑
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">node_create</span><span class="p">(</span><span class="n">node_name</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;node2&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">dsn</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;host=node2.example.com port=5432 dbname=production&#39;</span><span class="p">);</span></span></span></code></pre></div><h3 id="step-3建-replication-set--subscribe">Step 3：建 replication set + subscribe</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 在 node1 建 default replication set + 加 tables
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">repset_add_all_tables</span><span class="p">(</span><span class="s1">&#39;default&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- 在 node1 subscribe node2
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">sub_create</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">subscription_name</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;sub_n1_n2&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="n">provider_dsn</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;host=node2.example.com port=5432 dbname=production&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 在 node2 subscribe node1（雙向）
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">sub_create</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="n">subscription_name</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;sub_n2_n1&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">    </span><span class="n">provider_dsn</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;host=node1.example.com port=5432 dbname=production&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><h3 id="step-4設-conflict-resolution">Step 4：設 conflict resolution</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 設 LWW（預設）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">conflict_resolution_setting_set</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">conflict_type</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;update_origin_change&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">resolution_setting</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;apply_remote&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><h3 id="step-5驗證">Step 5：驗證</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 看 subscription 狀態
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">subscription</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 看 replication lag
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_replication</span><span class="p">;</span></span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-lww-data-loss--application-沒設計-commutative">1. LWW data loss — Application 沒設計 commutative</h3>
<p>LWW 預設、兩 region 同時 UPDATE 同 row → 晚的 commit 贏、早的丟失。Application 看不到「我寫的不見了」、debug 困難。</p>
<p>修法：</p>
<ul>
<li>Application schema 設計 <em>tenant-isolated</em>（user_id 各自寫自己 row）</li>
<li>對 <em>shared counter / inventory</em> 用 <em>commutative operation</em>（INCREMENT not SET）</li>
<li>重要寫入加 <em>audit log</em> — conflict 仍寫到 audit、application 看 audit 知道發生過</li>
<li>真的需要 strict consistency 別用 multi-master、用 single-primary + reader 或 distributed SQL</li>
</ul>
<h3 id="2-sequence-collision--two-region-各自-next-同號">2. Sequence collision — Two region 各自 next 同號</h3>
<p><code>SERIAL</code> / <code>IDENTITY</code> 用 sequence、兩 region 各自 nextval 可能拿到同 number、INSERT 衝突（PK duplicate）。</p>
<p>修法：</p>
<ul>
<li>用 <em>staggered sequence range</em>：node1 用 1-1M、node2 用 1M+1 到 2M（用 <code>setval</code>）</li>
<li>或用 <em>UUID</em>（v4 / v7）作 PK、跨 node 無 collision</li>
<li>或 <em>sequence per-node namespace</em>：<code>CREATE SEQUENCE orders_id_node1 START 1 INCREMENT 2</code>（odd vs even）</li>
</ul>
<h3 id="3-ddl-replication-不自動">3. DDL replication 不自動</h3>
<p>PG logical replication（pgEdge / BDR 基礎）<em>不自動 replicate DDL</em>。每 node <code>CREATE TABLE</code> / <code>ALTER TABLE</code> 必須 <em>分別跑</em>。</p>
<p>修法：</p>
<ul>
<li>用 <em>deployment automation</em>（Ansible / Terraform）對所有 node 同時跑 DDL</li>
<li>pgEdge 提供 <code>spock.replicate_ddl(...)</code> 把 DDL 轉成可 replicate event</li>
<li>BDR Enterprise 有 <em>DDL replication</em>（商業 feature）</li>
<li>DDL 變更前確認 <em>所有 node 都健康</em>、減少 partial state</li>
</ul>
<h3 id="4-conflict-log-治理--log-table-爆滿">4. Conflict log 治理 — Log table 爆滿</h3>
<p>每個 conflict 寫進 <code>spock.conflict_log</code> / <code>bdr.conflict_history</code> 等 table、log 累積 disk 爆。</p>
<p>修法：</p>
<ul>
<li>設 <em>log retention</em>：cron 定期 archive + delete 老 conflict log</li>
<li>監控 conflict rate — 高 conflict rate 是 application 設計問題（不是 ops 問題）</li>
<li>對 <em>strict business</em> conflict 寫進 application-level audit table、不只 system log</li>
</ul>
<h3 id="5-failover-後-timeline-分歧">5. Failover 後 timeline 分歧</h3>
<p>Multi-master 設計上 <em>每 region 是 primary</em>、Region A 掛了 Region B 接管 — 但 Region A 復活後 <em>仍認為自己是 primary</em>。如果 Region A 復活前已有寫入沒 replicate 出去、resolution 跟 LWW 衝突。</p>
<p>修法：</p>
<ul>
<li><em>Fence Region A 復活</em>：物理 fence（network firewall）+ 手動 unfence 流程</li>
<li>用 <em>etcd / Consul</em> 跟 BDR / Spock 整合 leader election（避免 split-brain）</li>
<li>對 cross-region multi-master、必須有 <em>runbook</em> 處理 region 復活流程、不靠自動</li>
</ul>
<h2 id="何時用-multi-master-vs-不用">何時用 multi-master vs 不用</h2>
<table>
  <thead>
      <tr>
          <th>情境</th>
          <th>建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>真正 cross-region active-active write 需求</td>
          <td>BDR / pgEdge</td>
      </tr>
      <tr>
          <td>不可中斷 maintenance（zero downtime upgrade）</td>
          <td>BDR / pgEdge</td>
      </tr>
      <tr>
          <td>高 conflict rate（shared counter / inventory）</td>
          <td>不要 multi-master、用 distributed SQL</td>
      </tr>
      <tr>
          <td>Read scaling 為主、可接受 stale read</td>
          <td>streaming replication + read replica（更簡單）</td>
      </tr>
      <tr>
          <td>Strict consistency 需求</td>
          <td>single-primary + sync replication 或 Aurora DSQL / Spanner</td>
      </tr>
      <tr>
          <td>預算敏感 + 不想養 BDR / pgEdge ops</td>
          <td>不要 multi-master、用 managed distributed SQL</td>
      </tr>
  </tbody>
</table>
<h2 id="跟-mysql-group-replication-對比">跟 MySQL Group Replication 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PG Multi-Master</th>
          <th>MySQL Group Replication</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>內建？</td>
          <td>否、需 extension</td>
          <td>是、5.7+ 內建</td>
      </tr>
      <tr>
          <td>商業 vs 開源</td>
          <td>BDR 商業 / pgEdge 開源</td>
          <td>Oracle 商業 / community 都行</td>
      </tr>
      <tr>
          <td>Sync mode</td>
          <td>可（BDR eager）</td>
          <td>是（certification-based）</td>
      </tr>
      <tr>
          <td>Conflict resolution</td>
          <td>LWW / column / user-defined</td>
          <td>Certification-based（distributed transaction）</td>
      </tr>
      <tr>
          <td>Production maturity</td>
          <td>BDR 高、pgEdge 中</td>
          <td>高（Oracle 推）</td>
      </tr>
      <tr>
          <td>Use case 比例</td>
          <td>少（PG 多用 single-primary）</td>
          <td>較多（MySQL 推 InnoDB Cluster）</td>
      </tr>
  </tbody>
</table>
<p>MySQL GR 內建 + Oracle 推、PG 沒對應內建。對 multi-master 需求重的 org、MySQL 走 GR 路徑更直接。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication Topology</h3>
<p>Multi-master 是 <em>streaming replication 之上的 logical replication 加雙向</em>、不取代 streaming。Streaming 仍給 standby / failover、multi-master 給 active-active write。詳見 <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>。</p>
<h3 id="跟-logical-replication">跟 Logical Replication</h3>
<p>pgEdge / BDR 都基於 logical replication slot、跟 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a> 共用 PG logical decoding infrastructure、但 <em>配置 + tooling</em> 不同。</p>
<h3 id="跟-mvcc">跟 MVCC</h3>
<p>Multi-master 的 conflict 在 <em>commit 後</em> 偵測（async）、不在 transaction 內。跟單機 MVCC（同 cluster 內 transaction snapshot）不同層。詳見 <a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">MVCC + Lock Model</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（streaming + multi-master 共存）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">PG Logical Replication + Debezium</a>（logical decoding 基礎）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">PG MVCC + Lock Model</a>（multi-master conflict vs 單機 MVCC）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">PG Patroni HA</a>（single-primary HA 替代方案）</li>
<li><a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a>（multi-master vs distributed SQL）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &#43; certification* 整個機制不同。本文走 GR 機制（GCE &#43; certification &#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">MySQL Group Replication</a>（sibling、不同實作）</li>
<li>官方：<a href="https://www.enterprisedb.com/products/edb-postgres-distributed-bdr">EDB BDR</a> / <a href="https://www.pgedge.com/">pgEdge</a> / <a href="https://github.com/pgEdge/spock">Spock GitHub</a> / <a href="https://bucardo.org/">Bucardo</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/query-optimization/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/query-optimization/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>query optimization&lt;/em> — EXPLAIN ANALYZE / auto_explain / pg_hint_plan 三層工具跟 4 個實際 case。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="4-個常見-production-case">4 個常見 production case&lt;/h2>
&lt;p>PG query 慢的 root cause 多數是 &lt;em>planner 選錯 plan&lt;/em>。從以下 4 個 case 進入 query optimization：&lt;/p>
&lt;h3 id="case-15-秒--50ms--seq-scan-vs-index">Case 1：5 秒 → 50ms — Seq scan vs index&lt;/h3>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- 慢 (5 秒)
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">amount&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">c&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">JOIN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">customers&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">c&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">customer_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">c&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">c&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">region&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;TW&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">AND&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">created_at&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;2026-05-01&amp;#39;&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>EXPLAIN (ANALYZE, BUFFERS)&lt;/code>：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">Hash Join (cost=20000..50000 rows=100 width=...) (actual time=4900..5000 rows=10000)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> -&amp;gt; Seq Scan on customers c (cost=0..20000 rows=1000000 width=...)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> Filter: (region = &amp;#39;TW&amp;#39;)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> Rows Removed by Filter: 900000
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl"> -&amp;gt; Hash (cost=...)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl"> -&amp;gt; Index Scan on orders_created_idx&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>問題：&lt;code>customers.region&lt;/code> 沒 index、planner 選 seq scan、實際 region=TW 只 10% row。修法：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">INDEX&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">CONCURRENTLY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">idx_customers_region&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">customers&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">region&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">ANALYZE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">customers&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">-- 更新 statistics、讓 planner 看到新 index&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>加完 5 秒降 50ms。&lt;/p>
&lt;h3 id="case-230-秒--200ms--hash-join-沒觸發用-nested-loop">Case 2：30 秒 → 200ms — Hash join 沒觸發、用 nested loop&lt;/h3>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">u&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">count&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">u&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">LEFT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">JOIN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">user_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">u&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">GROUP&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">u&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>EXPLAIN ANALYZE 顯示 &lt;em>Nested Loop&lt;/em> 跑 1M 次 inner loop、執行 30 秒。Planner 估錯 row count、選 nested loop。Hash join 應該 &amp;lt; 200ms。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>query optimization</em> — EXPLAIN ANALYZE / auto_explain / pg_hint_plan 三層工具跟 4 個實際 case。</p></blockquote>
<hr>
<h2 id="4-個常見-production-case">4 個常見 production case</h2>
<p>PG query 慢的 root cause 多數是 <em>planner 選錯 plan</em>。從以下 4 個 case 進入 query optimization：</p>
<h3 id="case-15-秒--50ms--seq-scan-vs-index">Case 1：5 秒 → 50ms — Seq scan vs index</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 慢 (5 秒)
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">amount</span><span class="p">,</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">name</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">customers</span><span class="w"> </span><span class="k">c</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">customer_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">id</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">region</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;TW&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="p">;</span></span></span></code></pre></div><p><code>EXPLAIN (ANALYZE, BUFFERS)</code>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Hash Join  (cost=20000..50000 rows=100 width=...) (actual time=4900..5000 rows=10000)
</span></span><span class="line"><span class="ln">2</span><span class="cl">  -&gt;  Seq Scan on customers c  (cost=0..20000 rows=1000000 width=...)
</span></span><span class="line"><span class="ln">3</span><span class="cl">      Filter: (region = &#39;TW&#39;)
</span></span><span class="line"><span class="ln">4</span><span class="cl">      Rows Removed by Filter: 900000
</span></span><span class="line"><span class="ln">5</span><span class="cl">  -&gt;  Hash  (cost=...)
</span></span><span class="line"><span class="ln">6</span><span class="cl">      -&gt;  Index Scan on orders_created_idx</span></span></code></pre></div><p>問題：<code>customers.region</code> 沒 index、planner 選 seq scan、實際 region=TW 只 10% row。修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">CONCURRENTLY</span><span class="w"> </span><span class="n">idx_customers_region</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">customers</span><span class="p">(</span><span class="n">region</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">ANALYZE</span><span class="w"> </span><span class="n">customers</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 更新 statistics、讓 planner 看到新 index</span></span></span></code></pre></div><p>加完 5 秒降 50ms。</p>
<h3 id="case-230-秒--200ms--hash-join-沒觸發用-nested-loop">Case 2：30 秒 → 200ms — Hash join 沒觸發、用 nested loop</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">u</span><span class="p">.</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="n">o</span><span class="p">.</span><span class="n">id</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="n">u</span><span class="w"> </span><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">u</span><span class="p">.</span><span class="n">id</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">u</span><span class="p">.</span><span class="n">name</span><span class="p">;</span></span></span></code></pre></div><p>EXPLAIN ANALYZE 顯示 <em>Nested Loop</em> 跑 1M 次 inner loop、執行 30 秒。Planner 估錯 row count、選 nested loop。Hash join 應該 &lt; 200ms。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ANALYZE</span><span class="w"> </span><span class="n">users</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">ANALYZE</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- 提高 default_statistics_target 對 critical column
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">ALTER</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="k">STATISTICS</span><span class="w"> </span><span class="mi">1000</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">ANALYZE</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span></span></span></code></pre></div><p>統計精度提升、planner 估 row count 準、自動切 hash join。</p>
<h3 id="case-38-秒--100ms--multi-column-統計缺">Case 3：8 秒 → 100ms — Multi-column 統計缺</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;pending&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">region</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;TW&#39;</span><span class="p">;</span></span></span></code></pre></div><p><code>status = 'pending'</code> 5% row、<code>region = 'TW'</code> 10% row。Planner 假設兩 column 獨立、估 0.5% (5K row)。實際 status=&lsquo;pending&rsquo; 跟 region=&lsquo;TW&rsquo; 強相關（TW 訂單多 pending）、實際 4% (40K row)。Planner 估錯 8x、選錯 plan。</p>
<p>修法（PG 10+）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">STATISTICS</span><span class="w"> </span><span class="n">stats_orders_status_region</span><span class="w"> </span><span class="p">(</span><span class="n">dependencies</span><span class="p">,</span><span class="w"> </span><span class="n">ndistinct</span><span class="p">,</span><span class="w"> </span><span class="n">mcv</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="n">status</span><span class="p">,</span><span class="w"> </span><span class="n">region</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">ANALYZE</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 之後 planner 知道 status+region 相關度、估準</span></span></span></code></pre></div><h3 id="case-420-秒--5-秒--parallel-query-沒觸發">Case 4：20 秒 → 5 秒 — Parallel query 沒觸發</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">region</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">),</span><span class="w"> </span><span class="k">sum</span><span class="p">(</span><span class="n">amount</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">region</span><span class="p">;</span></span></span></code></pre></div><p><code>orders</code> 100M row、預期 PG parallel scan + parallel aggregate、實際 single worker 跑 20 秒。</p>
<p>EXPLAIN：<code>Workers Planned: 0</code>。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">max_parallel_workers_per_gather</span> <span class="o">=</span> <span class="s">4</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">max_parallel_workers</span> <span class="o">=</span> <span class="s">8</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">max_worker_processes</span> <span class="o">=</span> <span class="s">16</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">parallel_setup_cost</span> <span class="o">=</span> <span class="s">100        # 預設 1000、降低讓 planner 更敢 parallel</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="na">parallel_tuple_cost</span> <span class="o">=</span> <span class="s">0.01       # 預設 0.1</span></span></span></code></pre></div><p>並行後 5 秒。</p>
<h2 id="explain-三層工具">EXPLAIN 三層工具</h2>
<h3 id="tool-1explain--plan-preview">Tool 1：EXPLAIN — Plan preview</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">EXPLAIN</span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="p">...;</span></span></span></code></pre></div><p>輸出每個 node 的 <em>估計</em> cost / row count / width。<strong>用於 quick plan check</strong>。</p>
<p>關鍵欄位：</p>
<ul>
<li><code>Plan node 類型</code>：<code>Seq Scan</code> &lt; <code>Index Scan</code> &lt; <code>Index Only Scan</code>、警訊看 <em>unexpected</em> node type</li>
<li><code>cost=START..END</code>：planner 估的 cost、START 是 startup cost、END 是 total</li>
<li><code>rows</code>：估計 output row 數</li>
<li><code>width</code>：每 row average byte（影響 sort / hash memory）</li>
</ul>
<h3 id="tool-2explain-analyze--實際執行--對比-estimate">Tool 2：EXPLAIN ANALYZE — 實際執行 + 對比 estimate</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">EXPLAIN</span><span class="w"> </span><span class="p">(</span><span class="k">ANALYZE</span><span class="p">,</span><span class="w"> </span><span class="n">BUFFERS</span><span class="p">,</span><span class="w"> </span><span class="k">VERBOSE</span><span class="p">)</span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="p">...;</span></span></span></code></pre></div><p>差別：實際 <em>跑 query</em>、輸出實際 row count / time、跟 estimate 對比：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Hash Join  (cost=20000..50000 rows=100) (actual time=400..500 rows=10000 loops=1)</span></span></code></pre></div><p><code>rows=100 (estimate)</code> vs <code>rows=10000 (actual)</code> — 估錯 100x、planner 可能選錯 plan。<code>BUFFERS</code> 顯示 disk read vs buffer cache hit。</p>
<p><strong>注意</strong>：EXPLAIN ANALYZE <em>實際跑 query</em>、修改性 query（UPDATE / DELETE）會真的改 data。讀 query 安全。修改性 query 包 transaction：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">EXPLAIN</span><span class="w"> </span><span class="k">ANALYZE</span><span class="w"> </span><span class="k">UPDATE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;x&#39;</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">...;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">ROLLBACK</span><span class="p">;</span></span></span></code></pre></div><h3 id="tool-3auto_explain--production-query-自動-capture">Tool 3：auto_explain — Production query 自動 capture</h3>
<p><code>auto_explain</code> extension 自動 log slow query 的 plan：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">shared_preload_libraries</span> <span class="o">=</span> <span class="s">&#39;auto_explain&#39;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">auto_explain.log_min_duration</span> <span class="o">=</span> <span class="s">&#39;1s&#39;    # 超過 1 秒 log plan</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">auto_explain.log_analyze</span> <span class="o">=</span> <span class="s">on            # 含 ANALYZE 統計</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">auto_explain.log_buffers</span> <span class="o">=</span> <span class="s">on</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="na">auto_explain.log_format</span> <span class="o">=</span> <span class="s">&#39;json&#39;         # JSON 格式給工具消費</span></span></span></code></pre></div><p>Production slow query 自動進 log、不必手動 EXPLAIN。組合 pg_stat_statements + auto_explain 是 PG 標準 query observability。</p>
<h2 id="pg_hint_plan-vs-planner-guc">pg_hint_plan vs Planner GUC</h2>
<p>PG 兩種方式 nudge planner：</p>
<h3 id="planner-gucglobal">Planner GUC（global）</h3>
<p><code>postgresql.conf</code> 內：</p>
<ul>
<li><code>enable_seqscan = off</code> — 禁用 seq scan（force index）</li>
<li><code>enable_nestloop = off</code> — 禁用 nested loop（force hash/merge join）</li>
<li><code>random_page_cost = 1.1</code> — SSD 設低（預設 4 是 HDD assumption）</li>
<li><code>effective_cache_size = '16GB'</code> — buffer pool + OS cache 估、影響 planner</li>
</ul>
<p>GUC 是 <em>global</em> — 影響所有 query。對 <em>單一 query 用 hint</em>：</p>
<h3 id="pg_hint_plan-extensionper-query-hint">pg_hint_plan extension（per-query hint）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 強制特定 plan
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="cm">/*+ IndexScan(orders idx_orders_status) NestLoop(orders customers) */</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">customers</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="p">...;</span></span></span></code></pre></div><p>Hint 形態：</p>
<ul>
<li><code>IndexScan(t1 idx_name)</code> — 強制 index scan</li>
<li><code>SeqScan(t1)</code> — 強制 seq scan</li>
<li><code>HashJoin(t1 t2)</code> / <code>NestLoop(t1 t2)</code> / <code>MergeJoin(t1 t2)</code></li>
<li><code>Leading(t1 t2 t3)</code> — 強制 join order</li>
<li><code>Rows(t1 t2 #100)</code> — 強制 row 估計</li>
</ul>
<p><strong>推薦</strong>：</p>
<ul>
<li>全 cluster 行為：用 GUC（如 <code>random_page_cost</code>）</li>
<li>單 query 行為：用 pg_hint_plan（不污染其他 query）</li>
<li>不要過度 hint — planner 多數時候 <em>是對的</em>、hint 是 last resort</li>
</ul>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-statistics-過時--planner-估錯-row-count">1. Statistics 過時 — Planner 估錯 row count</h3>
<p><code>ANALYZE</code> 是 autovacuum 一部分、預設 <em>autovacuum_analyze_scale_factor=0.1</em>（10% row 變動才 analyze）。對 <em>快速 grow 的表</em>（log / event）、ANALYZE 跟不上、planner 用過時 statistics。</p>
<p>修法：</p>
<ul>
<li>
<p>對 critical table 設 <em>較 aggressive autovacuum_analyze_scale_factor</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="p">(</span><span class="n">autovacuum_analyze_scale_factor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">.</span><span class="mi">02</span><span class="p">);</span></span></span></code></pre></div></li>
<li>
<p>對 <em>大批量寫入後</em>、手動 <code>ANALYZE events;</code></p>
</li>
<li>
<p>監控 <code>pg_stat_user_tables.last_analyze</code> — 跟 row count 比、判定是否需手動 trigger</p>
</li>
</ul>
<h3 id="2-multi-column-statistics--planner-假設-column-獨立">2. Multi-column statistics — Planner 假設 column 獨立</h3>
<p>如 Case 3、單 column statistics 對 <em>相關 column</em> 估錯。</p>
<p>修法：</p>
<ul>
<li>對 <em>常一起 query 的 column 組合</em>、建 <code>CREATE STATISTICS</code>（PG 10+）</li>
<li>3 種 type：<code>dependencies</code>（functional dependency）、<code>ndistinct</code>（multi-column distinct count）、<code>mcv</code>（most common value combinations）</li>
<li>設完 <em>必須跑 ANALYZE</em> 才生效</li>
</ul>
<h3 id="3-cost-base-setting-不對齊硬體--planner-偏-seq-scan">3. Cost-base setting 不對齊硬體 — Planner 偏 seq scan</h3>
<p>預設 <code>random_page_cost = 4</code>、<code>seq_page_cost = 1</code> 是 <em>HDD assumption</em>（random IO 比 sequential 慢 4x）。SSD / NVMe random / seq IO 差別小、planner 不該 4x penalty random。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- SSD
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">random_page_cost</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">.</span><span class="mi">1</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- NVMe
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">random_page_cost</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_reload_conf</span><span class="p">();</span></span></span></code></pre></div><p><code>random_page_cost</code> 改了 planner 對 index scan 的 cost 估計更準、自動選 index 更積極。</p>
<h3 id="4-effective_cache_size-不對齊實際-ram">4. <code>effective_cache_size</code> 不對齊實際 RAM</h3>
<p><code>effective_cache_size</code> 預設 4 GB、planner 假設 buffer pool + OS cache 共 4 GB。實際 server 64 GB RAM、<code>shared_buffers = 16GB</code>、OS page cache ~30 GB、實際可用 cache 46 GB。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">effective_cache_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;46GB&#39;</span><span class="p">;</span><span class="w">  </span><span class="c1">-- shared_buffers + OS cache 估</span></span></span></code></pre></div><p>提升後 planner 估 query 多數 page 在 cache、降低 <em>估計 random IO cost</em>、選 index 更積極。</p>
<h3 id="5-parallel-query-不觸發">5. Parallel query 不觸發</h3>
<p>預設 <code>max_parallel_workers_per_gather = 2</code>、有些 workload 不夠。或 <em>table size 太小</em>、<code>min_parallel_table_scan_size = 8MB</code> 預設、小表不 parallel。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">max_parallel_workers_per_gather</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">4</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">parallel_setup_cost</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">parallel_tuple_cost</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">.</span><span class="mi">01</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">min_parallel_table_scan_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;0&#39;</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 任何 size 都 parallel</span></span></span></code></pre></div><p>監控 <code>EXPLAIN</code> 的 <code>Workers Planned</code> 數量、看是否真 parallel。</p>
<h2 id="觀測-metric">觀測 metric</h2>
<p>Production 持續 monitor：</p>
<ul>
<li><code>pg_stat_statements</code>：每個 query digest 累計 calls / time / rows / IO</li>
<li><code>auto_explain</code> log：slow query 的實際 plan + ANALYZE 統計</li>
<li><code>pg_stat_user_tables.last_analyze</code> / <code>last_autoanalyze</code>：statistics 新鮮度</li>
<li><code>pg_stat_user_indexes.idx_scan</code>：每個 index 使用次數 — 0 表示沒用、可考慮 drop</li>
</ul>
<p>把這些丟進 Datadog / Prometheus（用 <code>postgres_exporter</code> / <code>pg_exporter</code>）做 trend analysis。</p>
<h2 id="跟-mysql-query-optimization-對照">跟 MySQL Query Optimization 對照</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PG</th>
          <th>MySQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Query plan preview</td>
          <td><code>EXPLAIN</code></td>
          <td><code>EXPLAIN</code></td>
      </tr>
      <tr>
          <td>實際執行統計</td>
          <td><code>EXPLAIN ANALYZE</code></td>
          <td><code>EXPLAIN ANALYZE</code> (8.0+)</td>
      </tr>
      <tr>
          <td>Auto-capture</td>
          <td><code>auto_explain</code> extension</td>
          <td><code>slow_query_log</code> + <code>pt-query-digest</code></td>
      </tr>
      <tr>
          <td>Optimizer trace</td>
          <td>log_planner_stats / log_executor_stats</td>
          <td><code>optimizer_trace</code> (JSON)</td>
      </tr>
      <tr>
          <td>Per-query hint</td>
          <td><code>pg_hint_plan</code> extension</td>
          <td>optimizer hint comment (<code>/*+ */</code>)</td>
      </tr>
      <tr>
          <td>Multi-column statistics</td>
          <td><code>CREATE STATISTICS</code></td>
          <td>無原生（依賴 index 統計）</td>
      </tr>
      <tr>
          <td>Parallel query</td>
          <td>Full (scan / agg / join, PG 9.6+)</td>
          <td>受限 (8.0 hash join)</td>
      </tr>
      <tr>
          <td>Cost-base setting</td>
          <td>random_page_cost / effective_cache_size</td>
          <td>隱性、optimizer 預設</td>
      </tr>
  </tbody>
</table>
<p>PG planner 整體成熟、複雜 OLAP-style query 處理較好。MySQL 8.0 補了不少（histograms / hash join）但複雜 query 仍弱於 PG。詳見 <a href="/blog/backend/01-database/vendors/mysql/query-optimization/" data-link-title="MySQL Query Optimization：從 EXPLAIN 看到實際執行、5 條 query 從 5 秒變 50ms 的 anatomy" data-link-desc="MySQL query 慢的根因不在「SQL 寫法」、在「optimizer 選錯 plan」。本文從 5 個常見 production case 開場（5 秒 → 50ms / 30 秒 → 200ms / 8 秒 → 30ms 等）、走 EXPLAIN / EXPLAIN ANALYZE / optimizer trace 三層分析工具、index hint vs optimizer hint 取捨、cardinality estimation 失效時的修法、5 production 踩雷（statistics 過時 / forced index 用錯 / hash join 沒觸發 / range scan 退化 ALL / derived table materialization）">MySQL Query Optimization</a>。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-autovacuum-tuning">跟 Autovacuum Tuning</h3>
<p>ANALYZE 是 autovacuum 一部分、autovacuum 跟不上 → statistics 過時 → planner 估錯。詳見 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a>。</p>
<h3 id="跟-replication-topology">跟 Replication Topology</h3>
<p>Standby 上跑 query 用同 statistics（streaming replication copy 整個 system catalog）、planner 行為一致。但 <em>standby 有 hot_standby_feedback</em> 影響 primary autovacuum / ANALYZE 行為。詳見 <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>。</p>
<h3 id="跟-partitioning">跟 Partitioning</h3>
<p>Partition pruning 跟 query plan 緊密 — <code>EXPLAIN</code> 看是否 prune 對的 partition。詳見 <a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">Declarative Partitioning</a>。</p>
<h2 id="何時用-pg_hint_plan-vs-guc">何時用 pg_hint_plan vs GUC</h2>
<table>
  <thead>
      <tr>
          <th>情境</th>
          <th>選擇</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>全 cluster 行為（如 SSD random_page_cost）</td>
          <td>GUC</td>
      </tr>
      <tr>
          <td>單一 critical query 強制特定 plan</td>
          <td>pg_hint_plan</td>
      </tr>
      <tr>
          <td>暫時 disable 某類 plan 給 debug</td>
          <td><code>SET enable_xxx=off</code> per-session</td>
      </tr>
      <tr>
          <td>Production stable use</td>
          <td>GUC + multi-column statistics 為主、hint 為 last resort</td>
      </tr>
  </tbody>
</table>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">PG Autovacuum Tuning</a>（ANALYZE 跟 statistics 新鮮度）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（standby planner 行為）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">PG Declarative Partitioning</a>（partition pruning）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/query-optimization/" data-link-title="MySQL Query Optimization：從 EXPLAIN 看到實際執行、5 條 query 從 5 秒變 50ms 的 anatomy" data-link-desc="MySQL query 慢的根因不在「SQL 寫法」、在「optimizer 選錯 plan」。本文從 5 個常見 production case 開場（5 秒 → 50ms / 30 秒 → 200ms / 8 秒 → 30ms 等）、走 EXPLAIN / EXPLAIN ANALYZE / optimizer trace 三層分析工具、index hint vs optimizer hint 取捨、cardinality estimation 失效時的修法、5 production 踩雷（statistics 過時 / forced index 用錯 / hash join 沒觸發 / range scan 退化 ALL / derived table materialization）">MySQL Query Optimization</a>（sibling、不同 optimizer 成熟度）</li>
<li>官方：<a href="https://www.postgresql.org/docs/current/sql-explain.html">EXPLAIN</a> / <a href="https://github.com/ossc-db/pg_hint_plan">pg_hint_plan</a> / <a href="https://www.postgresql.org/docs/current/auto-explain.html">auto_explain</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL MVCC + Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>MVCC + lock model&lt;/em> — PG 並行控制機制跟跟 MySQL lock-based 不同。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="pg-mvcc每次更新都-新增-tuple不改舊版">PG MVCC：每次更新都 &lt;em>新增 tuple&lt;/em>、不改舊版&lt;/h2>
&lt;p>PG 的並行控制核心是 &lt;em>Multi-Version Concurrency Control&lt;/em> — UPDATE 不修改原 row、是 &lt;em>新增&lt;/em> 一個 tuple version、舊 version 留在 table 直到 VACUUM 清理：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">原 row: (id=1, status=&amp;#39;pending&amp;#39;, xmin=100, xmax=NULL)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> ↓ UPDATE status=&amp;#39;shipped&amp;#39;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">新 tuple: (id=1, status=&amp;#39;shipped&amp;#39;, xmin=200, xmax=NULL)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">舊 tuple 標 xmax=200（不刪、給其他 transaction 看舊 version）&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>xmin&lt;/code> / &lt;code>xmax&lt;/code> 是 &lt;em>creator transaction id&lt;/em> / &lt;em>destroyer transaction id&lt;/em>。每個 SELECT 用 &lt;em>snapshot&lt;/em>（含當下 active transaction list）判斷哪些 tuple 對自己可見：&lt;/p>
&lt;ul>
&lt;li>自己 transaction id &amp;gt; tuple.xmin 且 (tuple.xmax = NULL 或自己 transaction id &amp;lt; tuple.xmax) → 可見&lt;/li>
&lt;li>否則 → 看不到（過去 / 未來版本）&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>結果&lt;/strong>：&lt;/p>
&lt;ul>
&lt;li>&lt;em>Readers 不 lock writers&lt;/em>：SELECT 看 snapshot、不 block UPDATE&lt;/li>
&lt;li>&lt;em>Writers 不 lock readers&lt;/em>：UPDATE 寫新 tuple、不影響正在跑的 SELECT snapshot&lt;/li>
&lt;li>&lt;em>Writers 只 lock 同一 row 的 writers&lt;/em>：兩個 UPDATE 同 row 才 conflict&lt;/li>
&lt;/ul>
&lt;p>跟 MySQL InnoDB &lt;em>lock-based&lt;/em>（&lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/lock-contention/" data-link-title="MySQL Lock Contention：在 staging 重現的 deadlock、production 跑 6 個月才出現" data-link-desc="MySQL InnoDB 的 lock 是 row-level、但 *為什麼某些 row 莫名其妙也被 lock* 是 gap lock / next-key lock 設計造成的隱性行為。本文從一個 production case 開場（staging 重現 deadlock / production 6 個月後突然爆）、走 5 種 InnoDB lock 類型（record / gap / next-key / insert intention / auto-inc）、isolation level 對 lock 行為的決定性影響、deadlock detection / SHOW ENGINE INNODB STATUS 解讀、5 production 踩雷（gap lock 阻塞 INSERT / auto-inc lock contention / FK lock cascading / large transaction lock holding / READ COMMITTED 跟 binlog ROW 互動）">Lock Contention&lt;/a>）對比：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>MVCC + lock model</em> — PG 並行控制機制跟跟 MySQL lock-based 不同。</p></blockquote>
<hr>
<h2 id="pg-mvcc每次更新都-新增-tuple不改舊版">PG MVCC：每次更新都 <em>新增 tuple</em>、不改舊版</h2>
<p>PG 的並行控制核心是 <em>Multi-Version Concurrency Control</em> — UPDATE 不修改原 row、是 <em>新增</em> 一個 tuple version、舊 version 留在 table 直到 VACUUM 清理：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">原 row:    (id=1, status=&#39;pending&#39;, xmin=100, xmax=NULL)
</span></span><span class="line"><span class="ln">2</span><span class="cl">                 ↓ UPDATE status=&#39;shipped&#39;
</span></span><span class="line"><span class="ln">3</span><span class="cl">新 tuple:  (id=1, status=&#39;shipped&#39;, xmin=200, xmax=NULL)
</span></span><span class="line"><span class="ln">4</span><span class="cl">舊 tuple 標 xmax=200（不刪、給其他 transaction 看舊 version）</span></span></code></pre></div><p><code>xmin</code> / <code>xmax</code> 是 <em>creator transaction id</em> / <em>destroyer transaction id</em>。每個 SELECT 用 <em>snapshot</em>（含當下 active transaction list）判斷哪些 tuple 對自己可見：</p>
<ul>
<li>自己 transaction id &gt; tuple.xmin 且 (tuple.xmax = NULL 或自己 transaction id &lt; tuple.xmax) → 可見</li>
<li>否則 → 看不到（過去 / 未來版本）</li>
</ul>
<p><strong>結果</strong>：</p>
<ul>
<li><em>Readers 不 lock writers</em>：SELECT 看 snapshot、不 block UPDATE</li>
<li><em>Writers 不 lock readers</em>：UPDATE 寫新 tuple、不影響正在跑的 SELECT snapshot</li>
<li><em>Writers 只 lock 同一 row 的 writers</em>：兩個 UPDATE 同 row 才 conflict</li>
</ul>
<p>跟 MySQL InnoDB <em>lock-based</em>（<a href="/blog/backend/01-database/vendors/mysql/lock-contention/" data-link-title="MySQL Lock Contention：在 staging 重現的 deadlock、production 跑 6 個月才出現" data-link-desc="MySQL InnoDB 的 lock 是 row-level、但 *為什麼某些 row 莫名其妙也被 lock* 是 gap lock / next-key lock 設計造成的隱性行為。本文從一個 production case 開場（staging 重現 deadlock / production 6 個月後突然爆）、走 5 種 InnoDB lock 類型（record / gap / next-key / insert intention / auto-inc）、isolation level 對 lock 行為的決定性影響、deadlock detection / SHOW ENGINE INNODB STATUS 解讀、5 production 踩雷（gap lock 阻塞 INSERT / auto-inc lock contention / FK lock cascading / large transaction lock holding / READ COMMITTED 跟 binlog ROW 互動）">Lock Contention</a>）對比：</p>
<ul>
<li>MySQL：SELECT FOR UPDATE 用 gap lock 防 phantom、deadlock 機率高</li>
<li>PG：MVCC + snapshot 自然防 phantom（read 看 snapshot）、deadlock 少</li>
</ul>
<p>但 PG 代價是 <em>VACUUM 治理</em> — dead tuple 不清理會佔 disk + 影響 query 效率。詳見 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a>。</p>
<h2 id="pg-4-種-lock">PG 4 種 lock</h2>
<p>PG 仍有 lock、但場景跟 MySQL 不同：</p>
<h3 id="1-row-level-lock--主要由-update--delete--select-for-update-取">1. Row-level lock — 主要由 UPDATE / DELETE / SELECT FOR UPDATE 取</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="k">UPDATE</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- 對 id=100 row 加 ROW EXCLUSIVE lock
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1">-- 其他 transaction 試 UPDATE / DELETE id=100 必須等</span></span></span></code></pre></div><p>Row-level lock <em>不 block reader</em>（SELECT 看 snapshot、不檢查 lock）。</p>
<h3 id="2-table-level-lock--ddl-跟少數-select-for-場景">2. Table-level lock — DDL 跟少數 SELECT FOR 場景</h3>
<p>PG 有 8 種 table lock mode、嚴重程度遞增：</p>
<table>
  <thead>
      <tr>
          <th>Mode</th>
          <th>行為</th>
          <th>衝突</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ACCESS SHARE</td>
          <td>SELECT 跑</td>
          <td>跟 ACCESS EXCLUSIVE 衝突</td>
      </tr>
      <tr>
          <td>ROW SHARE</td>
          <td>SELECT FOR UPDATE / FOR SHARE</td>
          <td>跟 EXCLUSIVE 衝突</td>
      </tr>
      <tr>
          <td>ROW EXCLUSIVE</td>
          <td>UPDATE / DELETE / INSERT</td>
          <td>跟 SHARE 衝突</td>
      </tr>
      <tr>
          <td>SHARE UPDATE EXCLUSIVE</td>
          <td>VACUUM / ANALYZE / CREATE INDEX CONCURRENTLY</td>
          <td>跟同 mode + 高 mode 衝突</td>
      </tr>
      <tr>
          <td>SHARE</td>
          <td>CREATE INDEX（non-concurrent）</td>
          <td>跟 ROW EXCLUSIVE 衝突</td>
      </tr>
      <tr>
          <td>SHARE ROW EXCLUSIVE</td>
          <td>CREATE TRIGGER / 某些 ALTER</td>
          <td>跟 ROW EXCLUSIVE 衝突</td>
      </tr>
      <tr>
          <td>EXCLUSIVE</td>
          <td>REFRESH MATERIALIZED VIEW</td>
          <td>跟所有 + 自身衝突</td>
      </tr>
      <tr>
          <td>ACCESS EXCLUSIVE</td>
          <td>DROP / ALTER TABLE / VACUUM FULL</td>
          <td>跟所有衝突</td>
      </tr>
  </tbody>
</table>
<p>DDL（ALTER / DROP）拿 ACCESS EXCLUSIVE、跟所有衝突。Production 跑 ALTER 必須短時間或走 <a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">Online Schema Change</a>。</p>
<h3 id="3-advisory-lock--application-自己控">3. Advisory lock — Application 自己控</h3>
<p>PG 提供 <em>advisory lock</em> 給 application 用、不關 row / table 結構：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Session 1
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_advisory_lock</span><span class="p">(</span><span class="mi">12345</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- 跑 critical section
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_advisory_unlock</span><span class="p">(</span><span class="mi">12345</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- Session 2
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_try_advisory_lock</span><span class="p">(</span><span class="mi">12345</span><span class="p">);</span><span class="w">  </span><span class="c1">-- 試取、不阻塞、返回 false</span></span></span></code></pre></div><p>用途：</p>
<ul>
<li>Application-level 互斥（如：cron job 同時只跑一個）</li>
<li>跨 connection 同步（PG-managed mutex）</li>
<li>Distributed transaction coordinator（lightweight）</li>
</ul>
<p>跟 row lock 不同：advisory lock 不關 row、application 自定義 lock ID 語義。</p>
<h3 id="4-predicate-lock--serializable-isolation-才用">4. Predicate lock — SERIALIZABLE isolation 才用</h3>
<p>PG SERIALIZABLE 用 <em>Serializable Snapshot Isolation (SSI)</em>、追蹤 <em>predicate</em>（query 條件）而不是 <em>row</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SET</span><span class="w"> </span><span class="k">TRANSACTION</span><span class="w"> </span><span class="k">ISOLATION</span><span class="w"> </span><span class="k">LEVEL</span><span class="w"> </span><span class="k">SERIALIZABLE</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- Predicate lock 紀錄這個 query 看了哪些 predicate
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;pending&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 其他 transaction INSERT pending order
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1">-- 提交時：PG 偵測 anomaly、rollback 之一
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">COMMIT</span><span class="p">;</span></span></span></code></pre></div><p>跟 MySQL gap lock 不同：</p>
<ul>
<li>MySQL gap lock：<em>pre-lock</em>、防 phantom 在 query 期間</li>
<li>PG predicate lock：<em>post-detect</em>、commit 時偵測 anomaly、退回 transaction</li>
</ul>
<p>PG SSI 對 <em>寫入吞吐影響低</em>（不 pre-lock）、但 <em>transaction rollback 機率高</em>（要 application retry）。</p>
<h2 id="pg-預設-isolationread-committed">PG 預設 isolation：READ COMMITTED</h2>
<p>PG 預設 READ COMMITTED、跟 MySQL InnoDB 預設 REPEATABLE READ 不同：</p>
<table>
  <thead>
      <tr>
          <th>Isolation</th>
          <th>PG 行為</th>
          <th>MySQL InnoDB 對應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>READ UNCOMMITTED</td>
          <td>PG 視為 READ COMMITTED（不真的支援 dirty read）</td>
          <td>MySQL 真支援</td>
      </tr>
      <tr>
          <td>READ COMMITTED</td>
          <td>每 statement 看當下 committed snapshot（PG 預設）</td>
          <td>一致</td>
      </tr>
      <tr>
          <td>REPEATABLE READ</td>
          <td>Transaction 內 fixed snapshot（純 MVCC）</td>
          <td>MVCC snapshot + gap lock 防 phantom（兩者都 MVCC、差在 phantom 防護機制：PG 靠 snapshot version visibility、InnoDB 加 gap lock pre-lock 範圍）</td>
      </tr>
      <tr>
          <td>SERIALIZABLE</td>
          <td>SSI、commit 時偵測 anomaly</td>
          <td>強 lock + gap</td>
      </tr>
  </tbody>
</table>
<p><strong>對 application code 含意</strong>：</p>
<ul>
<li>PG REPEATABLE READ 對 <em>寫入吞吐</em> 影響低（不 pre-lock、只 retry）</li>
<li>沒 gap lock → INSERT 不被 lock-induced 阻塞</li>
<li>Deadlock 機率比 MySQL 低數量級</li>
</ul>
<p>實務 PG production：用預設 READ COMMITTED 即可、SERIALIZABLE 留給 <em>strict consistency 需求</em>（金融 / 訂單）但接受 retry。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-idle-transaction-卡-vacuum--bloat-暴增">1. Idle transaction 卡 vacuum — Bloat 暴增</h3>
<p>PG MVCC 仰賴 <em>VACUUM 清理 dead tuple</em>。VACUUM 只清理 <em>沒 active transaction 看得到的 dead tuple</em>。如果有 <em>idle in transaction</em> session 持續開著（application connection pool 連線忘關 transaction）、VACUUM 看不到 <em>該 transaction snapshot 之後的 dead tuple</em>、累積 bloat。</p>
<p>修法：</p>
<ul>
<li>監控 <code>pg_stat_activity</code> 看 <code>state = 'idle in transaction'</code> 持續時間</li>
<li>設 <code>idle_in_transaction_session_timeout = '5min'</code> — 超時 PG 自動 kill 該 session</li>
<li>Application connection pool 配置 <em>不留 transaction 開著</em>（如：pgBouncer transaction pool 自動 commit / rollback）</li>
</ul>
<h3 id="2-select-for-update-跨-transaction--application-retry-麻煩">2. SELECT FOR UPDATE 跨 transaction — Application retry 麻煩</h3>
<p>跟 MySQL 不同：PG SELECT FOR UPDATE 不會 <em>block 其他 SELECT</em>（讀仍可繼續）、但 <em>block 其他 UPDATE / FOR UPDATE</em>。若 application 在 transaction 內 SELECT FOR UPDATE、其他 transaction 等。</p>
<p>如果 application 設計 <em>跨 transaction 持 lock</em>（如：取 lock + return UI + 等用戶操作 + commit）、容易撞 idle in transaction 跟其他 transaction wait。</p>
<p>修法：</p>
<ul>
<li><em>Transaction 短</em>：取 FOR UPDATE → 立刻處理 → commit、不跨 user interaction</li>
<li>跨 user interaction 用 <em>advisory lock</em> 或 application-level state machine、不依賴 row lock</li>
</ul>
<h3 id="3-advisory-lock-沒釋放--session-結束才自動釋放">3. Advisory lock 沒釋放 — Session 結束才自動釋放</h3>
<p><code>pg_advisory_lock()</code> 拿了、沒 <code>pg_advisory_unlock()</code>、lock 直到 <em>session 結束</em> 才自動釋放。Connection pool 重複使用同 connection、可能繼承前面留的 lock。</p>
<p>修法：</p>
<ul>
<li>用 <code>pg_advisory_lock</code> 必 <code>try/finally pg_advisory_unlock</code></li>
<li>或用 <em>session-level</em> 用 transaction-scoped：<code>pg_advisory_xact_lock()</code> — commit / rollback 自動釋放</li>
<li>監控 <code>pg_locks</code> 看 advisory lock count、長期累積是警訊</li>
</ul>
<h3 id="4-bloat-不只是-vacuum-沒跑是-active-transaction-阻擋-vacuum">4. Bloat 不只是 vacuum 沒跑、是 <em>active transaction 阻擋 vacuum</em></h3>
<p>第 #1 點延伸：vacuum 已跑、但 bloat 仍持續成長、原因不是 vacuum 不夠、是 <em>active transaction 阻擋 vacuum 看 dead tuple</em>。</p>
<p>修法：</p>
<ul>
<li>不只看 <code>last_vacuum</code>、看 <em>VACUUM 跑了但沒收回多少</em></li>
<li><code>SELECT * FROM pg_stat_progress_vacuum</code> 看 VACUUM 進度</li>
<li><code>SELECT * FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY backend_xmin</code> — 看誰阻擋 vacuum</li>
<li>詳見 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a></li>
</ul>
<h3 id="5-serializable-下-transaction-rollback--application-必須-retry">5. SERIALIZABLE 下 transaction rollback — Application 必須 retry</h3>
<p><code>SET TRANSACTION ISOLATION LEVEL SERIALIZABLE</code> 後、PG SSI 偵測到 anomaly 會 <em>rollback transaction</em>、application 看到 <code>serialization failure</code>、必須 retry。</p>
<p>對 <em>不知道要 retry</em> 的 application、SERIALIZABLE 變 production bug。</p>
<p>修法：</p>
<ul>
<li>Application code 加 <em>retry middleware</em>：catch <code>SQLSTATE 40001 (serialization_failure)</code> → exponential backoff retry</li>
<li>不必所有 transaction 走 SERIALIZABLE — 只對 <em>strict consistency 需求</em> 場景 set</li>
<li>高並發 SERIALIZABLE workload 容易 rollback storm、考慮拆 transaction 縮短時間</li>
</ul>
<h2 id="觀測-metric">觀測 metric</h2>
<p>Production 監控：</p>
<ul>
<li><code>pg_stat_activity</code>：active session / idle in transaction / wait_event</li>
<li><code>pg_locks</code>：當前 lock 列表、用 join 看誰 block 誰</li>
<li><code>pg_stat_database.deadlocks</code>：deadlock 計數（PG 較低、但仍要監控）</li>
<li><code>pg_stat_user_tables.n_dead_tup</code> / <code>n_live_tup</code>：dead tuple 比例 — bloat 指標</li>
<li><code>pg_stat_progress_vacuum</code>：VACUUM 進度</li>
</ul>
<h2 id="跟-mysql-lock-model-對比">跟 MySQL Lock Model 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PG MVCC</th>
          <th>MySQL InnoDB Lock</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>主要機制</td>
          <td>MVCC + snapshot</td>
          <td>Lock-based + MVCC mixed</td>
      </tr>
      <tr>
          <td>Readers vs Writers</td>
          <td>不互 block</td>
          <td>預設 RR 下 gap lock 影響</td>
      </tr>
      <tr>
          <td>Deadlock 機率</td>
          <td>低（無 gap lock）</td>
          <td>中-高（gap lock 主要來源）</td>
      </tr>
      <tr>
          <td>Phantom 防護</td>
          <td>Snapshot 自然防 + SSI predicate lock</td>
          <td>Gap lock 預先 lock</td>
      </tr>
      <tr>
          <td>預設 isolation</td>
          <td>READ COMMITTED</td>
          <td>REPEATABLE READ</td>
      </tr>
      <tr>
          <td>成本</td>
          <td>Dead tuple + VACUUM 治理</td>
          <td>Lock contention 治理</td>
      </tr>
      <tr>
          <td>Application code</td>
          <td>SERIALIZABLE 需 retry</td>
          <td>寫得不錯多數時 OK</td>
      </tr>
  </tbody>
</table>
<p>兩者解決同一問題（並行控制）、用不同策略。PG 用 <em>空間換時間</em>（保留多版本 tuple、讀寫不互鎖、但需 VACUUM 清理）、MySQL 用 <em>時間換空間</em>（lock 等待、但不必清舊版本）。</p>
<p><strong>選擇判讀</strong>：</p>
<ul>
<li>High 並發 OLTP、寫 / 讀都重：PG MVCC 通常更好（讀不 block 寫）</li>
<li>簡單 OLTP + 不想管 VACUUM：MySQL InnoDB 對 ops 簡單</li>
<li>需要 SERIALIZABLE 強一致：PG SSI 對寫吞吐影響低</li>
<li>已有 MySQL 生態 / 工具鏈：MySQL Lock 知識可繼續用</li>
</ul>
<p>詳見 <a href="/blog/backend/01-database/vendors/mysql/lock-contention/" data-link-title="MySQL Lock Contention：在 staging 重現的 deadlock、production 跑 6 個月才出現" data-link-desc="MySQL InnoDB 的 lock 是 row-level、但 *為什麼某些 row 莫名其妙也被 lock* 是 gap lock / next-key lock 設計造成的隱性行為。本文從一個 production case 開場（staging 重現 deadlock / production 6 個月後突然爆）、走 5 種 InnoDB lock 類型（record / gap / next-key / insert intention / auto-inc）、isolation level 對 lock 行為的決定性影響、deadlock detection / SHOW ENGINE INNODB STATUS 解讀、5 production 踩雷（gap lock 阻塞 INSERT / auto-inc lock contention / FK lock cascading / large transaction lock holding / READ COMMITTED 跟 binlog ROW 互動）">MySQL Lock Contention</a> — 完整 MySQL lock 機制。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-autovacuum-tuning">跟 Autovacuum Tuning</h3>
<p>MVCC 仰賴 VACUUM、autovacuum 是 PG 並行控制的 <em>維護成本</em>。VACUUM 跑慢 / 沒跑 → bloat → query 慢。詳見 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a>。</p>
<h3 id="跟-replication-topology">跟 Replication Topology</h3>
<p><code>hot_standby_feedback = on</code> 讓 standby 上 long-running query 不被 vacuum 取消、但 <em>standby 把 oldest xmin 推回 primary</em>、primary autovacuum 變保守、增加 bloat。詳見 <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>。</p>
<h3 id="跟-connection-pool">跟 Connection Pool</h3>
<p>pgBouncer transaction pooling 模式下、advisory lock / SELECT FOR UPDATE 跨 transaction 行為 <em>broken</em>（不同 transaction 可能進不同 backend connection）。詳見 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgBouncer Config</a>。</p>
<h3 id="跟-query-optimization">跟 Query Optimization</h3>
<p>長 transaction 跑慢 query 期間、其他 transaction 看到 snapshot bloat、planner 估錯 dead tuple ratio。詳見 <a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">Query Optimization</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">PG Autovacuum Tuning</a>（VACUUM 是 MVCC 必要成本）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（hot_standby_feedback 影響）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PG pgBouncer</a>（transaction pooling 跟 lock 互動）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">PG Online Schema Change</a>（DDL lock 議題）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">PG Query Optimization</a>（snapshot bloat 影響 planner）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/lock-contention/" data-link-title="MySQL Lock Contention：在 staging 重現的 deadlock、production 跑 6 個月才出現" data-link-desc="MySQL InnoDB 的 lock 是 row-level、但 *為什麼某些 row 莫名其妙也被 lock* 是 gap lock / next-key lock 設計造成的隱性行為。本文從一個 production case 開場（staging 重現 deadlock / production 6 個月後突然爆）、走 5 種 InnoDB lock 類型（record / gap / next-key / insert intention / auto-inc）、isolation level 對 lock 行為的決定性影響、deadlock detection / SHOW ENGINE INNODB STATUS 解讀、5 production 踩雷（gap lock 阻塞 INSERT / auto-inc lock contention / FK lock cascading / large transaction lock holding / READ COMMITTED 跟 binlog ROW 互動）">MySQL Lock Contention</a>（sibling、不同模型）</li>
<li><a href="/blog/backend/knowledge-cards/isolation-level/" data-link-title="Isolation Level" data-link-desc="說明資料庫交易隔離級別如何影響並發讀寫結果">Isolation Level 卡片</a></li>
<li>官方：<a href="https://www.postgresql.org/docs/current/mvcc.html">PG MVCC</a> / <a href="https://www.postgresql.org/docs/current/transaction-iso.html">PG Concurrency Control</a> / <a href="https://www.postgresql.org/docs/current/explicit-locking.html">Explicit Locking</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL JSONB Deep Dive：Binary Storage + GIN Index 為什麼是結構性優勢</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>JSONB deep dive&lt;/em> — binary storage + GIN index 的結構性優勢。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="json-vs-jsonb選-jsonb">JSON vs JSONB：選 JSONB&lt;/h2>
&lt;p>PG 9.2 加 &lt;code>JSON&lt;/code> type、9.4 加 &lt;code>JSONB&lt;/code>。99% 場景用 JSONB：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>JSON&lt;/th>
 &lt;th>JSONB&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>儲存&lt;/td>
 &lt;td>純文字（原樣保存）&lt;/td>
 &lt;td>Binary decomposed format&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Parse cost&lt;/td>
 &lt;td>每次 query parse&lt;/td>
 &lt;td>Insert 時 parse 一次&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Index 支援&lt;/td>
 &lt;td>Limited（functional index）&lt;/td>
 &lt;td>GIN / functional / partial 都行&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operator 支援&lt;/td>
 &lt;td>有限（→ / →&amp;gt;）&lt;/td>
 &lt;td>完整（@&amp;gt; / ? / @? / ? 等）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Duplicate key&lt;/td>
 &lt;td>保留（原樣）&lt;/td>
 &lt;td>只保留最後一個（normalize）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Key order&lt;/td>
 &lt;td>保留&lt;/td>
 &lt;td>不保留&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Whitespace&lt;/td>
 &lt;td>保留&lt;/td>
 &lt;td>不保留&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>JSONB 唯一缺點是 &lt;em>binary 儲存（不保留 key order / whitespace / duplicate）&lt;/em>。99% application 不在意這些。&lt;/p>
&lt;p>從 &lt;em>application semantics&lt;/em> 視角、JSONB 是 PG JSON 的 &lt;em>the right type&lt;/em>、JSON 是 &lt;em>legacy / niche&lt;/em>。&lt;/p>
&lt;h2 id="jsonb-gin-index核心結構性優勢">JSONB GIN Index：核心結構性優勢&lt;/h2>
&lt;p>PG GIN（Generalized Inverted Index）可以對 JSONB 內所有 key/value pair 建 inverted index：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">products&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">SERIAL&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">PRIMARY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">metadata&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">JSONB&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- GIN index
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">INDEX&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">idx_products_metadata&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">products&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">USING&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">GIN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">metadata&lt;/span>&lt;span class="p">);&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>加完後、JSONB query 用 GIN index 加速：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- @&amp;gt; (contains) 用 GIN
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">products&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">metadata&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">@&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;{&amp;#34;category&amp;#34;: &amp;#34;shoes&amp;#34;}&amp;#39;&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- ? (has key) 用 GIN
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">products&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">metadata&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">?&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;discount&amp;#39;&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- ?| (has any of these keys) 用 GIN
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">products&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">metadata&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">?|&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">array&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;discount&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;promotion&amp;#39;&lt;/span>&lt;span class="p">];&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>跟 MongoDB index 對比、PG 不必 &lt;em>預先 define&lt;/em> JSON path index、&lt;code>USING GIN (metadata)&lt;/code> 對 &lt;em>整個 JSONB document 任意 path&lt;/em> 都有效。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>JSONB deep dive</em> — binary storage + GIN index 的結構性優勢。</p></blockquote>
<hr>
<h2 id="json-vs-jsonb選-jsonb">JSON vs JSONB：選 JSONB</h2>
<p>PG 9.2 加 <code>JSON</code> type、9.4 加 <code>JSONB</code>。99% 場景用 JSONB：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>JSON</th>
          <th>JSONB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>儲存</td>
          <td>純文字（原樣保存）</td>
          <td>Binary decomposed format</td>
      </tr>
      <tr>
          <td>Parse cost</td>
          <td>每次 query parse</td>
          <td>Insert 時 parse 一次</td>
      </tr>
      <tr>
          <td>Index 支援</td>
          <td>Limited（functional index）</td>
          <td>GIN / functional / partial 都行</td>
      </tr>
      <tr>
          <td>Operator 支援</td>
          <td>有限（→ / →&gt;）</td>
          <td>完整（@&gt; / ? / @? / ? 等）</td>
      </tr>
      <tr>
          <td>Duplicate key</td>
          <td>保留（原樣）</td>
          <td>只保留最後一個（normalize）</td>
      </tr>
      <tr>
          <td>Key order</td>
          <td>保留</td>
          <td>不保留</td>
      </tr>
      <tr>
          <td>Whitespace</td>
          <td>保留</td>
          <td>不保留</td>
      </tr>
  </tbody>
</table>
<p>JSONB 唯一缺點是 <em>binary 儲存（不保留 key order / whitespace / duplicate）</em>。99% application 不在意這些。</p>
<p>從 <em>application semantics</em> 視角、JSONB 是 PG JSON 的 <em>the right type</em>、JSON 是 <em>legacy / niche</em>。</p>
<h2 id="jsonb-gin-index核心結構性優勢">JSONB GIN Index：核心結構性優勢</h2>
<p>PG GIN（Generalized Inverted Index）可以對 JSONB 內所有 key/value pair 建 inverted index：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">metadata</span><span class="w"> </span><span class="n">JSONB</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- GIN index
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_products_metadata</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="p">);</span></span></span></code></pre></div><p>加完後、JSONB query 用 GIN index 加速：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- @&gt; (contains) 用 GIN
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@&gt;</span><span class="w"> </span><span class="s1">&#39;{&#34;category&#34;: &#34;shoes&#34;}&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- ? (has key) 用 GIN
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">?</span><span class="w"> </span><span class="s1">&#39;discount&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- ?| (has any of these keys) 用 GIN
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">?|</span><span class="w"> </span><span class="nb">array</span><span class="p">[</span><span class="s1">&#39;discount&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;promotion&#39;</span><span class="p">];</span></span></span></code></pre></div><p>跟 MongoDB index 對比、PG 不必 <em>預先 define</em> JSON path index、<code>USING GIN (metadata)</code> 對 <em>整個 JSONB document 任意 path</em> 都有效。</p>
<h3 id="jsonb_ops-vs-jsonb_path_ops"><code>jsonb_ops</code> vs <code>jsonb_path_ops</code></h3>
<p>PG GIN 對 JSONB 有兩種 <em>operator class</em>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th><code>jsonb_ops</code>（預設）</th>
          <th><code>jsonb_path_ops</code></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>索引內容</td>
          <td>Key + value 都索引</td>
          <td>只索引 path → value pair</td>
      </tr>
      <tr>
          <td>Index size</td>
          <td>大</td>
          <td>小（約一半）</td>
      </tr>
      <tr>
          <td>支援 operator</td>
          <td><code>@&gt; / ? / ?| / ?&amp;</code></td>
          <td>只 <code>@&gt;</code> (containment)</td>
      </tr>
      <tr>
          <td>適用</td>
          <td>多種 query pattern</td>
          <td>只用 <code>@&gt;</code> 的場景</td>
      </tr>
  </tbody>
</table>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- jsonb_ops（預設）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_meta_default</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- jsonb_path_ops（小、快、但只支援 @&gt;）
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_meta_path</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="w"> </span><span class="n">jsonb_path_ops</span><span class="p">);</span></span></span></code></pre></div><p><strong>選擇</strong>：</p>
<ul>
<li>只跑 <code>@&gt;</code> containment query → <code>jsonb_path_ops</code>（index 小、快）</li>
<li>跑 <code>?</code> / <code>?|</code> / <code>?&amp;</code> key existence query → <code>jsonb_ops</code>（預設）</li>
</ul>
<h2 id="operator--path-query">Operator + Path Query</h2>
<p>JSONB 提供豐富 operator + jsonpath：</p>
<h3 id="operator">Operator</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- Extract value（returns jsonb）
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="s1">&#39;name&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- Extract text（returns text）
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">-&gt;&gt;</span><span class="w"> </span><span class="s1">&#39;name&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- Path extract
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">#&gt;</span><span class="w"> </span><span class="s1">&#39;{variants, 0, price}&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">#&gt;&gt;</span><span class="w"> </span><span class="s1">&#39;{variants, 0, price}&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 返回 text
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- Containment（用 GIN index）
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@&gt;</span><span class="w"> </span><span class="s1">&#39;{&#34;category&#34;: &#34;shoes&#34;, &#34;active&#34;: true}&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="c1">-- Reverse containment
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="s1">&#39;{&#34;sub&#34;: &#34;value&#34;}&#39;</span><span class="w"> </span><span class="o">&lt;@</span><span class="w"> </span><span class="n">metadata</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="c1">-- Key existence
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">?</span><span class="w"> </span><span class="s1">&#39;discount&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">?|</span><span class="w"> </span><span class="nb">array</span><span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;b&#39;</span><span class="p">];</span><span class="w">  </span><span class="c1">-- 任一 key
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">?&amp;</span><span class="w"> </span><span class="nb">array</span><span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;b&#39;</span><span class="p">];</span><span class="w">  </span><span class="c1">-- 全部 key</span></span></span></code></pre></div><h3 id="jsonpathpg-12">jsonpath（PG 12+）</h3>
<p>SQL/JSON jsonpath 是 SQL standard、PG 12+ 支援：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- jsonb_path_query：展開 path 結果
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">jsonb_path_query</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;$.variants[*].price&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="c1">-- jsonb_path_exists：返 boolean
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">jsonb_path_exists</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;$.variants[*] ? (@.price &gt; 100)&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- jsonb_path_query_array：返 array of result
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">jsonb_path_query_array</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;$.tags[*]&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="p">;</span></span></span></code></pre></div><p>jsonpath 比 PG-specific operator 標準化、跨 vendor portable。</p>
<h2 id="partial-jsonb-index">Partial JSONB Index</h2>
<p>對 <em>只 query subset row</em> 的場景、建 partial index：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 只對 active product 建 metadata index
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_active_products_metadata</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- Query active products + JSONB filter
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@&gt;</span><span class="w"> </span><span class="s1">&#39;{&#34;category&#34;: &#34;shoes&#34;}&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="c1">-- → planner 用 partial GIN index</span></span></span></code></pre></div><p>Partial index 比 full GIN 小很多、write cost 低、index hit rate 高。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-大-jsonb--toast--性能崩潰">1. 大 JSONB + TOAST — 性能崩潰</h3>
<p>JSONB &gt; 2 KB 自動進 TOAST（PG 內外部 storage）、每次 query read 該 row 都要 <em>de-TOAST</em>（拉外部 storage 再合併）。大 JSONB（&gt; 50 KB）每次 query 慢 10-100x。</p>
<p>修法：</p>
<ul>
<li>把 <em>大 attribute 拆獨立 column</em>（如 <code>description TEXT</code> 不放 metadata）</li>
<li>用 <em>JSON path index</em> 對 hot path 加速、不必每次讀整個 JSONB</li>
<li>用 <code>pg_column_size(metadata)</code> 監控 JSONB size 分布、找 outlier</li>
<li>對 truly 大 document（&gt; 1 MB）考慮 separate table 或 object storage</li>
</ul>
<h3 id="2-nested-update--整個-jsonb-重寫">2. Nested update — 整個 JSONB 重寫</h3>
<p>PG 沒 <em>atomic partial update</em>。修改 nested key 必須讀整個 JSONB → 修改 → 寫回：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">UPDATE</span><span class="w"> </span><span class="n">products</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SET</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">jsonb_set</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;{discount}&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;0.2&#39;</span><span class="p">::</span><span class="n">jsonb</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 等同於：讀 metadata、改 discount、寫回整個 metadata</span></span></span></code></pre></div><p>對 <em>大 JSONB + 高頻 update</em> 場景、寫吞吐受限。跟 MongoDB <code>$set</code> operator 對應 <em>partial document update</em> 不同。</p>
<p>修法：</p>
<ul>
<li>對 <em>high-update nested key</em> 拆獨立 column</li>
<li>Application 層 batch update（攢一批一次 update）</li>
<li>接受 PG JSONB <em>是 immutable-replace</em> 心智模型、不是 <em>mutable in-place</em></li>
</ul>
<h3 id="3-index-選錯-op-class---query-走-full-scan">3. Index 選錯 op class — <code>?</code> query 走 full scan</h3>
<p>對 <code>jsonb_path_ops</code> index、<code>?</code> key existence query 走 <em>full scan</em>（不用 index）。Application 看 query 慢、查 EXPLAIN 才發現 index 沒用。</p>
<p>修法：</p>
<ul>
<li>設計階段確認 <em>application query pattern</em>：只用 <code>@&gt;</code> 還是會用 <code>?</code></li>
<li>多 query pattern → <code>jsonb_ops</code>（預設）</li>
<li>純 containment → <code>jsonb_path_ops</code>（省 index size）</li>
<li>不確定先用預設、production 觀察後再優化</li>
</ul>
<h3 id="4-jsonb_path_query-跟-jsonb_path_exists-行為差">4. <code>jsonb_path_query</code> 跟 <code>jsonb_path_exists</code> 行為差</h3>
<ul>
<li><code>jsonb_path_query(metadata, '$.variants[*].price')</code> — 展開、每個 match return 一 row</li>
<li><code>jsonb_path_exists(metadata, '$.variants[*]')</code> — return boolean（true if any match）</li>
</ul>
<p>Application 想要「過濾 row」用前者寫成：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 錯：返多 row 給每個 product、結果 row count 暴增
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">jsonb_path_query</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;$.variants[*].price&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="p">;</span></span></span></code></pre></div><p>應該：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 對：只過濾 product
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">jsonb_path_exists</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;$.variants[*] ? (@.price &gt; 100)&#39;</span><span class="p">);</span></span></span></code></pre></div><p>修法：</p>
<ul>
<li>區分 <em>exists 過濾 row</em> vs <em>query 展開 row</em></li>
<li>過濾用 <code>jsonb_path_exists</code> 或 <code>@&gt;</code> operator</li>
<li>展開用 <code>jsonb_path_query</code> + 配合 <code>LATERAL</code> 或 subquery</li>
</ul>
<h3 id="5-partial-index-條件不對齊-query">5. Partial index 條件不對齊 query</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_active_metadata</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="p">)</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- Application query 但 status 沒 explicit
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@&gt;</span><span class="w"> </span><span class="s1">&#39;{&#34;category&#34;: &#34;shoes&#34;}&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- → 不用 partial index（planner 不知道 status=&#39;active&#39; 條件）</span></span></span></code></pre></div><p>修法：</p>
<ul>
<li>
<p>Application query <em>必須包含 partial index 的 WHERE 條件</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@&gt;</span><span class="w"> </span><span class="s1">&#39;...&#39;</span><span class="p">;</span></span></span></code></pre></div></li>
<li>
<p>確認 planner 用 partial index：<code>EXPLAIN</code> 看 <code>Index Scan using idx_active_metadata</code></p>
</li>
<li>
<p>不對齊 query pattern 的 partial index = waste</p>
</li>
</ul>
<h2 id="何時用-jsonb-vs-拆-column">何時用 JSONB vs 拆 column</h2>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>選擇</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>不規則 schema（user-generated metadata / customization）</td>
          <td>JSONB</td>
      </tr>
      <tr>
          <td>半結構化 + 5-10 個常 query key</td>
          <td>JSONB + GIN partial index</td>
      </tr>
      <tr>
          <td>規則 schema、column 數量穩定</td>
          <td>拆 column（更快 / index 易）</td>
      </tr>
      <tr>
          <td>Nested 結構 + 經常需要展開 query</td>
          <td>JSONB + jsonb_path_query</td>
      </tr>
      <tr>
          <td>大 document（&gt; 1 KB）+ 高頻 update</td>
          <td>拆 column 或 separate table</td>
      </tr>
      <tr>
          <td>完全 schemaless workload</td>
          <td>考慮 MongoDB 而非 PG</td>
      </tr>
  </tbody>
</table>
<p>JSONB 是 <em>PG 適合 semi-structured data</em> 的工具、不是 <em>MongoDB 替代品</em>。對 <em>主要結構化 + 少量 JSON</em> 場景 JSONB 完美；對 <em>主要 JSON / 複雜 nested aggregation</em> 場景 MongoDB 仍是專業選擇。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-query-optimization">跟 Query Optimization</h3>
<p>JSONB query 的 planner 行為：</p>
<ul>
<li><code>@&gt;</code> containment 對 jsonb_ops / jsonb_path_ops 都用 GIN</li>
<li><code>?</code> 只對 jsonb_ops 用 GIN</li>
<li>jsonb_path_exists 用 <em>functional index</em>（不是 GIN）</li>
<li>看 EXPLAIN 確認用對 index、詳見 <a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">Query Optimization</a></li>
</ul>
<h3 id="跟-sql-features-baseline">跟 SQL Features Baseline</h3>
<p>JSONB 是 PG 結構性領先特性之一、詳見 <a href="/blog/backend/01-database/vendors/postgresql/sql-features-baseline/" data-link-title="PostgreSQL SQL Features：PG 早就有的、MySQL 8.0 才補的、PG 仍領先的" data-link-desc="PG 在 SQL features 上長期領先 MySQL — CTE / window function / lateral / partial index / FTS / JSONB / GIN index / materialized view 在 PG 早 5-15 年。MySQL 8.0（2018）補多數但 *index / storage / extension* 層仍是 PG 結構優勢。本文整理 PG 早期就有的特性、MySQL 8.0 補的差異、PG 仍領先的、跟 MySQL modern-sql-features sibling 反向視角">SQL Features Baseline</a>。</p>
<h3 id="跟-mvcc--lock-model">跟 MVCC + Lock Model</h3>
<p>JSONB UPDATE 整個 column 重寫、每次 update 創新 tuple、跟 row update 相同 MVCC behavior。詳見 <a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">MVCC + Lock Model</a>。</p>
<h3 id="跟-mysql-json_table">跟 MySQL JSON_TABLE</h3>
<p>MySQL 8.0 JSON_TABLE 跟 PG jsonpath 類似（都 SQL standard）、但 <em>index 機制</em> 完全不同：</p>
<ul>
<li>PG：JSONB + GIN index over 整個 column</li>
<li>MySQL：JSON column + generated column + index over generated</li>
</ul>
<p>PG JSONB GIN 是 <em>結構性領先</em>、MySQL 短期內難對應。詳見 <a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features</a>。</p>
<h2 id="觀測-metric">觀測 metric</h2>
<ul>
<li><code>pg_column_size(metadata)</code> — 每 row JSONB size 分布</li>
<li><code>pg_relation_size('idx_name')</code> — JSONB GIN index 大小</li>
<li><code>pg_stat_user_indexes.idx_scan</code> — JSONB index 使用次數</li>
<li>TOAST table size：<code>SELECT pg_relation_size(reltoastrelid) FROM pg_class WHERE relname='products'</code></li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/sql-features-baseline/" data-link-title="PostgreSQL SQL Features：PG 早就有的、MySQL 8.0 才補的、PG 仍領先的" data-link-desc="PG 在 SQL features 上長期領先 MySQL — CTE / window function / lateral / partial index / FTS / JSONB / GIN index / materialized view 在 PG 早 5-15 年。MySQL 8.0（2018）補多數但 *index / storage / extension* 層仍是 PG 結構優勢。本文整理 PG 早期就有的特性、MySQL 8.0 補的差異、PG 仍領先的、跟 MySQL modern-sql-features sibling 反向視角">PG SQL Features Baseline</a>（JSONB 是 PG 結構領先之一）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">PG Query Optimization</a>（JSONB index 用對）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">PG MVCC + Lock Model</a>（JSONB update 跟 MVCC）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features</a>（JSON_TABLE vs JSONB 對比）</li>
<li><a href="/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB vendor</a>（純 document workload 替代）</li>
<li>官方：<a href="https://www.postgresql.org/docs/current/functions-json.html">PG JSON Functions</a> / <a href="https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING">JSONB Indexing</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/extension-ecosystem/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/extension-ecosystem/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>extension ecosystem&lt;/em> — PG 結構性產品線擴張的機制。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="extension-不只是-plugin是產品線擴張">Extension 不只是 plugin、是產品線擴張&lt;/h2>
&lt;p>PG extension 機制讓 &lt;em>第三方加新 type / function / operator / index access method / planner hook&lt;/em>、深度整合到 PG core。對比其他 DB 的 plugin model（MySQL plugin / MongoDB plugin）、PG extension 是 &lt;em>更深的 SPI&lt;/em>。&lt;/p>
&lt;p>結果：&lt;/p>
&lt;ul>
&lt;li>pgvector → PG 變 vector similarity search DB（取代 Pinecone / Weaviate）&lt;/li>
&lt;li>TimescaleDB → PG 變 time-series DB（取代 InfluxDB）&lt;/li>
&lt;li>Citus → PG 變 sharded cluster&lt;/li>
&lt;li>PostGIS → PG 變 GIS DB&lt;/li>
&lt;li>pg_cron → PG 變 scheduled job runner&lt;/li>
&lt;li>pgvectorscale → 大規模 vector index&lt;/li>
&lt;/ul>
&lt;p>對 &lt;em>vendor lock-in 敏感&lt;/em> / &lt;em>想統一 stack&lt;/em> 的 org、PG extension 提供 &lt;em>用 PG 取代多個 specialized DB&lt;/em> 的可能。&lt;/p>
&lt;p>但 &lt;em>統一 stack 的代價&lt;/em>：PG 主庫 ops 風險集中（一個 PG 掛 = vector / time-series / GIS / cron 全掛）、extension 跟 PG version 對齊矩陣多一道升級顧慮、規模上限通常比專業 DB 低（pgvector 100M+ vs Pinecone 10B+ / TimescaleDB 100K rows/s vs InfluxDB 500K+）。決策框架：&lt;em>中小規模 + 已用 PG + 不想多管系統&lt;/em> → extension；&lt;em>大規模 + 純該 workload + 有專業 team&lt;/em> → specialized DB。&lt;/p>
&lt;h2 id="extension-lifecycle">Extension Lifecycle&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- 看可用 extension
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_available_extensions&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 安裝（在 OS 層、要有對應 package）
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="c1">-- apt install postgresql-14-pg-stat-statements
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- Enable in DB
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_stat_statements&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 確認
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_extension&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 升級 extension
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">ALTER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_stat_statements&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">UPDATE&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 移除
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">DROP&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_stat_statements&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>每個 extension 有：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>extension ecosystem</em> — PG 結構性產品線擴張的機制。</p></blockquote>
<hr>
<h2 id="extension-不只是-plugin是產品線擴張">Extension 不只是 plugin、是產品線擴張</h2>
<p>PG extension 機制讓 <em>第三方加新 type / function / operator / index access method / planner hook</em>、深度整合到 PG core。對比其他 DB 的 plugin model（MySQL plugin / MongoDB plugin）、PG extension 是 <em>更深的 SPI</em>。</p>
<p>結果：</p>
<ul>
<li>pgvector → PG 變 vector similarity search DB（取代 Pinecone / Weaviate）</li>
<li>TimescaleDB → PG 變 time-series DB（取代 InfluxDB）</li>
<li>Citus → PG 變 sharded cluster</li>
<li>PostGIS → PG 變 GIS DB</li>
<li>pg_cron → PG 變 scheduled job runner</li>
<li>pgvectorscale → 大規模 vector index</li>
</ul>
<p>對 <em>vendor lock-in 敏感</em> / <em>想統一 stack</em> 的 org、PG extension 提供 <em>用 PG 取代多個 specialized DB</em> 的可能。</p>
<p>但 <em>統一 stack 的代價</em>：PG 主庫 ops 風險集中（一個 PG 掛 = vector / time-series / GIS / cron 全掛）、extension 跟 PG version 對齊矩陣多一道升級顧慮、規模上限通常比專業 DB 低（pgvector 100M+ vs Pinecone 10B+ / TimescaleDB 100K rows/s vs InfluxDB 500K+）。決策框架：<em>中小規模 + 已用 PG + 不想多管系統</em> → extension；<em>大規模 + 純該 workload + 有專業 team</em> → specialized DB。</p>
<h2 id="extension-lifecycle">Extension Lifecycle</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 看可用 extension
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_available_extensions</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- 安裝（在 OS 層、要有對應 package）
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1">-- apt install postgresql-14-pg-stat-statements
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- Enable in DB
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_stat_statements</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 確認
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_extension</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- 升級 extension
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_stat_statements</span><span class="w"> </span><span class="k">UPDATE</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="c1">-- 移除
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1"></span><span class="k">DROP</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_stat_statements</span><span class="p">;</span></span></span></code></pre></div><p>每個 extension 有：</p>
<ul>
<li><em>Version</em> — 跟 PG version 綁定（如 pg_stat_statements 14 / 15 / 16）</li>
<li><em>Schema</em> — 安裝到 <code>public</code> 或專屬 schema</li>
<li><em>Dependencies</em> — 部分 extension 依賴其他（如 PostGIS 依賴 pg_trgm）</li>
<li><em>Trusted vs untrusted</em> — trusted 可以 non-superuser 安裝（PG 13+）</li>
</ul>
<h2 id="6-個-production-critical-extension">6 個 Production-Critical Extension</h2>
<h3 id="1-pg_stat_statements--query-stats必裝">1. pg_stat_statements — Query stats（必裝）</h3>
<p>任何 production PG cluster 都該裝：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">shared_preload_libraries</span> <span class="o">=</span> <span class="s">&#39;pg_stat_statements&#39;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">pg_stat_statements.max</span> <span class="o">=</span> <span class="s">5000</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">pg_stat_statements.track</span> <span class="o">=</span> <span class="s">all</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_stat_statements</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- Top 10 query by total time
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">query</span><span class="p">,</span><span class="w"> </span><span class="n">calls</span><span class="p">,</span><span class="w"> </span><span class="n">total_exec_time</span><span class="p">,</span><span class="w"> </span><span class="n">mean_exec_time</span><span class="p">,</span><span class="w"> </span><span class="k">rows</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_statements</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">total_exec_time</span><span class="w"> </span><span class="k">DESC</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p>對應 MySQL <code>events_statements_summary_by_digest</code>。詳見 <a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">Query Optimization</a>。</p>
<h3 id="2-pg_partman--自動-partition-lifecycle">2. pg_partman — 自動 partition lifecycle</h3>
<p>PG declarative partitioning 需要 <em>手動建 / drop partition</em>。pg_partman 自動化：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_partman</span><span class="w"> </span><span class="k">SCHEMA</span><span class="w"> </span><span class="n">partman</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="c1">-- 設 events 表自動 monthly partition
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">partman</span><span class="p">.</span><span class="n">create_parent</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">p_parent_table</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="s1">&#39;public.events&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">p_control</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="s1">&#39;created_at&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="n">p_type</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="s1">&#39;range&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">    </span><span class="n">p_interval</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="s1">&#39;1 month&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="n">p_premake</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="mi">6</span><span class="w">  </span><span class="c1">-- 預先建 6 個未來 partition
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="c1">-- 跑 maintenance（建未來 partition + drop 老 partition）
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">partman</span><span class="p">.</span><span class="n">run_maintenance</span><span class="p">(</span><span class="n">p_analyze</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="k">false</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="c1">-- 預設用 pg_cron 排程</span></span></span></code></pre></div><p>對 <em>time-series partition</em> workload 必裝。詳見 <a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">Declarative Partitioning</a>。</p>
<h3 id="3-pg_repack--online-table-rewrite">3. pg_repack — Online table rewrite</h3>
<p>詳見 <a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">Online Schema Change</a>。</p>
<h3 id="4-pgvector--vector-similarity-search">4. pgvector — Vector similarity search</h3>
<p>LLM embedding / semantic search 場景必裝：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">vector</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">content</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">embedding</span><span class="w"> </span><span class="n">VECTOR</span><span class="p">(</span><span class="mi">1536</span><span class="p">)</span><span class="w">  </span><span class="c1">-- OpenAI text-embedding-3-small 1536-dim
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- HNSW index（pgvector 0.5+）
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">HNSW</span><span class="w"> </span><span class="p">(</span><span class="n">embedding</span><span class="w"> </span><span class="n">vector_cosine_ops</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="c1">-- 找最相似的 5 個
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;=&gt;</span><span class="w"> </span><span class="s1">&#39;[0.1, 0.2, ...]&#39;</span><span class="p">::</span><span class="n">vector</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">5</span><span class="p">;</span></span></span></code></pre></div><p>對 <em>中小規模 RAG / semantic search</em> workload、pgvector 在 PG 內跑、不必跨 Pinecone / Weaviate / Qdrant 等獨立服務。</p>
<p>對 <em>超大規模</em> vector workload（&gt; 1 億 vector）考慮 pgvectorscale（pgvector 的 streaming variant）或專業 vector DB。</p>
<h3 id="5-timescaledb--time-series-擴展">5. TimescaleDB — Time-series 擴展</h3>
<p>把 PG 變 time-series DB：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">timescaledb</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">metrics</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">time</span><span class="w"> </span><span class="n">TIMESTAMPTZ</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">device_id</span><span class="w"> </span><span class="nb">INT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">value</span><span class="w"> </span><span class="n">DOUBLE</span><span class="w"> </span><span class="k">PRECISION</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- 轉成 hypertable（auto-partition by time）
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_hypertable</span><span class="p">(</span><span class="s1">&#39;metrics&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;time&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="c1">-- Continuous aggregate（materialized view 自動 refresh）
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">metrics_5min</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">timescaledb</span><span class="p">.</span><span class="n">continuous</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">time_bucket</span><span class="p">(</span><span class="s1">&#39;5 minutes&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">time</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">bucket</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">       </span><span class="n">device_id</span><span class="p">,</span><span class="w"> </span><span class="k">avg</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">metrics</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">bucket</span><span class="p">,</span><span class="w"> </span><span class="n">device_id</span><span class="p">;</span></span></span></code></pre></div><p>對 IoT / monitoring / financial tick data 場景、TimescaleDB 比純 PG 寫吞吐高 10x+。</p>
<h3 id="6-postgis--gis-extension">6. PostGIS — GIS extension</h3>
<p>地理 / 空間 query 業界標準：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">postgis</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">stores</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">name</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="k">location</span><span class="w"> </span><span class="n">GEOGRAPHY</span><span class="p">(</span><span class="n">POINT</span><span class="p">,</span><span class="w"> </span><span class="mi">4326</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">stores</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIST</span><span class="w"> </span><span class="p">(</span><span class="k">location</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- 找 1 km 內的 store
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">stores</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">ST_DWithin</span><span class="p">(</span><span class="k">location</span><span class="p">,</span><span class="w"> </span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">05</span><span class="p">)::</span><span class="n">geography</span><span class="p">,</span><span class="w"> </span><span class="mi">1000</span><span class="p">);</span></span></span></code></pre></div><p>PostGIS 是 GIS workload 業界標準、其他 DB GIS 能力都對標 PostGIS。</p>
<h2 id="其他常用-extension">其他常用 extension</h2>
<p>除 6 個 production-critical 之外、以下是 <em>特定場景常用</em> 的 extension — 分四類：排程跟 utility（<code>pg_cron</code> / <code>pg_trgm</code> / <code>uuid-ossp</code>）、type 擴展（<code>hstore</code> / <code>citext</code> / <code>pgcrypto</code>）、跨 DB 整合（<code>postgres_fdw</code> / <code>mysql_fdw</code>）、observability / debug 工具（<code>pg_buffercache</code> / <code>pg_visibility</code> / <code>auto_explain</code>）：</p>
<table>
  <thead>
      <tr>
          <th>Extension</th>
          <th>用途</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>pg_cron</code></td>
          <td>排程 SQL job（不必外部 cron）</td>
      </tr>
      <tr>
          <td><code>pg_trgm</code></td>
          <td>Fuzzy string match / similarity</td>
      </tr>
      <tr>
          <td><code>uuid-ossp</code></td>
          <td>UUID 產生</td>
      </tr>
      <tr>
          <td><code>hstore</code></td>
          <td>Key-value pair type</td>
      </tr>
      <tr>
          <td><code>citext</code></td>
          <td>Case-insensitive text type</td>
      </tr>
      <tr>
          <td><code>pgcrypto</code></td>
          <td>加密 / hash function</td>
      </tr>
      <tr>
          <td><code>postgres_fdw</code></td>
          <td>PG → PG foreign table</td>
      </tr>
      <tr>
          <td><code>mysql_fdw</code></td>
          <td>PG → MySQL foreign table</td>
      </tr>
      <tr>
          <td><code>pg_buffercache</code></td>
          <td>Buffer pool 內容檢視</td>
      </tr>
      <tr>
          <td><code>pg_visibility</code></td>
          <td>Visibility map 檢視（debug bloat）</td>
      </tr>
      <tr>
          <td><code>auto_explain</code></td>
          <td>Slow query 自動 log plan</td>
      </tr>
      <tr>
          <td><code>wal2json</code></td>
          <td>Logical decoding output 為 JSON</td>
      </tr>
      <tr>
          <td><code>Citus</code></td>
          <td>Distributed PG</td>
      </tr>
      <tr>
          <td><code>pgvector</code></td>
          <td>Vector similarity</td>
      </tr>
      <tr>
          <td><code>pglogical</code></td>
          <td>Logical replication（功能比 native 強）</td>
      </tr>
      <tr>
          <td><code>pg_squeeze</code></td>
          <td>pg_repack 替代</td>
      </tr>
  </tbody>
</table>
<p>實務組合：observability 三件套（<code>pg_stat_statements</code> + <code>auto_explain</code> + <code>pg_buffercache</code>）幾乎是 production 標配；FDW 是「跨 DB query」的 escape hatch、但 cross-DB query 效能差、適合 reporting 不適合 OLTP。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-extension-version-跟-pg-version-對齊">1. Extension version 跟 PG version 對齊</h3>
<p>PG cluster 升 14 → 15 後、extension（pg_stat_statements / pg_partman / pgvector 等）必須有對應 15 版本。早期升級 / niche extension 可能還沒釋出。</p>
<p>修法：</p>
<ul>
<li>升 PG cluster 前 <em>先確認所有 extension 都有對應 PG version 釋出版本</em></li>
<li>升完 PG cluster <em>立即跑 <code>ALTER EXTENSION xxx UPDATE</code></em></li>
<li>Upgrade runbook 紀錄每個 extension 的版本兼容狀態</li>
</ul>
<h3 id="2-managed-pg-限制-extension-列表">2. Managed PG 限制 extension 列表</h3>
<p>AWS RDS / Aurora PG / Cloud SQL / Azure DB for PostgreSQL 各自有 <em>支援 extension 白名單</em>：</p>
<ul>
<li>不在白名單的 extension 不能 install</li>
<li>部分 extension 限定特定 PG version</li>
<li>Untrusted extension 通常不允許</li>
</ul>
<p>常見 <em>managed 不支援</em> 的 extension：</p>
<ul>
<li><code>pg_repack</code>（Aurora 有限支援、RDS 部分 version 支援）</li>
<li><code>pglogical</code>（部分 cloud 不支援）</li>
<li><code>pg_cron</code>（cloud 通常用 managed scheduler 取代）</li>
<li>Custom extension（自寫 .so）</li>
</ul>
<p>修法：</p>
<ul>
<li>評估 managed PG 之前、先查 <em>vendor 支援 extension 列表</em></li>
<li>Self-hosted vs managed 的 <em>跨雲 portability</em> 議題：extension 是 lock-in source</li>
<li>如果 application 強依賴某 extension（如 PostGIS），確認 cloud 支援</li>
</ul>
<h3 id="3-extension-upgrade-order">3. Extension upgrade order</h3>
<p><code>pg_upgrade</code> 升 PG major version 後、extension 也要升。順序：</p>
<ol>
<li><em>pg_upgrade</em> PG binary + cluster</li>
<li>對每個 DB 跑 <code>ALTER EXTENSION xxx UPDATE</code></li>
<li>部分 extension（如 PostGIS）需要 <em>特殊升級程序</em>（<code>SELECT postgis_extensions_upgrade()</code>）</li>
</ol>
<p>修法：</p>
<ul>
<li>升 PG 後 <em>先測 staging cluster</em> 確認 extension upgrade 流程</li>
<li>PostGIS / TimescaleDB / Citus 有自己 upgrade 程序、必須遵循 vendor doc</li>
<li>升完跑 <code>\dx</code> 看每個 extension 版本</li>
</ul>
<h3 id="4-shared_preload_libraries-衝突">4. <code>shared_preload_libraries</code> 衝突</h3>
<p>部分 extension（pg_stat_statements / auto_explain / TimescaleDB / Citus / pg_cron）必須在 <code>shared_preload_libraries</code> 加進去、需要 <em>重啟 PG</em>。</p>
<p>衝突情境：</p>
<ul>
<li>pg_partman + TimescaleDB 都用 background worker、worker 上限不夠</li>
<li><code>max_worker_processes</code> 預設 8、不夠時某些 extension 起不起來</li>
</ul>
<p>修法：</p>
<ul>
<li>列出所有 shared_preload extension、確認 order（部分有 dependency）</li>
<li>提高 <code>max_worker_processes = 16</code> / <code>max_parallel_workers = 8</code> 等</li>
<li>重啟 PG 才生效、計入 maintenance window</li>
</ul>
<h3 id="5-extension-跟-logical-replication-互動">5. Extension 跟 logical replication 互動</h3>
<p>Logical replication（pglogical / native）不自動 replicate extension state（function / type definition）。Subscriber 沒裝對應 extension、replicate event 失敗。</p>
<p>修法：</p>
<ul>
<li>Subscriber 必須 <em>先安裝</em> publisher 用的 extension</li>
<li>Extension 版本 <em>publisher / subscriber 對齊</em></li>
<li>對 extension-heavy schema、考慮用 <em>streaming replication</em>（physical）而非 logical</li>
</ul>
<h2 id="cloud-vendor-對-extension-的支援">Cloud Vendor 對 Extension 的支援</h2>
<table>
  <thead>
      <tr>
          <th>Vendor</th>
          <th>常見 extension 支援</th>
          <th>限制</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AWS RDS PostgreSQL</td>
          <td>pg_stat_statements / pg_partman / pgvector / pg_repack</td>
          <td>部分 version 限制 / 不能 install custom</td>
      </tr>
      <tr>
          <td>AWS Aurora PostgreSQL</td>
          <td>同 RDS、加 Aurora-specific</td>
          <td>pg_repack 限版本</td>
      </tr>
      <tr>
          <td>GCP Cloud SQL</td>
          <td>標準 extension 廣支援</td>
          <td>pg_cron / pgvector OK</td>
      </tr>
      <tr>
          <td>Azure DB for PostgreSQL</td>
          <td>廣泛支援 + Azure 整合</td>
          <td>Citus（managed 即 Cosmos DB for PG）</td>
      </tr>
      <tr>
          <td>Self-hosted</td>
          <td>全部</td>
          <td>自己維護</td>
      </tr>
  </tbody>
</table>
<p>對 <em>extension-heavy</em> application、self-hosted PG 仍是必要選擇。Managed PG 適合 <em>標準 extension</em> workload。</p>
<h2 id="何時用-pg-extension-取代專業-db">何時用 PG extension 取代專業 DB</h2>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>用 extension 還是專業 DB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>&lt; 100M vector + RAG / semantic search</td>
          <td>pgvector（單一 stack 省 ops）</td>
      </tr>
      <tr>
          <td>大規模 vector search &gt; 10M with high QPS</td>
          <td>專業 vector DB（Pinecone / Qdrant）</td>
      </tr>
      <tr>
          <td>Time-series &lt; 100 TB</td>
          <td>TimescaleDB</td>
      </tr>
      <tr>
          <td>Time-series &gt; 100 TB + high cardinality</td>
          <td>專業 TS DB（InfluxDB / VictoriaMetrics）</td>
      </tr>
      <tr>
          <td>GIS</td>
          <td>PostGIS（業界標準）</td>
      </tr>
      <tr>
          <td>Sharded &lt; 10 TB + multi-tenant</td>
          <td>Citus</td>
      </tr>
      <tr>
          <td>Sharded &gt; 100 TB</td>
          <td>distributed SQL（CockroachDB / TiDB）</td>
      </tr>
      <tr>
          <td>Scheduled job</td>
          <td>pg_cron（簡單）/ Airflow（複雜）</td>
      </tr>
  </tbody>
</table>
<p>對中小規模、PG + extension 是 <em>簡化 stack</em> 的有效路徑。規模超過時、專業 DB 仍是首選。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/citus-distributed/" data-link-title="PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster" data-link-desc="Citus 是 PG extension、把單機 PG 變成 *coordinator &#43; worker* sharded cluster、保留 PG SQL &#43; 加 distributed table &#43; reference table &#43; columnar storage。本文走 Citus 架構（coordinator / worker / distribution column）、3 種 table type（distributed / reference / local）、配置 step-by-step、5 production 踩雷（distribution column 選錯 / cross-shard transaction / reference table 過大 / colocate 不對齊 / worker failover）、跟 MySQL Vitess sharding sibling 對比">Citus Distributed</a>：extension 一例、可看 extension model</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">Query Optimization</a>：pg_stat_statements + auto_explain 必用</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">Online Schema Change</a>：pg_repack 是 extension</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">Declarative Partitioning</a>：pg_partman 是 extension</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/sql-features-baseline/" data-link-title="PostgreSQL SQL Features：PG 早就有的、MySQL 8.0 才補的、PG 仍領先的" data-link-desc="PG 在 SQL features 上長期領先 MySQL — CTE / window function / lateral / partial index / FTS / JSONB / GIN index / materialized view 在 PG 早 5-15 年。MySQL 8.0（2018）補多數但 *index / storage / extension* 層仍是 PG 結構優勢。本文整理 PG 早期就有的特性、MySQL 8.0 補的差異、PG 仍領先的、跟 MySQL modern-sql-features sibling 反向視角">SQL Features Baseline</a>：extension 是 PG 結構性領先之一</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/sql-features-baseline/" data-link-title="PostgreSQL SQL Features：PG 早就有的、MySQL 8.0 才補的、PG 仍領先的" data-link-desc="PG 在 SQL features 上長期領先 MySQL — CTE / window function / lateral / partial index / FTS / JSONB / GIN index / materialized view 在 PG 早 5-15 年。MySQL 8.0（2018）補多數但 *index / storage / extension* 層仍是 PG 結構優勢。本文整理 PG 早期就有的特性、MySQL 8.0 補的差異、PG 仍領先的、跟 MySQL modern-sql-features sibling 反向視角">PG SQL Features Baseline</a>（extension 是結構優勢）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/citus-distributed/" data-link-title="PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster" data-link-desc="Citus 是 PG extension、把單機 PG 變成 *coordinator &#43; worker* sharded cluster、保留 PG SQL &#43; 加 distributed table &#43; reference table &#43; columnar storage。本文走 Citus 架構（coordinator / worker / distribution column）、3 種 table type（distributed / reference / local）、配置 step-by-step、5 production 踩雷（distribution column 選錯 / cross-shard transaction / reference table 過大 / colocate 不對齊 / worker failover）、跟 MySQL Vitess sharding sibling 對比">PG Citus Distributed</a>（extension example）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">PG Online Schema Change</a>（pg_repack）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">PG Declarative Partitioning</a>（pg_partman）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">PG Query Optimization</a>（pg_stat_statements + auto_explain）</li>
<li>官方：<a href="https://www.postgresql.org/docs/current/extend-extensions.html">PG Extensions</a> / <a href="https://github.com/pgvector/pgvector">pgvector</a> / <a href="https://docs.timescale.com/">TimescaleDB</a> / <a href="https://postgis.net/">PostGIS</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Full-Text Search：tsvector / tsquery / GIN index 跟 pg_trgm fuzzy 三層搜尋</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/full-text-search/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/full-text-search/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>full-text search&lt;/em> — 內建 tsvector / tsquery + pg_trgm fuzzy match。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="pg-fts-機制tsvector--tsquery--gin-index">PG FTS 機制：tsvector + tsquery + GIN index&lt;/h2>
&lt;p>PG 內建 full-text search 三件組：&lt;/p>
&lt;ul>
&lt;li>&lt;code>tsvector&lt;/code>：document 轉成 &lt;em>lexeme&lt;/em>（字根 + position）vector、normalized 後存&lt;/li>
&lt;li>&lt;code>tsquery&lt;/code>：搜尋字串 parse 成 query 形式&lt;/li>
&lt;li>GIN index：對 tsvector 加 inverted index&lt;/li>
&lt;/ul>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- Document
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">to_tsvector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;The quick brown fox jumps over the lazy dog&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 結果：&amp;#39;brown&amp;#39;:3 &amp;#39;dog&amp;#39;:9 &amp;#39;fox&amp;#39;:4 &amp;#39;jump&amp;#39;:5 &amp;#39;lazi&amp;#39;:8 &amp;#39;quick&amp;#39;:2
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="c1">-- The/over 是 stop word 被過濾、jumps/lazy 轉字根、保留 position
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- Query
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">to_tsquery&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;fox &amp;amp; dog&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 結果：&amp;#39;fox&amp;#39; &amp;amp; &amp;#39;dog&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- Match
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">to_tsvector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;The quick brown fox&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">@@&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">to_tsquery&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;fox &amp;amp; quick&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- → true&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Index&lt;/strong>：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">articles&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">SERIAL&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">PRIMARY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">body&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- GIN index over tsvector (動態 cast)
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">INDEX&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">idx_articles_fts&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">articles&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">USING&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">GIN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">to_tsvector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">||&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39; &amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">||&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">body&lt;/span>&lt;span class="p">));&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- Query 用 index
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">articles&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">to_tsvector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">||&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39; &amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">||&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">body&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">@@&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">to_tsquery&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;postgres &amp;amp; index&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &amp;#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&amp;#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &amp;#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">JSONB GIN index&lt;/a> 同 GIN access method、不同 indexed expression。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>full-text search</em> — 內建 tsvector / tsquery + pg_trgm fuzzy match。</p></blockquote>
<hr>
<h2 id="pg-fts-機制tsvector--tsquery--gin-index">PG FTS 機制：tsvector + tsquery + GIN index</h2>
<p>PG 內建 full-text search 三件組：</p>
<ul>
<li><code>tsvector</code>：document 轉成 <em>lexeme</em>（字根 + position）vector、normalized 後存</li>
<li><code>tsquery</code>：搜尋字串 parse 成 query 形式</li>
<li>GIN index：對 tsvector 加 inverted index</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- Document
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;The quick brown fox jumps over the lazy dog&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="c1">-- 結果：&#39;brown&#39;:3 &#39;dog&#39;:9 &#39;fox&#39;:4 &#39;jump&#39;:5 &#39;lazi&#39;:8 &#39;quick&#39;:2
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1">-- The/over 是 stop word 被過濾、jumps/lazy 轉字根、保留 position
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="c1">-- Query
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;fox &amp; dog&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- 結果：&#39;fox&#39; &amp; &#39;dog&#39;
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- Match
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;The quick brown fox&#39;</span><span class="p">)</span><span class="w"> </span><span class="o">@@</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;fox &amp; quick&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="c1">-- → true</span></span></span></code></pre></div><p><strong>Index</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">title</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">body</span><span class="w"> </span><span class="nb">TEXT</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- GIN index over tsvector (動態 cast)
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_articles_fts</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">articles</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="s1">&#39; &#39;</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="n">body</span><span class="p">));</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- Query 用 index
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">articles</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="s1">&#39; &#39;</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="n">body</span><span class="p">)</span><span class="w"> </span><span class="o">@@</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;postgres &amp; index&#39;</span><span class="p">);</span></span></span></code></pre></div><p>跟 <a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">JSONB GIN index</a> 同 GIN access method、不同 indexed expression。</p>
<h2 id="generated-column-加速">Generated column 加速</h2>
<p>每次 query 都跑 <code>to_tsvector(...)</code> 浪費 CPU。用 <em>generated column</em> 預存：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">fts</span><span class="w"> </span><span class="n">tsvector</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">GENERATED</span><span class="w"> </span><span class="n">ALWAYS</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">coalesce</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="s1">&#39; &#39;</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="n">coalesce</span><span class="p">(</span><span class="n">body</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&#39;</span><span class="p">)))</span><span class="w"> </span><span class="n">STORED</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_articles_fts</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">fts</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- Query 簡化
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">fts</span><span class="w"> </span><span class="o">@@</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;postgres&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Stored generated column 是 PG 12+、自動跟 row update 同步。</p>
<h2 id="ranking--加權">Ranking + 加權</h2>
<p>PG FTS 提供 <code>ts_rank</code> / <code>ts_rank_cd</code> 給結果排序：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 簡單 ranking
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">ts_rank</span><span class="p">(</span><span class="n">fts</span><span class="p">,</span><span class="w"> </span><span class="n">query</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">rank</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">articles</span><span class="p">,</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;postgres &amp; index&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">query</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">fts</span><span class="w"> </span><span class="o">@@</span><span class="w"> </span><span class="n">query</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">rank</span><span class="w"> </span><span class="k">DESC</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p>加權（A &gt; B &gt; C &gt; D）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- Title 比 body 重要
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">UPDATE</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">fts</span><span class="w"> </span><span class="o">=</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">setweight</span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">coalesce</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&#39;</span><span class="p">)),</span><span class="w"> </span><span class="s1">&#39;A&#39;</span><span class="p">)</span><span class="w"> </span><span class="o">||</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">setweight</span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">coalesce</span><span class="p">(</span><span class="n">body</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&#39;</span><span class="p">)),</span><span class="w"> </span><span class="s1">&#39;B&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="c1">-- Query 用加權 ranking
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">       </span><span class="n">ts_rank</span><span class="p">(</span><span class="n">fts</span><span class="p">,</span><span class="w"> </span><span class="n">query</span><span class="p">,</span><span class="w"> </span><span class="mi">32</span><span class="w"> </span><span class="cm">/* normalize by document length */</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">rank</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">articles</span><span class="p">,</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;postgres&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">query</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">fts</span><span class="w"> </span><span class="o">@@</span><span class="w"> </span><span class="n">query</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">rank</span><span class="w"> </span><span class="k">DESC</span><span class="p">;</span></span></span></code></pre></div><p><code>ts_rank</code> 第三 parameter 是 normalization flag：</p>
<ul>
<li>0：no normalization</li>
<li>1：divide by document length</li>
<li>32：divide by uniqueness（避免短 doc 一律 rank 高）</li>
</ul>
<h2 id="multi-language-support">Multi-language Support</h2>
<p>PG 內建多種語言 dictionary：<code>english</code> / <code>french</code> / <code>german</code> / <code>spanish</code> / <code>simple</code>（不做 stemming）等。</p>
<p>對 <em>中文 / 日文 / 韓文</em>、PG 預設無支援、需要 extension：</p>
<ul>
<li><code>zhparser</code>（中文、用 SCWS 分詞）</li>
<li><code>pgroonga</code>（多語言、支援中日韓）</li>
<li><code>RUM index</code>（PG 自己 + 可選 dictionary）</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 中文用 zhparser
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">zhparser</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">SEARCH</span><span class="w"> </span><span class="n">CONFIGURATION</span><span class="w"> </span><span class="n">chinese</span><span class="w"> </span><span class="p">(</span><span class="n">PARSER</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">zhparser</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">SEARCH</span><span class="w"> </span><span class="n">CONFIGURATION</span><span class="w"> </span><span class="n">chinese</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">ADD</span><span class="w"> </span><span class="n">MAPPING</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="n">v</span><span class="p">,</span><span class="n">a</span><span class="p">,</span><span class="n">i</span><span class="p">,</span><span class="n">e</span><span class="p">,</span><span class="n">l</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="k">simple</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- 使用
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;我愛 PostgreSQL 資料庫&#39;</span><span class="p">);</span></span></span></code></pre></div><p>對 <em>主要英文 search</em> 場景 PG built-in 夠用、對 <em>主要 CJK search</em> 需要 extension。</p>
<h2 id="pg_trgm--fuzzy-string-match">pg_trgm — Fuzzy String Match</h2>
<p>PG FTS 對 <em>精確字根 match</em> 強、對 <em>拼錯 / similar string</em> 弱。<code>pg_trgm</code> extension 提供 trigram-based fuzzy match：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_trgm</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="c1">-- 對 column 建 GIN trigram index
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_users_name_trgm</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="n">gin_trgm_ops</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="c1">-- Fuzzy match（similarity threshold 預設 0.3）
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">%</span><span class="w"> </span><span class="s1">&#39;jhon&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- → 找到 &#39;John&#39;、&#39;Johan&#39;、&#39;Johnny&#39; 等 similar string
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 顯式 similarity score
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">similarity</span><span class="p">(</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;jhon&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">similarity</span><span class="p">(</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;jhon&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">DESC</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">5</span><span class="p">;</span></span></span></code></pre></div><p>用途：</p>
<ul>
<li>Autocomplete / typeahead suggestion</li>
<li>拼錯容錯（user 輸入 typo）</li>
<li>ILIKE 加速（<code>name ILIKE '%jhon%'</code> 走 GIN trigram index）</li>
</ul>
<p>跟 FTS 互補：</p>
<ul>
<li>FTS：full document search、tokenize / stemming / ranking</li>
<li>pg_trgm：short string similarity、typo tolerance</li>
</ul>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-dictionary-選錯--中文搜不到">1. Dictionary 選錯 — 中文搜不到</h3>
<p>對中文 column 用 <code>to_tsvector('english', text)</code>、不分詞、整段當一個 token、搜不到任何結果。</p>
<p>修法：</p>
<ul>
<li>中文用 <code>zhparser</code> / <code>pgroonga</code></li>
<li>多語言 column 拆 <em>per-language column</em> 或用 <code>simple</code> dictionary（不 stemming、字元級 match）</li>
<li>確認 dictionary 選對：<code>SELECT to_tsvector('chinese', '...')</code> 看分詞結果</li>
</ul>
<h3 id="2-gin-vs-gist-取捨選錯">2. GIN vs GiST 取捨選錯</h3>
<p>PG FTS 有兩種 index access method：</p>
<ul>
<li><em>GIN</em>：read fast、write slow、size 大、適合 <em>read-heavy</em></li>
<li><em>GiST</em>：read 慢、write fast、size 小、適合 <em>write-heavy 或 small doc</em></li>
</ul>
<p>預設選 GIN、適合 90% search workload。對 <em>寫入頻繁 + 文件小</em> 場景 GiST。</p>
<p>修法：</p>
<ul>
<li>預設 GIN</li>
<li>寫吞吐 &gt; 10K WPS 場景考慮 GiST 或 <em>bulk index</em>（先 disable index、bulk insert、重建 index）</li>
<li>GIN 有 <code>fastupdate</code> option、buffering 加速寫入（trade-off：read 慢）</li>
</ul>
<h3 id="3-ranking-評分權重不對齊-business">3. Ranking 評分權重不對齊 business</h3>
<p><code>ts_rank</code> 預設不考慮 <em>field weight</em>、<code>ts_rank_cd</code> 考慮 cover density、兩者結果不同。Application 不知道 <em>自己 query 對應哪個 rank function</em>、結果隨機。</p>
<p>修法：</p>
<ul>
<li>顯式選 ranking function：<code>ts_rank</code> 一般用、<code>ts_rank_cd</code> 對 <em>proximity 重要</em> 場景</li>
<li>設 <em>field weight</em>（A &gt; B &gt; C &gt; D）反映 business priority（title &gt; body &gt; tags）</li>
<li>對 <em>搜尋結果</em> 用 A/B test 評估 ranking 質量、不靠直覺</li>
</ul>
<h3 id="4-multi-language-column-處理">4. Multi-language column 處理</h3>
<p>Application 同表存多語言 row（user-generated content、不同 language）、用單一 <code>to_tsvector('english', ...)</code> 對中文 row 搜不到、對 french row 也 stem 錯。</p>
<p>修法：</p>
<ul>
<li>
<p>加 <code>language</code> column 標每 row 語言</p>
</li>
<li>
<p>用 dynamic dictionary：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">fts</span><span class="w"> </span><span class="n">tsvector</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">GENERATED</span><span class="w"> </span><span class="n">ALWAYS</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">to_tsvector</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">        </span><span class="k">CASE</span><span class="w"> </span><span class="k">WHEN</span><span class="w"> </span><span class="k">language</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;zh&#39;</span><span class="w"> </span><span class="k">THEN</span><span class="w"> </span><span class="s1">&#39;chinese&#39;</span><span class="p">::</span><span class="n">regconfig</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">             </span><span class="k">WHEN</span><span class="w"> </span><span class="k">language</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;fr&#39;</span><span class="w"> </span><span class="k">THEN</span><span class="w"> </span><span class="s1">&#39;french&#39;</span><span class="p">::</span><span class="n">regconfig</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">             </span><span class="k">ELSE</span><span class="w"> </span><span class="s1">&#39;english&#39;</span><span class="p">::</span><span class="n">regconfig</span><span class="w"> </span><span class="k">END</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">        </span><span class="n">coalesce</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="s1">&#39; &#39;</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="n">coalesce</span><span class="p">(</span><span class="n">body</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w">    </span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">STORED</span><span class="p">;</span></span></span></code></pre></div></li>
<li>
<p>Query 時用對應語言 <code>to_tsquery</code></p>
</li>
</ul>
<h3 id="5-何時不該用-pg-fts--應該換-elasticsearch--opensearch">5. 何時不該用 PG FTS — 應該換 Elasticsearch / OpenSearch</h3>
<p>PG FTS 適合 <em>中小規模搜尋</em>、不適合：</p>
<ul>
<li><em>&gt; 100M document</em> high-QPS search</li>
<li>需要 <em>complex aggregation</em>（faceted search）</li>
<li>需要 <em>advanced ranking</em>（BM25 / learning to rank）</li>
<li>需要 <em>分散式 search</em>（PG FTS 是 single-node）</li>
<li>需要 <em>near-real-time indexing</em>（PG GIN update 較慢）</li>
</ul>
<p>對這些場景、用 Elasticsearch / OpenSearch / Meilisearch / Typesense 等專業 search engine。</p>
<p>PG FTS <em>優勢</em> 是 <em>跟 OLTP data 同 transaction</em> — 不需要 ETL 同步 search index、application 寫 PG 立即 searchable。對 application data + search 是 <em>同源</em> 的場景 PG FTS 比較適合。</p>
<h2 id="何時用-pg-fts">何時用 PG FTS</h2>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>選擇</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Application internal search（admin / dashboard）</td>
          <td>PG FTS</td>
      </tr>
      <tr>
          <td>&lt; 10M document、低 QPS（&lt; 100/s）</td>
          <td>PG FTS</td>
      </tr>
      <tr>
          <td>Search 跟 OLTP data 同 transaction needed</td>
          <td>PG FTS</td>
      </tr>
      <tr>
          <td>Fuzzy / typo tolerance</td>
          <td>PG FTS + pg_trgm</td>
      </tr>
      <tr>
          <td>&gt; 100M document + high QPS</td>
          <td>Elasticsearch / OpenSearch</td>
      </tr>
      <tr>
          <td>Faceted aggregation</td>
          <td>Elasticsearch / OpenSearch</td>
      </tr>
      <tr>
          <td>Vector similarity（semantic search）</td>
          <td>pgvector（同 PG）</td>
      </tr>
  </tbody>
</table>
<p>PG FTS + pgvector 組合對 <em>中小規模 hybrid keyword + semantic search</em> 是強選擇。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">JSONB Deep Dive</a>：JSONB 跟 FTS 都用 GIN</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">Extension Ecosystem</a>：pg_trgm / pgroonga / zhparser 都是 extension</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">Query Optimization</a>：FTS query 的 EXPLAIN</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>：FTS GIN index 在 standby 自動 replicate</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">PG Extension Ecosystem</a>（pg_trgm / pgroonga）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">PG JSONB Deep Dive</a>（共用 GIN）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">PG Query Optimization</a>（FTS query plan）</li>
<li>官方：<a href="https://www.postgresql.org/docs/current/textsearch.html">PG Full-Text Search</a> / <a href="https://www.postgresql.org/docs/current/pgtrgm.html">pg_trgm</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Replication Slot Management：Physical / Logical / Failover Slot 治理</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/replication-slot-management/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/replication-slot-management/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>replication slot management&lt;/em> — physical / logical / failover slot 三類治理。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="replication-slot-兩大類">Replication Slot 兩大類&lt;/h2>
&lt;p>PG 兩種 replication slot：&lt;/p>
&lt;h3 id="physical-replication-slot">Physical Replication Slot&lt;/h3>
&lt;p>對應 &lt;em>streaming replication&lt;/em>（physical WAL byte-level）：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_create_physical_replication_slot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;standby1_slot&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>用於：&lt;/p>
&lt;ul>
&lt;li>Streaming replication standby（&lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &amp;#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &amp;#43; LSN-based 進度追蹤 &amp;#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &amp;#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &amp;#43; logical replication 整合">Replication Topology&lt;/a>）&lt;/li>
&lt;li>pg_basebackup 用 slot 防 WAL 清理&lt;/li>
&lt;li>高 lag standby 防 WAL premature deletion&lt;/li>
&lt;/ul>
&lt;h3 id="logical-replication-slot">Logical Replication Slot&lt;/h3>
&lt;p>對應 &lt;em>logical replication / logical decoding&lt;/em>：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_create_logical_replication_slot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;my_slot&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;pgoutput&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 或用 wal2json plugin
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_create_logical_replication_slot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;debezium_slot&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;wal2json&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>用於：&lt;/p>
&lt;ul>
&lt;li>PG-to-PG logical replication（publication / subscription）&lt;/li>
&lt;li>CDC（Debezium / Maxwell / pg_logical_emitter）&lt;/li>
&lt;li>Multi-master replication（BDR / pgEdge / Spock）&lt;/li>
&lt;/ul>
&lt;p>logical slot 跟 physical slot 共存、各自獨立 retention。&lt;/p>
&lt;h2 id="slot-lifecycle">Slot Lifecycle&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">建立 → active（有 consumer）→ inactive（consumer 失聯）→ drop
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> WAL 持續累積（直到推進 LSN 或 drop）&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>狀態查詢&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>replication slot management</em> — physical / logical / failover slot 三類治理。</p></blockquote>
<hr>
<h2 id="replication-slot-兩大類">Replication Slot 兩大類</h2>
<p>PG 兩種 replication slot：</p>
<h3 id="physical-replication-slot">Physical Replication Slot</h3>
<p>對應 <em>streaming replication</em>（physical WAL byte-level）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_create_physical_replication_slot</span><span class="p">(</span><span class="s1">&#39;standby1_slot&#39;</span><span class="p">);</span></span></span></code></pre></div><p>用於：</p>
<ul>
<li>Streaming replication standby（<a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>）</li>
<li>pg_basebackup 用 slot 防 WAL 清理</li>
<li>高 lag standby 防 WAL premature deletion</li>
</ul>
<h3 id="logical-replication-slot">Logical Replication Slot</h3>
<p>對應 <em>logical replication / logical decoding</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_create_logical_replication_slot</span><span class="p">(</span><span class="s1">&#39;my_slot&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;pgoutput&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="c1">-- 或用 wal2json plugin
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_create_logical_replication_slot</span><span class="p">(</span><span class="s1">&#39;debezium_slot&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;wal2json&#39;</span><span class="p">);</span></span></span></code></pre></div><p>用於：</p>
<ul>
<li>PG-to-PG logical replication（publication / subscription）</li>
<li>CDC（Debezium / Maxwell / pg_logical_emitter）</li>
<li>Multi-master replication（BDR / pgEdge / Spock）</li>
</ul>
<p>logical slot 跟 physical slot 共存、各自獨立 retention。</p>
<h2 id="slot-lifecycle">Slot Lifecycle</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">建立 → active（有 consumer）→ inactive（consumer 失聯）→ drop
</span></span><span class="line"><span class="ln">2</span><span class="cl">                                    ↓
</span></span><span class="line"><span class="ln">3</span><span class="cl">                              WAL 持續累積（直到推進 LSN 或 drop）</span></span></code></pre></div><p><strong>狀態查詢</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">slot_name</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">       </span><span class="n">slot_type</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">       </span><span class="n">active</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">       </span><span class="n">restart_lsn</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">       </span><span class="n">confirmed_flush_lsn</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">       </span><span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_wal_lsn_diff</span><span class="p">(</span><span class="n">pg_current_wal_lsn</span><span class="p">(),</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">retained_wal</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="p">;</span></span></span></code></pre></div><p>關鍵欄位：</p>
<ul>
<li><code>slot_type</code>：<code>physical</code> / <code>logical</code></li>
<li><code>active</code>：true / false（consumer 是否連著）</li>
<li><code>restart_lsn</code>：slot 起點 LSN、primary 必須保留這以後的 WAL</li>
<li><code>confirmed_flush_lsn</code>：logical slot 已 confirm flush 的 LSN</li>
<li><code>retained_wal</code>：當前因 slot 累積的 WAL</li>
</ul>
<h2 id="failover-slot-synchronization-pg-17">Failover Slot Synchronization (PG 17+)</h2>
<p>PG 17 之前的 <em>痛點</em>：logical replication slot 是 <em>primary 上的 state</em>、failover 後 <em>新 primary 沒這個 slot</em>、CDC consumer 失聯、需要重建（大工程）。</p>
<p>PG 17 加 <em>failover slot synchronization</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- PG 17+：標 slot 為 failover-tracked
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1">-- signature: pg_create_logical_replication_slot(slot_name, plugin, temporary, two_phase, failover)
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_create_logical_replication_slot</span><span class="p">(</span><span class="s1">&#39;my_slot&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;pgoutput&#39;</span><span class="p">,</span><span class="w"> </span><span class="k">false</span><span class="p">,</span><span class="w"> </span><span class="k">false</span><span class="p">,</span><span class="w"> </span><span class="k">true</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">--                                                                          ↑
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1">--                                                                     failover=true（第 5 個參數）
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1">-- 注意：第 4 個參數是 two_phase（這裡 false）、第 5 個才是 failover
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- Standby 上 enable sync_replication_slots
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">sync_replication_slots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">on</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_reload_conf</span><span class="p">();</span></span></span></code></pre></div><p><code>sync_replication_slots = on</code> 後、physical replication 同步 slot state 到 standby。Failover promote standby 後、logical slot 仍可用、CDC consumer 重連即可。</p>
<p>PG 17 之前用 <a href="https://www.pgedge.com/">pgEdge</a> / <em>pglogical</em> 等 extension 提供類似功能、現在 PG core 內建。</p>
<h2 id="orphan-slot-治理">Orphan Slot 治理</h2>
<p><code>active = false</code> 的 slot 持續累積 WAL、disk 爆是 PG production 經典事故。</p>
<h3 id="監控-orphan-slot">監控 orphan slot</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 找 inactive 太久的 slot
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">slot_name</span><span class="p">,</span><span class="w"> </span><span class="n">active</span><span class="p">,</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">       </span><span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_wal_lsn_diff</span><span class="p">(</span><span class="n">pg_current_wal_lsn</span><span class="p">(),</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">retained_wal</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="n">active</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="k">AND</span><span class="w"> </span><span class="n">pg_wal_lsn_diff</span><span class="p">(</span><span class="n">pg_current_wal_lsn</span><span class="p">(),</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">1024</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">1024</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">1024</span><span class="p">;</span><span class="w">  </span><span class="c1">-- &gt; 1 GB</span></span></span></code></pre></div><h3 id="自動-invalidate-slotpg-13">自動 invalidate slot（PG 13+）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- postgresql.conf
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">max_slot_wal_keep_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;50GB&#39;</span><span class="p">;</span><span class="w">  </span><span class="c1">-- slot 累積 &gt; 50GB 自動 invalidate</span></span></span></code></pre></div><p>當 slot 累積 WAL 超過 <code>max_slot_wal_keep_size</code>、PG 自動 invalidate slot（<code>active=false</code> 且不再保留 WAL）。Consumer 重連會 fail、必須重建（base backup + new slot）。</p>
<p>這是 <em>trade-off</em>：</p>
<ul>
<li>設 limit → 保護 disk、但 consumer 失聯 → 大重建工作</li>
<li>不設 limit → consumer 失聯 OK、但 disk 爆</li>
</ul>
<p>實務多數設 <code>max_slot_wal_keep_size</code> 給 <em>disk capacity 50%</em>、避免徹底 disk full。</p>
<h3 id="手動-drop-orphan-slot">手動 drop orphan slot</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 確認 slot 真的不需要
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">slot_name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;old_standby_slot&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Drop
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_drop_replication_slot</span><span class="p">(</span><span class="s1">&#39;old_standby_slot&#39;</span><span class="p">);</span></span></span></code></pre></div><p>DR runbook 必須包含 <em>standby 退役流程</em>：先 standby fence、再 primary drop slot。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-orphan-slot-disk-爆">1. Orphan slot disk 爆</h3>
<p>最經典 PG 事故：standby decomission 沒 drop slot、primary 持續保留 WAL、<code>pg_wal/</code> 累積到 disk full、primary 也掛。</p>
<p>修法：</p>
<ul>
<li>監控 <code>pg_replication_slots</code> + <code>pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn))</code> retained_wal</li>
<li>設 <code>max_slot_wal_keep_size</code>（PG 13+）— hard limit</li>
<li>Standby 退役 runbook 強制 <em>先 fence、再 drop slot</em></li>
<li>Cron job 自動 alert orphan slot</li>
</ul>
<h3 id="2-logical-slot-lag--cdc-consumer-跟不上">2. Logical slot lag — CDC consumer 跟不上</h3>
<p>Logical decoding 比 physical replication 慢（per-transaction logical event 重組）。CDC consumer（Debezium）跟不上 → slot lag 累積。</p>
<p>修法：</p>
<ul>
<li>監控 <code>pg_replication_slots.confirmed_flush_lsn</code> 跟 primary <code>pg_current_wal_lsn()</code> 對比</li>
<li>CDC consumer 性能調整（throughput / batch size）</li>
<li>Throttle source writes（如果不能升 consumer）</li>
<li>對 hot table 拆 publication / subscription、避免單 slot 處理所有變更</li>
</ul>
<p>詳見 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a>。</p>
<h3 id="3-failover-後-logical-slot-丟pg-16-之前">3. Failover 後 logical slot 丟（PG 16 之前）</h3>
<p>PG 16 之前、failover promote standby、新 primary 沒有原 logical slot。CDC consumer 試連、ERROR: <code>replication slot &quot;xxx&quot; does not exist</code>。</p>
<p>修法（PG 17+）：</p>
<ul>
<li>用 <em>failover slot synchronization</em>（如上）</li>
<li><code>pg_create_logical_replication_slot(...,  failover := true)</code></li>
<li>Standby <code>sync_replication_slots = on</code></li>
</ul>
<p>修法（PG 16-）：</p>
<ul>
<li>用 <a href="https://www.2ndquadrant.com/en/resources/pglogical/">pglogical</a> 或 <a href="https://www.pgedge.com/">pgEdge</a> extension</li>
<li>Failover runbook 包含 <em>新 primary 重建 logical slot</em>（CDC consumer 重 snapshot）</li>
<li>Pre-create slot on standby + manual sync（早期 workaround）</li>
</ul>
<h3 id="4-wal_keep_size-跟-slot-衝突">4. <code>wal_keep_size</code> 跟 slot 衝突</h3>
<p><code>wal_keep_size</code>（PG 13+）/ <code>wal_keep_segments</code>（&lt; 13）跟 slot 都會保留 WAL：</p>
<ul>
<li><code>wal_keep_size</code>：固定 minimum WAL 保留量</li>
<li>Slot：動態保留直到 consumer 推進</li>
</ul>
<p>兩者一起 set 時：實際保留 WAL = <code>max(wal_keep_size, slot 需要的量)</code>。</p>
<p>修法：</p>
<ul>
<li><code>wal_keep_size</code> 設小（如 1-2 GB）作 <em>minimum backup</em></li>
<li>主要靠 slot 動態保留 — 給 active consumer</li>
<li>監控 <code>pg_wal/</code> 大小 + 拆解 retention source（<code>wal_keep_size</code> vs slot 各佔多少）</li>
</ul>
<h3 id="5-slot-數量上限">5. Slot 數量上限</h3>
<p><code>max_replication_slots</code> 預設 10、不夠時新 slot 建不出來、報錯。</p>
<p>修法：</p>
<ul>
<li>Production 大 cluster 設 <code>max_replication_slots = 50</code> 或更多</li>
<li>對 <em>standby + logical replication + CDC consumer</em> 同時跑、計算需要的 slot 數</li>
<li>監控 <code>SELECT count(*) FROM pg_replication_slots</code> 接近 limit 時告警</li>
</ul>
<h2 id="slot-naming-convention">Slot Naming Convention</h2>
<p>Production 大 cluster 多 slot、命名 convention 重要：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">&lt;consumer-type&gt;_&lt;consumer-name&gt;_&lt;purpose&gt;
</span></span><span class="line"><span class="ln">2</span><span class="cl">例：
</span></span><span class="line"><span class="ln">3</span><span class="cl">- physical_standby1_replication
</span></span><span class="line"><span class="ln">4</span><span class="cl">- physical_standby2_replication
</span></span><span class="line"><span class="ln">5</span><span class="cl">- logical_debezium_orders_cdc
</span></span><span class="line"><span class="ln">6</span><span class="cl">- logical_pgedge_node2_subscription
</span></span><span class="line"><span class="ln">7</span><span class="cl">- physical_pgbasebackup_temp（base backup 用、completed 後 drop）</span></span></code></pre></div><p>清楚命名讓 <em>看 slot 名</em> 就知道用途、誰負責、能不能 drop。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>：physical slot 給 streaming replication 用</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a>：logical slot 給 CDC</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/bdr-multi-master/" data-link-title="PostgreSQL BDR / Multi-Master：active-active 寫入的 3 種路徑跟 conflict 治理" data-link-desc="PG 預設是 single-primary、active-active 多寫入入口需要 *BDR (EDB)* / *pgEdge* / *Bucardo* 等 extension。本文走 3 種 multi-master 方案對比、conflict detection &#43; resolution model、async vs sync 取捨、配置 step-by-step（pgEdge 為主）、5 production 踩雷（last-write-wins data loss / sequence collision / DDL replication / conflict log 治理 / failover 後 timeline 分歧）、跟 MySQL Group Replication sibling 對比">BDR / Multi-Master</a>：multi-master 大量用 logical slot</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL Archiving</a>：WAL archive 跟 slot 是兩種 WAL retention 機制、可並行</li>
</ul>
<h2 id="監控-metric">監控 metric</h2>
<p>Production 持續監控：</p>
<ul>
<li><code>pg_replication_slots.active</code> — 失聯 slot</li>
<li><code>pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)</code> — slot 累積 WAL</li>
<li><code>pg_replication_slots.confirmed_flush_lsn</code> vs <code>pg_current_wal_lsn()</code> — logical slot lag</li>
<li><code>pg_ls_waldir()</code> 看 <code>pg_wal/</code> 目錄大小</li>
<li><code>count(*) FROM pg_replication_slots</code> 對 <code>max_replication_slots</code> 比例</li>
</ul>
<p>把這些丟進 Datadog / Prometheus + alert。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（physical slot 用途）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">PG Logical Replication + Debezium</a>（logical slot 用途）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/bdr-multi-master/" data-link-title="PostgreSQL BDR / Multi-Master：active-active 寫入的 3 種路徑跟 conflict 治理" data-link-desc="PG 預設是 single-primary、active-active 多寫入入口需要 *BDR (EDB)* / *pgEdge* / *Bucardo* 等 extension。本文走 3 種 multi-master 方案對比、conflict detection &#43; resolution model、async vs sync 取捨、配置 step-by-step（pgEdge 為主）、5 production 踩雷（last-write-wins data loss / sequence collision / DDL replication / conflict log 治理 / failover 後 timeline 分歧）、跟 MySQL Group Replication sibling 對比">PG BDR / Multi-Master</a>（multi-master 大量 slot）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PG PITR + WAL Archiving</a>（WAL retention 兩種機制）</li>
<li>官方：<a href="https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-SLOTS">PG Replication Slots</a> / <a href="https://www.postgresql.org/docs/current/logicaldecoding.html">Logical Replication Slot</a></li>
</ul>
]]></content:encoded></item><item><title>TimescaleDB Deep Dive：Hypertable / Continuous Aggregate / Compression 把 PG 變 Time-Series DB</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/timescaledb-deep-dive/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/timescaledb-deep-dive/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>TimescaleDB extension&lt;/em> — 用 PG 解 time-series workload 的路徑、跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem&lt;/a> 是 &lt;em>單一 extension 細節 vs ecosystem 全景&lt;/em> 的關係。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="timescaledb-是-pg-的-time-series-specialization">TimescaleDB 是 PG 的 &lt;em>Time-Series Specialization&lt;/em>&lt;/h2>
&lt;p>TimescaleDB 不是獨立 DB、是 PG extension：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">timescaledb&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>加完後、PG 多三個 time-series 專屬機制：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Hypertable&lt;/strong>：對 time column 自動 partition、應用層看是一張表&lt;/li>
&lt;li>&lt;strong>Continuous aggregate&lt;/strong>：incremental refresh 的 materialized view&lt;/li>
&lt;li>&lt;strong>Compression&lt;/strong>：對舊 chunk 壓縮（columnar-like format）&lt;/li>
&lt;/ol>
&lt;p>跟專業 time-series DB（InfluxDB / Prometheus / VictoriaMetrics）對比、TimescaleDB 的賣點不是「最快」而是「PG ecosystem 一致」：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>TimescaleDB&lt;/th>
 &lt;th>InfluxDB&lt;/th>
 &lt;th>Prometheus&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Query 語言&lt;/td>
 &lt;td>標準 SQL&lt;/td>
 &lt;td>InfluxQL / Flux&lt;/td>
 &lt;td>PromQL&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>寫入效能&lt;/td>
 &lt;td>中（10-100K rows/s）&lt;/td>
 &lt;td>高（500K+ rows/s）&lt;/td>
 &lt;td>中（pull-based scrape）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>壓縮&lt;/td>
 &lt;td>90%+（columnar compression）&lt;/td>
 &lt;td>高&lt;/td>
 &lt;td>高&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Join&lt;/td>
 &lt;td>完整 SQL join&lt;/td>
 &lt;td>弱&lt;/td>
 &lt;td>不支援&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>跟既有 PG schema&lt;/td>
 &lt;td>同一個 DB、可 join&lt;/td>
 &lt;td>獨立&lt;/td>
 &lt;td>獨立&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>生態&lt;/td>
 &lt;td>完整 PG ecosystem&lt;/td>
 &lt;td>自家 ecosystem&lt;/td>
 &lt;td>自家 ecosystem&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Open source&lt;/td>
 &lt;td>Apache 2.0（部分功能 TSL license）&lt;/td>
 &lt;td>MIT&lt;/td>
 &lt;td>Apache 2.0&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>何時選 TimescaleDB&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>TimescaleDB extension</em> — 用 PG 解 time-series workload 的路徑、跟 <a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a> 是 <em>單一 extension 細節 vs ecosystem 全景</em> 的關係。</p></blockquote>
<hr>
<h2 id="timescaledb-是-pg-的-time-series-specialization">TimescaleDB 是 PG 的 <em>Time-Series Specialization</em></h2>
<p>TimescaleDB 不是獨立 DB、是 PG extension：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">timescaledb</span><span class="p">;</span></span></span></code></pre></div><p>加完後、PG 多三個 time-series 專屬機制：</p>
<ol>
<li><strong>Hypertable</strong>：對 time column 自動 partition、應用層看是一張表</li>
<li><strong>Continuous aggregate</strong>：incremental refresh 的 materialized view</li>
<li><strong>Compression</strong>：對舊 chunk 壓縮（columnar-like format）</li>
</ol>
<p>跟專業 time-series DB（InfluxDB / Prometheus / VictoriaMetrics）對比、TimescaleDB 的賣點不是「最快」而是「PG ecosystem 一致」：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>TimescaleDB</th>
          <th>InfluxDB</th>
          <th>Prometheus</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Query 語言</td>
          <td>標準 SQL</td>
          <td>InfluxQL / Flux</td>
          <td>PromQL</td>
      </tr>
      <tr>
          <td>寫入效能</td>
          <td>中（10-100K rows/s）</td>
          <td>高（500K+ rows/s）</td>
          <td>中（pull-based scrape）</td>
      </tr>
      <tr>
          <td>壓縮</td>
          <td>90%+（columnar compression）</td>
          <td>高</td>
          <td>高</td>
      </tr>
      <tr>
          <td>Join</td>
          <td>完整 SQL join</td>
          <td>弱</td>
          <td>不支援</td>
      </tr>
      <tr>
          <td>跟既有 PG schema</td>
          <td>同一個 DB、可 join</td>
          <td>獨立</td>
          <td>獨立</td>
      </tr>
      <tr>
          <td>生態</td>
          <td>完整 PG ecosystem</td>
          <td>自家 ecosystem</td>
          <td>自家 ecosystem</td>
      </tr>
      <tr>
          <td>Open source</td>
          <td>Apache 2.0（部分功能 TSL license）</td>
          <td>MIT</td>
          <td>Apache 2.0</td>
      </tr>
  </tbody>
</table>
<p><strong>何時選 TimescaleDB</strong>：</p>
<ul>
<li>Application 已用 PG、不想多管一套 time-series DB</li>
<li>需要 join time-series 跟 application 表（user / device metadata）</li>
<li>不需 InfluxDB 級寫入速度（&lt; 100K rows/s）</li>
<li>Team SQL 熟、PromQL / Flux 學習成本不想付</li>
</ul>
<p><strong>何時選 InfluxDB / Prometheus（不選 TimescaleDB）</strong>：</p>
<ul>
<li>High-cardinality metric（10M+ unique series）— TSDB-purpose-built engine 在 cardinality 跟 retention 上比 hypertable 高效</li>
<li>Pull-based scrape model（Prometheus）跟 alerting / Grafana 生態深整合</li>
<li>PromQL operator（<code>rate()</code> / <code>histogram_quantile()</code>）對 metric query 比 SQL 直覺</li>
<li>TSL license 不能接受（TimescaleDB 部分功能在 Timescale License、不是純 Apache 2.0）</li>
<li>Operational team 已熟 InfluxDB / Prometheus、不想多學 PG 維運</li>
</ul>
<h2 id="hypertable自動-time-based-partitioning">Hypertable：自動 Time-based Partitioning</h2>
<p>普通 PG 表變 hypertable：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">sensor_data</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">time</span><span class="w">        </span><span class="n">TIMESTAMPTZ</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">sensor_id</span><span class="w">   </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">temperature</span><span class="w"> </span><span class="n">DOUBLE</span><span class="w"> </span><span class="k">PRECISION</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="n">humidity</span><span class="w">    </span><span class="n">DOUBLE</span><span class="w"> </span><span class="k">PRECISION</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="c1">-- 變 hypertable、按 time 自動 partition
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_hypertable</span><span class="p">(</span><span class="s1">&#39;sensor_data&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;time&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Hypertable 機制：</p>
<ul>
<li>後台自動拆 <em>chunk</em>（child partition）by time interval（預設 7 天）</li>
<li>Application 看到的是 <code>sensor_data</code> 一張表、實際資料分散在 <code>_timescaledb_internal._hyper_*_chunk</code> 表</li>
<li>Query 自動 chunk pruning（只掃命中時間範圍的 chunk）</li>
</ul>
<p><strong>Chunk interval 選擇</strong>很關鍵：</p>
<table>
  <thead>
      <tr>
          <th>Chunk interval</th>
          <th>適用</th>
          <th>問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 小時</td>
          <td>高頻 metrics（每秒 100+ row）</td>
          <td>Chunk 太多、catalog 膨脹</td>
      </tr>
      <tr>
          <td>1 天</td>
          <td>中高頻（每秒 10-100 row）</td>
          <td>OK</td>
      </tr>
      <tr>
          <td>7 天（預設）</td>
          <td>中頻（每分鐘 row）</td>
          <td>OK</td>
      </tr>
      <tr>
          <td>30 天</td>
          <td>低頻（每小時 row）</td>
          <td>OK</td>
      </tr>
  </tbody>
</table>
<p>通用原則：<em>每個 chunk 25% RAM</em>、超過退化 disk IO。Production 監控 <code>chunk_size</code> 跟 <code>shared_buffers</code> ratio 自動調。</p>
<p><strong>Multi-dimensional hypertable</strong>（time + space partition）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 按 time + device_id 雙維 partition
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_hypertable</span><span class="p">(</span><span class="s1">&#39;sensor_data&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;time&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">partitioning_column</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="s1">&#39;sensor_id&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">number_partitions</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="mi">16</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p>適用 sensor 數 1000+ 的 IoT workload、單 chunk 太大時用 space partition 拆。</p>
<h2 id="continuous-aggregatecaggincremental-materialized-view">Continuous Aggregate（CAGG）：Incremental Materialized View</h2>
<p>普通 PG materialized view 是 <em>全量重算</em>、TimescaleDB CAGG 是 <em>incremental refresh</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1 小時粒度聚合
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">sensor_hourly</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">timescaledb</span><span class="p">.</span><span class="n">continuous</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">time_bucket</span><span class="p">(</span><span class="s1">&#39;1 hour&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">time</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">hour</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">sensor_id</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="k">avg</span><span class="p">(</span><span class="n">temperature</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">avg_temp</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">    </span><span class="k">max</span><span class="p">(</span><span class="n">temperature</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">max_temp</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="k">min</span><span class="p">(</span><span class="n">temperature</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">min_temp</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">sample_count</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">sensor_data</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">hour</span><span class="p">,</span><span class="w"> </span><span class="n">sensor_id</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="c1">-- 加 refresh policy（每 30 分鐘 refresh 過去 1 天）
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">add_continuous_aggregate_policy</span><span class="p">(</span><span class="s1">&#39;sensor_hourly&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">    </span><span class="n">start_offset</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;1 day&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">    </span><span class="n">end_offset</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;30 minutes&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w">    </span><span class="n">schedule_interval</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;30 minutes&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p>CAGG 機制：</p>
<ul>
<li>記錄哪些 time bucket 已 materialize、哪些 stale</li>
<li>Refresh 時只重算 stale bucket、不全量</li>
<li>Query CAGG 自動 fallback 到原 hypertable 補最新資料（real-time aggregation）</li>
</ul>
<p><strong>CAGG vs 普通 MV 對比</strong>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>TimescaleDB CAGG</th>
          <th>普通 PG MV</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Refresh 模式</td>
          <td>Incremental</td>
          <td>全量重算</td>
      </tr>
      <tr>
          <td>Refresh 時間</td>
          <td>秒級</td>
          <td>表大時數十分鐘</td>
      </tr>
      <tr>
          <td>Real-time fallback</td>
          <td>自動補最新</td>
          <td>不支援、需手動 union</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>多一份 aggregated</td>
          <td>多一份 aggregated</td>
      </tr>
      <tr>
          <td>Policy</td>
          <td>內建排程</td>
          <td>需 pg_cron / 外部排程</td>
      </tr>
  </tbody>
</table>
<p><strong>CAGG hierarchy</strong>（多層聚合）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 從 1 hour CAGG 再聚合到 1 day
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">sensor_daily</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">timescaledb</span><span class="p">.</span><span class="n">continuous</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="n">time_bucket</span><span class="p">(</span><span class="s1">&#39;1 day&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">hour</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">day</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">    </span><span class="n">sensor_id</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">    </span><span class="k">avg</span><span class="p">(</span><span class="n">avg_temp</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">daily_avg</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">sensor_hourly</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">day</span><span class="p">,</span><span class="w"> </span><span class="n">sensor_id</span><span class="p">;</span></span></span></code></pre></div><p>Application query 不同時間範圍時自動命中對應粒度、不必每次掃原始資料。</p>
<h2 id="compression把舊-chunk-壓-90">Compression：把舊 Chunk 壓 90%+</h2>
<p>舊 chunk 可以開啟 compression：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 開啟 compression（必須先設定 segment by）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">sensor_data</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">timescaledb</span><span class="p">.</span><span class="n">compress</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">timescaledb</span><span class="p">.</span><span class="n">compress_segmentby</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;sensor_id&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="n">timescaledb</span><span class="p">.</span><span class="n">compress_orderby</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;time DESC&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="c1">-- 自動壓縮 policy：7 天前 chunk 壓
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">add_compression_policy</span><span class="p">(</span><span class="s1">&#39;sensor_data&#39;</span><span class="p">,</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;7 days&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Compression 機制：</p>
<ul>
<li>把 chunk 內 row 按 <code>segmentby</code> 分組</li>
<li>每組內按 <code>orderby</code> 排序後、把每 column 變成 <em>columnar array</em></li>
<li>對 array 用 type-specific 壓縮（Gorilla for float / delta-of-delta for timestamp / dictionary for string）</li>
</ul>
<p>實際壓縮率：</p>
<table>
  <thead>
      <tr>
          <th>Workload</th>
          <th>壓縮率</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>IoT sensor（重複值多）</td>
          <td>95-98%</td>
      </tr>
      <tr>
          <td>Application metrics</td>
          <td>90-95%</td>
      </tr>
      <tr>
          <td>Trade tick（隨機浮點）</td>
          <td>70-85%</td>
      </tr>
      <tr>
          <td>Log line（高 cardinality string）</td>
          <td>50-70%</td>
      </tr>
  </tbody>
</table>
<p><strong>Compression 限制</strong>（重要）：</p>
<ul>
<li>壓縮後 chunk <strong>不能 UPDATE / DELETE 單 row</strong>（要先 decompress）</li>
<li>壓縮後 chunk <strong>不能加 column</strong>（要 decompress 所有 chunk）</li>
<li>壓縮後 chunk 只能 <em>append new row</em>、不能改舊 row</li>
<li>DDL 變更（加 column / 改 index）需 decompress</li>
</ul>
<p>實務：compression 是 <em>write-once cold data</em> 的工具、active OLTP chunk 不開。</p>
<h2 id="retention-policy自動刪舊資料">Retention Policy：自動刪舊資料</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 1 年前 chunk 自動刪
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">add_retention_policy</span><span class="p">(</span><span class="s1">&#39;sensor_data&#39;</span><span class="p">,</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;1 year&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Retention drop 整個 chunk（不是 DELETE row）、O(1) 操作、不產生 bloat。</p>
<p>CAGG 有獨立 retention：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 原始資料只留 30 天、aggregated 留 5 年
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">add_retention_policy</span><span class="p">(</span><span class="s1">&#39;sensor_data&#39;</span><span class="p">,</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;30 days&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">add_retention_policy</span><span class="p">(</span><span class="s1">&#39;sensor_hourly&#39;</span><span class="p">,</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;5 years&#39;</span><span class="p">);</span></span></span></code></pre></div><p>這是 TimescaleDB 跟普通 PG partitioning 最大的價值差 — 普通 PG 要自己寫 cron drop partition、TimescaleDB policy 內建。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="case-1chunk-size-不對catalog-膨脹">Case 1：Chunk size 不對、catalog 膨脹</h3>
<p><strong>情境</strong>：sensor 每秒寫 10 row、chunk_interval 設 1 小時、一年產 8760 chunk、<code>pg_class</code> 撐到 200 萬 row、planner 變慢。</p>
<p>修法：</p>
<ul>
<li>Chunk 數量上限 ~10000、超過 catalog overhead 出現</li>
<li>重設 chunk_interval：<code>SELECT set_chunk_time_interval('sensor_data', INTERVAL '1 day');</code></li>
<li>已存在 chunk 不會自動 merge、要靠 retention drop 自然消化</li>
</ul>
<h3 id="case-2cagg-refresh-落後-real-time">Case 2：CAGG refresh 落後 real-time</h3>
<p><strong>情境</strong>：CAGG refresh policy 每 1 小時跑、application 期待「即時 dashboard」、看到的數字落後 1 小時。</p>
<p>修法：</p>
<ul>
<li>縮短 <code>schedule_interval</code>（5 分鐘）</li>
<li>用 <code>real-time aggregation</code>（預設 ON、CAGG 自動 union 原始資料）</li>
<li>確認 <code>materialized_only = false</code>（real-time aggregation 開啟）</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">sensor_hourly</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="p">(</span><span class="n">timescaledb</span><span class="p">.</span><span class="n">materialized_only</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">false</span><span class="p">);</span></span></span></code></pre></div><h3 id="case-3compression-後想-update">Case 3：Compression 後想 UPDATE</h3>
<p><strong>情境</strong>：發現某個歷史 row 數值錯、想 UPDATE、報錯 <em>cannot update/delete from compressed chunk</em>。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 找到該 chunk 並 decompress
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">decompress_chunk</span><span class="p">(</span><span class="k">c</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">show_chunks</span><span class="p">(</span><span class="s1">&#39;sensor_data&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">older_than</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;7 days&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">c</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="k">c</span><span class="p">::</span><span class="nb">text</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="s1">&#39;%_5_chunk&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- UPDATE 完再 compress 回去
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">UPDATE</span><span class="w"> </span><span class="n">sensor_data</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">temperature</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">22</span><span class="p">.</span><span class="mi">5</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">...;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">compress_chunk</span><span class="p">(...);</span></span></span></code></pre></div><p>或設計階段就避免 — compression 用在 <em>immutable data</em>、有可能改的留未壓。</p>
<h3 id="case-4hypertable-不能加-fk-到-non-hypertable">Case 4：Hypertable 不能加 FK 到 non-hypertable</h3>
<p><strong>情境</strong>：想對 <code>sensor_data</code> 加 FK 到 <code>sensors</code> 表、報錯 <em>foreign key constraints with hypertables are not supported</em>。</p>
<p>修法：</p>
<ul>
<li>Application 層維護 referential integrity</li>
<li>或反過來：<code>sensors</code> 可以 FK 到 hypertable（特定方向支援）</li>
<li>TimescaleDB 2.11+ 部分支援 FK from hypertable、但限制多</li>
</ul>
<h3 id="case-5timescaledb-跟-pg-主版本對齊">Case 5：TimescaleDB 跟 PG 主版本對齊</h3>
<p><strong>情境</strong>：PG 升級 14 → 16、TimescaleDB extension 沒對應升級、PG 啟動 fail。</p>
<p>TimescaleDB 跟 PG 版本對齊矩陣：</p>
<table>
  <thead>
      <tr>
          <th>TimescaleDB</th>
          <th>支援 PG version</th>
          <th>備註</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2.11+</td>
          <td>13, 14, 15</td>
          <td></td>
      </tr>
      <tr>
          <td>2.13+</td>
          <td>13, 14, 15, 16</td>
          <td>加 PG 16 支援</td>
      </tr>
      <tr>
          <td>2.15.x</td>
          <td>13, 14, 15, 16</td>
          <td>最後支援 PG 13 的 minor</td>
      </tr>
      <tr>
          <td>2.16+</td>
          <td>14, 15, 16</td>
          <td>PG 13 drop</td>
      </tr>
      <tr>
          <td>2.17+</td>
          <td>14, 15, 16, 17</td>
          <td>PG 17 加入（需 17.2+ binary 對齊）</td>
      </tr>
      <tr>
          <td>2.18+</td>
          <td>14, 15, 16, 17</td>
          <td>PG 17 完整支援</td>
      </tr>
      <tr>
          <td>2.23+</td>
          <td>14, 15, 16, 17, 18</td>
          <td>PG 18 加入</td>
      </tr>
  </tbody>
</table>
<p>修法：</p>
<ul>
<li>升 PG 前先升 TimescaleDB 到支援目標 PG 版本的 extension</li>
<li>Production 升級順序：TimescaleDB minor upgrade → PG major upgrade → TimescaleDB final upgrade</li>
<li>Cloud managed（Timescale Cloud）自動處理</li>
</ul>
<h2 id="跟-pg-原生-partitioning-對比">跟 PG 原生 Partitioning 對比</h2>
<p>PG 10+ 有 declarative partitioning、不一定要 TimescaleDB：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>TimescaleDB hypertable</th>
          <th>PG declarative partitioning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>自動建 chunk</td>
          <td>是</td>
          <td>否（需手動或 pg_partman）</td>
      </tr>
      <tr>
          <td>Chunk pruning</td>
          <td>自動</td>
          <td>自動（需 partition key）</td>
      </tr>
      <tr>
          <td>Retention 內建</td>
          <td>是</td>
          <td>否（pg_partman 或自寫 cron）</td>
      </tr>
      <tr>
          <td>Compression</td>
          <td>內建 columnar</td>
          <td>否</td>
      </tr>
      <tr>
          <td>Continuous aggregate</td>
          <td>內建</td>
          <td>否（自寫 incremental refresh）</td>
      </tr>
      <tr>
          <td>跨 chunk index</td>
          <td>統一 management</td>
          <td>Per-partition index</td>
      </tr>
      <tr>
          <td>Cardinality limit</td>
          <td>10000+ chunk OK</td>
          <td>1000+ partition 就慢</td>
      </tr>
  </tbody>
</table>
<p>何時用原生 partitioning（不用 TimescaleDB）：</p>
<ul>
<li>不需要 compression / CAGG</li>
<li>Partition 數 &lt; 1000</li>
<li>已用 pg_partman 不想換</li>
<li>公司禁用 TSL license（TimescaleDB 部分功能受限）</li>
</ul>
<p>何時用 TimescaleDB：</p>
<ul>
<li>高頻 time-series（compression 必要）</li>
<li>需要 CAGG（手寫 incremental MV 成本高）</li>
<li>Partition 數 &gt; 1000</li>
<li>IoT / metrics / observability workload</li>
</ul>
<p>詳細 partitioning 機制看 <a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">declarative-partitioning</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a>：PG extension 全景</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">declarative-partitioning</a>：原生 partitioning</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">jsonb-deep-dive</a>：IoT payload 用 JSONB 儲存</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum-tuning</a>：hypertable autovacuum 行為</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/major-version-upgrade/" data-link-title="PostgreSQL major version upgrade (14 → 17)：為什麼這篇不套 5 type migration" data-link-desc="PostgreSQL major version upgrade 是 *5 type 漏類* 的實證 — source/target 同 vendor、5 維度都 Low 但 *upgrade-specific audit* 是核心；本文結構接近 deep article methodology 的 6-section &#43; 額外 upgrade audit 段；涵蓋 pg_upgrade / logical replication / blue-green 三方法、extension 相容性、5 production 踩雷">major-version-upgrade</a>：TimescaleDB + PG 升級順序</li>
</ul>
<h2 id="下一步">下一步</h2>
<ul>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a> 了解其他 PG 擴展選項</li>
<li>回 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview</a> 看全圖</li>
</ul>
]]></content:encoded></item><item><title>pgvector Deep Dive：HNSW / IVFFlat 取捨跟跟專業 Vector DB 對比</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pgvector-deep-dive/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pgvector-deep-dive/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>pgvector extension&lt;/em> — 用 PG 解 vector search workload 的路徑、是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem&lt;/a> 內最受關注的 extension。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="pgvector-是-pg-變-vector-db-的最短路徑">pgvector 是 PG 變 Vector DB 的最短路徑&lt;/h2>
&lt;p>pgvector 加兩件事：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">vector&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 加 vector column（dimension 必須事先決定）
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">documents&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">SERIAL&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">PRIMARY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">content&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">embedding&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">vector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1536&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">-- OpenAI ada-002 維度
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 三種 distance operator
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">documents&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ORDER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">embedding&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;lt;-&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;[0.1, 0.2, ...]&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">LIMIT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">-- L2
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">documents&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ORDER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">embedding&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;lt;#&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;[0.1, 0.2, ...]&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">LIMIT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">-- inner product
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">documents&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ORDER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">embedding&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;lt;=&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;[0.1, 0.2, ...]&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">LIMIT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">-- cosine&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Operator 對應：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>pgvector extension</em> — 用 PG 解 vector search workload 的路徑、是 <a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a> 內最受關注的 extension。</p></blockquote>
<hr>
<h2 id="pgvector-是-pg-變-vector-db-的最短路徑">pgvector 是 PG 變 Vector DB 的最短路徑</h2>
<p>pgvector 加兩件事：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">vector</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="c1">-- 加 vector column（dimension 必須事先決定）
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">content</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="n">embedding</span><span class="w"> </span><span class="n">vector</span><span class="p">(</span><span class="mi">1536</span><span class="p">)</span><span class="w">  </span><span class="c1">-- OpenAI ada-002 維度
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 三種 distance operator
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;-&gt;</span><span class="w"> </span><span class="s1">&#39;[0.1, 0.2, ...]&#39;</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span><span class="w">  </span><span class="c1">-- L2
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;#&gt;</span><span class="w"> </span><span class="s1">&#39;[0.1, 0.2, ...]&#39;</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span><span class="w">  </span><span class="c1">-- inner product
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;=&gt;</span><span class="w"> </span><span class="s1">&#39;[0.1, 0.2, ...]&#39;</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span><span class="w">  </span><span class="c1">-- cosine</span></span></span></code></pre></div><p>Operator 對應：</p>
<table>
  <thead>
      <tr>
          <th>Operator</th>
          <th>意義</th>
          <th>適用</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>&lt;-&gt;</code></td>
          <td>L2 distance</td>
          <td>通用、空間距離</td>
      </tr>
      <tr>
          <td><code>&lt;#&gt;</code></td>
          <td>Negative inner product</td>
          <td>normalized vector、cosine 等價</td>
      </tr>
      <tr>
          <td><code>&lt;=&gt;</code></td>
          <td>Cosine distance</td>
          <td>embedding 比較最常用</td>
      </tr>
  </tbody>
</table>
<p>對 OpenAI / Cohere / sentence-transformers embedding、通常用 <code>&lt;=&gt;</code>（cosine）— embedding model 訓練時是 cosine objective。</p>
<h2 id="ann-index-是-vector-search-的核心">ANN Index 是 Vector Search 的核心</h2>
<p>不加 index 的 <code>ORDER BY embedding &lt;=&gt; ?</code> 是 <em>full scan</em>：</p>
<ul>
<li>100K row、1536 dim、每 query ~2-5s（不可用）</li>
<li>1M row 直接超時</li>
</ul>
<p>pgvector 提供兩種 <em>Approximate Nearest Neighbor</em>（ANN）index：</p>
<table>
  <thead>
      <tr>
          <th>Index</th>
          <th>Build 時間</th>
          <th>Query 時間</th>
          <th>Recall@10</th>
          <th>Memory cost</th>
          <th>Update 行為</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>IVFFlat</td>
          <td>快（分鐘級）</td>
          <td>中（10-100ms）</td>
          <td>90-95%</td>
          <td>中（lists 數量）</td>
          <td>Insert OK、需重建保持 recall</td>
      </tr>
      <tr>
          <td>HNSW</td>
          <td>慢（小時級）</td>
          <td>快（1-10ms）</td>
          <td>95-99%</td>
          <td>高（2-4x 資料）</td>
          <td>Insert OK、graph 漸進維護</td>
      </tr>
  </tbody>
</table>
<p><strong>選 IVFFlat 的場景</strong>：</p>
<ul>
<li>Embedding 量 &lt; 1M</li>
<li>Build 時間敏感（CI / batch 環境）</li>
<li>Memory 緊</li>
<li>接受重建 cost（每月 / 每季）</li>
</ul>
<p><strong>選 HNSW 的場景</strong>：</p>
<ul>
<li>Embedding 量 1M-100M</li>
<li>Query latency &lt; 50ms 要求</li>
<li>Memory 充足</li>
<li>Insert 量穩定（不會爆炸性增長）</li>
</ul>
<h2 id="ivfflat分-cluster-找鄰居">IVFFlat：分 Cluster 找鄰居</h2>
<p>IVFFlat 機制：</p>
<ol>
<li><strong>Build</strong>：跑 k-means 把所有 vector 分 <code>lists</code> 個 cluster</li>
<li><strong>Query</strong>：先找最近的 <code>probes</code> 個 cluster、再在這些 cluster 內找 nearest neighbor</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Build（lists 數量重要）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">ivfflat</span><span class="w"> </span><span class="p">(</span><span class="n">embedding</span><span class="w"> </span><span class="n">vector_cosine_ops</span><span class="p">)</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">lists</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Query 時調 probes 換 recall vs latency
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SET</span><span class="w"> </span><span class="n">ivfflat</span><span class="p">.</span><span class="n">probes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;=&gt;</span><span class="w"> </span><span class="o">?</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p><strong>Lists 跟 probes sizing 規則</strong>（pgvector 官方建議）：</p>
<table>
  <thead>
      <tr>
          <th>Row count</th>
          <th>lists 建議</th>
          <th>probes 建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>&lt; 1M</td>
          <td><code>rows / 1000</code></td>
          <td><code>sqrt(lists)</code></td>
      </tr>
      <tr>
          <td>&gt; 1M</td>
          <td><code>sqrt(rows)</code></td>
          <td><code>sqrt(lists)</code></td>
      </tr>
  </tbody>
</table>
<p>實務：100K row → lists=100 / probes=10、1M row → lists=1000 / probes=32。</p>
<p><strong>IVFFlat 的 recall drift</strong>：cluster 是 build 時固定的、新 insert 的 vector 進入「最近 cluster」、但隨資料分布改變、cluster center 可能不再代表性、recall 隨時間下降。</p>
<p>修法：定期 <code>REINDEX INDEX CONCURRENTLY ...</code>（每月 / 每 100K 新 row）。</p>
<h2 id="hnswmulti-level-graph-找鄰居">HNSW：Multi-level Graph 找鄰居</h2>
<p>HNSW（Hierarchical Navigable Small World）機制：</p>
<ol>
<li>多層 graph、上層稀疏、下層密集</li>
<li>Query 從上層 entry point 開始、逐層找近鄰、最後在底層精細搜尋</li>
<li>Insert 漸進維護 graph、不必重建</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Build（兩個關鍵參數）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">hnsw</span><span class="w"> </span><span class="p">(</span><span class="n">embedding</span><span class="w"> </span><span class="n">vector_cosine_ops</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">16</span><span class="p">,</span><span class="w"> </span><span class="n">ef_construction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">64</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- Query 時調 ef_search
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">SET</span><span class="w"> </span><span class="n">hnsw</span><span class="p">.</span><span class="n">ef_search</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;=&gt;</span><span class="w"> </span><span class="o">?</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p><strong>參數含義</strong>：</p>
<table>
  <thead>
      <tr>
          <th>參數</th>
          <th>含義</th>
          <th>預設</th>
          <th>Trade-off</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>m</code></td>
          <td>每 node 最多鄰居數</td>
          <td>16</td>
          <td>大 → recall 高、memory 多</td>
      </tr>
      <tr>
          <td><code>ef_construction</code></td>
          <td>Build 時 graph 質量參數</td>
          <td>64</td>
          <td>大 → build 慢、graph 質量好</td>
      </tr>
      <tr>
          <td><code>ef_search</code></td>
          <td>Query 時搜尋範圍</td>
          <td>40</td>
          <td>大 → recall 高、latency 高</td>
      </tr>
  </tbody>
</table>
<p><strong>Build cost 真實量級</strong>（1M vector × 1536 dim）：</p>
<table>
  <thead>
      <tr>
          <th>配置</th>
          <th>Build 時間</th>
          <th>Memory</th>
          <th>Recall@10</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>m=8, ef_construction=32</td>
          <td>30 min</td>
          <td>4GB</td>
          <td>92%</td>
      </tr>
      <tr>
          <td>m=16, ef_construction=64</td>
          <td>2 hour</td>
          <td>8GB</td>
          <td>96%</td>
      </tr>
      <tr>
          <td>m=32, ef_construction=200</td>
          <td>8 hour</td>
          <td>16GB</td>
          <td>98%</td>
      </tr>
  </tbody>
</table>
<p>Production 多數選中間 <code>m=16, ef_construction=64</code>、recall / cost 平衡。</p>
<h2 id="hybrid-searchvector--filter-一起">Hybrid Search：Vector + Filter 一起</h2>
<p>Vector search 加 SQL filter 是 pgvector 比專業 vector DB 強的場景：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Vector + metadata filter
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">category</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;tech&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2025-01-01&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;=&gt;</span><span class="w"> </span><span class="s1">&#39;[0.1, 0.2, ...]&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p>但這裡有個 <em>pgvector 的踩雷</em>：filter 跟 ANN index 互動有兩種模式：</p>
<ol>
<li><strong>Pre-filter</strong>（planner 選）：先 filter 出符合條件的 row、再對 subset 跑 vector ordering → 不用 ANN index、可能慢</li>
<li><strong>Post-filter</strong>：用 ANN index 找 top-N、再 filter、可能 N 不夠補</li>
</ol>
<p>pgvector 0.8+（2024-10 release）加入 <em>iterative index scan</em>：HNSW / IVFFlat 一邊掃 graph 一邊 filter、效能比 pre-filter 好 5-10x。0.7+（2024-07）加 halfvec / binary quantization / parallel HNSW build。</p>
<p>實務：filter selectivity 高（&lt; 10%）時、考慮對 filter column 加 index 走 pre-filter；selectivity 低（&gt; 50%）走 iterative scan。</p>
<h2 id="quantization-跟-dimension-reduction">Quantization 跟 Dimension Reduction</h2>
<p>1536 dim float32 vector 一筆 6KB、1M row 6GB、加 HNSW index 後 ~20GB。Memory 緊時的省法：</p>
<h3 id="half-precisionpgvector-07">Half-precision（pgvector 0.7+）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">embedding</span><span class="w"> </span><span class="n">halfvec</span><span class="p">(</span><span class="mi">1536</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p><code>halfvec</code> 是 float16、storage 減半、recall 損失通常 &lt; 1%。</p>
<h3 id="binary-quantization">Binary quantization</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 把每維壓成 1 bit
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">hnsw</span><span class="w"> </span><span class="p">(</span><span class="n">embedding</span><span class="w"> </span><span class="n">bit_hamming_ops</span><span class="p">);</span></span></span></code></pre></div><p>Recall 下降明顯（85-90%）、但 storage 1/32、適合「先粗篩再 rerank」hybrid pipeline。</p>
<h3 id="dimension-reduction">Dimension reduction</h3>
<p>訓練 PCA / Matryoshka model 把 1536 dim 降到 256-512 dim、recall 通常損失 &lt; 3%、storage 1/3-1/6。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="case-1dimension-超-2000-限制">Case 1：Dimension 超 2000 限制</h3>
<p><strong>情境</strong>：要用 OpenAI text-embedding-3-large（3072 dim）、<code>CREATE TABLE ... embedding vector(3072)</code> 報錯。</p>
<p>pgvector <code>vector</code> type 上限 2000 dim（IVFFlat / HNSW index 限制）。</p>
<p>修法：</p>
<ul>
<li>改用 <code>halfvec</code>（pgvector 0.7+ 支援 4000 dim）</li>
<li>用 Matryoshka 截斷到 2000 dim 以下</li>
<li>換 embedding model（OpenAI text-embedding-3-small 1536 dim / 可截斷到 256-1024）</li>
</ul>
<h3 id="case-2hnsw-build-太慢">Case 2：HNSW build 太慢</h3>
<p><strong>情境</strong>：1M row build HNSW、跑 8 小時、blocking production。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 用 CONCURRENTLY 不 block
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">CONCURRENTLY</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">hnsw</span><span class="w"> </span><span class="p">(...);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 開 maintenance_work_mem
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SET</span><span class="w"> </span><span class="n">maintenance_work_mem</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;8GB&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- 開 parallel
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="k">SET</span><span class="w"> </span><span class="n">max_parallel_maintenance_workers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">7</span><span class="p">;</span></span></span></code></pre></div><p>仍慢的話、考慮：</p>
<ul>
<li>切分 batch insert + index（適合 read-heavy）</li>
<li>用 IVFFlat 短期上線、之後再切 HNSW</li>
<li>改用 cloud managed pgvector（提供更大 instance）</li>
</ul>
<h3 id="case-3ivfflat-不重建-recall-漂移">Case 3：IVFFlat 不重建 recall 漂移</h3>
<p><strong>情境</strong>：IVFFlat build 時資料 100K、現在 500K、新資料 recall 從 92% 降到 75%、user 抱怨「找不到相關文件」。</p>
<p>修法：</p>
<ul>
<li>Monitor recall：定期跑 ground-truth eval（brute-force 對比）</li>
<li>設定 reindex policy：每 100K 新 row 或每月 reindex</li>
<li>換 HNSW：insert 漸進維護、不需 reindex（trade-off：build 更慢）</li>
</ul>
<h3 id="case-4hybrid-search-filter-selectivity-沒設計">Case 4：Hybrid search filter selectivity 沒設計</h3>
<p><strong>情境</strong>：query <code>WHERE user_id = ? ORDER BY embedding &lt;=&gt; ?</code>、user_id 高選擇性（1/1M）、planner 選 vector index scan、掃到 top-K 全不符 user_id、補抓無止盡。</p>
<p>修法：</p>
<ul>
<li><code>EXPLAIN</code> 看 planner 選 pre-filter 還是 vector-first</li>
<li>對 <code>user_id</code> 加 B-tree index、強 planner pre-filter（hint 不容易、用 statistics）</li>
<li>pgvector 0.8+ 用 iterative scan、自動處理</li>
<li>設計 schema：高選擇性 filter（user_id）建議走 pre-filter；低選擇性（category）走 iterative</li>
</ul>
<h3 id="case-5memory-budget-沒抓">Case 5：Memory budget 沒抓</h3>
<p><strong>情境</strong>：1M vector × 1536 dim × HNSW（m=16）= ~12GB index、shared_buffers 8GB、index 不在 cache、每 query disk IO、latency 100ms+。</p>
<p>修法：</p>
<ul>
<li>算 vector + index memory：<code>row × dim × 4 bytes × (1 + index_overhead)</code></li>
<li><code>shared_buffers</code> 至少能放 hot index portion</li>
<li>不行就降 dim（halfvec）/ 升 instance / 拆 sharded</li>
</ul>
<h2 id="跟專業-vector-db-對比">跟專業 Vector DB 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>pgvector</th>
          <th>Pinecone</th>
          <th>Weaviate</th>
          <th>Milvus</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Query 介面</td>
          <td>SQL</td>
          <td>REST/gRPC API</td>
          <td>GraphQL / REST</td>
          <td>gRPC</td>
      </tr>
      <tr>
          <td>Recall</td>
          <td>95-99%（HNSW）</td>
          <td>95-99%</td>
          <td>95-99%</td>
          <td>95-99%</td>
      </tr>
      <tr>
          <td>Throughput</td>
          <td>中（PG 限制）</td>
          <td>高</td>
          <td>高</td>
          <td>高</td>
      </tr>
      <tr>
          <td>Hybrid search</td>
          <td>強（完整 SQL）</td>
          <td>中（metadata filter）</td>
          <td>中</td>
          <td>中</td>
      </tr>
      <tr>
          <td>跟既有 PG 整合</td>
          <td>完美（同 DB join）</td>
          <td>需 sync</td>
          <td>需 sync</td>
          <td>需 sync</td>
      </tr>
      <tr>
          <td>Multi-tenant</td>
          <td>row-level（PG 一致）</td>
          <td>內建</td>
          <td>內建</td>
          <td>partition</td>
      </tr>
      <tr>
          <td>Open source</td>
          <td>是</td>
          <td>否</td>
          <td>是</td>
          <td>是</td>
      </tr>
      <tr>
          <td>Operational cost</td>
          <td>跟 PG 一樣（管 PG 即可）</td>
          <td>Managed-only</td>
          <td>需自管或 cloud</td>
          <td>需自管或 cloud</td>
      </tr>
      <tr>
          <td>Scale 上限</td>
          <td>10M-100M vector</td>
          <td>10B+</td>
          <td>1B+</td>
          <td>10B+</td>
      </tr>
  </tbody>
</table>
<p><strong>選 pgvector 的場景</strong>：</p>
<ul>
<li>Application 已用 PG、不想多管系統</li>
<li>Vector 量 &lt; 100M</li>
<li>需要 join vector + relational</li>
<li>Team SQL 熟、不想學 API SDK</li>
<li>Cost 敏感（managed Pinecone 1M vector 月 ~$70+）</li>
</ul>
<p><strong>選專業 vector DB 的場景</strong>：</p>
<ul>
<li>Vector 量 &gt; 5-20M（依 dim / QPS / recall 要求、pgvector 在這個級別 + 高 QPS 已開始痛、不必撐到 100M 才換）</li>
<li>純 vector workload（沒 relational integration）</li>
<li>需要 multi-tenant SaaS</li>
<li>Throughput 要求極高（&gt; 10K QPS）</li>
<li>不想自管 HNSW build / memory budget / recall drift（managed Pinecone 把這層 ops 轉嫁、cost 換 ops 時間）</li>
<li>需要 dim &gt; 2000（pgvector vector type 限制、halfvec 可到 4000、再大需 dimension reduction）</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a>：其他 PG extension</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">jsonb-deep-dive</a>：embedding 通常配 metadata JSONB</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/index-selection/" data-link-title="PostgreSQL Index Selection：B-tree / GIN / GiST / BRIN / Hash 對應 workload 的決策樹" data-link-desc="PG 有 6 種 index method（B-tree / Hash / GIN / GiST / SP-GiST / BRIN）跟 partial / expression / covering 三種變體、不是「都用 B-tree 就好」。每種 index 有自己的 query pattern、儲存代價、write amplification 跟 maintenance 成本。本文走 6 種 index 的適用 workload 對照、決策樹、partial / expression / covering / multi-column 變體、5 production 踩雷（過度 index / partial 條件不對 / B-tree 對 JSON 無效 / BRIN 對非 correlated 資料無效 / multi-column 順序錯）、跟 query-optimization 的 EXPLAIN 互補">index-selection</a>：B-tree / GIN / HNSW 整體比較</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization</a>：vector query 的 EXPLAIN</li>
</ul>
<h2 id="下一步">下一步</h2>
<ul>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a> 探索其他 PG 擴展可能</li>
<li>回 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview</a> 看全圖</li>
</ul>
]]></content:encoded></item><item><title>PostGIS Deep Dive：Geometry / Geography 型別、GiST 空間索引跟 ST_* 函式生態</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/postgis-deep-dive/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/postgis-deep-dive/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>PostGIS extension&lt;/em> — PG 變 GIS DB 的標配、跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem&lt;/a> 是 &lt;em>單一 extension 細節 vs ecosystem 全景&lt;/em> 的關係。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="postgis-是-pg-的-gis-specialization">PostGIS 是 PG 的 &lt;em>GIS Specialization&lt;/em>&lt;/h2>
&lt;p>PostGIS 是 PG 最成熟的 extension 之一（2001 年起、25 年歷史）、產業地位等同 OracleSpatial / SQL Server geography：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">postgis&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>加完後 PG 多兩件事：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>空間型別&lt;/strong>：&lt;code>geometry&lt;/code>（平面）/ &lt;code>geography&lt;/code>（地球曲面）/ &lt;code>raster&lt;/code>（柵格）&lt;/li>
&lt;li>&lt;strong>1000+ 函式&lt;/strong>：&lt;code>ST_Distance&lt;/code> / &lt;code>ST_Within&lt;/code> / &lt;code>ST_Buffer&lt;/code> / &lt;code>ST_Intersects&lt;/code> 等&lt;/li>
&lt;/ol>
&lt;p>用 PostGIS 解的典型 workload：&lt;/p>
&lt;ul>
&lt;li>「離我最近的 N 家店」（k-NN）&lt;/li>
&lt;li>「半徑 1km 內的所有 POI」（radius query）&lt;/li>
&lt;li>「兩個 polygon 是否重疊」（intersection）&lt;/li>
&lt;li>「polyline 總長度」（measurement）&lt;/li>
&lt;li>「行政區包含哪些 point」（containment）&lt;/li>
&lt;/ul>
&lt;h2 id="geometry-vs-geography選錯付學費">Geometry vs Geography：選錯付學費&lt;/h2>
&lt;p>PostGIS 提供兩種空間型別、用途完全不同：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>&lt;code>geometry&lt;/code>&lt;/th>
 &lt;th>&lt;code>geography&lt;/code>&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>座標系統&lt;/td>
 &lt;td>平面（笛卡兒）&lt;/td>
 &lt;td>地球曲面（spheroid）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>距離單位&lt;/td>
 &lt;td>座標系統決定（meter / degree）&lt;/td>
 &lt;td>永遠 meter&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>跨經度 180°&lt;/td>
 &lt;td>不處理&lt;/td>
 &lt;td>自動處理&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>適用範圍&lt;/td>
 &lt;td>小區域（單一城市 / 國家）&lt;/td>
 &lt;td>全球&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>函式覆蓋&lt;/td>
 &lt;td>1000+ 函式&lt;/td>
 &lt;td>約 300 函式&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>效能&lt;/td>
 &lt;td>快（平面計算）&lt;/td>
 &lt;td>慢 2-5x（球面計算）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Index 行為&lt;/td>
 &lt;td>GiST 直接&lt;/td>
 &lt;td>GiST 直接&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>選 &lt;code>geography&lt;/code> 的場景&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>PostGIS extension</em> — PG 變 GIS DB 的標配、跟 <a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a> 是 <em>單一 extension 細節 vs ecosystem 全景</em> 的關係。</p></blockquote>
<hr>
<h2 id="postgis-是-pg-的-gis-specialization">PostGIS 是 PG 的 <em>GIS Specialization</em></h2>
<p>PostGIS 是 PG 最成熟的 extension 之一（2001 年起、25 年歷史）、產業地位等同 OracleSpatial / SQL Server geography：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">postgis</span><span class="p">;</span></span></span></code></pre></div><p>加完後 PG 多兩件事：</p>
<ol>
<li><strong>空間型別</strong>：<code>geometry</code>（平面）/ <code>geography</code>（地球曲面）/ <code>raster</code>（柵格）</li>
<li><strong>1000+ 函式</strong>：<code>ST_Distance</code> / <code>ST_Within</code> / <code>ST_Buffer</code> / <code>ST_Intersects</code> 等</li>
</ol>
<p>用 PostGIS 解的典型 workload：</p>
<ul>
<li>「離我最近的 N 家店」（k-NN）</li>
<li>「半徑 1km 內的所有 POI」（radius query）</li>
<li>「兩個 polygon 是否重疊」（intersection）</li>
<li>「polyline 總長度」（measurement）</li>
<li>「行政區包含哪些 point」（containment）</li>
</ul>
<h2 id="geometry-vs-geography選錯付學費">Geometry vs Geography：選錯付學費</h2>
<p>PostGIS 提供兩種空間型別、用途完全不同：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th><code>geometry</code></th>
          <th><code>geography</code></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>座標系統</td>
          <td>平面（笛卡兒）</td>
          <td>地球曲面（spheroid）</td>
      </tr>
      <tr>
          <td>距離單位</td>
          <td>座標系統決定（meter / degree）</td>
          <td>永遠 meter</td>
      </tr>
      <tr>
          <td>跨經度 180°</td>
          <td>不處理</td>
          <td>自動處理</td>
      </tr>
      <tr>
          <td>適用範圍</td>
          <td>小區域（單一城市 / 國家）</td>
          <td>全球</td>
      </tr>
      <tr>
          <td>函式覆蓋</td>
          <td>1000+ 函式</td>
          <td>約 300 函式</td>
      </tr>
      <tr>
          <td>效能</td>
          <td>快（平面計算）</td>
          <td>慢 2-5x（球面計算）</td>
      </tr>
      <tr>
          <td>Index 行為</td>
          <td>GiST 直接</td>
          <td>GiST 直接</td>
      </tr>
  </tbody>
</table>
<p><strong>選 <code>geography</code> 的場景</strong>：</p>
<ul>
<li>全球範圍 application（跨國 / 跨大陸）</li>
<li>距離精準度要求高（球面比平面誤差小）</li>
<li>不需要複雜空間運算（geography 函式較少）</li>
</ul>
<p><strong>選 <code>geometry</code> 的場景</strong>：</p>
<ul>
<li>單一城市 / 國家內 application</li>
<li>需要完整 ST_* 函式（90% 函式只支援 geometry）</li>
<li>效能敏感</li>
</ul>
<p>實務多數 production 選 <code>geometry</code> + 適合的 SRID（用 local projection）— 既快又精準。</p>
<h2 id="srid-跟-projection為什麼-4326-vs-3857-是-gis-第一課">SRID 跟 Projection：為什麼 4326 vs 3857 是 GIS 第一課</h2>
<p>SRID（Spatial Reference System Identifier）定義「座標數字怎麼解讀」：</p>
<table>
  <thead>
      <tr>
          <th>SRID</th>
          <th>名稱</th>
          <th>適用</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>4326</td>
          <td>WGS 84（GPS）</td>
          <td>經緯度、最常見、Google Maps API</td>
      </tr>
      <tr>
          <td>3857</td>
          <td>Web Mercator</td>
          <td>Web tile map（OpenStreetMap）</td>
      </tr>
      <tr>
          <td>3826</td>
          <td>TWD97 / TM2 zone 121</td>
          <td>台灣 local projection、米為單位</td>
      </tr>
      <tr>
          <td>2272</td>
          <td>NAD83 / Pennsylvania</td>
          <td>美國 state plane（各州不同）</td>
      </tr>
  </tbody>
</table>
<p><strong>為什麼選 local projection（3826）而不是經緯度（4326）</strong>：</p>
<ul>
<li>經緯度單位是 <em>度</em>、不是距離 — <code>ST_Distance</code> 直接算出來是「度」、不是「米」</li>
<li>距離計算需 <code>ST_DistanceSphere</code> 或 <code>geography</code> cast、計算 cost 高</li>
<li>Local projection 是「平面投影」、<code>ST_Distance</code> 直接是米、<code>ST_Area</code> 直接是平方米</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 4326 經緯度直接算 → 結果不是米
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">ST_Distance</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5654</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">0330</span><span class="p">),</span><span class="w"> </span><span class="mi">4326</span><span class="p">),</span><span class="w">  </span><span class="c1">-- 台北 101
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"></span><span class="w">    </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5170</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">0478</span><span class="p">),</span><span class="w"> </span><span class="mi">4326</span><span class="p">)</span><span class="w">   </span><span class="c1">-- 台北車站
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="p">);</span><span class="w">  </span><span class="c1">-- ~0.05（這是「度」）
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- 轉 3826（台灣本地投影）才是米
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">ST_Distance</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="n">ST_Transform</span><span class="p">(</span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5654</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">0330</span><span class="p">),</span><span class="w"> </span><span class="mi">4326</span><span class="p">),</span><span class="w"> </span><span class="mi">3826</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="n">ST_Transform</span><span class="p">(</span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5170</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">0478</span><span class="p">),</span><span class="w"> </span><span class="mi">4326</span><span class="p">),</span><span class="w"> </span><span class="mi">3826</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">  </span><span class="c1">-- ~5300（米）
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- 或用 geography cast
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">ST_Distance</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">    </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5654</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">0330</span><span class="p">),</span><span class="w"> </span><span class="mi">4326</span><span class="p">)::</span><span class="n">geography</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">    </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5170</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">0478</span><span class="p">),</span><span class="w"> </span><span class="mi">4326</span><span class="p">)::</span><span class="n">geography</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">  </span><span class="c1">-- ~5300（米）</span></span></span></code></pre></div><p><strong>典型 schema 設計</strong>（台灣 application）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">pois</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">name</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="c1">-- 儲存 4326（跟 Google Maps API 對齊）
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="w">    </span><span class="n">location_4326</span><span class="w"> </span><span class="n">geometry</span><span class="p">(</span><span class="n">Point</span><span class="p">,</span><span class="w"> </span><span class="mi">4326</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="c1">-- 預計算 3826（給距離 / 面積 query 用）
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="w">    </span><span class="n">location_3826</span><span class="w"> </span><span class="n">geometry</span><span class="p">(</span><span class="n">Point</span><span class="p">,</span><span class="w"> </span><span class="mi">3826</span><span class="p">)</span><span class="w"> </span><span class="k">GENERATED</span><span class="w"> </span><span class="n">ALWAYS</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">        </span><span class="p">(</span><span class="n">ST_Transform</span><span class="p">(</span><span class="n">location_4326</span><span class="p">,</span><span class="w"> </span><span class="mi">3826</span><span class="p">))</span><span class="w"> </span><span class="n">STORED</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_pois_location_3826</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">pois</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIST</span><span class="w"> </span><span class="p">(</span><span class="n">location_3826</span><span class="p">);</span></span></span></code></pre></div><h2 id="gist-空間索引r-tree-的-pg-實作">GiST 空間索引：R-tree 的 PG 實作</h2>
<p>PostGIS 用 PG 內建 GiST 做空間索引（內部是 R-tree 變體）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_pois_geom</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">pois</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIST</span><span class="w"> </span><span class="p">(</span><span class="n">location_3826</span><span class="p">);</span></span></span></code></pre></div><p>GiST 對空間 query 加速的場景：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 範圍 query（box overlap）
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">location_3826</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w"> </span><span class="n">ST_MakeEnvelope</span><span class="p">(</span><span class="mi">290000</span><span class="p">,</span><span class="w"> </span><span class="mi">2760000</span><span class="p">,</span><span class="w"> </span><span class="mi">305000</span><span class="p">,</span><span class="w"> </span><span class="mi">2775000</span><span class="p">,</span><span class="w"> </span><span class="mi">3826</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="c1">-- 半徑 query（用 ST_DWithin 才走 index）
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">ST_DWithin</span><span class="p">(</span><span class="n">location_3826</span><span class="p">,</span><span class="w"> </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">300000</span><span class="p">,</span><span class="w"> </span><span class="mi">2770000</span><span class="p">),</span><span class="w"> </span><span class="mi">3826</span><span class="p">),</span><span class="w"> </span><span class="mi">1000</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- k-NN（PostGIS 2.0+ &lt;-&gt; operator）
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">location_3826</span><span class="w"> </span><span class="o">&lt;-&gt;</span><span class="w"> </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">300000</span><span class="p">,</span><span class="w"> </span><span class="mi">2770000</span><span class="p">),</span><span class="w"> </span><span class="mi">3826</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">dist</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">location_3826</span><span class="w"> </span><span class="o">&lt;-&gt;</span><span class="w"> </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">300000</span><span class="p">,</span><span class="w"> </span><span class="mi">2770000</span><span class="p">),</span><span class="w"> </span><span class="mi">3826</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p><strong>index 用沒用到的關鍵</strong>：</p>
<table>
  <thead>
      <tr>
          <th>Query 寫法</th>
          <th>走 index？</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>ST_DWithin(a, b, dist)</code></td>
          <td>是</td>
      </tr>
      <tr>
          <td><code>ST_Distance(a, b) &lt; dist</code></td>
          <td>否（必 full scan）</td>
      </tr>
      <tr>
          <td><code>a &amp;&amp; bbox</code></td>
          <td>是</td>
      </tr>
      <tr>
          <td><code>ST_Intersects(a, bbox)</code></td>
          <td>是</td>
      </tr>
      <tr>
          <td><code>a &lt;-&gt; b ORDER BY ... LIMIT n</code></td>
          <td>是（k-NN）</td>
      </tr>
      <tr>
          <td><code>ST_Equals(a, b)</code></td>
          <td>否</td>
      </tr>
  </tbody>
</table>
<p>Production 寫法守則：能用 <code>ST_DWithin</code> 就不用 <code>ST_Distance(...) &lt; ?</code>、語意一樣但 index 行為差很多。</p>
<h2 id="st_-函式生態產業級全套">ST_* 函式生態：產業級全套</h2>
<p>PostGIS 1000+ 函式分類（典型用到的）：</p>
<table>
  <thead>
      <tr>
          <th>類別</th>
          <th>代表函式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>建構</td>
          <td><code>ST_MakePoint</code> / <code>ST_MakeLine</code> / <code>ST_MakePolygon</code></td>
      </tr>
      <tr>
          <td>關係判定</td>
          <td><code>ST_Intersects</code> / <code>ST_Within</code> / <code>ST_Contains</code> / <code>ST_Touches</code></td>
      </tr>
      <tr>
          <td>距離 / 大小</td>
          <td><code>ST_Distance</code> / <code>ST_DWithin</code> / <code>ST_Length</code> / <code>ST_Area</code></td>
      </tr>
      <tr>
          <td>變換</td>
          <td><code>ST_Buffer</code> / <code>ST_Union</code> / <code>ST_Difference</code> / <code>ST_Intersection</code></td>
      </tr>
      <tr>
          <td>投影</td>
          <td><code>ST_Transform</code> / <code>ST_SetSRID</code></td>
      </tr>
      <tr>
          <td>格式轉換</td>
          <td><code>ST_AsGeoJSON</code> / <code>ST_AsKML</code> / <code>ST_AsText</code> / <code>ST_GeomFromGeoJSON</code></td>
      </tr>
      <tr>
          <td>路徑 / 拓樸</td>
          <td><code>ST_ShortestLine</code> / <code>ST_LineMerge</code></td>
      </tr>
      <tr>
          <td>聚合</td>
          <td><code>ST_Collect</code> / <code>ST_ConvexHull</code> / <code>ST_Centroid</code></td>
      </tr>
      <tr>
          <td>簡化</td>
          <td><code>ST_Simplify</code> / <code>ST_SimplifyPreserveTopology</code></td>
      </tr>
  </tbody>
</table>
<p><strong>Web tile 場景</strong>典型 query：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 給定 z/x/y tile、找這個 tile 內的所有 POI
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">ST_AsMVTGeom</span><span class="p">(</span><span class="n">location_3857</span><span class="p">,</span><span class="w"> </span><span class="n">ST_TileEnvelope</span><span class="p">(</span><span class="n">z</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">geom</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">location_3857</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w"> </span><span class="n">ST_TileEnvelope</span><span class="p">(</span><span class="n">z</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">);</span></span></span></code></pre></div><p><code>ST_AsMVTGeom</code> + <code>ST_AsMVT</code> 直接產 Mapbox Vector Tile binary、給前端 Leaflet / Mapbox GL JS 用。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="case-1geometry-用錯-srid">Case 1：Geometry 用錯 SRID</h3>
<p><strong>情境</strong>：app 寫入時用 4326、query 時用 3826 ST_Transform、忘記給某個 column 設 SRID、index 失效。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 確認 SRID
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">ST_SRID</span><span class="p">(</span><span class="k">location</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- 強 type 約束（column type 寫死 SRID）
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">pois</span><span class="w"> </span><span class="k">ALTER</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="k">location</span><span class="w"> </span><span class="k">TYPE</span><span class="w"> </span><span class="n">geometry</span><span class="p">(</span><span class="n">Point</span><span class="p">,</span><span class="w"> </span><span class="mi">4326</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="k">USING</span><span class="w"> </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="k">location</span><span class="p">,</span><span class="w"> </span><span class="mi">4326</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- Check constraint 防錯
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">pois</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">CONSTRAINT</span><span class="w"> </span><span class="n">chk_location_srid</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">CHECK</span><span class="w"> </span><span class="p">(</span><span class="n">ST_SRID</span><span class="p">(</span><span class="k">location</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">4326</span><span class="p">);</span></span></span></code></pre></div><h3 id="case-2geography-不能用所有-st_-函式">Case 2：Geography 不能用所有 ST_* 函式</h3>
<p><strong>情境</strong>：用 <code>geography</code> 想跑 <code>ST_Buffer</code>、報錯或結果不對。</p>
<p><code>ST_Buffer</code> 對 geography 走 spheroid 近似、邊界 case 結果跟 geometry 不一致；很多函式（<code>ST_Voronoi</code> / <code>ST_Delaunay</code> 等）只支援 geometry。</p>
<p>修法：</p>
<ul>
<li>簡單距離 query 用 geography</li>
<li>複雜空間運算用 geometry + 適合 projection</li>
<li>不確定哪些函式支援 geography、看 PostGIS docs <em>Geography Support Functions</em> 清單</li>
</ul>
<h3 id="case-3gist-index-不對-st_distance-生效">Case 3：GiST index 不對 ST_Distance 生效</h3>
<p><strong>情境</strong>：query <code>ST_Distance(location, ?) &lt; 1000</code>、<code>EXPLAIN</code> 顯示 full scan、加 index 也沒用。</p>
<p><code>ST_Distance</code> 算完才 filter、planner 沒辦法用 GiST。</p>
<p>修法：</p>
<ul>
<li>改 <code>ST_DWithin(location, ?, 1000)</code> — 語意一樣、會走 GiST</li>
<li>確認 index 是對 <em>被 query 的 column</em> 建的（不是 transform 後的 expression）</li>
</ul>
<h3 id="case-4cluster-on-geom-後-brin-失效">Case 4：CLUSTER on geom 後 BRIN 失效</h3>
<p><strong>情境</strong>：對 <code>pois</code> 跑 <code>CLUSTER pois USING idx_pois_geom</code> 想加速空間查、但同時對 <code>created_at</code> 用 BRIN index、BRIN 完全失效。</p>
<p>CLUSTER 重組 physical order 跟 GiST 對齊、<code>created_at</code> physical order correlation 從 1.0 變 0.0、BRIN range 沒選擇性。</p>
<p>修法：</p>
<ul>
<li>不要 CLUSTER 大表（一次性、影響其他 column）</li>
<li>換 partition by time + GiST per-partition（取兩者）</li>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/index-selection/" data-link-title="PostgreSQL Index Selection：B-tree / GIN / GiST / BRIN / Hash 對應 workload 的決策樹" data-link-desc="PG 有 6 種 index method（B-tree / Hash / GIN / GiST / SP-GiST / BRIN）跟 partial / expression / covering 三種變體、不是「都用 B-tree 就好」。每種 index 有自己的 query pattern、儲存代價、write amplification 跟 maintenance 成本。本文走 6 種 index 的適用 workload 對照、決策樹、partial / expression / covering / multi-column 變體、5 production 踩雷（過度 index / partial 條件不對 / B-tree 對 JSON 無效 / BRIN 對非 correlated 資料無效 / multi-column 順序錯）、跟 query-optimization 的 EXPLAIN 互補">index-selection</a> 的 BRIN 段</li>
</ul>
<h3 id="case-5ewkb-vs-wkb-跨工具相容">Case 5：EWKB vs WKB 跨工具相容</h3>
<p><strong>情境</strong>：用 PostGIS export 給其他 GIS 工具（QGIS / Shapely / ogr2ogr）、resort 抱怨格式不對。</p>
<p>PostGIS 內部用 EWKB（Extended Well-Known Binary）— 多帶 SRID。多數 GIS 工具讀 WKB（標準）。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Export 標準 WKB
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">ST_AsBinary</span><span class="p">(</span><span class="n">geom</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 或 GeoJSON（跨工具最相容）
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">ST_AsGeoJSON</span><span class="p">(</span><span class="n">geom</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- 或 Shapefile via ogr2ogr
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1">-- ogr2ogr -f &#34;ESRI Shapefile&#34; output.shp PG:&#34;...&#34; -sql &#34;SELECT * FROM pois&#34;</span></span></span></code></pre></div><h2 id="跟專業-gis-db-對比">跟專業 GIS DB 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PostGIS</th>
          <th>Oracle Spatial</th>
          <th>SQL Server geography</th>
          <th>MongoDB GeoJSON</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>函式覆蓋</td>
          <td>1000+</td>
          <td>800+</td>
          <td>200+</td>
          <td>~20</td>
      </tr>
      <tr>
          <td>Raster 支援</td>
          <td>是</td>
          <td>是</td>
          <td>否</td>
          <td>否</td>
      </tr>
      <tr>
          <td>Topology</td>
          <td>是（PostGIS Topology）</td>
          <td>是</td>
          <td>否</td>
          <td>否</td>
      </tr>
      <tr>
          <td>3D 支援</td>
          <td>是（PostGIS SFCGAL）</td>
          <td>是</td>
          <td>部分</td>
          <td>否</td>
      </tr>
      <tr>
          <td>License</td>
          <td>GPL</td>
          <td>商業</td>
          <td>商業</td>
          <td>開源</td>
      </tr>
      <tr>
          <td>Tile generation</td>
          <td>內建（ST_AsMVT）</td>
          <td>否</td>
          <td>否</td>
          <td>否</td>
      </tr>
      <tr>
          <td>跟 PG 整合</td>
          <td>完美</td>
          <td>跟 Oracle 一體</td>
          <td>跟 SQL Server 一體</td>
          <td>獨立</td>
      </tr>
      <tr>
          <td>工業界使用</td>
          <td>OpenStreetMap / 各國國土測繪</td>
          <td>大型企業</td>
          <td>Microsoft 生態</td>
          <td>簡單 location app</td>
      </tr>
  </tbody>
</table>
<p><strong>選 PostGIS 的場景</strong>（90% GIS workload）：</p>
<ul>
<li>Application 已用 PG</li>
<li>需要完整 GIS 函式生態（路網 / 等高線 / 流域分析）</li>
<li>開源 / cost 敏感</li>
<li>跟 OGR / GDAL / QGIS 互通</li>
</ul>
<p><strong>選專業 GIS DB 的場景</strong>：</p>
<ul>
<li>已綁定 Oracle / SQL Server license</li>
<li>極專業 GIS（3D 城市模型 / LIDAR / GPU 加速）</li>
<li>純 location app 不需 relational（MongoDB GeoJSON 足夠）</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a>：其他 PG extension</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/index-selection/" data-link-title="PostgreSQL Index Selection：B-tree / GIN / GiST / BRIN / Hash 對應 workload 的決策樹" data-link-desc="PG 有 6 種 index method（B-tree / Hash / GIN / GiST / SP-GiST / BRIN）跟 partial / expression / covering 三種變體、不是「都用 B-tree 就好」。每種 index 有自己的 query pattern、儲存代價、write amplification 跟 maintenance 成本。本文走 6 種 index 的適用 workload 對照、決策樹、partial / expression / covering / multi-column 變體、5 production 踩雷（過度 index / partial 條件不對 / B-tree 對 JSON 無效 / BRIN 對非 correlated 資料無效 / multi-column 順序錯）、跟 query-optimization 的 EXPLAIN 互補">index-selection</a>：GiST 跟其他 index 對比</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization</a>：空間 query 的 EXPLAIN</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">jsonb-deep-dive</a>：POI metadata 用 JSONB 儲存</li>
</ul>
<h2 id="下一步">下一步</h2>
<ul>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a> 探索其他 PG 擴展可能</li>
<li>回 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview</a> 看全圖</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PostgreSQL MVCC 的 vacuum 必要性、本文聚焦 &lt;em>autovacuum 在 production write-heavy workload 為什麼追不上&lt;/em> 的根因 + 各維度 tuning。&lt;/p>&lt;/blockquote>
&lt;h2 id="你的-autovacuum-永遠追不上-bloat--為什麼">你的 autovacuum 永遠追不上 bloat — 為什麼&lt;/h2>
&lt;p>write-heavy table 的常見故事：上線時表 10GB、3 個月後 30GB、6 個月 80GB；DBA 看 &lt;code>pg_stat_user_tables&lt;/code> 發現 &lt;code>n_dead_tup&lt;/code> 比 &lt;code>n_live_tup&lt;/code> 還多、&lt;code>pg_stat_progress_vacuum&lt;/code> 顯示 autovacuum 一直在跑、但 dead tuple 從沒清乾淨。表本身才 5M row、實際磁碟卻佔 80GB。&lt;/p>
&lt;p>這不是 PostgreSQL bug、是 autovacuum &lt;em>cost-based throttling 預設保守&lt;/em> 的設計意圖 — autovacuum 不該影響 OLTP query 性能、所以每跑一段就 sleep。預設 &lt;code>autovacuum_vacuum_cost_limit=200&lt;/code> + &lt;code>autovacuum_vacuum_cost_delay=2ms&lt;/code> 在 write-heavy 表（每秒幾千 UPDATE）下、清理速度 &lt;em>永遠慢於&lt;/em> dead tuple 產生速度。預設配置適合 read-heavy / write-light workload；OLTP write-heavy 必須調。&lt;/p>
&lt;h2 id="mvcc-跟-dead-tuplevacuum-在解什麼">MVCC 跟 dead tuple：vacuum 在解什麼&lt;/h2>
&lt;p>PostgreSQL MVCC：每次 UPDATE 都是 &lt;em>insert new row + mark old row as deleted&lt;/em>；DELETE 是 &lt;em>mark as deleted、不立刻釋放空間&lt;/em>。dead tuple 在 disk 上佔位、但不能被 query 讀到。autovacuum 的責任：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>回收 dead tuple 空間&lt;/strong> 供新 row reuse（不縮 table 大小、是 free space map）&lt;/li>
&lt;li>&lt;strong>更新 visibility map&lt;/strong> 讓 index-only scan 跳過 heap fetch&lt;/li>
&lt;li>&lt;strong>凍結老 row 的 xid&lt;/strong>（freeze）避免 xid wraparound 災難&lt;/li>
&lt;li>&lt;strong>重整 index B-tree&lt;/strong> 標記 dead pointer（不刪 index page）&lt;/li>
&lt;/ol>
&lt;p>Vacuum 不縮表 — 真要縮要跑 &lt;code>VACUUM FULL&lt;/code>（全表 exclusive lock、production 不能跑）或 &lt;code>pg_repack&lt;/code>（online repack tool）。預期 vacuum 只能 &lt;em>讓表停止長大&lt;/em>、不能 &lt;em>讓表變小&lt;/em>。&lt;/p>
&lt;h2 id="tuningcost-based-throttle-跟-trigger-threshold">Tuning：cost-based throttle 跟 trigger threshold&lt;/h2>
&lt;h3 id="cost-based-throttle全-instance">Cost-based throttle（全 instance）&lt;/h3>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-ini" data-lang="ini">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1"># postgresql.conf&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="na">autovacuum_vacuum_cost_limit&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">2000 # 預設 200、production 拉 5-10 倍&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="na">autovacuum_vacuum_cost_delay&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">2ms # 預設 2ms、不太需要動&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="na">autovacuum_max_workers&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">6 # 預設 3、CPU 多時拉到 6-10&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="na">maintenance_work_mem&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">1GB # 預設 64MB、單一 vacuum 用的記憶體&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>直覺：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PostgreSQL MVCC 的 vacuum 必要性、本文聚焦 <em>autovacuum 在 production write-heavy workload 為什麼追不上</em> 的根因 + 各維度 tuning。</p></blockquote>
<h2 id="你的-autovacuum-永遠追不上-bloat--為什麼">你的 autovacuum 永遠追不上 bloat — 為什麼</h2>
<p>write-heavy table 的常見故事：上線時表 10GB、3 個月後 30GB、6 個月 80GB；DBA 看 <code>pg_stat_user_tables</code> 發現 <code>n_dead_tup</code> 比 <code>n_live_tup</code> 還多、<code>pg_stat_progress_vacuum</code> 顯示 autovacuum 一直在跑、但 dead tuple 從沒清乾淨。表本身才 5M row、實際磁碟卻佔 80GB。</p>
<p>這不是 PostgreSQL bug、是 autovacuum <em>cost-based throttling 預設保守</em> 的設計意圖 — autovacuum 不該影響 OLTP query 性能、所以每跑一段就 sleep。預設 <code>autovacuum_vacuum_cost_limit=200</code> + <code>autovacuum_vacuum_cost_delay=2ms</code> 在 write-heavy 表（每秒幾千 UPDATE）下、清理速度 <em>永遠慢於</em> dead tuple 產生速度。預設配置適合 read-heavy / write-light workload；OLTP write-heavy 必須調。</p>
<h2 id="mvcc-跟-dead-tuplevacuum-在解什麼">MVCC 跟 dead tuple：vacuum 在解什麼</h2>
<p>PostgreSQL MVCC：每次 UPDATE 都是 <em>insert new row + mark old row as deleted</em>；DELETE 是 <em>mark as deleted、不立刻釋放空間</em>。dead tuple 在 disk 上佔位、但不能被 query 讀到。autovacuum 的責任：</p>
<ol>
<li><strong>回收 dead tuple 空間</strong> 供新 row reuse（不縮 table 大小、是 free space map）</li>
<li><strong>更新 visibility map</strong> 讓 index-only scan 跳過 heap fetch</li>
<li><strong>凍結老 row 的 xid</strong>（freeze）避免 xid wraparound 災難</li>
<li><strong>重整 index B-tree</strong> 標記 dead pointer（不刪 index page）</li>
</ol>
<p>Vacuum 不縮表 — 真要縮要跑 <code>VACUUM FULL</code>（全表 exclusive lock、production 不能跑）或 <code>pg_repack</code>（online repack tool）。預期 vacuum 只能 <em>讓表停止長大</em>、不能 <em>讓表變小</em>。</p>
<h2 id="tuningcost-based-throttle-跟-trigger-threshold">Tuning：cost-based throttle 跟 trigger threshold</h2>
<h3 id="cost-based-throttle全-instance">Cost-based throttle（全 instance）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">autovacuum_vacuum_cost_limit</span> <span class="o">=</span> <span class="s">2000          # 預設 200、production 拉 5-10 倍</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">autovacuum_vacuum_cost_delay</span> <span class="o">=</span> <span class="s">2ms            # 預設 2ms、不太需要動</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">autovacuum_max_workers</span> <span class="o">=</span> <span class="s">6                    # 預設 3、CPU 多時拉到 6-10</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">maintenance_work_mem</span> <span class="o">=</span> <span class="s">1GB                    # 預設 64MB、單一 vacuum 用的記憶體</span></span></span></code></pre></div><p>直覺：</p>
<ul>
<li><code>cost_limit</code> 是每個 cycle 能消費多少「cost」、cost 由 page read / dirty / hit 加總；拉高 = 每次 cycle 處理更多 page</li>
<li>拉 <code>cost_limit</code> 比 <code>cost_delay</code> 直接 — delay 太低（&lt; 1ms）OS scheduler 抖動就無效</li>
<li><code>max_workers</code> 限同時跑的 vacuum；partition 多時容易爆滿、要拉</li>
<li><code>maintenance_work_mem</code> 影響 index vacuum 速度、SSD 環境 1-2GB 是 sweet spot</li>
</ul>
<h3 id="per-table-override精準到-hot-table">Per-table override（精準到 hot table）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 對 hot write-heavy 表加強
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="n">autovacuum_vacuum_scale_factor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">.</span><span class="mi">05</span><span class="p">,</span><span class="w">      </span><span class="c1">-- 預設 0.2、5% dead 就觸發
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"></span><span class="w">  </span><span class="n">autovacuum_vacuum_threshold</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1000</span><span class="p">,</span><span class="w">          </span><span class="c1">-- 預設 50、絕對值底線
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="w">  </span><span class="n">autovacuum_vacuum_cost_limit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">5000</span><span class="p">,</span><span class="w">         </span><span class="c1">-- 該表獨立 cost_limit
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="w">  </span><span class="n">autovacuum_analyze_scale_factor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">.</span><span class="mi">05</span><span class="p">,</span><span class="w">      </span><span class="c1">-- analyze 也跟著
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="w">  </span><span class="n">autovacuum_freeze_max_age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100000000</span><span class="w">        </span><span class="c1">-- anti-wraparound 提前
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 對 append-only 表（log table）降頻
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">audit_log</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">  </span><span class="n">autovacuum_vacuum_scale_factor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span><span class="w">        </span><span class="c1">-- 50% dead 才觸發（極少 UPDATE / DELETE）
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span><span class="w">  </span><span class="n">autovacuum_freeze_max_age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1000000000</span><span class="w">       </span><span class="c1">-- freeze 延後
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span><span class="p">);</span></span></span></code></pre></div><p>關鍵：<em>hot table 比 default 緊、cold table 比 default 鬆</em>、不要把所有表用同套配置。Production cluster 通常 5-20 個 hot table 需要 per-table tuning。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1write-heavy-hot-tableautovacuum-永遠跑不完">Case 1：write-heavy hot table，autovacuum 永遠跑不完</h3>
<p><strong>徵兆</strong>：<code>pg_stat_user_tables.n_dead_tup</code> 持續高於 <code>n_live_tup</code>、<code>pg_stat_progress_vacuum</code> 顯示某表 vacuum 跑了 6+ 小時還在 <code>scanning heap</code>、表 size 持續長大。</p>
<p><strong>根因</strong>：default <code>cost_limit=200</code> 對該表 write rate（~5000 UPDATE/s）下、vacuum 處理速度 &lt; dead tuple 產生速度；單次 autovacuum 跑完整表要 12 小時、但表 5% bloat 觸發又啟動下一輪。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>對該表 <code>ALTER TABLE ... SET (autovacuum_vacuum_cost_limit = 10000)</code> — 該表 vacuum 不受全 instance 限制</li>
<li><code>maintenance_work_mem</code> 拉到 2GB（單 vacuum）</li>
<li>短期：手動 <code>VACUUM (VERBOSE, ANALYZE) events;</code> 在 maintenance window 跑、catch up</li>
<li>長期：考慮 partitioning — partition 後 vacuum 只動最近 partition、不掃整表</li>
</ol>
<h3 id="case-2長-transaction-卡住-vacuum-的-xmin-horizon">Case 2：長 transaction 卡住 vacuum 的 xmin horizon</h3>
<p><strong>徵兆</strong>：autovacuum 看似有跑、但 <code>n_dead_tup</code> 不降；<code>pg_stat_activity</code> 看到一個跑了 8 小時的 SELECT（report query 或 idle in transaction）。</p>
<p><strong>根因</strong>：vacuum 只能回收「不會被任何 active transaction 看到」的 dead tuple；長 transaction 的 xmin 鎖死 vacuum 能回收的範圍、即使 autovacuum 不停跑、能回收的 row 數為 0。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：application 端用 <code>statement_timeout</code> + <code>idle_in_transaction_session_timeout</code>（30 分鐘）強制終止 long transaction</li>
<li><strong>偵測</strong>：<code>SELECT pid, now() - xact_start FROM pg_stat_activity WHERE state = 'idle in transaction'</code> 定期掃</li>
<li><strong>臨時</strong>：kill 長 transaction（<code>pg_cancel_backend(pid)</code> / <code>pg_terminate_backend(pid)</code>）、autovacuum 下次跑就能回收</li>
<li><strong>架構</strong>：報表 query 跑在 standby、不要在 primary 開 long transaction</li>
</ol>
<h3 id="case-3anti-wraparound-vacuum-在-peak-觸發">Case 3：Anti-wraparound vacuum 在 peak 觸發</h3>
<p><strong>徵兆</strong>：production 流量高峰時 PostgreSQL CPU 100%、<code>pg_stat_progress_vacuum</code> 顯示 anti-wraparound vacuum 正在跑、application latency 暴漲；log 出現 <code>database &quot;myapp&quot; must be vacuumed within X transactions</code>。</p>
<p><strong>根因</strong>：autovacuum_freeze_max_age（預設 200M）到了、PostgreSQL <em>強制</em> 跑 anti-wraparound vacuum（即使在 peak）；這個 vacuum <em>不受 cost_limit 限制</em>、跑到完才停、表大時要幾小時、跟 OLTP query 搶 IO。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：<code>autovacuum_freeze_max_age</code> 拉到 1B（10 億）、給 freeze 更多時間在 off-peak 自然發生</li>
<li><strong>per-table freeze</strong>：hot table 設 <code>autovacuum_freeze_max_age = 100M</code>（提前在 off-peak freeze）、cold table 設 800M（避免不必要 freeze）</li>
<li><strong>緊急</strong>：手動跑 <code>VACUUM (FREEZE, VERBOSE) table_name;</code> 在 maintenance window 預先 freeze</li>
<li><strong>監測</strong>：<code>SELECT relname, age(relfrozenxid) FROM pg_class WHERE relkind = 'r' ORDER BY age(relfrozenxid) DESC LIMIT 20;</code> 看哪些表逼近 wraparound</li>
</ol>
<h3 id="case-4partition-table-把-autovacuum_max_workers-跑滿">Case 4：Partition table 把 autovacuum_max_workers 跑滿</h3>
<p><strong>徵兆</strong>：partition 後（時間 partition、12 個月分區）、autovacuum 跑很慢、<code>pg_stat_activity</code> 看到 3 個 autovacuum worker 都在跑 partition 表、其他 hot table queue 等很久。</p>
<p><strong>根因</strong>：<code>autovacuum_max_workers=3</code> 預設、每個 partition 算獨立 table；100 個 partition 中 50 個都需要 vacuum、worker 滿、其他 table 排隊。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>拉 <code>autovacuum_max_workers</code> 到 6-10（依 CPU core 數）</li>
<li>cold partition 設 <code>autovacuum_enabled = false</code>（已不寫的舊 partition）、減少 worker 競爭</li>
<li>partition 數量本身要克制 — 100+ partition 是訊號該重新評估 partition strategy</li>
</ol>
<h3 id="case-5index-bloat-沒被-vacuum-處理">Case 5：Index bloat 沒被 vacuum 處理</h3>
<p><strong>徵兆</strong>：表 vacuum 跑完了、<code>n_dead_tup</code> 為 0、但 index size 持續長大；query 用該 index 越來越慢、跟 sequential scan 差不多。</p>
<p><strong>根因</strong>：autovacuum 只處理 <em>heap</em>（table data）跟 <em>index leaf pages</em>；index B-tree 內部結構 fragmentation 不被 vacuum 處理。dead pointer 留在 index leaf page、查詢仍 traverse 過、IO 多。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><code>REINDEX CONCURRENTLY</code> 線上重建 index（PG 12+）、不鎖表</li>
<li>監測 index bloat：<code>pgstattuple_approx</code> extension 或 <code>pg_repack</code></li>
<li>預防：B-tree index 設計避免 high cardinality + 大量 UPDATE 同欄位（typical 場景：status column update）；考慮 <em>partial index</em> 或 <em>hash index</em>（PG 10+ logged）</li>
<li>大量 bloat index 用 <code>pg_repack</code> 重建（不需要 superuser、不鎖表）</li>
</ol>
<h2 id="容量規劃">容量規劃</h2>
<p>vacuum capacity 用 <em>跟得上 dead tuple 產生速度</em> 衡量：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>估算方式</th>
          <th>警戒</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>dead tuple 產生 rate</td>
          <td><code>UPDATE/s + DELETE/s + ~10% INSERT/s（HOT update miss）</code></td>
          <td>跟 vacuum rate 對比</td>
      </tr>
      <tr>
          <td>vacuum 處理 rate</td>
          <td><code>cost_limit / cost_delay × page_size</code>、~MB/s 數量級</td>
          <td>跟 dead tuple rate 對比</td>
      </tr>
      <tr>
          <td>autovacuum_max_workers</td>
          <td>partition 數 + hot table 數 / 3-5</td>
          <td>100+ partition 必須拉 worker</td>
      </tr>
      <tr>
          <td>maintenance_work_mem</td>
          <td>1-2GB / vacuum worker</td>
          <td>全 worker 跑時的記憶體上限要 sizing</td>
      </tr>
      <tr>
          <td>anti-wraparound 觸發頻率</td>
          <td>預設 200M xid、write-heavy ~ 1-2 週觸發一次</td>
          <td>拉到 1B 後 ~ 2-3 月一次</td>
      </tr>
      <tr>
          <td>Bloat ratio</td>
          <td><code>pg_stat_user_tables.n_dead_tup / n_live_tup</code></td>
          <td>&gt; 50% 表示 vacuum 追不上</td>
      </tr>
  </tbody>
</table>
<p>實務 default：</p>
<ul>
<li>OLTP write-heavy（事件 / 訂單）：cost_limit 2000-5000、scale_factor 0.05、freeze_max_age 100M</li>
<li>OLTP read-heavy（user / config）：default 即可</li>
<li>Append-only log：scale_factor 0.5、freeze_max_age 800M、<code>autovacuum_enabled = false</code> for cold partition</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-partitioning-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">partitioning</a> 整合</h3>
<p>partitioning 是 vacuum 問題的長期解：</p>
<ul>
<li>大表（&gt; 100GB）vacuum 時間隨 size 線性、partition 後 vacuum 只動最近 partition</li>
<li>Cold partition <code>autovacuum_enabled = false</code> 完全停掉、新數據只在 hot partition</li>
<li>缺點：partition 數量爆炸時、autovacuum_max_workers 也要拉</li>
</ul>
<h3 id="跟-monitoring-整合">跟 monitoring 整合</h3>
<p>關鍵 metric：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- bloat 比例
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">relname</span><span class="p">,</span><span class="w"> </span><span class="n">n_dead_tup</span><span class="p">,</span><span class="w"> </span><span class="n">n_live_tup</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">       </span><span class="n">round</span><span class="p">(</span><span class="n">n_dead_tup</span><span class="p">::</span><span class="nb">numeric</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="k">nullif</span><span class="p">(</span><span class="n">n_live_tup</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">100</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">dead_pct</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_user_tables</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">n_live_tup</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">1000</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">n_dead_tup</span><span class="w"> </span><span class="k">DESC</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">20</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- vacuum 進度
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_progress_vacuum</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- xid wraparound 距離
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">datname</span><span class="p">,</span><span class="w"> </span><span class="n">age</span><span class="p">(</span><span class="n">datfrozenxid</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_database</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="k">DESC</span><span class="p">;</span></span></span></code></pre></div><p>Prometheus alert 三條：<code>dead_pct &gt; 30</code>、<code>vacuum_running_seconds &gt; 3600</code>、<code>xid_age &gt; 500000000</code>。</p>
<h3 id="跟-backup-window">跟 backup window</h3>
<p>VACUUM FREEZE 在 backup 前跑能減少 backup size（freeze tuple 不需要 special handling）：</p>
<ol>
<li>每週 maintenance window 跑 <code>VACUUM (FREEZE, ANALYZE) hot_table</code> — 預先 freeze + 更新 stats</li>
<li>backup 前避免長 transaction、確保 vacuum 能跑</li>
</ol>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>HOT update 跟 fillfactor</strong>：UPDATE 同頁可重用空間、fillfactor 80 為 hot table 留 20% buffer</li>
<li><strong><code>pg_repack</code> vs <code>VACUUM FULL</code></strong>：online vs offline、長期維護工具選擇</li>
<li><strong>PostgreSQL 14+ parallel vacuum</strong>：index vacuum 平行化、大表受益明顯</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>上游 chapter：<a href="/blog/backend/01-database/high-concurrency-access/" data-link-title="1.1 高併發下的 SQL 讀寫邊界" data-link-desc="說明高併發服務如何共用資料庫 client、控制 transaction、管理 connection pool、避免資料庫成為瓶頸">High Concurrency Access</a> — vacuum 是 concurrency 治理一環</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">Declarative Partitioning</a> / <a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">MVCC + Lock Model</a>（為什麼會有 dead tuple、跟 lock 互動）</li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>Migration Playbook：Cloud SQL for PostgreSQL → Cloud Spanner</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/spanner/migrate-from-cloud-sql-pg/</link><pubDate>Wed, 27 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/spanner/migrate-from-cloud-sql-pg/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/spanner/" data-link-title="Google Cloud Spanner" data-link-desc="全球分散式 strong-consistency OLTP、TrueTime API、線性擴展到 10 億 req/sec">Cloud Spanner&lt;/a> overview 的 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration&lt;/a> playbook。走 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendor-article-spec/" data-link-title="資料庫 Vendor 文章撰寫規格" data-link-desc="把 PostgreSQL 與 MySQL batch 的正文經驗整理成資料庫 vendor overview、deep article 與 migration playbook 的撰寫規格">vendor-article-spec&lt;/a> Migration Playbook 規格 + &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology&lt;/a> Type E（paradigm shift）。每階段切換用 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate&lt;/a> 把關 — Evidence 段列的證據是 gate 通過條件、不是 nice-to-have。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="driver為什麼遷什麼條件不該遷">Driver：為什麼遷、什麼條件不該遷&lt;/h2>
&lt;h3 id="啟動壓力">啟動壓力&lt;/h3>
&lt;p>single-region Cloud SQL PostgreSQL primary 觸到容量上限（connection、write throughput、storage IOPS、region 故障風險）、產品要求跨 region active-active write、external consistency 是契約而非 nice-to-have。讀者要先確認自己面對的是「real 跨 region write residency」、不是「想用更強的技術」 — driver 段的核心責任是排除空泛動機。&lt;/p>
&lt;h3 id="主要-driver-候選">主要 driver 候選&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Global write residency&lt;/strong>：用戶分散全球、各地寫入本地 region、跨 region 一致性是產品要求&lt;/li>
&lt;li>&lt;strong>External consistency 對帳契約&lt;/strong>：跨 region 交易順序錯誤會導致對帳爆炸（金融、計費、ticketing）&lt;/li>
&lt;li>&lt;strong>單 primary 容量天花板&lt;/strong>：Cloud SQL 最大 instance 仍撐不住、應用層 sharding 是大工程&lt;/li>
&lt;li>&lt;strong>跨 region read latency&lt;/strong>：read 從各地直接打本地 replica、Cloud SQL read replica 受 single-primary 寫入 throughput 限制&lt;/li>
&lt;/ul>
&lt;h3 id="no-go-condition基礎">No-go condition（基礎）&lt;/h3>
&lt;p>流量集中單 region、跨 region 只是 DR 需求 → 維持 Cloud SQL + read replica + cross-region async DR 更便宜。這條 no-go 不複雜、但團隊常被 marketing 推著跳過 — 在自家 traffic dashboard 上 audit 一遍「write 來自哪些 region、各占比多少」、若 90%+ 來自單 region、Spanner 沒有 benefit。&lt;/p>
&lt;h3 id="no-go-conditionsizing-barrier">No-go condition（sizing barrier）&lt;/h3>
&lt;p>小 / 中型 PostgreSQL workload 的成本門檻 — Spanner 早期最小單位 100 processing units（≈ 1 node）對中小負載偏貴、過去是 sizing barrier；2021+ 推出 100 pu 起跳的 granular sizing 後雖然可從小開始、但 100 pu × per-pu monthly cost 加上跨 region replication 仍可能比 Cloud SQL HA 設定貴數倍。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/spanner/" data-link-title="Google Cloud Spanner" data-link-desc="全球分散式 strong-consistency OLTP、TrueTime API、線性擴展到 10 億 req/sec">Cloud Spanner</a> overview 的 <a href="/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration</a> playbook。走 <a href="/blog/backend/01-database/vendor-article-spec/" data-link-title="資料庫 Vendor 文章撰寫規格" data-link-desc="把 PostgreSQL 與 MySQL batch 的正文經驗整理成資料庫 vendor overview、deep article 與 migration playbook 的撰寫規格">vendor-article-spec</a> Migration Playbook 規格 + <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology</a> Type E（paradigm shift）。每階段切換用 <a href="/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate</a> 把關 — Evidence 段列的證據是 gate 通過條件、不是 nice-to-have。</p></blockquote>
<hr>
<h2 id="driver為什麼遷什麼條件不該遷">Driver：為什麼遷、什麼條件不該遷</h2>
<h3 id="啟動壓力">啟動壓力</h3>
<p>single-region Cloud SQL PostgreSQL primary 觸到容量上限（connection、write throughput、storage IOPS、region 故障風險）、產品要求跨 region active-active write、external consistency 是契約而非 nice-to-have。讀者要先確認自己面對的是「real 跨 region write residency」、不是「想用更強的技術」 — driver 段的核心責任是排除空泛動機。</p>
<h3 id="主要-driver-候選">主要 driver 候選</h3>
<ul>
<li><strong>Global write residency</strong>：用戶分散全球、各地寫入本地 region、跨 region 一致性是產品要求</li>
<li><strong>External consistency 對帳契約</strong>：跨 region 交易順序錯誤會導致對帳爆炸（金融、計費、ticketing）</li>
<li><strong>單 primary 容量天花板</strong>：Cloud SQL 最大 instance 仍撐不住、應用層 sharding 是大工程</li>
<li><strong>跨 region read latency</strong>：read 從各地直接打本地 replica、Cloud SQL read replica 受 single-primary 寫入 throughput 限制</li>
</ul>
<h3 id="no-go-condition基礎">No-go condition（基礎）</h3>
<p>流量集中單 region、跨 region 只是 DR 需求 → 維持 Cloud SQL + read replica + cross-region async DR 更便宜。這條 no-go 不複雜、但團隊常被 marketing 推著跳過 — 在自家 traffic dashboard 上 audit 一遍「write 來自哪些 region、各占比多少」、若 90%+ 來自單 region、Spanner 沒有 benefit。</p>
<h3 id="no-go-conditionsizing-barrier">No-go condition（sizing barrier）</h3>
<p>小 / 中型 PostgreSQL workload 的成本門檻 — Spanner 早期最小單位 100 processing units（≈ 1 node）對中小負載偏貴、過去是 sizing barrier；2021+ 推出 100 pu 起跳的 granular sizing 後雖然可從小開始、但 100 pu × per-pu monthly cost 加上跨 region replication 仍可能比 Cloud SQL HA 設定貴數倍。</p>
<p><strong>來源 9.C10「判讀」段第 3 點</strong>：Spanner 早期 100 pu 起跳是 sizing barrier、後來推出 granular sizing 才讓中小負載可從小開始。<strong>Dogfood 邊界明示</strong>：9.C10 case 揭露的 sizing 結構是 Google 內部 dogfood 的 capacity 規劃語言、不是 customer-facing pricing 承諾；客戶實際成本要看當期 Spanner pricing + region + replication config。</p>
<p>觸發 sizing no-go 的條件：</p>
<table>
  <thead>
      <tr>
          <th>信號</th>
          <th>判讀</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>workload row count &lt; 數百萬</td>
          <td>100 pu 對這個資料量過 over-provision</td>
      </tr>
      <tr>
          <td>QPS &lt; 1000</td>
          <td>100 pu 容量遠超實際 traffic、cost / QPS ratio 高</td>
      </tr>
      <tr>
          <td>單 region 即可滿足合規</td>
          <td>跨 region replication cost 是純浪費</td>
      </tr>
      <tr>
          <td>Cloud SQL HA 設定已 cover SLA</td>
          <td>升 Spanner 沒 marginal benefit</td>
      </tr>
  </tbody>
</table>
<p>觸發任一條 → 強烈建議走 Cloud SQL HA、不升 Spanner。判讀時要把 Cloud SQL HA cost vs Spanner 100 pu cost 對比清楚、避免讀者「想用新技術」而升級。</p>
<h3 id="no-go-condition應用層延遲容忍">No-go condition（應用層延遲容忍）</h3>
<p>應用層延遲容忍 &lt; 50ms write 的 workload 不該升 Spanner — 跨 region Spanner write 在物理光速硬限下達 100-200ms（<a href="../consistency-models-comparison/">consistency-models-comparison</a> 的 cross-region quorum 段）。延遲敏感 workload 升級後會在 p99 直接撞牆、回退時資料已經寫進 Spanner、roll back 成本巨大。</p>
<p><strong>來源 9.C10「判讀」段第 2 點 + 「策略」段第 3 點</strong>：「external consistency 必須等多區 quorum、跨洲交易延遲可達 100-200ms」。<strong>Dogfood 邊界明示</strong>：9.C10 揭露的數量級是 Google internal observation、客戶實際 latency 隨 voting region 配置變化、引用時要附條件。</p>
<p>觸發 latency no-go 的場景：</p>
<ul>
<li>實時報價系統（毫秒級回應）</li>
<li>高頻交易（HFT）</li>
<li>遊戲 leaderboard 寫入</li>
<li>低延遲 OLTP（金融下單、支付路由）</li>
</ul>
<p>觸發任一條 → 強烈建議走 Cloud SQL 單 region、或考慮把 <em>跨 region 一致性需求</em> 重新審視（是否真的需要強一致、能不能改 event-driven async reconcile）。</p>
<h3 id="替代方案排除">替代方案排除</h3>
<ul>
<li><strong>Aurora DSQL</strong>：AWS 生態、若團隊在 GCP、跨雲不合</li>
<li><strong>CockroachDB</strong>：要自管或想 PostgreSQL wire 但不選 GCP 託管時可考慮、本 playbook 不對照</li>
<li><strong>Citus on Cloud SQL</strong>：multi-region write 不是強項、不解 cross-region external consistency 需求</li>
</ul>
<h3 id="case-anchor--dogfood-邊界">Case anchor + dogfood 邊界</h3>
<p><strong>無強 customer case</strong>。9.C10 是 Google 內部 dogfood、不是公開遷移 case；本 playbook 用 Spanner overview 的 PostgreSQL dialect 路徑 + 官方 migration guide + 通用 pattern。引用時必須明示「9.C10 揭露的線性 scaling / line-rate 設計目標是 Spanner 設計依據、不等於客戶遷移後可獲得的 capacity」。</p>
<p>對照 case：<a href="/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered Aurora 受監管 banking</a> — 雖然是 Aurora、不是 Spanner、但揭露「受監管 OLTP 遷移要算合規 lead time」「資料駐留限制 = 容量規劃 per-市場」這兩條結論在 Spanner 遷移同樣適用。讀者若是受監管產業、跨 region instance config 還要疊上 voting region 是否落在合規市場的 audit。</p>
<h2 id="diff-audit6-規格面--sizing--cost-第-7-面">Diff Audit（6 規格面 + sizing / cost 第 7 面）</h2>
<h3 id="schema-diff">Schema diff</h3>
<p>PostgreSQL DDL → Spanner PostgreSQL dialect 對照：</p>
<table>
  <thead>
      <tr>
          <th>PostgreSQL 特性</th>
          <th>Spanner 對應</th>
          <th>動作</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>SERIAL</code></td>
          <td>bit-reversed sequence</td>
          <td>改 primary key 策略、避免 hot split</td>
      </tr>
      <tr>
          <td><code>JSONB</code></td>
          <td><code>JSON</code> type</td>
          <td>大部分相容、複雜 path query 重寫</td>
      </tr>
      <tr>
          <td><code>ARRAY</code></td>
          <td><code>ARRAY</code></td>
          <td>OK</td>
      </tr>
      <tr>
          <td><code>PARTITION BY</code></td>
          <td>不直接支援</td>
          <td>改成 interleaved table 或單表</td>
      </tr>
      <tr>
          <td><code>FOREIGN KEY</code></td>
          <td>保留 FK constraint + 考慮 <a href="/blog/backend/knowledge-cards/interleaved-table/" data-link-title="Interleaved Table" data-link-desc="Spanner 把 parent / child table row 物理交錯儲存、parent &#43; child JOIN 不跨 split">Interleaved Table</a></td>
          <td>parent-child access pattern 改 interleaved</td>
      </tr>
      <tr>
          <td><code>B-tree INDEX</code></td>
          <td>OK</td>
          <td>直接遷</td>
      </tr>
      <tr>
          <td><code>GIN / GiST INDEX</code></td>
          <td>不支援</td>
          <td>用 <code>STORING</code> column 取代部分需求、其餘改應用層</td>
      </tr>
      <tr>
          <td><code>CHECK constraint</code></td>
          <td>部分支援（time-sensitive、查最新文件）</td>
          <td>audit 每條 constraint</td>
      </tr>
      <tr>
          <td><code>UDF / stored procedure</code></td>
          <td>少數支援</td>
          <td>改應用層或 client-side compute</td>
      </tr>
      <tr>
          <td><code>TRIGGER</code></td>
          <td>不支援</td>
          <td>改 application 層或 Spanner change streams</td>
      </tr>
  </tbody>
</table>
<p>interleaved table 設計參考 <a href="../schema-migration-interleaved-tables/">schema-migration-interleaved-tables</a>。讀者要在 schema audit 階段就決定哪些 parent-child 該 interleave、避免後悔成本。</p>
<h3 id="operational-diff">Operational diff</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Cloud SQL</th>
          <th>Spanner</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>基礎架構</td>
          <td>VM-based</td>
          <td>API-based</td>
      </tr>
      <tr>
          <td>認證</td>
          <td>postgres user / role</td>
          <td>IAM role / service account</td>
      </tr>
      <tr>
          <td>備份</td>
          <td>pg_dump / pgBackRest</td>
          <td>point-in-time backup（PITR）</td>
      </tr>
      <tr>
          <td>監控</td>
          <td>postgres-flavor（pg_stat_*）</td>
          <td>Cloud Monitoring <code>spanner.*</code></td>
      </tr>
      <tr>
          <td>Connection pool</td>
          <td>PgBouncer</td>
          <td>SDK 內 gRPC pool</td>
      </tr>
      <tr>
          <td>Vacuum</td>
          <td>必要</td>
          <td>不存在（MVCC 機制不同）</td>
      </tr>
      <tr>
          <td>Replication lag</td>
          <td>需監控</td>
          <td>不存在 single-primary 概念</td>
      </tr>
  </tbody>
</table>
<p>不再需要的 Cloud SQL 責任：vacuum、autovacuum tuning、connection pool（PgBouncer）、replication lag 監控、Patroni HA。</p>
<p>新增 Spanner 責任：processing unit capacity 預測、TrueTime ε 觀測（<a href="../truetime-api-depth/">truetime-api-depth</a>）、long-running schema operation 跟蹤、IAM 細粒度權限。</p>
<h3 id="paradigm-diff">Paradigm diff</h3>
<p>從 single-primary OLTP → 跨 region distributed SQL：</p>
<ul>
<li>transaction commit latency：&lt; 5ms → 50-200ms（跨洲、含 <a href="/blog/backend/knowledge-cards/commit-wait/" data-link-title="Commit Wait" data-link-desc="Spanner external consistency 的核心機制 — read-write transaction 拿 commit timestamp s 後等到 TT.after(s) 才 ACK、wait ≈ 2ε、付 latency tax 換 commit 順序 = real-time 順序">Commit Wait</a> + cross-region quorum）</li>
<li>external consistency 是 default（不再是 isolation level 選擇題）</li>
<li>transaction 上限：Cloud SQL 無硬限 → Spanner 10s timeout、要重構成短交易</li>
<li>read consistency：default eventual → default strong、需顯式選 bounded staleness</li>
</ul>
<p>詳細 consistency model 差異看 <a href="../consistency-models-comparison/">consistency-models-comparison</a>。</p>
<h3 id="component-diff">Component diff</h3>
<p>退役：</p>
<ul>
<li>PgBouncer / pgcat（connection pool）</li>
<li>Cloud SQL HA / Patroni cluster</li>
<li>pgBackRest（備份外掛）</li>
<li>Citus extension（若有用）</li>
<li>各種 postgres extension（時間敏感、逐個 audit 是否 Spanner 支援等效）</li>
</ul>
<p>新增：</p>
<ul>
<li>Spanner client library（Go / Java / Node / Python）</li>
<li>Dataflow（用於 bulk export-import）</li>
<li>Datastream / Database Migration Service（用於 CDC catch-up）</li>
<li>Spanner Studio（query UI）</li>
</ul>
<h3 id="application-diff">Application diff</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Cloud SQL（PostgreSQL client）</th>
          <th>Spanner</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ORM</td>
          <td>全 PG ORM 相容</td>
          <td>PostgreSQL dialect 相容部分 ORM、查最新 dialect 支援列表</td>
      </tr>
      <tr>
          <td>Connection model</td>
          <td>process-per-connection（postgres）</td>
          <td>stateless gRPC client（SDK 內 pool）</td>
      </tr>
      <tr>
          <td>Transaction model</td>
          <td>可長交易</td>
          <td>10s timeout、需短交易</td>
      </tr>
      <tr>
          <td>Timestamp 使用</td>
          <td>app 內 <code>now()</code> / <code>CURRENT_TIMESTAMP</code></td>
          <td>改用 <code>PENDING_COMMIT_TIMESTAMP</code> sentinel</td>
      </tr>
      <tr>
          <td>Cursor / prepared statement</td>
          <td>全支援</td>
          <td>部分支援、查 SDK 文件</td>
      </tr>
      <tr>
          <td>Stored procedure</td>
          <td>全支援</td>
          <td>少數支援、業務邏輯改應用層</td>
      </tr>
  </tbody>
</table>
<p>ORM 兼容性是 time-sensitive claim — JPA / Hibernate / SQLAlchemy 在 Spanner PostgreSQL dialect 上的行為隨 dialect 版本演進、實作前查最新 vendor docs。讀者要把 ORM 兼容測試放 Phase 0、不能假設「PostgreSQL ORM 直接搬到 Spanner」。</p>
<h3 id="data-topology-diff">Data topology diff</h3>
<ul>
<li>Single primary（write）+ read replica → multi-region voting + read-only replica</li>
<li>Primary key 設計：避免單調遞增（SERIAL）造成 hot split、改 UUID 或 bit-reversed</li>
<li>Partition：PostgreSQL declarative partition → Spanner 不需要顯式 partition（自動 split）</li>
</ul>
<h3 id="sizing--cost-diff第-7-規格面">Sizing / cost diff（第 7 規格面）</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Cloud SQL</th>
          <th>Spanner</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>計費單位</td>
          <td>instance class（vCPU / RAM）+ storage IOPS + HA add-on</td>
          <td>100 processing units 起跳 ≈ 1 node</td>
      </tr>
      <tr>
          <td>起跳成本</td>
          <td>小型 instance 月成本可控（小型 HA $50-200/月）</td>
          <td>100 pu × per-pu monthly rate、月成本是 Cloud SQL 小型 HA 的數倍</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>獨立計費（GB / month）</td>
          <td>含在 node count 內、無單獨 storage charge</td>
      </tr>
      <tr>
          <td>Throughput cap</td>
          <td>隨 instance class</td>
          <td>隨 pu 線性擴展</td>
      </tr>
      <tr>
          <td>跨 region replication</td>
          <td>額外 read replica cost</td>
          <td>含在 multi-region instance config 內</td>
      </tr>
      <tr>
          <td>Egress</td>
          <td>跨 region 額外</td>
          <td>跨 region 額外</td>
      </tr>
  </tbody>
</table>
<p>觸發 sizing audit 的時機：workload 行數、QPS、跨 region 需求都明確後、把「Cloud SQL HA monthly bill」對「Spanner 100 pu × monthly rate + egress」做 cost crossover 分析、無法 cost crossover 證明 → 不升。</p>
<p>Cost crossover 不是「Spanner 成本必須低於 Cloud SQL」、是「Spanner 多付的成本要對應到具體 benefit」：</p>
<ul>
<li>若 benefit 是 multi-region write residency、Spanner 多付的 cost 換得跨 region 一致性 — 對齊</li>
<li>若 benefit 只是「更新的技術」、Spanner 多付的 cost 沒對應產品價值 — 不升</li>
</ul>
<h3 id="type-判定">Type 判定</h3>
<p><strong>Type E（paradigm shift）</strong>、不是 drop-in。schema / app / operation / data topology / cost 五軸都動、不能用 Type B（drop-in）思路規劃 phase。詳細 type 判定方法看 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology</a>。</p>
<h2 id="phase-plan9-段每段有驗證門檻">Phase Plan：9 段、每段有驗證門檻</h2>
<h3 id="phase-0--compatibility-audit--sizing-audit">Phase 0 — Compatibility audit + sizing audit</h3>
<p>跑 schema-converter（pgloader / Spanner migration tool）、列出 incompatible feature、決定哪些改 schema、哪些改 app。hot key 風險評估（SERIAL primary key、單調遞增 timestamp）。</p>
<p>同時跑 sizing audit：</p>
<ul>
<li>估 target Spanner pu 數（基於 QPS、storage size、cross-region replication factor）</li>
<li>做 Cloud SQL HA cost vs Spanner cost crossover 分析</li>
<li>若 cost crossover 證明不出來 → halt migration、回到 driver 段重審</li>
</ul>
<p>Phase 0 是 migration 的決策閘門 — 不過閘門就停、不浪費 Phase 1+ 的 engineering effort。</p>
<h3 id="phase-1--target-schema-design">Phase 1 — Target schema design</h3>
<ul>
<li>interleaved table 設計（base on Phase 0 access pattern audit）</li>
<li>Index 重寫（GIN / GiST 用 STORING column 替代、其他用 B-tree）</li>
<li>Primary key 反序（避免 hot split）</li>
<li>Storing column 選擇（trade-off：query latency vs index size）</li>
</ul>
<p>Output 是 target DDL、跟原 PostgreSQL schema 並排 diff 文件、給 application 團隊審。</p>
<h3 id="phase-2--application-dual-target-preparation">Phase 2 — Application dual-target preparation</h3>
<ul>
<li>抽象 DB layer（repository pattern、避免直接呼 SQL）</li>
<li>SDK 並存（go-pg + Spanner client）</li>
<li>Feature flag 控制讀寫路徑（read-from-pg / read-from-spanner / dual-write）</li>
<li>Transaction 模式 audit（長交易拆短）</li>
</ul>
<h3 id="phase-3--bulk-initial-load">Phase 3 — Bulk initial load</h3>
<p>Cloud SQL → Cloud Storage（CSV / Avro）→ Dataflow → Spanner。Row count + checksum 驗證、column-level diff sample。</p>
<h3 id="phase-4--cdc-catch-up">Phase 4 — CDC catch-up</h3>
<p>Datastream from Cloud SQL → Dataflow → Spanner。Replication lag &lt; 1s 為前進門檻、sustained 24h。</p>
<h3 id="phase-5--shadow-read">Phase 5 — Shadow read</h3>
<p>Production read 同時打 Cloud SQL 跟 Spanner、diff log 異常。至少 7 天觀察、divergence rate &lt; 0.1%、p99 latency Spanner &lt; 1.5x Cloud SQL。</p>
<h3 id="phase-6--dual-write">Phase 6 — Dual write</h3>
<p>Cloud SQL 為 source-of-truth、Spanner 為 mirror。偵測 dual write divergence、評估是否提早升 source-of-truth。</p>
<h3 id="phase-7--cutover">Phase 7 — Cutover</h3>
<p>read-only window（&lt; 5 min）→ 最後 catch-up → switch source-of-truth → cutover application write。</p>
<h3 id="phase-8--cleanup">Phase 8 — Cleanup</h3>
<p>退役 Cloud SQL primary、保留 backup、清 PgBouncer / Patroni / 監控 dashboard。</p>
<h3 id="stage-0-variant-規劃">Stage 0 variant 規劃</h3>
<p>若 read-only window 不可接受（24/7 不能停機的金融 / 醫療系統）、Phase 6 dual write 期間做 conflict resolution（last-writer-wins + manual reconcile）、進入 fail-forward 模式、不走 read-only cutover。</p>
<h2 id="evidence每階段驗證材料">Evidence：每階段驗證材料</h2>
<table>
  <thead>
      <tr>
          <th>Phase</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Phase 0</td>
          <td>incompatible feature 清單、預估改動 SP、hot key 風險 row count、<strong>sizing audit 報告</strong>（target pu 數估算 + Cloud SQL HA vs Spanner cost crossover 月 / 年成本對比）</td>
      </tr>
      <tr>
          <td>Phase 1</td>
          <td>DDL diff report、預估 backfill 時間（基於 row count + Spanner 文件）</td>
      </tr>
      <tr>
          <td>Phase 3</td>
          <td>row count 對齊、column-level checksum、payload sample diff</td>
      </tr>
      <tr>
          <td>Phase 4</td>
          <td>CDC lag &lt; 1s sustained 24h、error rate &lt; 0.01%</td>
      </tr>
      <tr>
          <td>Phase 5</td>
          <td>shadow read divergence rate &lt; 0.1%、p99 latency Spanner &lt; 1.5x Cloud SQL</td>
      </tr>
      <tr>
          <td>Phase 6</td>
          <td>dual write divergence &lt; 0.01%、reconcile queue 不積壓</td>
      </tr>
      <tr>
          <td>Phase 7</td>
          <td>cutover window 內 write 一致性、回到 Phase 6 的條件（rollback path）</td>
      </tr>
  </tbody>
</table>
<p><strong>Cost crossover 報告</strong>（Phase 0 必交付）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Item                          | Cloud SQL HA | Spanner 100 pu | Delta
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">------------------------------|--------------|----------------|------
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">Compute monthly               | $X           | $Y             | $Y-X
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">Storage monthly               | $A           | (included)     | -$A
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">Cross-region replication      | $B           | (included)     | -$B
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">Egress (est)                  | $C           | $C             | $0
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">Total monthly                 | $X+A+B+C     | $Y+C           | $Y-X-A-B
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">Annual                        | 12*above     | 12*above       | -
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">Benefit (qualitative)         | -            | multi-region write residency / external consistency | -
</span></span><span class="line"><span class="ln">10</span><span class="cl">Crossover verdict             | -            | proceed / halt | -</span></span></code></pre></div><p>Verdict = <code>proceed</code> 才進 Phase 1；<code>halt</code> → 回到 Driver 段重審 driver 是否成立。</p>
<p>所有 evidence 進 incident decision log、回 <a href="/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">4.20 Observability Evidence Package</a>。</p>
<h2 id="cutover決策與-rollback">Cutover：決策與 rollback</h2>
<h3 id="cutover-window">Cutover window</h3>
<p>選用戶最低流量時段、&lt; 5 min read-only freeze、預先通知。受監管產業（對照 <a href="/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered</a>）要算合規 lead time、每市場各自審。</p>
<h3 id="decision-owner">Decision owner</h3>
<p>DB lead + product lead + on-call SRE 共同 sign-off。受監管產業多加合規 owner。</p>
<h3 id="rollback-condition">Rollback condition</h3>
<ul>
<li>cutover 後 30 min 內 p99 write latency 持續 &gt; SLA 2x → rollback</li>
<li>error rate &gt; 1% sustained 5 min → rollback</li>
<li>對帳系統發現 divergence &gt; 0.1% → rollback</li>
</ul>
<h3 id="rollback-機制">Rollback 機制</h3>
<p>保留 Cloud SQL 為 read-only mirror 14 天、Spanner 改 read-only、reverse CDC（Spanner → Cloud SQL）需事先準備。Reverse CDC 在 Phase 4-6 期間就要 dry-run 過、不能 cutover 才第一次試。</p>
<p>連結 <a href="/blog/backend/knowledge-cards/rollback-window/" data-link-title="Rollback Window" data-link-desc="說明變更進入 production 後還能用哪種方式回退或改路線的時間與條件">rollback-window</a>、<a href="/blog/backend/knowledge-cards/rollback-condition/" data-link-title="Rollback Condition" data-link-desc="說明決策執行後出現哪些訊號時要撤回、回退或改路線">rollback-condition</a>。</p>
<h2 id="cleanup退役清單跟保留責任">Cleanup：退役清單跟保留責任</h2>
<h3 id="退役清單">退役清單</h3>
<ul>
<li>Cloud SQL primary instance</li>
<li>PgBouncer 配置</li>
<li>Patroni cluster</li>
<li>pgBackRest backup job（保留歸檔 90 天、依產業合規）</li>
<li>Datastream pipeline</li>
<li>Dataflow job</li>
</ul>
<h3 id="監控清理">監控清理</h3>
<p>postgres-specific dashboard（exporter / wal lag / autovacuum）改成 Spanner dashboard（commit_latencies / clock_skew_ms / cpu_utilization_by_priority）。</p>
<h3 id="文件--runbook-更新">文件 / runbook 更新</h3>
<p>postgres operation runbook 標記 deprecated、Spanner runbook 上線。新 runbook 含：</p>
<ul>
<li>DDL long-running operation 監控</li>
<li>TrueTime ε 異常處理</li>
<li>Cross-region instance failover drill</li>
<li>Cost monitoring alert</li>
</ul>
<h3 id="稽核--合規">稽核 / 合規</h3>
<p>保留 final pg_dump 7 年（依產業）、incident write-back 完成、合規市場各自留檔（對照 Standard Chartered case 的 per-市場合規 lead time）。</p>
<h2 id="邊界與整合sibling對照anti-recommendation">邊界與整合：sibling、對照、anti-recommendation</h2>
<h3 id="sibling-deep-articles">Sibling deep articles</h3>
<ul>
<li><a href="../truetime-api-depth/">truetime-api-depth</a>：app 對 timestamp 假設審計（Phase 2 必讀）</li>
<li><a href="../schema-migration-interleaved-tables/">schema-migration-interleaved-tables</a>：Phase 1 target schema 設計</li>
<li><a href="../consistency-models-comparison/">consistency-models-comparison</a>：Phase 0 應用層一致性要求釐清、Driver 段 latency no-go 的物理硬限</li>
</ul>
<h3 id="跟其他-migration-對照">跟其他 migration 對照</h3>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora-dsql/" data-link-title="PostgreSQL → Aurora DSQL Migration：PG wire-compatible Distributed SQL 的 Paradigm Shift" data-link-desc="Aurora DSQL（2024-12 re:Invent preview / 2025-05 GA）是 AWS 推的 PG wire-compatible *active-active distributed SQL*、跟 self-managed PG / Aurora PG 不同 paradigm（OCC &#43; snapshot isolation &#43; multi-region strong consistency）。Migration 結構是 *protocol drop-in &#43; paradigm shift*：app SQL 不太改、但 transaction retry / extension 缺位 / 多 region 一致性需重設計。本文走 DSQL vs Aurora PG vs self-managed PG 三軸對比、為什麼遷的三條 driver（global write / operational zero-touch / region resiliency）、Type E phased plan、5 production 踩雷（transaction retry 沒處理 / extension 缺位 / sequence throughput 限制 / Aurora PG 直升 DSQL 不可行 / region failover semantic）、跟 PG → Aurora 跟 PG → CockroachDB 對比">PostgreSQL → Aurora DSQL Migration</a>：兩者都是 PostgreSQL → distributed SQL paradigm shift、選 GCP / AWS 看生態</li>
<li><a href="/blog/backend/01-database/large-scale-db-migration/" data-link-title="1.12 大規模 DB 遷移實戰" data-link-desc="跨 DB 遷移的 dual-write、[shadow read](/backend/knowledge-cards/shadow-read/)、cutover、rollback 流程 — 從實戰案例提煉的工程做法">1.12 大規模 DB 遷移實戰</a>：通用大規模遷移方法論</li>
</ul>
<h3 id="跟-case-對照">跟 case 對照</h3>
<ul>
<li><a href="/blog/backend/09-performance-capacity/cases/spanner-planetary-scale-database-gcp/" data-link-title="9.C10 Cloud Spanner：每秒 10 億請求的全球一致性資料庫" data-link-desc="Google Cloud Spanner 內部峰值 10 億 req/sec、跨地區強一致 — 全球分散式 OLTP 容量參考">9.C10 Cloud Spanner planetary scale</a>：dogfood case、揭露 Spanner 設計目標、不是 customer-facing capacity reference</li>
<li><a href="/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered Aurora banking</a>：受監管產業遷移要算合規 lead time、per-市場容量規劃</li>
</ul>
<h3 id="anti-recommendation">Anti-recommendation</h3>
<p>讀者讀完本文應該能判斷：</p>
<ul>
<li>若 driver 只是「想用新技術」→ 回 Cloud SQL</li>
<li>若 workload 小（QPS &lt; 1000、行數 &lt; 數百萬）→ Cloud SQL HA 更划算</li>
<li>若應用層延遲容忍 &lt; 50ms write → Cloud SQL 單 region</li>
<li>若 cost crossover 證明不出來 → halt migration、不升</li>
</ul>
<p>Driver 是真正跨 region write residency / external consistency 對帳契約 / 單 primary 容量天花板 → 才升。Migration playbook 的目標不是把所有 Cloud SQL workload 升到 Spanner、是把「適合升」的部分用低風險路徑遷過去。</p>
]]></content:encoded></item><item><title>PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/declarative-partitioning/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/declarative-partitioning/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明大表（&amp;gt; 1TB）需要 partitioning、本文聚焦 &lt;em>partition 真實價值在哪、為什麼多數人第一次 partition 都做錯&lt;/em>。&lt;/p>&lt;/blockquote>
&lt;h2 id="partition-不是把大表切小是讓-planner-pruning--縮小-maintenance-scope">Partition 不是「把大表切小」、是「讓 planner pruning + 縮小 maintenance scope」&lt;/h2>
&lt;p>剛開始學 partitioning 的人多半從「表太大、切小一點」直覺出發；切了之後發現 — &lt;em>query 變慢&lt;/em>（planner 還在看所有 partition）、&lt;em>INSERT 變慢&lt;/em>（trigger / partition routing overhead）、&lt;em>backup 沒變短&lt;/em>（總資料量沒變）。直覺錯了：partition 的工程價值來自兩個機制、跟「切小」沒直接關係：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Query planner pruning&lt;/strong>：planner 在 planning 階段 &lt;em>跳過&lt;/em> 不可能命中 partition key 的 partition、查詢只 scan 相關 partition；前提是 &lt;em>WHERE 條件含 partition key&lt;/em>、否則 planner 看完所有 partition、效能反而比單表差&lt;/li>
&lt;li>&lt;strong>Maintenance scope 縮小&lt;/strong>：vacuum / index rebuild / DROP / archive 只動單一 partition、不掃整表；vacuum 12 小時變 30 分鐘 / DROP 老資料 0.01 秒、是 partition 真正回本的地方&lt;/li>
&lt;/ol>
&lt;p>partition 是 &lt;em>為了 maintenance 跟 planner pruning&lt;/em> 設計、不是「表變小」設計。漏掉這個 framing、partition 配置會錯。&lt;/p>
&lt;h2 id="range--list--hashpartition-策略對應業務形狀">RANGE / LIST / HASH：partition 策略對應業務形狀&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- RANGE: 時間序列、log、event（最常見）
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">bigint&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">event_time&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">timestamptz&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">payload&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">jsonb&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">RANGE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">event_time&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events_2026_05&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">OF&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">FOR&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">VALUES&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;2026-05-01&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TO&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;2026-06-01&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- LIST: tenant ID / region / status enum
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">bigint&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">tenant_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">int&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="p">...&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">LIST&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tenant_id&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">18&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders_tenant_premium&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">OF&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">19&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">FOR&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">VALUES&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">IN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1001&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">1002&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">1003&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">20&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">21&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- HASH: 均勻散落（無自然 partition key）
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">22&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">23&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">user_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">bigint&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">24&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="p">...&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">25&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">HASH&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">user_id&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">26&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">27&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users_0&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">OF&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">28&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">FOR&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">VALUES&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">WITH&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">MODULUS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">4&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">REMAINDER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">);&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>策略選擇關鍵：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明大表（&gt; 1TB）需要 partitioning、本文聚焦 <em>partition 真實價值在哪、為什麼多數人第一次 partition 都做錯</em>。</p></blockquote>
<h2 id="partition-不是把大表切小是讓-planner-pruning--縮小-maintenance-scope">Partition 不是「把大表切小」、是「讓 planner pruning + 縮小 maintenance scope」</h2>
<p>剛開始學 partitioning 的人多半從「表太大、切小一點」直覺出發；切了之後發現 — <em>query 變慢</em>（planner 還在看所有 partition）、<em>INSERT 變慢</em>（trigger / partition routing overhead）、<em>backup 沒變短</em>（總資料量沒變）。直覺錯了：partition 的工程價值來自兩個機制、跟「切小」沒直接關係：</p>
<ol>
<li><strong>Query planner pruning</strong>：planner 在 planning 階段 <em>跳過</em> 不可能命中 partition key 的 partition、查詢只 scan 相關 partition；前提是 <em>WHERE 條件含 partition key</em>、否則 planner 看完所有 partition、效能反而比單表差</li>
<li><strong>Maintenance scope 縮小</strong>：vacuum / index rebuild / DROP / archive 只動單一 partition、不掃整表；vacuum 12 小時變 30 分鐘 / DROP 老資料 0.01 秒、是 partition 真正回本的地方</li>
</ol>
<p>partition 是 <em>為了 maintenance 跟 planner pruning</em> 設計、不是「表變小」設計。漏掉這個 framing、partition 配置會錯。</p>
<h2 id="range--list--hashpartition-策略對應業務形狀">RANGE / LIST / HASH：partition 策略對應業務形狀</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- RANGE: 時間序列、log、event（最常見）
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="n">id</span><span class="w"> </span><span class="nb">bigint</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="n">event_time</span><span class="w"> </span><span class="n">timestamptz</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="n">payload</span><span class="w"> </span><span class="n">jsonb</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">RANGE</span><span class="w"> </span><span class="p">(</span><span class="n">event_time</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_2026_05</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">OF</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">  </span><span class="k">FOR</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;2026-05-01&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;2026-06-01&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- LIST: tenant ID / region / status enum
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">  </span><span class="n">id</span><span class="w"> </span><span class="nb">bigint</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">  </span><span class="n">tenant_id</span><span class="w"> </span><span class="nb">int</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">  </span><span class="p">...</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">LIST</span><span class="w"> </span><span class="p">(</span><span class="n">tenant_id</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders_tenant_premium</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">OF</span><span class="w"> </span><span class="n">orders</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">  </span><span class="k">FOR</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="mi">1001</span><span class="p">,</span><span class="w"> </span><span class="mi">1002</span><span class="p">,</span><span class="w"> </span><span class="mi">1003</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w"></span><span class="c1">-- HASH: 均勻散落（無自然 partition key）
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">  </span><span class="n">user_id</span><span class="w"> </span><span class="nb">bigint</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w">  </span><span class="p">...</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">HASH</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">users_0</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">OF</span><span class="w"> </span><span class="n">users</span><span class="w">
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="w">  </span><span class="k">FOR</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">MODULUS</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span><span class="w"> </span><span class="n">REMAINDER</span><span class="w"> </span><span class="mi">0</span><span class="p">);</span></span></span></code></pre></div><p>策略選擇關鍵：</p>
<ul>
<li><strong>RANGE</strong> 適合 <em>時間 / 有序值</em> — query 多半帶 <code>WHERE event_time &gt;= X</code>、prune 效率最高；archive / drop 老資料是 <code>DROP PARTITION</code> 0.01 秒</li>
<li><strong>LIST</strong> 適合 <em>離散 enum / tenant</em> — query 帶 <code>WHERE tenant_id = X</code> prune；缺點是 tenant 增長要手動 ALTER ADD PARTITION</li>
<li><strong>HASH</strong> 適合 <em>均勻分散、沒自然 key</em> — query 多半 by-PK lookup、HASH 讓單 partition 大小均勻；prune 只在 <code>WHERE hash_key = X</code> 等值查詢觸發</li>
</ul>
<h3 id="選錯-partition-key-是最常見的錯誤">選錯 partition key 是最常見的錯誤</h3>
<p>例：events 表用 <code>user_id</code> HASH partition、但 query 多半 <code>WHERE event_time BETWEEN ...</code>、<code>user_id</code> 不在 WHERE — planner 沒法 prune、掃所有 partition、效能比單表更差（多了 partition routing overhead）。</p>
<p>partition key <em>必須</em> 對應 query 最常用的 WHERE filter；錯了就退化成 <em>維護面有好處、查詢面有壞處</em> 的尷尬狀態。</p>
<h2 id="partition-pruningplanner-怎麼決定跳過">Partition pruning：planner 怎麼決定跳過</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">EXPLAIN</span><span class="w"> </span><span class="p">(</span><span class="k">ANALYZE</span><span class="p">,</span><span class="w"> </span><span class="n">BUFFERS</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="s1">&#39;2026-05-15&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 期望輸出包含：
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1">--  Append (cost=...)
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1">--    -&gt; Seq Scan on events_2026_05  (cost=...)
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1">-- (只 scan 一個 partition、其他 partition pruned)</span></span></span></code></pre></div><p>pruning 觸發條件：</p>
<ol>
<li>WHERE 含 partition key 的 <em>constant expression</em>（<code>WHERE x = 5</code> 觸發；<code>WHERE x = some_function()</code> 不觸發 planning-time prune、但 PG 11+ execution-time prune 可救）</li>
<li>PG 11+ 支援 <em>execution-time pruning</em> — query plan 內含 partition key、runtime 才知道值（prepared statement / NestedLoop join）</li>
<li>partition key 不在 WHERE 時 — <em>全部 partition 掃</em>、是反指標、表示 partition strategy 不對</li>
</ol>
<h3 id="partition-wise-join--aggregate-pg-11">Partition-wise join / aggregate (PG 11+)</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SET</span><span class="w"> </span><span class="n">enable_partitionwise_join</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">on</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SET</span><span class="w"> </span><span class="n">enable_partitionwise_aggregate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">on</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 兩個同 partition 策略的表 JOIN 時、planner 可 partition-wise 平行做
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="n">e</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">events_metadata</span><span class="w"> </span><span class="n">m</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="k">ON</span><span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="n">event_time</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m</span><span class="p">.</span><span class="n">event_time</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">  </span><span class="k">WHERE</span><span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="n">event_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="p">;</span></span></span></code></pre></div><p>需要兩個表 <em>partition strategy 完全一致</em>（同 partition key + 同 partition boundary）— 設計時對齊、後期不容易調整。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1partition-key-選錯query-變慢">Case 1：partition key 選錯，query 變慢</h3>
<p><strong>徵兆</strong>：partition 後特定查詢從 200ms 變成 2000ms；EXPLAIN 顯示 <code>Append</code> 下面所有 partition 都被 scan、沒 partition 被 prune。</p>
<p><strong>根因</strong>：partition by <code>user_id</code> HASH、但 query 多用 <code>WHERE created_at BETWEEN X AND Y</code>；planner 不知道 user 在哪個 partition、必須掃全部。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>驗證 step</strong>：partition 前先 <code>pg_stat_statements</code> 看 top 10 query 的 WHERE pattern、partition key 必須對應其中 80% 流量的 filter</li>
<li><strong>修正</strong>：DROP partition strategy、改 partition by <code>created_at</code> RANGE；遷移用 <code>pg_dump --section=data</code> per-partition 重灌</li>
<li><strong>避免</strong>：partitioning 不可逆、設計階段 query pattern 沒看清楚不要動</li>
</ol>
<h3 id="case-2cross-partition-unique-constraint-不-enforce">Case 2：cross-partition unique constraint 不 enforce</h3>
<p><strong>徵兆</strong>：partition 後發現 application code 寫死 duplicate user_email、但 unique constraint 沒擋；DB 內有同 email 多筆。</p>
<p><strong>根因</strong>：PostgreSQL partition table 的 <code>UNIQUE</code> constraint <em>必須包含 partition key</em> — <code>UNIQUE (email)</code> 在 partition by <code>tenant_id</code> 的表上 <em>無法 enforce</em>（PostgreSQL 拒建）；workaround 用 <code>UNIQUE (email, tenant_id)</code>、但業務語意是「email 全域唯一」、PG 無法保證。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>架構</strong>：跨 partition 唯一性必須在 <em>application 層</em> enforce（lock + check 模式）</li>
<li><strong>替代</strong>：用 <em>non-partitioned</em> 表存唯一性目標（user_email_registry）、做寫入前 lookup</li>
<li><strong>設計階段檢查</strong>：partition by X、unique constraint 必須含 X；若業務要求 unique 不含 X、partition strategy 錯</li>
</ol>
<h3 id="case-3attach-partition-鎖表太久">Case 3：ATTACH PARTITION 鎖表太久</h3>
<p><strong>徵兆</strong>：新 month partition <code>ATTACH PARTITION</code> 跑 30 秒、期間整個 events 表 read 阻塞、application timeout 大量。</p>
<p><strong>根因</strong>：<code>ATTACH PARTITION</code> 預設加 <code>ACCESS EXCLUSIVE</code> lock 在 parent table、scan 整個新 partition 驗證 CHECK constraint；大 partition + 沒 CHECK constraint 預先驗證 → 鎖時間爆。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. 先把要 attach 的 partition 加 CHECK constraint，用 NOT VALID 不掃描
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_2026_06</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">CONSTRAINT</span><span class="w"> </span><span class="n">events_2026_06_range</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="k">CHECK</span><span class="w"> </span><span class="p">(</span><span class="n">event_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="s1">&#39;2026-06-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="s1">&#39;2026-07-01&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">VALID</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="c1">-- 2. VALIDATE 用 SHARE UPDATE EXCLUSIVE lock、允許讀寫
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_2026_06</span><span class="w"> </span><span class="n">VALIDATE</span><span class="w"> </span><span class="k">CONSTRAINT</span><span class="w"> </span><span class="n">events_2026_06_range</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- 3. ATTACH 不再需要 scan（CHECK 已 VALIDATE 過）
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="n">ATTACH</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">events_2026_06</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">  </span><span class="k">FOR</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;2026-06-01&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;2026-07-01&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- ATTACH 變 instant</span></span></span></code></pre></div><h3 id="case-4partition-數爆炸planner-planning-time-爆">Case 4：partition 數爆炸，planner planning time 爆</h3>
<p><strong>徵兆</strong>：partition 累積到 500+（daily partition 跑 1-2 年）、簡單 query EXPLAIN 顯示 planning_time 從 1ms 漲到 200ms、application response 變慢。</p>
<p><strong>根因</strong>：partition 越多 planner 要評估的 partition 越多、即使有 pruning、planning 階段也要 walk 全部 partition table；500+ partition 是 planning overhead 明顯的閾值。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>架構</strong>：partition granularity 對應 retention — 不要 daily partition 留 2 年（→ weekly / monthly）</li>
<li><strong>archive 老 partition</strong>：DETACH 老 partition、轉成 cold storage 表、planner 不再看</li>
<li><strong><code>enable_partition_pruning</code></strong> 預設 on、確保啟用</li>
<li><strong>PG 12+</strong>：planner 對 partition table 的 list 處理優化、planning time 上限拉高、但仍要控</li>
</ol>
<h3 id="case-5detach-後磁碟空間沒回收">Case 5：DETACH 後磁碟空間沒回收</h3>
<p><strong>徵兆</strong>：DETACH PARTITION 後 <code>pg_database_size</code> 沒下降、預期釋放 50GB；磁碟仍滿。</p>
<p><strong>根因</strong>：DETACH 只是把 partition 從 parent table <em>分離</em>、partition 自己仍是獨立表存在；要真釋放需要 <code>DROP TABLE detached_partition</code>。SRE 以為 DETACH = 刪掉。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 完整流程
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="n">DETACH</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">events_2024_01</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- events_2024_01 仍存在、佔磁碟
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 確認沒 query 在用後
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">DROP</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_2024_01</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- 才釋放磁碟</span></span></span></code></pre></div><h3 id="routinearchive-workflow">Routine：archive workflow</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 月底跑：
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1">-- 1. detach 13 個月前的 partition
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="n">DETACH</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">events_2025_04</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 2. dump 到 cold storage
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="err">\</span><span class="k">COPY</span><span class="w"> </span><span class="n">events_2025_04</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="s1">&#39;/cold/events_2025_04.csv&#39;</span><span class="w"> </span><span class="p">(</span><span class="n">FORMAT</span><span class="w"> </span><span class="n">CSV</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="c1">-- 3. drop 釋放磁碟
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1"></span><span class="k">DROP</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_2025_04</span><span class="p">;</span></span></span></code></pre></div><h2 id="容量規劃">容量規劃</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>估算</th>
          <th>警戒</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>單 partition size</td>
          <td>跟單表 vacuum 上限對齊（10-100GB sweet spot）</td>
          <td>&gt; 200GB 時考慮 sub-partition 或細化 granularity</td>
      </tr>
      <tr>
          <td>Partition 數量</td>
          <td>對應 retention × granularity</td>
          <td>&gt; 200 partition 時 planning time 開始浮現</td>
      </tr>
      <tr>
          <td>Partition key cardinality</td>
          <td>LIST：&lt; 100 / HASH：自定 modulus / RANGE：時間 + 維度</td>
          <td>太多獨立 partition value 用 HASH</td>
      </tr>
      <tr>
          <td>Cross-partition query 比例</td>
          <td>EXPLAIN 看 partition scan 數</td>
          <td>&gt; 30% query 掃 &gt; 50% partition 表示 key 選錯</td>
      </tr>
      <tr>
          <td>Maintenance window</td>
          <td>DROP / DETACH / ATTACH 各 partition 各自管</td>
          <td>hot partition 維護仍在 maintenance window</td>
      </tr>
  </tbody>
</table>
<p>實務 default：</p>
<ul>
<li>時間序列（events / log）：monthly RANGE partition、retention 12-24 個月</li>
<li>Multi-tenant（orders / records）：tenant_id LIST partition + 大 tenant 各自獨立 partition</li>
<li>均勻散落（user / metric）：8-16 個 HASH partition、單 partition 50-100GB</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-autovacuum-tuning-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum tuning</a> 整合</h3>
<p>partitioning 是 autovacuum 問題的長期解：</p>
<ol>
<li>Hot partition autovacuum 緊（scale_factor 0.05、cost_limit 5000）</li>
<li>Cold partition <code>autovacuum_enabled = false</code></li>
<li>但 partition 數爆會把 <code>autovacuum_max_workers</code> 跑滿、需要拉</li>
</ol>
<h3 id="跟-index-設計整合">跟 index 設計整合</h3>
<p>partition table 的 index 處理：</p>
<ol>
<li>PG 11+ 全域 index：<code>CREATE INDEX ON partitioned_table (...)</code> 自動在每 partition 建 local index</li>
<li><strong>不存在跨 partition unique</strong> — 只能 partition-local</li>
<li><strong>partition-wise index scan</strong>：PG 11+ 跟 partition-wise join 一起、index lookup 平行</li>
</ol>
<h3 id="跟-backup--pitr">跟 backup / PITR</h3>
<p>partition 不是 backup 替代品 — 但能加速 <em>partial restore</em>：</p>
<ol>
<li>只 restore 特定時段的 partition、不用 restore 整個表</li>
<li>對應 <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL archiving</a> 的 partial recovery scenario</li>
</ol>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Sub-partitioning</strong>：partition 內再 partition（時間 + tenant）、適合 multi-tenant + 時間序列</li>
<li><strong>pg_partman extension</strong>：自動建月 partition、不用 cron</li>
<li><strong>Foreign key to partitioned table</strong> (PG 12+)：跨 partition FK enforce、但 cascade 限制多</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>上游 chapter：<a href="/blog/backend/01-database/schema-design/" data-link-title="1.2 Schema Design 與資料建模" data-link-desc="整理 table、index、key、partition、denormalization 與命名規則">Schema Design</a> — partition 是 schema 決策</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum tuning</a> / <a href="/blog/backend/01-database/vendors/postgresql/timescaledb-deep-dive/" data-link-title="TimescaleDB Deep Dive：Hypertable / Continuous Aggregate / Compression 把 PG 變 Time-Series DB" data-link-desc="TimescaleDB 是 PG extension（不是 fork）、用 *hypertable* 自動 partition by time、加 *continuous aggregate* 做 incremental materialized view、加 *compression* 對舊 chunk 壓 90%&#43;、把 PG 變成 InfluxDB / Prometheus 級 time-series DB。本文走 hypertable 機制、continuous aggregate 跟普通 MV 差異、compression policy、retention policy、5 production 踩雷（chunk size 不對 / CAGG refresh 落後 / compression 後 update 限制 / hypertable 不能加 FK / TimescaleDB 跟 PG 主版本對齊）、跟 PG 原生 partitioning 對比">TimescaleDB Deep Dive</a>（hypertable 是 partition 自動化）</li>
<li>後續路由：<a href="/blog/backend/01-database/vendors/postgresql/partition-redesign/" data-link-title="PostgreSQL Partition Redesign：當 monthly partition 越跑越慢" data-link-desc="PostgreSQL partition redesign 是 Type F「topology re-layout」第 2 個 dogfood — 從 monthly partition 改 daily / 從 range 改 list / 從單軸改 sub-partition；6 維 audit 皆 Low &#43; topology 軸 High；涵蓋 partition 不平衡偵測、ATTACH/DETACH 線上重劃、5 個 production 踩雷、跟 partition_pruning &#43; autovacuum 整合">Partition Redesign</a>（重排 partition strategy 的 migration playbook）</li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Logical Replication + Debezium CDC：replication slot × failure × recovery 對照</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 提到 logical decoding / Debezium CDC、本文聚焦 &lt;em>replication slot 生命週期 + 5 個 production failure mode 跟 recovery&lt;/em> 的對照。&lt;/p>&lt;/blockquote>
&lt;h2 id="replication-slot--failure--recovery-對照">Replication slot × Failure × Recovery 對照&lt;/h2>
&lt;p>Logical replication 跟 Debezium CDC 的 production 議題集中在 &lt;em>replication slot&lt;/em> — 它是 PostgreSQL 內保證 WAL 不被回收的 anchor point；slot 設不對、整個 CDC pipeline 失效。各 failure mode 對 slot 的影響跟 recovery 路徑：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Failure mode&lt;/th>
 &lt;th>對 slot 影響&lt;/th>
 &lt;th>Primary 端徵兆&lt;/th>
 &lt;th>Recovery 路徑&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Consumer 卡住 / lag&lt;/td>
 &lt;td>slot LSN 不前進、WAL 留著&lt;/td>
 &lt;td>&lt;code>pg_wal&lt;/code> 目錄持續長大、disk 撐爆&lt;/td>
 &lt;td>修 consumer / 加 throttle / 必要時 drop slot&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Consumer crash 無 restart&lt;/td>
 &lt;td>slot 留在 active state&lt;/td>
 &lt;td>跟 lag 同、不會自動清&lt;/td>
 &lt;td>手動 &lt;code>SELECT pg_drop_replication_slot('name')&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Schema change（ADD COLUMN）&lt;/td>
 &lt;td>多數 plugin 自動處理、無感&lt;/td>
 &lt;td>通常無感&lt;/td>
 &lt;td>-&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Schema change（DROP / RENAME COLUMN）&lt;/td>
 &lt;td>多數 plugin 直接斷&lt;/td>
 &lt;td>Consumer log 報錯、slot active 卻不前進&lt;/td>
 &lt;td>重建 publication / 重 init load&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Initial COPY&lt;/td>
 &lt;td>slot 建立時跑 snapshot、long-running tx&lt;/td>
 &lt;td>大表 COPY 期間鎖跟 WAL 都受影響&lt;/td>
 &lt;td>用 &lt;code>CREATE_REPLICATION_SLOT ... NOEXPORT_SNAPSHOT&lt;/code> 分階段&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Promotion (failover)&lt;/td>
 &lt;td>physical slot 跟 logical slot 處理不同&lt;/td>
 &lt;td>logical slot 在 PG 16- 不跨 failover&lt;/td>
 &lt;td>PG 16+ logical slot 持久化、或 consumer 重 init load&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Replay storm（offset 重置）&lt;/td>
 &lt;td>slot 不變、consumer 重讀&lt;/td>
 &lt;td>Kafka 端流量爆、application 看 duplicate&lt;/td>
 &lt;td>Idempotent consumer 設計、或 transactional outbox&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>每個 failure mode 對應的詳細配置 + recovery 步驟、下面分段展開。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 提到 logical decoding / Debezium CDC、本文聚焦 <em>replication slot 生命週期 + 5 個 production failure mode 跟 recovery</em> 的對照。</p></blockquote>
<h2 id="replication-slot--failure--recovery-對照">Replication slot × Failure × Recovery 對照</h2>
<p>Logical replication 跟 Debezium CDC 的 production 議題集中在 <em>replication slot</em> — 它是 PostgreSQL 內保證 WAL 不被回收的 anchor point；slot 設不對、整個 CDC pipeline 失效。各 failure mode 對 slot 的影響跟 recovery 路徑：</p>
<table>
  <thead>
      <tr>
          <th>Failure mode</th>
          <th>對 slot 影響</th>
          <th>Primary 端徵兆</th>
          <th>Recovery 路徑</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Consumer 卡住 / lag</td>
          <td>slot LSN 不前進、WAL 留著</td>
          <td><code>pg_wal</code> 目錄持續長大、disk 撐爆</td>
          <td>修 consumer / 加 throttle / 必要時 drop slot</td>
      </tr>
      <tr>
          <td>Consumer crash 無 restart</td>
          <td>slot 留在 active state</td>
          <td>跟 lag 同、不會自動清</td>
          <td>手動 <code>SELECT pg_drop_replication_slot('name')</code></td>
      </tr>
      <tr>
          <td>Schema change（ADD COLUMN）</td>
          <td>多數 plugin 自動處理、無感</td>
          <td>通常無感</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Schema change（DROP / RENAME COLUMN）</td>
          <td>多數 plugin 直接斷</td>
          <td>Consumer log 報錯、slot active 卻不前進</td>
          <td>重建 publication / 重 init load</td>
      </tr>
      <tr>
          <td>Initial COPY</td>
          <td>slot 建立時跑 snapshot、long-running tx</td>
          <td>大表 COPY 期間鎖跟 WAL 都受影響</td>
          <td>用 <code>CREATE_REPLICATION_SLOT ... NOEXPORT_SNAPSHOT</code> 分階段</td>
      </tr>
      <tr>
          <td>Promotion (failover)</td>
          <td>physical slot 跟 logical slot 處理不同</td>
          <td>logical slot 在 PG 16- 不跨 failover</td>
          <td>PG 16+ logical slot 持久化、或 consumer 重 init load</td>
      </tr>
      <tr>
          <td>Replay storm（offset 重置）</td>
          <td>slot 不變、consumer 重讀</td>
          <td>Kafka 端流量爆、application 看 duplicate</td>
          <td>Idempotent consumer 設計、或 transactional outbox</td>
      </tr>
  </tbody>
</table>
<p>每個 failure mode 對應的詳細配置 + recovery 步驟、下面分段展開。</p>
<h2 id="logical-replication-基礎publication--subscription--slot">Logical replication 基礎：publication + subscription + slot</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Primary：建 publication
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">PUBLICATION</span><span class="w"> </span><span class="n">app_changes</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="p">,</span><span class="w"> </span><span class="n">events</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Subscriber：建 subscription（自動建 replication slot）
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">SUBSCRIPTION</span><span class="w"> </span><span class="n">app_sub</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="k">CONNECTION</span><span class="w"> </span><span class="s1">&#39;host=primary user=replicator dbname=app&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">  </span><span class="n">PUBLICATION</span><span class="w"> </span><span class="n">app_changes</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w">  </span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">slot_name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;app_sub_slot&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">copy_data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">true</span><span class="p">);</span></span></span></code></pre></div><p>關鍵物件：</p>
<ul>
<li><strong>publication</strong>（primary 端）：宣告 <em>哪些表 + 哪些操作（INSERT/UPDATE/DELETE/TRUNCATE）</em> 對外暴露</li>
<li><strong>subscription</strong>（subscriber 端、若是 PG-to-PG）：訂閱 + 自動建 slot + 自動 initial COPY</li>
<li><strong>replication slot</strong>：primary 端、保證 <em>consumer 還沒消費的 WAL</em> 不被回收</li>
</ul>
<p><code>copy_data = true</code> 觸發 initial COPY（snapshot）+ 後續 streaming；<code>copy_data = false</code> 只 streaming、適合 already-in-sync 場景。</p>
<h2 id="debezium-cdc用-logical-replication-slot-但繞過-subscription">Debezium CDC：用 logical replication slot 但繞過 subscription</h2>
<p>Debezium 不是 PostgreSQL subscriber、是 <em>直接讀 replication slot</em> 的外部 consumer：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-properties" data-lang="properties"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Debezium PostgreSQL connector</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="na">connector.class</span><span class="o">=</span><span class="s">io.debezium.connector.postgresql.PostgresConnector</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="na">database.hostname</span><span class="o">=</span><span class="s">primary</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="na">database.dbname</span><span class="o">=</span><span class="s">app</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="na">plugin.name</span><span class="o">=</span><span class="s">pgoutput                            # 內建、PG 10+ 推薦</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="na">slot.name</span><span class="o">=</span><span class="s">debezium_app</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="na">publication.name</span><span class="o">=</span><span class="s">app_changes</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="na">publication.autocreate.mode</span><span class="o">=</span><span class="s">filtered            # debezium 自動建 publication</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="na">table.include.list</span><span class="o">=</span><span class="s">public.orders,public.events</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="na">snapshot.mode</span><span class="o">=</span><span class="s">initial                            # 起始 snapshot 後 streaming</span></span></span></code></pre></div><p>差異：</p>
<ul>
<li>Debezium 用 <code>pgoutput</code>（PG 10+ 內建）或 <code>wal2json</code>（外掛 plugin）解 WAL、轉成結構化事件送 Kafka</li>
<li>不像 PG-to-PG subscription、Debezium 沒 subscription object、是 <em>外部 consumer 自管</em> replication slot</li>
<li>Failure mode 上 <em>consumer 端是 Debezium 自己</em>、所以 lag 來源是 Debezium 處理速度 / Kafka 寫入速度</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1consumer-lagslot-lsn-不前進primary-disk-爆">Case 1：consumer lag、slot LSN 不前進、primary disk 爆</h3>
<p><strong>徵兆</strong>：primary <code>pg_wal</code> 目錄持續長大、<code>df -h</code> 看磁碟 90%+；<code>pg_replication_slots</code> 看 <code>confirmed_flush_lsn</code> 卡在某 LSN、<code>pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)</code> 數十 GB。</p>
<p><strong>根因</strong>：consumer（Debezium / subscriber）處理慢於 primary 寫入；replication slot <em>保證 WAL 不回收</em>、但 consumer 沒消費 → WAL 堆積。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>監測</strong>：Prometheus alert <code>pg_replication_slot_lag_bytes &gt; 5GB</code> 觸發前 catch</li>
<li><strong>修 consumer</strong>：throttle primary 寫入 OR scale Debezium / subscriber 處理能力</li>
<li><strong>緊急</strong>：<code>SELECT pg_drop_replication_slot('debezium_app')</code> 釋放 WAL — 但 consumer 必須重 init load（資料缺一塊）</li>
<li><strong>架構</strong>：用 <em>max_slot_wal_keep_size</em>（PG 13+）設 slot 能保留 WAL 上限、超出自動 invalidate slot、保護 primary disk</li>
</ol>
<h3 id="case-2consumer-crash-後-slot-變-zombie">Case 2：consumer crash 後 slot 變 zombie</h3>
<p><strong>徵兆</strong>：Debezium pod OOM crash、新 pod 起來時報 <code>slot is active for PID X</code>、無法 attach；primary 端 <code>pg_replication_slots.active = true</code>、<code>active_pid</code> 指向已經死掉的 process。</p>
<p><strong>根因</strong>：PostgreSQL 把 slot 標 active 是基於 <em>當下有 connection</em>；consumer crash 但 connection 沒被 server 端發現（network 沒 RST）、slot 留在 active state。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 手動清 zombie slot
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_terminate_backend</span><span class="p">(</span><span class="n">active_pid</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="k">WHERE</span><span class="w"> </span><span class="n">slot_name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;debezium_app&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">active</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 或直接 drop（會丟資料、consumer 要重 init）
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_drop_replication_slot</span><span class="p">(</span><span class="s1">&#39;debezium_app&#39;</span><span class="p">);</span></span></span></code></pre></div><p>預防：</p>
<ol>
<li>PostgreSQL <code>tcp_keepalives_idle / interval / count</code> 設較短（300 / 60 / 6）、network drop 較快被發現</li>
<li>Consumer 端用 <em>graceful shutdown</em> + <code>pg_terminate_backend(active_pid)</code> 在 startup 前主動清 stale connection</li>
</ol>
<h3 id="case-3schema-changedrop--rename-column斷流">Case 3：schema change（DROP / RENAME COLUMN）斷流</h3>
<p><strong>徵兆</strong>：Debezium consumer 突然停 produce 訊息、log 報 <code>column XYZ does not exist</code>；primary 端 slot 還 active、但 <code>confirmed_flush_lsn</code> 不前進。</p>
<p><strong>根因</strong>：pgoutput plugin 把 WAL 解成 row event 時、用的 schema 是 <em>當下 catalog</em>；如果中間 DROP COLUMN、之前 WAL 內的 row event 含已不存在欄位、解析失敗。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：schema change 走 <em>expand-contract pattern</em>
<ul>
<li>Phase 1: ADD COLUMN new_col（不影響 logical replication）</li>
<li>Phase 2: application 雙寫 old + new</li>
<li>Phase 3: 等 consumer catch up old column 訊息</li>
<li>Phase 4: DROP COLUMN old_col（此時無 in-flight WAL 帶 old_col）</li>
</ul>
</li>
<li><strong>緊急</strong>：DROP existing slot、重建 publication 跟 slot、consumer 重 init load</li>
<li><strong>長期</strong>：用 Debezium <em>snapshot.mode=schema_only_recovery</em> 在 schema 變動時不重灌資料、只 reset schema</li>
</ol>
<h3 id="case-4initial-copy-大表鎖太久">Case 4：initial COPY 大表鎖太久</h3>
<p><strong>徵兆</strong>：對 1TB 表跑 <code>CREATE SUBSCRIPTION ... WITH (copy_data=true)</code> 後、application 對該表 query / write 阻塞 30+ 分鐘；application timeout 大量。</p>
<p><strong>根因</strong>：initial COPY 默認跑在 <em>single transaction</em>、整個 snapshot LSN 鎖住、長 transaction 跟 vacuum 衝突；同時對 subscriber 端鎖表寫入。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>分階段 init</strong>：</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- Primary：建 publication 不 copy
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">PUBLICATION</span><span class="w"> </span><span class="n">app_changes</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">big_table</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- Subscriber：建 subscription 不 copy
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">SUBSCRIPTION</span><span class="w"> </span><span class="n">app_sub</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">  </span><span class="k">CONNECTION</span><span class="w"> </span><span class="s1">&#39;...&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">  </span><span class="n">PUBLICATION</span><span class="w"> </span><span class="n">app_changes</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">  </span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">copy_data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">false</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 手動跑 partition-by-partition COPY（若是 partition table）
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1">-- 或用 pg_dump / pg_basebackup 拿 snapshot</span></span></span></code></pre></div><ol start="2">
<li><strong>PG 16+ parallel init</strong>：<code>max_sync_workers_per_subscription = 4</code> 平行 COPY 多個表</li>
<li><strong>Debezium replacement</strong>：用 incremental snapshot（Debezium 1.6+）、background trickle copy、不鎖長 transaction</li>
</ol>
<h3 id="case-5replay-storm-後-consumer-offset-reset">Case 5：replay storm 後 consumer offset reset</h3>
<p><strong>徵兆</strong>：Debezium 修 bug / 重 deploy 後、<code>snapshot.mode=initial</code> 觸發整個資料重灌；Kafka topic 流量爆 10x、下游 application 看到大量 duplicate event。</p>
<p><strong>根因</strong>：Debezium offset store（Kafka topic 或 file）被誤刪 / corruption；重啟時不知道從哪 LSN 開始、預設 fall back 到 initial snapshot。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：Debezium offset store 跟 Kafka cluster <em>backup 一起做</em>、不要單獨依賴 Kafka topic</li>
<li><strong>架構</strong>：consumer side 設計 <em>idempotent</em> — 用 event 自帶的 (source LSN + transaction ID) 當 dedupe key</li>
<li><strong>transactional outbox pattern</strong>：CDC 只 capture outbox 表、application 主動寫 outbox + business data 在同 transaction；duplicate 由 application 自己 dedupe</li>
</ol>
<h2 id="容量規劃">容量規劃</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>估算</th>
          <th>警戒</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Replication slot lag</td>
          <td><code>pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)</code></td>
          <td>&gt; 1GB lag 訊號 consumer 跟不上</td>
      </tr>
      <tr>
          <td>Primary <code>pg_wal</code> size</td>
          <td>retention × peak WAL rate</td>
          <td>預留 disk 容量 = max_slot_wal_keep_size + 30% buffer</td>
      </tr>
      <tr>
          <td>Debezium throughput</td>
          <td>~5-10K row/s 單 connector、多表平行可拉</td>
          <td>跟 primary write rate 對比</td>
      </tr>
      <tr>
          <td>Initial COPY time</td>
          <td>100GB ~ 10-30 分鐘（看 network + subscriber IO）</td>
          <td>TB 級必須分階段</td>
      </tr>
      <tr>
          <td>Slot 數量</td>
          <td>每 slot 佔 primary 一份 WAL 保留 buffer</td>
          <td>5+ slot 同時跑 disk 壓力倍增</td>
      </tr>
      <tr>
          <td>max_replication_slots</td>
          <td>預設 10、production 跑 CDC + standby 各佔 slot 要拉到 20-50</td>
          <td>達上限會拒新 slot 建立</td>
      </tr>
  </tbody>
</table>
<p>實務 default：</p>
<ul>
<li>Debezium production：1 connector per source schema、不要 1 connector 跨 50 個表</li>
<li>Slot retention：<code>max_slot_wal_keep_size = 100GB</code>、超出 invalidate slot 保護 primary</li>
<li>Monitor cadence：1 分鐘 sample lag + 5 分鐘 alert threshold</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-patroni-ha-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> 整合</h3>
<p>logical slot 在 PG 16- 不跨 failover、是長期痛點：</p>
<ol>
<li><strong>PG 16-</strong>：failover 後 logical consumer 必須重 init（slot 在新 leader 上不存在）</li>
<li><strong>PG 16+</strong>：<code>failover</code> parameter 讓 logical slot 在 standby 同步、failover 後 consumer 直接接</li>
<li>Patroni 16+ 支援 logical slot persistence 配置、配合用</li>
</ol>
<h3 id="跟-kafka-outbox-pattern">跟 Kafka outbox pattern</h3>
<p>production-grade CDC 不直接 read business table、是 read <em>outbox table</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Application transaction
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(...)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(...);</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">outbox</span><span class="w"> </span><span class="p">(</span><span class="n">event_type</span><span class="p">,</span><span class="w"> </span><span class="n">payload</span><span class="p">,</span><span class="w"> </span><span class="n">created_at</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;order_created&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;...&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">now</span><span class="p">());</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">COMMIT</span><span class="p">;</span></span></span></code></pre></div><p>Debezium 只 capture outbox table、event payload 已是 application-shaped JSON、不用解 row event。好處：</p>
<ol>
<li>Schema change 不影響 CDC（outbox table schema 穩定）</li>
<li>跨表 transaction 對應到單 event（outbox 是業務語意層）</li>
<li>Replay 可靠 — outbox 是 append-only、可重讀</li>
</ol>
<h3 id="跟-partitioning-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">partitioning</a> 整合</h3>
<p>partitioned table 的 logical replication：</p>
<ol>
<li>PG 13+ <code>publish_via_partition_root = true</code> — publication 從 parent 角度看、不是 per-partition</li>
<li>Subscriber 端可 partition 不同 strategy（甚至不 partition）</li>
<li>Schema change 對 partition table 更複雜、走 expand-contract 嚴格</li>
</ol>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Logical replication conflict</strong>：subscriber 端寫衝突的處理（PG 17+ 加 conflict resolution）</li>
<li><strong>bi-directional replication（pg_active）</strong>：多 region active-active、衝突解決設計</li>
<li><strong>Decoder plugin 對比</strong>：pgoutput / wal2json / decoderbufs 效能跟易用性</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>上游 chapter：<a href="/blog/backend/01-database/schema-migration-rollout-evidence/" data-link-title="1.7 Schema Migration Rollout 證據（Schema Migration Rollout Evidence）實作示範" data-link-desc="以訂單付款狀態欄位演進示範 schema migration 如何產出 evidence、release gate 與 incident decision log。">Schema Migration Rollout Evidence</a> — schema change × CDC 對應</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL Archiving</a> / <a href="/blog/backend/01-database/vendors/postgresql/replication-slot-management/" data-link-title="PostgreSQL Replication Slot Management：Physical / Logical / Failover Slot 治理" data-link-desc="PG replication slot 是 *primary 端的 standby 進度紀錄*、防 WAL premature deletion。但 orphan slot 會吃 disk、failover 後 logical slot 不會自動跟新 primary、是 PG 操作的 hidden complexity。本文走 physical / logical slot 差異、slot lifecycle、failover slot synchronization（PG 17&#43; 新特性）、orphan slot 治理、5 production 踩雷（orphan slot disk 爆 / logical slot lag / failover 後 slot 丟 / wal_keep_size 跟 slot 衝突 / connection 同時打 slot 數量限制）">Replication Slot Management</a>（slot lifecycle / orphan / failover sync）/ <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>（streaming + LSN 基礎）</li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL PITR + WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 backup / recovery 是 OLTP 必備能力、本文聚焦 &lt;em>PITR（Point-In-Time Recovery）的雙軌資料設計 + production 5 個 failure mode&lt;/em>。&lt;/p>&lt;/blockquote>
&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>Logical bug 在 production 部署、執行 6 小時後才發現 — 某個 batch job 把 50 萬筆 user.email 改成 NULL。此時：&lt;/p>
&lt;ul>
&lt;li>還原最新 daily backup（昨晚）→ 丟掉今天所有正常寫入（訂單、註冊）&lt;/li>
&lt;li>從 standby promote → standby 已同步 bug、跟 primary 同狀態&lt;/li>
&lt;li>從 application log 重建 → 部分操作不可逆（已寄出 email）&lt;/li>
&lt;/ul>
&lt;p>PITR 是這類 &lt;em>logical disaster&lt;/em> 的標準解 — 不還原到 backup 時間點、而是 &lt;em>還原到 bug 發生前一刻&lt;/em>（例：1 分鐘前）。需要 &lt;em>base backup + WAL archive&lt;/em> 雙軌資料：base backup 是 snapshot、WAL archive 是 snapshot 之後的所有寫入；recovery 時 replay WAL 到指定 timestamp / LSN / transaction ID。&lt;/p>
&lt;h2 id="核心概念base-backup--wal-archive-的雙軌設計">核心概念：base backup + WAL archive 的雙軌設計&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">[Base backup t0] + [WAL archive t0 → now]
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> ↓ ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> 全量 snapshot incremental log
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> ↓ ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl"> └────── recover to t_target ──→ [restored cluster at t_target]&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>兩個軌道各自獨立但必須對齊：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Base backup&lt;/strong>：某時刻整個 data dir 的 snapshot。&lt;code>pg_basebackup&lt;/code> / &lt;code>pgBackRest&lt;/code> / &lt;code>WAL-G&lt;/code> 都產這個；通常 &lt;em>每天 / 每週&lt;/em> 跑一次&lt;/li>
&lt;li>&lt;strong>WAL archive&lt;/strong>：base backup 之後每段 WAL 都 push 到外部 storage（S3 / GCS / NFS）。&lt;code>archive_command&lt;/code> 觸發、PostgreSQL 等到 archive 成功才 &lt;em>回收&lt;/em> 那段 WAL&lt;/li>
&lt;/ol>
&lt;p>兩者組合決定 RPO（recovery point objective）：&lt;/p>
&lt;ul>
&lt;li>RPO ≈ WAL archive frequency（streaming 即時、&lt;code>archive_timeout&lt;/code> 預設 1 分鐘）&lt;/li>
&lt;li>RPO 不是 base backup frequency — daily base backup + 每分鐘 archive WAL → RPO 1 分鐘&lt;/li>
&lt;/ul>
&lt;p>RTO（recovery time objective）跟 &lt;em>base backup size + WAL replay 量&lt;/em> 相關：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 backup / recovery 是 OLTP 必備能力、本文聚焦 <em>PITR（Point-In-Time Recovery）的雙軌資料設計 + production 5 個 failure mode</em>。</p></blockquote>
<h2 id="問題情境">問題情境</h2>
<p>Logical bug 在 production 部署、執行 6 小時後才發現 — 某個 batch job 把 50 萬筆 user.email 改成 NULL。此時：</p>
<ul>
<li>還原最新 daily backup（昨晚）→ 丟掉今天所有正常寫入（訂單、註冊）</li>
<li>從 standby promote → standby 已同步 bug、跟 primary 同狀態</li>
<li>從 application log 重建 → 部分操作不可逆（已寄出 email）</li>
</ul>
<p>PITR 是這類 <em>logical disaster</em> 的標準解 — 不還原到 backup 時間點、而是 <em>還原到 bug 發生前一刻</em>（例：1 分鐘前）。需要 <em>base backup + WAL archive</em> 雙軌資料：base backup 是 snapshot、WAL archive 是 snapshot 之後的所有寫入；recovery 時 replay WAL 到指定 timestamp / LSN / transaction ID。</p>
<h2 id="核心概念base-backup--wal-archive-的雙軌設計">核心概念：base backup + WAL archive 的雙軌設計</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">[Base backup t0]  +  [WAL archive t0 → now]
</span></span><span class="line"><span class="ln">2</span><span class="cl">     ↓                       ↓
</span></span><span class="line"><span class="ln">3</span><span class="cl">  全量 snapshot          incremental log
</span></span><span class="line"><span class="ln">4</span><span class="cl">     ↓                       ↓
</span></span><span class="line"><span class="ln">5</span><span class="cl">     └────── recover to t_target ──→ [restored cluster at t_target]</span></span></code></pre></div><p>兩個軌道各自獨立但必須對齊：</p>
<ol>
<li><strong>Base backup</strong>：某時刻整個 data dir 的 snapshot。<code>pg_basebackup</code> / <code>pgBackRest</code> / <code>WAL-G</code> 都產這個；通常 <em>每天 / 每週</em> 跑一次</li>
<li><strong>WAL archive</strong>：base backup 之後每段 WAL 都 push 到外部 storage（S3 / GCS / NFS）。<code>archive_command</code> 觸發、PostgreSQL 等到 archive 成功才 <em>回收</em> 那段 WAL</li>
</ol>
<p>兩者組合決定 RPO（recovery point objective）：</p>
<ul>
<li>RPO ≈ WAL archive frequency（streaming 即時、<code>archive_timeout</code> 預設 1 分鐘）</li>
<li>RPO 不是 base backup frequency — daily base backup + 每分鐘 archive WAL → RPO 1 分鐘</li>
</ul>
<p>RTO（recovery time objective）跟 <em>base backup size + WAL replay 量</em> 相關：</p>
<ul>
<li>Restore base backup ~ 1-4 小時（TB 級）</li>
<li>WAL replay 時間 ~ archive 累積量 / replay throughput</li>
</ul>
<h2 id="step-by-step-配置">Step-by-step 配置</h2>
<h3 id="primaryarchive_command-設好">Primary：archive_command 設好</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">wal_level</span> <span class="o">=</span> <span class="s">replica                          # 預設 replica、PITR 需要</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">archive_mode</span> <span class="o">=</span> <span class="s">on                            # 啟用 archive</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">archive_command</span> <span class="o">=</span> <span class="s">&#39;wal-g wal-push %p&#39;        # 或 pgBackRest / 自寫 script</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">archive_timeout</span> <span class="o">=</span> <span class="s">60                         # 60s 無 WAL 時強制切 segment</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="na">max_wal_size</span> <span class="o">=</span> <span class="s">4GB</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="na">checkpoint_timeout</span> <span class="o">=</span> <span class="s">15min</span></span></span></code></pre></div><p><code>archive_command</code> 必須 <em>回 exit code 0 才算成功</em>；非 0 PostgreSQL retry、retry 失敗會在 <code>pg_wal</code> 堆積 WAL 直到 disk 滿。<strong>critical：archive_command 不能寫成 silent-fail</strong>。</p>
<h3 id="用-pgbackrest-取代手寫-script">用 pgBackRest 取代手寫 script</h3>
<p>production 強烈不建議自寫 archive script — pgBackRest / WAL-G / Barman 處理過所有 edge case：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># pgbackrest.conf</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="k">[global]</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="na">repo1-type</span><span class="o">=</span><span class="s">s3</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="na">repo1-s3-bucket</span><span class="o">=</span><span class="s">mybucket</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="na">repo1-s3-region</span><span class="o">=</span><span class="s">us-east-1</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="na">repo1-retention-full</span><span class="o">=</span><span class="s">4                       # 留 4 個 full backup</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="na">repo1-retention-diff</span><span class="o">=</span><span class="s">8                       # 留 8 個 differential</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="na">repo1-cipher-type</span><span class="o">=</span><span class="s">aes-256-cbc                # encrypt at rest</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="na">process-max</span><span class="o">=</span><span class="s">8                                # parallel restore</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="k">[main]</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="na">pg1-path</span><span class="o">=</span><span class="s">/var/lib/postgresql/16/main</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 跑 full backup</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pgbackrest --stanza<span class="o">=</span>main backup --type<span class="o">=</span>full
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># archive_command 用 pgbackrest 內建</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="nv">archive_command</span> <span class="o">=</span> <span class="s1">&#39;pgbackrest --stanza=main archive-push %p&#39;</span></span></span></code></pre></div><p>pgBackRest 處理：parallel push、compression、encryption、checksum、archive replay timing、backup catalog、retention 自動清理。</p>
<h3 id="restorerecovery_target_time">Restore：recovery_target_time</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 1. 從 S3 / repo 拉 base backup</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pgbackrest --stanza<span class="o">=</span>main --type<span class="o">=</span><span class="nb">time</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --target<span class="o">=</span><span class="s2">&#34;2026-05-18 14:30:00+00&#34;</span> restore
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 2. PostgreSQL 進 recovery mode、自動 replay WAL 到 target time</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># (pgBackRest 寫好 recovery.signal + postgresql.auto.conf)</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"># 3. 確認到目標 timestamp 後、promote</span>
</span></span><span class="line"><span class="ln">9</span><span class="cl">pg_ctl promote</span></span></code></pre></div><p>Recovery target 三種：</p>
<ul>
<li><strong><code>recovery_target_time</code></strong>：到某 timestamp</li>
<li><strong><code>recovery_target_xid</code></strong>：到某 transaction ID（log 有 xid 才好定位）</li>
<li><strong><code>recovery_target_lsn</code></strong>：到某 WAL LSN（最精確、但需要事先記下 LSN）</li>
</ul>
<p>production 多用 timestamp、application log 有時間戳容易定位。</p>
<h2 id="故障演練--邊界-case">故障演練 / 邊界 case</h2>
<h3 id="case-1archive_command-靜默失敗">Case 1：archive_command 靜默失敗</h3>
<p><strong>徵兆</strong>：DBA 發現某 PITR test 時、最近 3 天的 WAL 在 S3 上沒有；但 PostgreSQL 沒 alert、<code>pg_wal</code> 也沒堆積（早就被回收？）。</p>
<p><strong>根因</strong>：archive_command 寫成 <code>aws s3 cp %p s3://bucket/... 2&gt;/dev/null</code> — 錯誤訊息被吞、exit code 卻是 0（cp 失敗但 redirect 後 shell wrapper 不傳 fail code）；PostgreSQL 以為成功、繼續 advance WAL pointer、舊 WAL 已回收、archive 上實際沒有。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>絕對不要靜默 exit code</strong>：archive_command 必須 <em>fail loud</em>、exit code 非 0</li>
<li><strong>用 pgBackRest / WAL-G</strong>、不自寫 shell 腳本</li>
<li><strong>monitoring</strong>：對 archive lag 寫 alert</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_last_archived_xact_time</span><span class="p">(),</span><span class="w"> </span><span class="n">now</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pg_last_archived_xact_time</span><span class="p">()</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">lag</span><span class="p">;</span></span></span></code></pre></div><p>alert if lag &gt; 5 minutes</p>
<ol start="4">
<li><strong>定期測試 restore</strong>：每月跑一次 PITR drill、實際從 archive restore + 驗證 timestamp</li>
</ol>
<h3 id="case-2wal-archive-lagprimary-disk-壓力">Case 2：WAL archive lag、primary disk 壓力</h3>
<p><strong>徵兆</strong>：<code>pg_wal</code> 目錄持續長大、<code>df -h</code> 90%+；<code>pg_stat_archiver</code> 顯示 <code>failed_count</code> 累積、<code>last_failed_time</code> 是 30 分鐘前；archive_command 寫不出去（S3 throttle / network 慢）。</p>
<p><strong>根因</strong>：archive_command 寫到 S3、但 S3 rate limit / connection timeout、PostgreSQL retry；WAL 一直在 <code>pg_wal</code> 不能回收、disk 持續長。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：<code>archive_command</code> 內部 retry + parallel push（pgBackRest 自帶 <code>process-max</code>）</li>
<li><strong>alert</strong>：<code>pg_stat_archiver.failed_count</code> 增長 + primary disk usage &gt; 80%</li>
<li><strong>緊急</strong>：暫時改 archive_command 寫 local NFS / 其他 storage、等 S3 恢復再同步；不要直接 disable archive（會丟資料）</li>
<li><strong>架構</strong>：archive storage 至少跨 region 兩份、單一 storage 故障不影響 archive</li>
</ol>
<h3 id="case-3recovery-跑到-wrong-target-time">Case 3：recovery 跑到 wrong target time</h3>
<p><strong>徵兆</strong>：PITR 還原後資料看起來 <em>缺一塊</em>；DBA 後悔 — target time 設早了 30 分鐘、recovery 已 promote、後續 WAL 在新 timeline 上、回不去。</p>
<p><strong>根因</strong>：recovery 過程不可逆 — 一旦 promote 開新 timeline、舊 WAL 在新 timeline 上不會被 replay；想還原到更晚 timestamp 必須 <em>重新 restore base backup + WAL</em>。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong><code>recovery_target_action = pause</code></strong>（PG 13+）：到 target time 後 <em>暫停</em>、不自動 promote；DBA 手動 query 確認資料對才 promote</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">recovery_target_time</span> <span class="o">=</span> <span class="s">&#39;2026-05-18 14:30:00+00&#39;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">recovery_target_action</span> <span class="o">=</span> <span class="s">pause</span></span></span></code></pre></div><ol start="2">
<li><strong>多次 PITR 試錯</strong>：用 <em>獨立 staging cluster</em> restore、驗證 target time 對、再對 production 跑</li>
<li><strong>記錄 target time 來源</strong>：application log / event timestamp 多比對、避免時區錯亂（<code>+00</code> UTC 跟 local time 差）</li>
</ol>
<h3 id="case-4base-backup-過期未清storage-爆">Case 4：base backup 過期未清、storage 爆</h3>
<p><strong>徵兆</strong>：S3 backup bucket size 半年內從 200GB 漲到 5TB；DBA 才發現 retention 沒設、daily base backup 留 180 天。</p>
<p><strong>根因</strong>：archive_command 自寫腳本沒 retention 邏輯、或 pgBackRest 設了 <code>repo1-retention-full=180</code> 漏看；DB 容量本來就成長 + 每日 full backup 累積。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># pgBackRest retention：4 full + auto-expire archive</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">repo1-retention-full</span><span class="o">=</span><span class="s">4                         # 留 4 個 full backup</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">repo1-retention-diff</span><span class="o">=</span><span class="s">8                         # 留 8 個 differential</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">repo1-retention-archive</span><span class="o">=</span><span class="s">4                      # WAL archive 跟 full 對齊</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">repo1-retention-archive-type</span><span class="o">=</span><span class="s">full</span></span></span></code></pre></div><p>storage budgeting：</p>
<ul>
<li>daily full + diff + WAL archive ≈ 1-2x DB size / day</li>
<li>4-week retention → ~30-60x DB size storage</li>
<li>跨 region replication → 2-3x</li>
</ul>
<h3 id="case-5timeline-分歧後-recovery-模糊">Case 5：timeline 分歧後 recovery 模糊</h3>
<p><strong>徵兆</strong>：production 經歷一次 failover（Patroni promote）+ 之後又 PITR 一次；現在要再 PITR 到 failover 前一刻、archive 上有兩個 timeline、recovery target 搞不清要哪個。</p>
<p><strong>根因</strong>：每次 promote 開新 timeline ID（<code>.history</code> 檔）；archive storage 上同 LSN 可能對應不同 timeline；recovery target time 在分歧點附近、ambiguous。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong><code>recovery_target_timeline</code></strong> 明示要 follow 哪個 timeline</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">recovery_target_time</span> <span class="o">=</span> <span class="s">&#39;2026-05-15 10:00:00+00&#39;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">recovery_target_timeline</span> <span class="o">=</span> <span class="s">&#39;3&#39;                 # 要 follow timeline 3</span></span></span></code></pre></div><ol start="2">
<li><strong>熟悉 <code>.history</code> 檔</strong>：<code>/wal_archive/000000XX.history</code> 記錄 timeline 切換點、PITR 前先看</li>
<li><strong>預防</strong>：每次 promote 後 <em>立刻</em> 跑新的 base backup、簡化未來 PITR 流程（不用跨 timeline）</li>
</ol>
<h2 id="容量--cost-規劃">容量 / cost 規劃</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>估算</th>
          <th>警戒</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base backup size</td>
          <td>跟 DB data dir 大小成正比（PostgreSQL 內部 compression 後）</td>
          <td>每 backup ~ 0.5-1x DB size</td>
      </tr>
      <tr>
          <td>WAL archive size</td>
          <td>~5-50GB / day depending on write volume</td>
          <td>1TB DB / write-heavy 可能 100GB+ / day</td>
      </tr>
      <tr>
          <td>Storage retention</td>
          <td>4-12 weeks 典型</td>
          <td>30-60x DB size budget</td>
      </tr>
      <tr>
          <td>Base backup time</td>
          <td>TB 級 1-4 小時</td>
          <td>跑在 maintenance window</td>
      </tr>
      <tr>
          <td>Restore time</td>
          <td>base backup restore + WAL replay</td>
          <td>TB 級 PITR 通常 2-6 小時</td>
      </tr>
      <tr>
          <td>Network bandwidth</td>
          <td>full backup 期間 100-500 Mbps</td>
          <td>跨 region 注意 egress cost</td>
      </tr>
  </tbody>
</table>
<p>實務 default：</p>
<ul>
<li>Daily full backup + 4 weeks retention</li>
<li>WAL archive every 60s（<code>archive_timeout = 60</code>）</li>
<li>跨 region replication（S3 → S3 cross-region）</li>
<li>月度 restore drill 驗證可用</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-patroni-ha-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> 整合</h3>
<p>Patroni 不管 backup，但 promotion 後 timeline 切換影響 archive：</p>
<ol>
<li>archive_command 用 <code>%t</code>（timeline）+ <code>%f</code>（filename）路徑、避免不同 timeline WAL 覆蓋</li>
<li>Patroni <code>recovery_conf</code> 包含 <code>restore_command</code>、standby clone 從 archive 拉</li>
<li>每次 Patroni failover 後跑 <em>full backup</em>、簡化未來 PITR</li>
</ol>
<h3 id="跟-logical-replication-對位">跟 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">logical replication</a> 對位</h3>
<p>PITR 跟 logical replication 服務不同 use case：</p>
<ul>
<li>PITR 是 <em>災難恢復</em>（logical bug / corruption）— 全量還原到某時刻</li>
<li>Logical replication 是 <em>連續 sync</em> — Kafka / 跨 DB 即時複製</li>
</ul>
<p>兩者 <em>都依賴 WAL</em>、但目標不同；同 PostgreSQL 可同時跑、互不衝突。</p>
<h3 id="跟-monitoring--alert">跟 monitoring + alert</h3>
<p>關鍵 metric：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- archive 健康度
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_archiver</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- archived_count, failed_count, last_archived_wal, last_archived_time
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- WAL 在 pg_wal 等待 archive 量
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_ls_waldir</span><span class="p">()</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s1">&#39;^[0-9A-F]{24}$&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="c1">-- base backup 上次跑時間
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1">-- (pgBackRest API 或 backup catalog)</span></span></span></code></pre></div><p>Prometheus alert 三條：archive failed_count 增、archive lag &gt; 5min、base backup &gt; 25h 沒跑。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Incremental backup（PG 17+）</strong>：base backup 不全量、只 base + incremental</li>
<li><strong>Block-level differential</strong>：pgBackRest 已支援</li>
<li><strong>Cloud-native 替代</strong>：RDS / Aurora 用 storage-layer snapshot、不走 PITR 鏈</li>
<li><strong><code>pg_dump</code> vs PITR</strong>：pg_dump 是 logical backup（resume to different schema OK）、PITR 是 physical（必須同 version + same arch）</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>上游 chapter：<a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">Database Migration Playbook</a> — PITR 是 migration 的失敗回退</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a> / <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum tuning</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL major version upgrade (14 → 17)：為什麼這篇不套 5 type migration</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/major-version-upgrade/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/major-version-upgrade/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。寫作前判讀 &lt;em>不適用&lt;/em> &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> 的 5 type — 本文是該 methodology 「何時不該套」段的第 2 項實證（同 vendor major version upgrade）。&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼這篇不套-5-type-migration">為什麼這篇不套 5 type migration&lt;/h2>
&lt;p>跑 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit&lt;/a> 對 PostgreSQL 14 → 17：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>評估&lt;/th>
 &lt;th>等級&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Schema / API&lt;/td>
 &lt;td>同 PostgreSQL wire protocol、SQL syntax 99%+ 相容&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operational model&lt;/td>
 &lt;td>同 PostgreSQL operational stack、tooling 不變&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Abstraction / paradigm&lt;/td>
 &lt;td>同 OLTP RDBMS&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Number of components&lt;/td>
 &lt;td>同 1 個&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application change&lt;/td>
 &lt;td>多數 application 不改&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>5 維皆 Low — 對映 Type B drop-in。但 &lt;em>實際工作量&lt;/em> 跟 drop-in 完全不同：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Extension 相容性&lt;/strong>：pg14 的 extension 不一定能在 pg17 直接用（API 變動 / ABI break）&lt;/li>
&lt;li>&lt;strong>Breaking change&lt;/strong>：每個 major version 有 release-specific behavior change（pg17 移除 &lt;code>relation&lt;/code>/&lt;code>oid&lt;/code> 隱性 type、pg15 公開 &lt;code>pg_role&lt;/code> 規則變嚴）&lt;/li>
&lt;li>&lt;strong>Storage format&lt;/strong>：major version 之間 &lt;em>data dir 不向後相容&lt;/em>、必須 &lt;code>pg_upgrade&lt;/code> 或 dump-restore&lt;/li>
&lt;li>&lt;strong>Statistics 重建&lt;/strong>：upgrade 後 &lt;code>pg_statistic&lt;/code> 失效、必須跑 &lt;code>ANALYZE&lt;/code>、否則 query plan 退化&lt;/li>
&lt;li>&lt;strong>Replication slot&lt;/strong>：logical replication slot 不跨 major version&lt;/li>
&lt;/ul>
&lt;p>5 type 對映 &lt;em>跨 vendor process&lt;/em>、漏了 &lt;em>同 vendor 內升級&lt;/em> 的 upgrade-specific dimension。本文採用 &lt;em>deep article methodology 的 6-section + 額外 upgrade audit 段&lt;/em> 結構、不是 5 type 的任一個。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。寫作前判讀 <em>不適用</em> <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> 的 5 type — 本文是該 methodology 「何時不該套」段的第 2 項實證（同 vendor major version upgrade）。</p></blockquote>
<h2 id="為什麼這篇不套-5-type-migration">為什麼這篇不套 5 type migration</h2>
<p>跑 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a> 對 PostgreSQL 14 → 17：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>同 PostgreSQL wire protocol、SQL syntax 99%+ 相容</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>同 PostgreSQL operational stack、tooling 不變</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Abstraction / paradigm</td>
          <td>同 OLTP RDBMS</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Number of components</td>
          <td>同 1 個</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>多數 application 不改</td>
          <td>Low</td>
      </tr>
  </tbody>
</table>
<p>5 維皆 Low — 對映 Type B drop-in。但 <em>實際工作量</em> 跟 drop-in 完全不同：</p>
<ul>
<li><strong>Extension 相容性</strong>：pg14 的 extension 不一定能在 pg17 直接用（API 變動 / ABI break）</li>
<li><strong>Breaking change</strong>：每個 major version 有 release-specific behavior change（pg17 移除 <code>relation</code>/<code>oid</code> 隱性 type、pg15 公開 <code>pg_role</code> 規則變嚴）</li>
<li><strong>Storage format</strong>：major version 之間 <em>data dir 不向後相容</em>、必須 <code>pg_upgrade</code> 或 dump-restore</li>
<li><strong>Statistics 重建</strong>：upgrade 後 <code>pg_statistic</code> 失效、必須跑 <code>ANALYZE</code>、否則 query plan 退化</li>
<li><strong>Replication slot</strong>：logical replication slot 不跨 major version</li>
</ul>
<p>5 type 對映 <em>跨 vendor process</em>、漏了 <em>同 vendor 內升級</em> 的 upgrade-specific dimension。本文採用 <em>deep article methodology 的 6-section + 額外 upgrade audit 段</em> 結構、不是 5 type 的任一個。</p>
<h2 id="結構-differentiatordeep-article--upgrade-audit">結構 differentiator：deep article + upgrade audit</h2>
<p>跟 single feature deep article（如 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgBouncer config</a> / <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a>）對照、本文多一段 <em>upgrade audit</em>；跟 migration playbook 對照、本文 <em>沒 phased translation / parallel run / cutover routing</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">問題情境（為什麼升）
</span></span><span class="line"><span class="ln">2</span><span class="cl">→ Upgrade audit（extension / breaking change / dependency）
</span></span><span class="line"><span class="ln">3</span><span class="cl">→ 升級方法選擇（pg_upgrade / logical / blue-green）
</span></span><span class="line"><span class="ln">4</span><span class="cl">→ Step-by-step 執行
</span></span><span class="line"><span class="ln">5</span><span class="cl">→ 故障演練
</span></span><span class="line"><span class="ln">6</span><span class="cl">→ Capacity / downtime trade-off
</span></span><span class="line"><span class="ln">7</span><span class="cl">→ 整合 / 下一步</span></span></code></pre></div><p>7 段、220-280 行。比 single feature deep article 多 1 段 audit、比 migration playbook 少 phased translation 章節。</p>
<h2 id="問題情境major-version-不只是-minor-bump">問題情境：major version 不只是 minor bump</h2>
<p>PostgreSQL major version（14 / 15 / 16 / 17）一年一版、每版含 <em>breaking change</em>、不是 minor bump。常見升級驅動：</p>
<ul>
<li><strong>EOL pressure</strong>：PostgreSQL 每版 maintained 5 年、pg14 EOL 2026-11；pg13 EOL 2025-11 已過、production 仍跑 pg13 是 risk</li>
<li><strong>新 feature 需求</strong>：pg15 MERGE / pg16 parallel hash join / pg17 incremental backup</li>
<li><strong>Cloud provider 強制</strong>：Aurora / RDS 對 EOL 版本停 minor patch、planned upgrade 不能拖</li>
</ul>
<p>不升級的代價：security patch 停發、新功能不能用、跟新 client / extension 漸增不相容。</p>
<h2 id="upgrade-audit">Upgrade audit</h2>
<p>升級前的硬閘門 audit、跳過任一個 production 必踩：</p>
<h3 id="audit-1extension-相容性">Audit 1：Extension 相容性</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">extname</span><span class="p">,</span><span class="w"> </span><span class="n">extversion</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_extension</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">extname</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s1">&#39;plpgsql&#39;</span><span class="p">;</span></span></span></code></pre></div><p>對每個 extension 跑：</p>
<ol>
<li>對應 target version (pg17) 是否有 release？</li>
<li>ABI break？（如 PostGIS major version 對應 PG major version）</li>
<li>是否有 maintainer 持續更新？（TimescaleDB 已不 cover pg17 部分 feature）</li>
</ol>
<p>常見 pg14 → pg17 需要 <em>先升 extension</em> 的：PostGIS / TimescaleDB / pgaudit / pg_partman / pg_repack。</p>
<h3 id="audit-2breaking-change-pull">Audit 2：Breaking change pull</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 查 release note 累積 breaking change（pg14 → pg17 跨 3 個 major）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># pg15: deprecated public schema 預設 write 權限變嚴</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># pg16: regrole removed implicit casts</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># pg17: removed several deprecated columns from system catalogs</span></span></span></code></pre></div><p>對每個 breaking change：</p>
<ol>
<li>用 SQL grep / static analysis 找 application code 影響範圍</li>
<li>評估修改工作量（通常 50-95% 是 false alarm、5-10% 真實影響）</li>
<li>列出無法立刻修的、規劃 <em>逐 major 升</em> 而不是 <em>一次升 3 major</em></li>
</ol>
<h3 id="audit-3replication--logical-slot">Audit 3：Replication / logical slot</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">slot_name</span><span class="p">,</span><span class="w"> </span><span class="n">plugin</span><span class="p">,</span><span class="w"> </span><span class="n">slot_type</span><span class="p">,</span><span class="w"> </span><span class="n">active</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="p">;</span></span></span></code></pre></div><p>major version upgrade 後：</p>
<ul>
<li><strong>Physical replication slot</strong>：standby 必須先升級到 <em>相同 major version</em> 才能跟新 primary</li>
<li><strong>Logical replication slot</strong>：<strong>不跨 major version</strong>、必須在 upgrade 前 drop、之後重建（消費者重 init load）</li>
<li>對應 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Debezium CDC</a> consumer 必須重 init</li>
</ul>
<h3 id="audit-4config-參數變更">Audit 4：Config 參數變更</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># diff postgresql.conf default 14 vs 17</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 重點: shared_preload_libraries / autovacuum_* / wal_level / synchronous_commit</span></span></span></code></pre></div><p>新 major version 預設值常變（pg14 → 17：<code>max_worker_processes</code> 預設變 / <code>unix_socket_directories</code> 行為差異）；自定 config 需逐項 review。</p>
<h3 id="audit-5statistics-重建計畫">Audit 5：Statistics 重建計畫</h3>
<p><code>pg_upgrade</code> 後 <code>pg_statistic</code> 重置、第一次跑 query plan 用空 stats、production 性能會塌；upgrade 計畫必須含：</p>
<ul>
<li><code>ANALYZE</code> 跑全 DB（小 DB ~10 分鐘、大 DB 1-3 小時）</li>
<li>多 stage <code>vacuumdb --analyze-in-stages</code> 先快速跑 baseline、再跑 full</li>
<li>Maintenance window 內預留 statistics 重建時間</li>
</ul>
<h2 id="升級方法選擇">升級方法選擇</h2>
<p>三種主流方法、依 downtime 容忍跟 DB 大小：</p>
<table>
  <thead>
      <tr>
          <th>方法</th>
          <th>Downtime</th>
          <th>風險</th>
          <th>適用</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>pg_upgrade --link</code></td>
          <td>10-30 分鐘</td>
          <td>data dir 跟 OS package 同 host、回退複雜</td>
          <td>&lt; 500GB、可接受 30 分鐘 downtime</td>
      </tr>
      <tr>
          <td>Logical replication</td>
          <td>切換瞬間（&lt; 1 分鐘）</td>
          <td>設定複雜、long-running migration window</td>
          <td>TB 級、低 downtime 需求</td>
      </tr>
      <tr>
          <td>Blue-green deployment</td>
          <td>切換瞬間</td>
          <td>雙倍硬體、cutover 期間需嚴格 traffic shifting</td>
          <td>Cloud-managed（Aurora / RDS 內建）</td>
      </tr>
  </tbody>
</table>
<h3 id="pg_upgrade---link-流程"><code>pg_upgrade --link</code> 流程</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. install pg17 binary（不啟動）</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"># 2. stop pg14</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">sudo systemctl stop postgresql@14
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 3. 跑 pg_upgrade（hard link、不複製資料）</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">sudo -u postgres /usr/lib/postgresql/17/bin/pg_upgrade <span class="se">\
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="se"></span>  --old-bindir<span class="o">=</span>/usr/lib/postgresql/14/bin <span class="se">\
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="se"></span>  --new-bindir<span class="o">=</span>/usr/lib/postgresql/17/bin <span class="se">\
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="se"></span>  --old-datadir<span class="o">=</span>/var/lib/postgresql/14/main <span class="se">\
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="se"></span>  --new-datadir<span class="o">=</span>/var/lib/postgresql/17/main <span class="se">\
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="se"></span>  --link <span class="se">\
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="se"></span>  --jobs<span class="o">=</span><span class="m">8</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"># 4. 啟動 pg17</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">sudo systemctl start postgresql@17
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1"># 5. 跑 pg_upgrade 產出的 analyze script</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">sudo -u postgres /tmp/analyze_new_cluster.sh</span></span></code></pre></div><p><code>--link</code> 用 hard link、不複製 data dir、適合大 DB；缺點是 <em>回退到 pg14 不可能</em>（data dir 已被新 pg 修改）— 必須有完整 backup + tested restore。</p>
<h2 id="故障演練">故障演練</h2>
<h3 id="case-1extension-相容性沒先-auditupgrade-後啟動失敗">Case 1：Extension 相容性沒先 audit、upgrade 後啟動失敗</h3>
<p><strong>徵兆</strong>：pg_upgrade 跑完、<code>pg_ctl start</code> 失敗、log 顯示 <code>could not load library &quot;timescaledb-2.13.so&quot;</code>。</p>
<p><strong>根因</strong>：TimescaleDB 對應 pg14、pg17 需要 TimescaleDB 2.16+；pg_upgrade 階段沒 check、library path 找不到。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-upgrade audit</strong>：每個 extension 列出 target version 對應、預先升 extension（在 pg14 上跑、用 <code>ALTER EXTENSION ... UPDATE</code>）</li>
<li><strong>回退</strong>：data dir 用 <code>--link</code> 已不可逆、必須從 backup restore + 重試</li>
<li><strong>預防</strong>：staging 環境完整 dry-run、production upgrade 前已知 path 都驗證過</li>
</ol>
<h3 id="case-2application-用-deprecated-sql跑壞">Case 2：Application 用 deprecated SQL、跑壞</h3>
<p><strong>徵兆</strong>：upgrade 後某些 application query 直接 error <code>ERROR: type &quot;regtype&quot; does not have a cast</code>。</p>
<p><strong>根因</strong>：pg16 移除了某些隱性 cast、application code 用了 implicit cast、現在 explicit cast 才能跑。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-upgrade</strong>：跑 application test suite 對 pg17 staging、catch 不相容 query</li>
<li><strong>緊急</strong>：staging 找到的 query 在 production 改 application code、deploy 後再 upgrade DB</li>
<li><strong>長期</strong>：application code 用 ORM / query builder、避免 raw SQL 對 PG version-specific behavior 依賴</li>
</ol>
<h3 id="case-3analyze-沒跑production-query-性能崩">Case 3：<code>ANALYZE</code> 沒跑、production query 性能崩</h3>
<p><strong>徵兆</strong>：upgrade 後 5 分鐘、application latency p99 從 50ms 衝到 5000ms；query plan 從 index scan 退化到 seq scan。</p>
<p><strong>根因</strong>：<code>pg_upgrade</code> 重置 <code>pg_statistic</code>、planner 用空 stats 跑 plan、無法估 selectivity、保守選 seq scan。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># upgrade 完立刻跑 (順序)</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">vacuumdb --all --analyze-in-stages --jobs<span class="o">=</span><span class="m">4</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># Stage 1: 最少 stats（快、~5 分鐘）</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># Stage 2: 中 stats（~30 分鐘）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># Stage 3: 完整 stats（1-3 小時）</span></span></span></code></pre></div><p><code>--analyze-in-stages</code> 分 3 階段、第 1 階段就能讓 planner 做大致正確的決策；可在 maintenance window 內接受 stage 3 仍在跑。</p>
<h3 id="case-4logical-replication-slot-漏-dropdebezium-卡死">Case 4：Logical replication slot 漏 drop、Debezium 卡死</h3>
<p><strong>徵兆</strong>：upgrade 完開機後、Debezium connector log 顯示 <code>slot not found</code>、消費停滯；Kafka downstream 訊息斷流。</p>
<p><strong>根因</strong>：logical replication slot 不跨 major version、<code>pg_upgrade</code> 不自動處理 logical slot；upgrade 前沒 drop、新 cluster 上 slot 不存在。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-upgrade</strong>：列所有 logical replication slot、Debezium 暫停 consumer + drop slot</li>
<li><strong>Upgrade 後重建</strong>：用新 LSN starting position 建 slot、Debezium snapshot.mode=schema_only_recovery 取代 initial（避免重 init load）</li>
<li><strong>架構</strong>：未來考慮用 <em>outbox pattern</em>、CDC 只追 outbox 表、降低 logical slot 重建成本</li>
</ol>
<h3 id="case-5standby-沒同步升replication-斷">Case 5：Standby 沒同步升、replication 斷</h3>
<p><strong>徵兆</strong>：primary 升 pg17 後、standby 仍 pg14、replication 不通；<code>pg_stat_replication</code> 沒 standby connection。</p>
<p><strong>根因</strong>：streaming replication 不跨 major version；standby 必須 <em>先升</em> 或 <em>upgrade 後重 base backup</em>。</p>
<p><strong>修法</strong>：</p>
<p>兩種策略：</p>
<ol>
<li><strong>In-place upgrade standby</strong>：standby 也跑 <code>pg_upgrade</code>、但要先 stop streaming、升完重接（standby 端 archive_command + restore_command 對齊）</li>
<li><strong>Rebuild standby</strong>：upgrade primary 完、standby 跑 <code>pg_basebackup</code> 重建（適合 standby 容量小、network 快）</li>
</ol>
<p>Patroni HA 環境：用 <em>rolling upgrade</em> — 先升 sync standby、failover 過去、再升舊 primary 變新 standby。複雜度高、需要 staging 演練。</p>
<h2 id="capacity--downtime-trade-off">Capacity / downtime trade-off</h2>
<table>
  <thead>
      <tr>
          <th>方法</th>
          <th>Downtime 估算（500GB DB）</th>
          <th>硬體成本</th>
          <th>風險</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>pg_upgrade --link</code></td>
          <td>15-30 分鐘（含 ANALYZE 1st stage）</td>
          <td>同當前</td>
          <td>高（不可逆）</td>
      </tr>
      <tr>
          <td><code>pg_upgrade --clone</code></td>
          <td>1-3 小時</td>
          <td>暫時 2x storage</td>
          <td>中</td>
      </tr>
      <tr>
          <td>Logical replication</td>
          <td>&lt; 1 分鐘 cutover</td>
          <td>暫時 2x compute + storage</td>
          <td>中（複雜）</td>
      </tr>
      <tr>
          <td>Blue-green</td>
          <td>切換瞬間（&lt; 30 秒）</td>
          <td>持續 2x（cutover 後可拆）</td>
          <td>低（cloud managed）</td>
      </tr>
  </tbody>
</table>
<p>實務 default：</p>
<ul>
<li>&lt; 100GB、可接受 30 分鐘 downtime：<code>pg_upgrade --link</code></li>
<li>100GB - 1TB、要求 &lt; 5 分鐘 downtime：logical replication（標準 PostgreSQL）</li>
<li>1TB+ 或 SLA 嚴格：blue-green via Aurora / RDS（cloud managed）</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-patroni-ha-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> 整合</h3>
<p>HA cluster upgrade 流程：</p>
<ol>
<li>升新 standby（不在 cluster 中、physical / logical replicate 過去）</li>
<li>Promote 新 standby、舊 cluster failover 過去</li>
<li>重建剩餘 standby</li>
</ol>
<p>Patroni 17+ 支援 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">logical slot 跨 failover</a> — major version upgrade 期間 logical consumer 影響降低。</p>
<h3 id="跟-monitoring-整合">跟 monitoring 整合</h3>
<p>upgrade 期間特別關注的 metric：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Pre-upgrade baseline
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_database_size</span><span class="p">(</span><span class="s1">&#39;myapp&#39;</span><span class="p">),</span><span class="w"> </span><span class="k">version</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Post-upgrade verification
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_database_size</span><span class="p">(</span><span class="s1">&#39;myapp&#39;</span><span class="p">),</span><span class="w"> </span><span class="k">version</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_user_tables</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">last_analyze</span><span class="w"> </span><span class="k">IS</span><span class="w"> </span><span class="k">NULL</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- 應該 = 0、若有未 analyze 表、ANALYZE 沒跑完</span></span></span></code></pre></div><p>Prometheus alert 三條：<code>pg_database_size</code> upgrade 後差異 &lt; 1%、<code>pg_stat_replication</code> lag &lt; 10s、<code>pg_query_p99_latency</code> 對 baseline &lt; 1.5x。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Aurora major version upgrade</strong>：blue-green deployment 是 default、流程跟 self-managed 完全不同、見 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora migration</a> 對位段</li>
<li><strong>Cross-major version skip upgrade</strong>：pg13 → pg17 跨 4 major、breaking change 累積、建議 <em>逐 major 升</em> 而不是 <em>single hop</em></li>
<li><strong>Extension lifecycle 管理</strong>：自動 audit extension 跟 PG version compatibility、每 quarter 跑 dry-run</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL Archiving</a> / <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a></li>
<li>對位 migration：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a> / <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a>（本文驗證 <em>漏類</em>）</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL → Aurora Migration：protocol 相容、operational 重設計</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration&lt;/a> playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a>（self-managed source）跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&amp;#43;75% 效能改善的 production 證據">Aurora&lt;/a>（cloud-managed target）。跟前兩篇 migration（&lt;a href="https://tarrragon.github.io/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic&lt;/a> 高 schema 差 / &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &amp;#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB&lt;/a> drop-in）對照、本篇是 &lt;em>middle ground&lt;/em>：wire protocol drop-in、但 operational model 重設計。每階段切換用 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate&lt;/a> 把關。&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼遷operational-cost--ha--dr-三條-driver">為什麼遷：operational cost / HA / DR 三條 driver&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Driver&lt;/th>
 &lt;th>觸發場景&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>Operational cost&lt;/strong>&lt;/td>
 &lt;td>self-managed PostgreSQL + Patroni HA + pgBackRest backup + monitoring 需 0.5-2 FTE；Aurora 把這層責任轉嫁 AWS、SRE 專注 application&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>HA reliability&lt;/strong>&lt;/td>
 &lt;td>Patroni split-brain / DCS quorum 偶爾踩雷、production failover 4-15s；Aurora 自動 multi-AZ failover &amp;lt; 30s、shared storage 不丟資料&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>DR / backup&lt;/strong>&lt;/td>
 &lt;td>自管 PITR + cross-region replication 複雜；Aurora 內建 PITR + global database + backup retention 簡化&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>反向 driver（Aurora → self-managed）也存在 — 主要是 &lt;em>cost 在 10TB+ 規模時 Aurora 反而更貴&lt;/em>、或 &lt;em>需要 PostgreSQL extension Aurora 不支援&lt;/em>（pg_partman / pg_repack / TimescaleDB 等）。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor <a href="/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration</a> playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a>（self-managed source）跟 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora</a>（cloud-managed target）。跟前兩篇 migration（<a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic</a> 高 schema 差 / <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a> drop-in）對照、本篇是 <em>middle ground</em>：wire protocol drop-in、但 operational model 重設計。每階段切換用 <a href="/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate</a> 把關。</p></blockquote>
<h2 id="為什麼遷operational-cost--ha--dr-三條-driver">為什麼遷：operational cost / HA / DR 三條 driver</h2>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>觸發場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Operational cost</strong></td>
          <td>self-managed PostgreSQL + Patroni HA + pgBackRest backup + monitoring 需 0.5-2 FTE；Aurora 把這層責任轉嫁 AWS、SRE 專注 application</td>
      </tr>
      <tr>
          <td><strong>HA reliability</strong></td>
          <td>Patroni split-brain / DCS quorum 偶爾踩雷、production failover 4-15s；Aurora 自動 multi-AZ failover &lt; 30s、shared storage 不丟資料</td>
      </tr>
      <tr>
          <td><strong>DR / backup</strong></td>
          <td>自管 PITR + cross-region replication 複雜；Aurora 內建 PITR + global database + backup retention 簡化</td>
      </tr>
  </tbody>
</table>
<p>反向 driver（Aurora → self-managed）也存在 — 主要是 <em>cost 在 10TB+ 規模時 Aurora 反而更貴</em>、或 <em>需要 PostgreSQL extension Aurora 不支援</em>（pg_partman / pg_repack / TimescaleDB 等）。</p>
<h2 id="結構protocol-相容--operational-phased-的混合">結構：protocol 相容 + operational phased 的混合</h2>
<p>跟前兩篇對照、Aurora migration 結構是 <em>protocol drop-in</em>（application 不改 SQL）+ <em>operational redesign</em>（HA / backup / monitoring 全換）：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Splunk → Elastic（高 schema 差）</th>
          <th>Redis → DragonflyDB（drop-in）</th>
          <th>PostgreSQL → Aurora（middle）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Wire protocol</td>
          <td>完全不同（SPL vs KQL）</td>
          <td>完全相同（RESP）</td>
          <td>完全相同（PostgreSQL wire）</td>
      </tr>
      <tr>
          <td>Schema / data model</td>
          <td>高差異（CIM vs ECS）</td>
          <td>完全相同</td>
          <td>完全相同</td>
      </tr>
      <tr>
          <td>Application code</td>
          <td>必改</td>
          <td>不改</td>
          <td>不改</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>不同</td>
          <td>相似</td>
          <td><strong>大差</strong></td>
      </tr>
      <tr>
          <td>HA / replication</td>
          <td>不同</td>
          <td>相似</td>
          <td><strong>完全重設計</strong></td>
      </tr>
      <tr>
          <td>Backup model</td>
          <td>不同</td>
          <td>簡化</td>
          <td><strong>完全換 AWS-native</strong></td>
      </tr>
      <tr>
          <td>Migration 週期</td>
          <td>4-9 個月</td>
          <td>1-4 週</td>
          <td>6-12 週</td>
      </tr>
      <tr>
          <td>Phased 結構需要</td>
          <td>6-phase 明顯</td>
          <td>不需要</td>
          <td><strong>混合</strong>（3 operational phase + drop-in cutover）</td>
      </tr>
  </tbody>
</table>
<p><strong>Hypothesis 驗證</strong>：migration playbook 結構由 <em>最大差異維度</em> 決定 — Splunk → Elastic 是 schema 差導向 phased、Aurora migration 是 operational 差導向局部 phased。</p>
<h2 id="operational-redesign-對位">Operational redesign 對位</h2>
<p>跟 self-managed PostgreSQL 比、Aurora 的 operational 模型差異：</p>
<table>
  <thead>
      <tr>
          <th>Operational concept</th>
          <th>Self-managed PostgreSQL</th>
          <th>Aurora</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Storage</td>
          <td>Local disk / EBS、跟 compute 一體</td>
          <td>Shared storage 跨 AZ 6 副本、跟 compute 解耦</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>Patroni + DCS quorum + watchdog</td>
          <td>Aurora 自家 failover、shared storage 不重 promote</td>
      </tr>
      <tr>
          <td>Read replica</td>
          <td>Streaming replication + Patroni 管理</td>
          <td>Aurora reader endpoint、cluster 自動 routing</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>pgBackRest / WAL-G + S3</td>
          <td>自動 continuous backup + PITR（內建）</td>
      </tr>
      <tr>
          <td>Failover time</td>
          <td>15-60s（Patroni）</td>
          <td>&lt; 30s（同 AZ）/ 1-2 min（跨 AZ）</td>
      </tr>
      <tr>
          <td>Connection management</td>
          <td>PgBouncer 必裝</td>
          <td>RDS Proxy 推薦、Aurora 自家 connection pool</td>
      </tr>
      <tr>
          <td>Major version upgrade</td>
          <td>手動 + 停機</td>
          <td>Aurora 自家 blue/green deployment</td>
      </tr>
      <tr>
          <td>Monitoring</td>
          <td>Prometheus + grafana-postgresql</td>
          <td>CloudWatch + Performance Insights</td>
      </tr>
      <tr>
          <td>Extension support</td>
          <td>自由安裝</td>
          <td><strong>白名單</strong>、限 AWS 認可 extension</td>
      </tr>
      <tr>
          <td>Custom config</td>
          <td>postgresql.conf 全控</td>
          <td>Parameter Group（限制）</td>
      </tr>
      <tr>
          <td>OS / kernel access</td>
          <td>完全控</td>
          <td><strong>無</strong>（fully managed）</td>
      </tr>
  </tbody>
</table>
<p>每一條 operational concept 都需要 migration plan、application code 不變但 <em>運維知識體系全換</em>。</p>
<h2 id="migration-流程3-phase-operational--drop-in-cutover">Migration 流程：3 phase operational + drop-in cutover</h2>
<h3 id="phase-0pre-migration-audit1-2-週">Phase 0：Pre-migration audit（1-2 週）</h3>
<ol>
<li><strong>Extension 清單對位</strong>：</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">extname</span><span class="p">,</span><span class="w"> </span><span class="n">extversion</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_extension</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="c1">-- 對照 Aurora supported extensions list
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1">-- 不支援的（pg_repack / pg_partman 部分 / TimescaleDB / Citus）需替代方案</span></span></span></code></pre></div><ol start="2">
<li><strong>Custom config 清單</strong>：</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">setting</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_settings</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="k">source</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s1">&#39;default&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="c1">-- 對照 Aurora Parameter Group 可調項目</span></span></span></code></pre></div><ol start="3">
<li><strong>Capacity 評估</strong>：</li>
</ol>
<ul>
<li>當前 IOPS / connection / storage / WAL rate</li>
<li>對應 Aurora instance class（db.r6g.large to db.r6g.32xlarge）</li>
<li>估算 cost（vCPU + IOPS + storage + backup retention）</li>
</ul>
<ol start="4">
<li><strong>Application connection pool audit</strong>：</li>
</ol>
<ul>
<li>PgBouncer 配置是否能直接搬到 RDS Proxy</li>
<li>Connection string + IAM 認證準備</li>
</ul>
<h3 id="phase-1operational-infrastructure-準備2-3-週">Phase 1：Operational infrastructure 準備（2-3 週）</h3>
<ol>
<li>建 Aurora cluster（Terraform / CloudFormation）</li>
<li>設 Parameter Group、對位 self-managed 配置</li>
<li>設 Security Group + IAM role</li>
<li>設 RDS Proxy（推薦、connection 集中管理）</li>
<li>CloudWatch alert + Performance Insights baseline</li>
<li>Backup retention + PITR window 設定</li>
</ol>
<h3 id="phase-2data-migration取決於-dataset-大小">Phase 2：Data migration（取決於 dataset 大小）</h3>
<p>兩條路：</p>
<h4 id="路線-aaws-dms推薦中等規模--5tb">路線 A：AWS DMS（推薦中等規模 &lt; 5TB）</h4>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">self-managed Postgres ──(DMS)──→ Aurora
</span></span><span class="line"><span class="ln">2</span><span class="cl">                         |
</span></span><span class="line"><span class="ln">3</span><span class="cl">                  full load + CDC continuous</span></span></code></pre></div><ul>
<li>DMS task 設 <code>Full Load + Ongoing Replication</code></li>
<li>跑 full load 估算（100GB ~ 1-3 小時依 instance class）</li>
<li>CDC 持續直到 cutover</li>
</ul>
<h4 id="路線-blogical-replication推薦-5tb-或要精準控制">路線 B：Logical replication（推薦 5TB+ 或要精準控制）</h4>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Source：建 publication
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">PUBLICATION</span><span class="w"> </span><span class="n">migrate_pub</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="k">ALL</span><span class="w"> </span><span class="n">TABLES</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Aurora：建 subscription
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">SUBSCRIPTION</span><span class="w"> </span><span class="n">migrate_sub</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="k">CONNECTION</span><span class="w"> </span><span class="s1">&#39;host=&lt;source&gt; dbname=&lt;db&gt; user=&lt;replicator&gt;&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">  </span><span class="n">PUBLICATION</span><span class="w"> </span><span class="n">migrate_pub</span><span class="p">;</span></span></span></code></pre></div><ul>
<li>Initial COPY 跑完後 streaming</li>
<li>詳見 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a></li>
</ul>
<h3 id="phase-3cutover-跟-verification">Phase 3：Cutover 跟 verification</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. Application 端設 maintenance mode（block writes）
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. 等 replication lag → 0
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. 確認 Aurora 端 row count + checksum 對齊
</span></span><span class="line"><span class="ln">4</span><span class="cl">4. Application connection string 切到 Aurora endpoint
</span></span><span class="line"><span class="ln">5</span><span class="cl">5. 解除 maintenance mode
</span></span><span class="line"><span class="ln">6</span><span class="cl">6. Self-managed 端 read-only 保留 1-2 週 standby</span></span></code></pre></div><p>Cutover window 視 dataset 大小：</p>
<ul>
<li>&lt; 100GB：1-2 小時</li>
<li>100GB - 1TB：2-4 小時</li>
<li>1TB+：考慮 <em>zero-downtime cutover</em> via blue-green deployment</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1extension-不支援application-直接壞">Case 1：Extension 不支援、application 直接壞</h3>
<p><strong>徵兆</strong>：cutover 後 application 某些 query 報 <code>extension &quot;pg_repack&quot; not available</code>、batch job 壞。</p>
<p><strong>根因</strong>：Phase 0 audit 漏掉 application 用 pg_repack 做 maintenance；Aurora 不支援、self-managed 端的 cron job 改不過去。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration audit 必做</strong>：<code>SELECT extname FROM pg_extension</code> 對照 <a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Extensions.html">Aurora extension whitelist</a></li>
<li><strong>替代方案</strong>：
<ul>
<li>pg_repack → Aurora 自家 vacuum + storage auto-resize</li>
<li>TimescaleDB → 改 declarative partitioning 或換 Timestream</li>
<li>Citus → 評估保留 self-managed 或重設計 schema</li>
</ul>
</li>
<li><strong>退役策略</strong>：Extension 是 application 必要的、評估暫不遷或選 alternative cloud（如 AlloyDB / Citus on Azure）</li>
</ol>
<h3 id="case-2replication-slot-不直通">Case 2：Replication slot 不直通</h3>
<p><strong>徵兆</strong>：self-managed 端有 Debezium CDC 接 application 事件、cutover 後 CDC pipeline 直接壞、Kafka 端訊息斷流。</p>
<p><strong>根因</strong>：Aurora 對 logical replication slot 有限制 — 不直接支援 external consumer（如 Debezium）讀 slot；要走 <em>RDS Database Events</em> 或 <em>DMS CDC</em>。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration audit</strong>：列所有 logical consumer（Debezium / Kafka Connect / 自家 CDC）</li>
<li><strong>替代方案</strong>：
<ul>
<li>DMS CDC 取代 Debezium（Aurora 原生支援）</li>
<li>評估 RDS Database Activity Streams（newer feature）</li>
<li>重設計 CDC：application 寫 outbox 表、Aurora trigger 發 SNS → Lambda → Kafka</li>
</ul>
</li>
<li><strong>接受代價</strong>：CDC pipeline 重建是 2-4 週工作、納入 migration scope</li>
</ol>
<h3 id="case-3autovacuum-行為跟-self-managed-不同">Case 3：Autovacuum 行為跟 self-managed 不同</h3>
<p><strong>徵兆</strong>：cutover 後幾天、特定 hot table 的 bloat 數據異常、application 端 query latency p99 漲；CloudWatch Performance Insights 顯示 autovacuum 跑頻率比 self-managed 端高 3 倍。</p>
<p><strong>根因</strong>：Aurora 預設 Parameter Group 的 autovacuum 配置跟 self-managed 不同 — <code>autovacuum_vacuum_cost_limit</code> 預設更低、<code>vacuum_scale_factor</code> 更激進；shared storage 上 vacuum 行為不一樣。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Parameter Group 對位</strong>：把 self-managed <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum tuning</a> 配置複製到 Aurora Parameter Group</li>
<li><strong>per-table tuning</strong>：hot table 的 <code>ALTER TABLE SET (autovacuum_*)</code> 可遷過去</li>
<li><strong>接受差異</strong>：Aurora storage 設計讓 vacuum 不一定要跟 self-managed 同 cadence、SRE 心智模型要調</li>
</ol>
<h3 id="case-4iam-認證強制application-端改-connection-logic">Case 4：IAM 認證強制、application 端改 connection logic</h3>
<p><strong>徵兆</strong>：production 切到 Aurora 後、application 仍用 password authentication、SOC team 要求改 IAM 認證（compliance）；application 連線 logic 大改、token rotation 邏輯也要加。</p>
<p><strong>根因</strong>：self-managed 端用固定 username/password、Aurora 推薦（部分情境強制）IAM authentication；token 15 分鐘輪換、application 必須改連線 SDK。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Migration scope 內包含</strong>：authentication migration 是必要工作、不能事後補</li>
<li><strong>SDK 整合</strong>：用 AWS SDK + RDS Proxy 抽象 token rotation、application 不直接管 token</li>
<li><strong>Hybrid 期間</strong>：保留 password auth 直到 application 全切 IAM、再 disable password auth</li>
</ol>
<h3 id="case-5cost-model-預估錯月底帳單炸">Case 5：Cost model 預估錯、月底帳單炸</h3>
<p><strong>徵兆</strong>：第一個月 Aurora 帳單比預估高 50-80%；IOPS / backup storage / I/O cost 都比預期多。</p>
<p><strong>根因</strong>：Aurora pricing 三層（compute instance / storage / I/O）—</p>
<ul>
<li>Storage：actual data + backup × retention</li>
<li>I/O：每個 read / write block 都計費（self-managed 不算）</li>
<li>Backup：超過 backup retention 部分 charged as snapshot storage</li>
</ul>
<p>self-managed 端習慣 <em>fixed EC2 + EBS</em> cost、Aurora I/O-based 計費對 high-IOPS workload 衝擊大。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration cost estimate</strong>：用 self-managed <code>pg_stat_database</code> 估 I/O 量、套 Aurora pricing calc</li>
<li><strong>I/O optimization</strong>：開 Aurora I/O-Optimized storage class（fixed monthly + 不算 I/O）、適合 high-IOPS workload</li>
<li><strong>Backup retention 控制</strong>：不要 default 35 天、依 compliance 調整（7-14 天通常夠）</li>
<li><strong>Reserved Instance</strong>：穩定 workload 預付 1-3 年、省 30-40%</li>
</ol>
<h2 id="capacity--cost-對照">Capacity / cost 對照</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Self-managed PostgreSQL（EC2 + EBS）</th>
          <th>Aurora</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Instance cost</td>
          <td>EC2 + EBS（compute + storage 自管）</td>
          <td>Aurora instance class + storage + I/O</td>
      </tr>
      <tr>
          <td>HA cost</td>
          <td>Patroni 跨 3 AZ + EBS 3 副本</td>
          <td>Aurora 跨 3 AZ shared storage（內建）</td>
      </tr>
      <tr>
          <td>Backup cost</td>
          <td>pgBackRest + S3 archive</td>
          <td>Aurora 自動 continuous backup（內建）</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.5-2 FTE（HA / backup / patching）</td>
          <td>0.1-0.3 FTE（application 端 + Parameter Group）</td>
      </tr>
      <tr>
          <td>1TB / month cost</td>
          <td>$400-800（含 HA）</td>
          <td>$700-1500（含 HA）</td>
      </tr>
      <tr>
          <td>10TB / month cost</td>
          <td>$2K-4K</td>
          <td>$4K-8K（I/O cost 顯著）</td>
      </tr>
      <tr>
          <td>50TB+ cost</td>
          <td>$10K-20K</td>
          <td>$30K+（cost 反轉、self-managed 更便宜）</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：&lt; 10TB workload Aurora 平攤 operational cost 後仍便宜；50TB+ workload Aurora cost 顯著高、要 reserved + I/O-Optimized 才有競爭力。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-patroni-ha-對位">跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> 對位</h3>
<p>Patroni 在 Aurora migration 後 <em>退役</em> — Aurora 自家 failover 取代；但 SRE 心智模型要調：</p>
<ul>
<li>Patroni 的 <code>pg_rewind</code> 概念不存在（shared storage）</li>
<li>Patroni 的 <code>synchronous_commit</code> 行為 Aurora 隱藏在 storage layer</li>
<li>Aurora 跨 region 用 <em>Global Database</em>、不是 Patroni cross-region setup</li>
</ul>
<h3 id="跟-pitr-對位">跟 <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR</a> 對位</h3>
<p>self-managed PITR rebuild 工作量大、Aurora PITR 是 native API call：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">aws rds restore-db-cluster-to-point-in-time <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --source-db-cluster-identifier myapp-prod <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --db-cluster-identifier myapp-prod-restored <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  --restore-to-time 2026-05-19T14:30:00Z</span></span></code></pre></div><p>完全不需要 base backup + WAL replay 思維、storage layer 自動處理。</p>
<h3 id="跟-pgbouncer--rds-proxy">跟 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PgBouncer</a> → RDS Proxy</h3>
<p>PgBouncer 多數情境可換 RDS Proxy：</p>
<ul>
<li>transaction pooling 等效</li>
<li>IAM authentication 整合</li>
<li>Connection pinning（Lambda / serverless workload）</li>
<li><strong>限制</strong>：RDS Proxy 對某些 PG 14+ feature 仍 catching up、prepared statements 行為差異</li>
</ul>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Aurora Serverless v2 評估</strong>：variable workload 適合、steady workload 反而貴</li>
<li><strong>Babelfish 評估</strong>：跑 SQL Server protocol on Aurora（多 source 遷移到 Aurora）</li>
<li><strong>Cross-region DR</strong>：Aurora Global Database vs self-managed cross-region streaming + Patroni</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source vendor：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>Target vendor：<a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora</a></li>
<li>平行 migration playbook：<a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic Security</a> / <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a></li>
<li>Aurora family 內進一步遷移：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora-dsql/" data-link-title="PostgreSQL → Aurora DSQL Migration：PG wire-compatible Distributed SQL 的 Paradigm Shift" data-link-desc="Aurora DSQL（2024-12 re:Invent preview / 2025-05 GA）是 AWS 推的 PG wire-compatible *active-active distributed SQL*、跟 self-managed PG / Aurora PG 不同 paradigm（OCC &#43; snapshot isolation &#43; multi-region strong consistency）。Migration 結構是 *protocol drop-in &#43; paradigm shift*：app SQL 不太改、但 transaction retry / extension 缺位 / 多 region 一致性需重設計。本文走 DSQL vs Aurora PG vs self-managed PG 三軸對比、為什麼遷的三條 driver（global write / operational zero-touch / region resiliency）、Type E phased plan、5 production 踩雷（transaction retry 沒處理 / extension 缺位 / sequence throughput 限制 / Aurora PG 直升 DSQL 不可行 / region failover semantic）、跟 PG → Aurora 跟 PG → CockroachDB 對比">→ Aurora DSQL</a>（從 Aurora PG 升 DSQL active-active distributed、Type E paradigm shift）</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL Archiving</a> / <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL → Aurora DSQL Migration：PG wire-compatible Distributed SQL 的 Paradigm Shift</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora-dsql/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora-dsql/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration&lt;/a> playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a>（source）跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&amp;#43;75% 效能改善的 production 證據">Aurora&lt;/a>（DSQL 也屬 Aurora family、但 paradigm 不同）。跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &amp;#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">migrate-to-aurora&lt;/a>（PG → Aurora PG、protocol drop-in + operational redesign）跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &amp;#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">migrate-to-cockroachdb&lt;/a>（PG → CRDB、Type E paradigm shift）對照、本篇是 &lt;em>Aurora 內 PG → DSQL 的 paradigm shift&lt;/em>。每階段切換用 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate&lt;/a> 把關。&lt;/p>&lt;/blockquote>
&lt;blockquote>
&lt;p>&lt;strong>時間錨點&lt;/strong>：Aurora DSQL 在 &lt;strong>2024-12 re:Invent preview&lt;/strong>、&lt;strong>2025-05-27 GA&lt;/strong>。本文 vendor claim 以 2025-2026 公開狀態為準、實際 migration 前請以 AWS docs 為準（feature 持續演進中）。&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼遷global-write--operational-zero-touch--region-resiliency-三條-driver">為什麼遷：Global Write / Operational Zero-touch / Region Resiliency 三條 driver&lt;/h2>
&lt;p>PG → DSQL 不是「自然演進」、是 &lt;em>application 需求超出 single-primary 模型&lt;/em> 時的 paradigm 換軌。三條典型 driver 各自對應一種 application 約束、不是「三選一」、而是「至少其中一條剛性、其他兩條是 bonus」：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor <a href="/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration</a> playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a>（source）跟 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora</a>（DSQL 也屬 Aurora family、但 paradigm 不同）。跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">migrate-to-aurora</a>（PG → Aurora PG、protocol drop-in + operational redesign）跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">migrate-to-cockroachdb</a>（PG → CRDB、Type E paradigm shift）對照、本篇是 <em>Aurora 內 PG → DSQL 的 paradigm shift</em>。每階段切換用 <a href="/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate</a> 把關。</p></blockquote>
<blockquote>
<p><strong>時間錨點</strong>：Aurora DSQL 在 <strong>2024-12 re:Invent preview</strong>、<strong>2025-05-27 GA</strong>。本文 vendor claim 以 2025-2026 公開狀態為準、實際 migration 前請以 AWS docs 為準（feature 持續演進中）。</p></blockquote>
<h2 id="為什麼遷global-write--operational-zero-touch--region-resiliency-三條-driver">為什麼遷：Global Write / Operational Zero-touch / Region Resiliency 三條 driver</h2>
<p>PG → DSQL 不是「自然演進」、是 <em>application 需求超出 single-primary 模型</em> 時的 paradigm 換軌。三條典型 driver 各自對應一種 application 約束、不是「三選一」、而是「至少其中一條剛性、其他兩條是 bonus」：</p>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>觸發場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Global write</strong></td>
          <td>Application 需要多 region active-active write（不是 Aurora PG 的 single-writer + read replica）</td>
      </tr>
      <tr>
          <td><strong>Operational zero-touch</strong></td>
          <td>不想管 Patroni / PgBouncer / autovacuum / failover / backup retention、Aurora PG 已減一半、DSQL 進一步零接觸</td>
      </tr>
      <tr>
          <td><strong>Region resiliency</strong></td>
          <td>整 region 失效時應用無感切換（Aurora PG 是 cross-region replica 異步、DSQL 是 strong consistency 多 region）</td>
      </tr>
  </tbody>
</table>
<p>反向 driver（DSQL → Aurora PG）也存在：</p>
<ul>
<li>需要 PG extension（pgvector / TimescaleDB / PostGIS / pg_repack）— DSQL 不支援</li>
<li>Cost：DSQL 比 Aurora PG 貴 2-5x（依 region 數量）</li>
<li>Single-region OLTP 不需 distributed transaction 的 overhead</li>
</ul>
<h2 id="結構protocol-drop-in--paradigm-shift">結構：Protocol Drop-in + Paradigm Shift</h2>
<p>DSQL 是 PG wire-compatible（用 <code>psql</code> 連得上）、但內部是 <em>distributed SQL engine</em>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>self-managed PG</th>
          <th>Aurora PG</th>
          <th>Aurora DSQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Wire protocol</td>
          <td>PG</td>
          <td>PG</td>
          <td>PG（subset）</td>
      </tr>
      <tr>
          <td>Architecture</td>
          <td>Single primary</td>
          <td>Single primary + shared storage</td>
          <td><strong>Active-active distributed</strong></td>
      </tr>
      <tr>
          <td>Multi-region write</td>
          <td>不支援（async replica）</td>
          <td>不支援（async replica）</td>
          <td><strong>Strong consistency 多 region</strong></td>
      </tr>
      <tr>
          <td>Transaction model</td>
          <td>MVCC + snapshot isolation</td>
          <td>MVCC + snapshot isolation</td>
          <td><strong>OCC + strong snapshot isolation</strong></td>
      </tr>
      <tr>
          <td>Extension</td>
          <td>任意</td>
          <td>AWS whitelist</td>
          <td><strong>無 extension 支援</strong></td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>全部自管</td>
          <td>AWS 管 storage / failover</td>
          <td>AWS 管全部、零接觸</td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>Patroni 15-60s</td>
          <td>Aurora 30s</td>
          <td>N/A（永遠 active-active、無 failover 概念）</td>
      </tr>
      <tr>
          <td>Cost model</td>
          <td>Self-managed instance</td>
          <td>Instance hour + storage</td>
          <td>Per-DPU + multi-AZ replication</td>
      </tr>
  </tbody>
</table>
<p><strong>Paradigm shift 的核心</strong>：</p>
<ol>
<li><strong>Transaction semantic</strong>：DSQL 用 OCC（Optimistic Concurrency Control）+ strong snapshot isolation、跟 PG 預設 read committed / repeatable read snapshot 不同 — 同 row 有 concurrent write 時、commit 階段才偵測衝突 + abort、application 要 handle <code>40001</code> serialization_failure</li>
<li><strong>No extension</strong>：PostGIS / pgvector / TimescaleDB / pg_partman 都不能用、依賴這些 feature 的 application 要拆出去</li>
<li><strong>No connection pool stateful</strong>：DSQL 內建 connection pool、application 不能依賴 session state（temp table / prepared statement / advisory lock）</li>
</ol>
<h2 id="schema-gappg-對-dsql-限制">Schema gap：PG 對 DSQL 限制</h2>
<p>DSQL 是 PG-compatible <em>subset</em>、有幾類功能不支援：</p>
<table>
  <thead>
      <tr>
          <th>類別</th>
          <th>PG 支援</th>
          <th>DSQL 支援</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Extension</td>
          <td>是</td>
          <td>否（沒 <code>CREATE EXTENSION</code>）</td>
      </tr>
      <tr>
          <td>Foreign key constraint</td>
          <td>是</td>
          <td>否（application 維護 referential integrity）</td>
      </tr>
      <tr>
          <td>View / Materialized view</td>
          <td>是</td>
          <td>View 部分 / Materialized view 否</td>
      </tr>
      <tr>
          <td>JSON / JSONB</td>
          <td>是</td>
          <td>部分（無 GIN index 加速）</td>
      </tr>
      <tr>
          <td>Foreign data wrapper</td>
          <td>是</td>
          <td>否</td>
      </tr>
      <tr>
          <td>Stored procedure（PL/pgSQL）</td>
          <td>是</td>
          <td>部分（限制多）</td>
      </tr>
      <tr>
          <td>Trigger</td>
          <td>是</td>
          <td>部分</td>
      </tr>
      <tr>
          <td>LISTEN / NOTIFY</td>
          <td>是</td>
          <td>否</td>
      </tr>
      <tr>
          <td><code>SELECT ... FOR UPDATE</code></td>
          <td>是</td>
          <td>部分（DSQL OCC semantic）</td>
      </tr>
      <tr>
          <td>Sequence（serial / identity）</td>
          <td>是</td>
          <td>支援、但高吞吐有 coordination overhead</td>
      </tr>
      <tr>
          <td>Table partition</td>
          <td>是</td>
          <td>部分</td>
      </tr>
      <tr>
          <td>Logical replication slot</td>
          <td>是</td>
          <td>否</td>
      </tr>
  </tbody>
</table>
<p><strong>Migration 必做 schema audit</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 找所有 extension 依賴
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_extension</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- 找 materialized view
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">schemaname</span><span class="p">,</span><span class="w"> </span><span class="n">matviewname</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_matviews</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- 找 sequence
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_sequences</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 找 FDW
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_foreign_server</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- 找 trigger
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_trigger</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="n">tgisinternal</span><span class="p">;</span></span></span></code></pre></div><p>任何項目命中、都是 migration blocker。</p>
<h2 id="operational-redesign">Operational Redesign</h2>
<p>跟 self-managed PG 或 Aurora PG 比、DSQL operational model 大幅簡化但語意不同：</p>
<table>
  <thead>
      <tr>
          <th>Operational concept</th>
          <th>self-managed PG</th>
          <th>Aurora PG</th>
          <th>Aurora DSQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Storage</td>
          <td>Local / EBS</td>
          <td>Shared 6 副本</td>
          <td>Distributed log + replicated state</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>Patroni</td>
          <td>Aurora failover</td>
          <td>永遠 HA（無 failover 概念）</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>pgBackRest / WAL-G</td>
          <td>內建 continuous</td>
          <td>內建 continuous（更深整合）</td>
      </tr>
      <tr>
          <td>Connection pool</td>
          <td>PgBouncer / PgCat</td>
          <td>RDS Proxy 推薦</td>
          <td>內建（無需配置）</td>
      </tr>
      <tr>
          <td>Major version upgrade</td>
          <td>手動 + 停機</td>
          <td>Aurora blue/green</td>
          <td>完全 transparent（AWS 升）</td>
      </tr>
      <tr>
          <td>Read replica</td>
          <td>Streaming replication</td>
          <td>Reader endpoint</td>
          <td>無分（每 region 都讀寫）</td>
      </tr>
      <tr>
          <td>Monitoring</td>
          <td>Prometheus / pg_stat_*</td>
          <td>CloudWatch + Performance Insights</td>
          <td>CloudWatch（簡化）</td>
      </tr>
      <tr>
          <td>預期 SRE FTE</td>
          <td>0.5-2</td>
          <td>0.2-0.5</td>
          <td>&lt; 0.1</td>
      </tr>
  </tbody>
</table>
<h2 id="migration-流程type-e-phased-plan">Migration 流程：Type E Phased Plan</h2>
<p>Type E paradigm shift 的 phased plan、跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">migrate-to-cockroachdb</a> 結構類似：</p>
<h3 id="phase-1schema--application-audit">Phase 1：Schema / Application Audit</h3>
<ul>
<li>跑 schema audit（extension / MV / FDW / sequence / trigger）</li>
<li>識別 application 哪些 query / transaction pattern 需重設計</li>
<li>估算 <em>能直接遷的 % vs 需重寫的 %</em>、典型 60-80% / 20-40%</li>
</ul>
<h3 id="phase-2application-改造不上-dsql先在-pg-跑">Phase 2：Application 改造（不上 DSQL、先在 PG 跑）</h3>
<ul>
<li>加 transaction retry middleware（攔截 <code>40001</code>、exponential backoff）</li>
<li>用 UUID 替代 serial / bigserial</li>
<li>移除依賴 LISTEN/NOTIFY 的功能（改 SQS / EventBridge）</li>
<li>移除 materialized view（改 application-side cache 或 incremental ETL）</li>
<li>Stored procedure 改 application code</li>
<li>在 PG 上跑 staging、確認新 application code 還對</li>
</ul>
<h3 id="phase-3dsql-cluster-建立--schema-遷">Phase 3：DSQL Cluster 建立 + Schema 遷</h3>
<ul>
<li>DSQL cluster create</li>
<li>DDL apply（subset of PG schema、無 extension）</li>
<li>DMS（Database Migration Service）initial load + ongoing replication</li>
<li>兩邊跑 shadow traffic、比對 query 結果</li>
</ul>
<h3 id="phase-4cutover">Phase 4：Cutover</h3>
<ul>
<li>Application 切 connection string 到 DSQL</li>
<li>保留 PG read-only 一週、出狀況 rollback</li>
<li>Monitor <code>40001</code> retry rate、scaling event 行為</li>
</ul>
<h3 id="phase-5多-region-拓展如適用">Phase 5：多 region 拓展（如適用）</h3>
<ul>
<li>加第二 region endpoint</li>
<li>Application 改 multi-region routing（latency-based）</li>
<li>Test region failure / network partition 行為</li>
</ul>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="case-1transaction-retry-沒處理">Case 1：Transaction Retry 沒處理</h3>
<p><strong>情境</strong>：PG 上「兩個 transaction 都 update 同 row」走 lock + wait；DSQL 同情境一個會收 <code>40001 serialization_failure</code>、application 沒 catch、user 看到 500 error。</p>
<p>修法：</p>
<ul>
<li>DAO 層加 retry middleware：catch <code>40001</code> + exponential backoff（jitter）</li>
<li>Retry 上限 3-5 次、超過回 4xx 給 user</li>
<li>Transaction 內不要做 side effect（API call / message send）、retry 會重做</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">with_retry</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="n">max_attempts</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="k">for</span> <span class="n">attempt</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_attempts</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">        <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">            <span class="k">return</span> <span class="n">fn</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">        <span class="k">except</span> <span class="n">SerializationError</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">            <span class="k">if</span> <span class="n">attempt</span> <span class="o">==</span> <span class="n">max_attempts</span> <span class="o">-</span> <span class="mi">1</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">                <span class="k">raise</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">            <span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">((</span><span class="mi">2</span> <span class="o">**</span> <span class="n">attempt</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.05</span> <span class="o">+</span> <span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">()</span> <span class="o">*</span> <span class="mf">0.05</span><span class="p">)</span></span></span></code></pre></div><h3 id="case-2extension-缺位feature-整段掉">Case 2：Extension 缺位、Feature 整段掉</h3>
<p><strong>情境</strong>：production PG 用 pgvector 做 RAG search、PostGIS 做 store locator、TimescaleDB 做 metrics — 切 DSQL 後三 feature 全沒。</p>
<p>修法：</p>
<ul>
<li>不要直接遷、評估 <em>which extension is load-bearing</em></li>
<li>pgvector → 外掛 Pinecone / Weaviate 或保留 PG 跑 vector workload</li>
<li>PostGIS → 保留 PG 跑 GIS workload</li>
<li>TimescaleDB → 切 Amazon Timestream 或保留 PG</li>
<li>DSQL 只放 <em>不依賴 extension</em> 的 transactional core</li>
</ul>
<p>實務常見拓撲：DSQL 跑 transactional core、附 PG（vector） + PG（GIS） + Timestream（metrics）。</p>
<h3 id="case-3sequence-高吞吐撞-coordination-overhead">Case 3：Sequence 高吞吐撞 Coordination Overhead</h3>
<p><strong>情境</strong>：<code>SERIAL</code> / <code>GENERATED AS IDENTITY</code> PK 在 DSQL 用、insert 量 1000+/s 時 sequence nextval 變成 bottleneck、insert latency 從 5ms 跳到 80-100ms+。</p>
<p>DSQL 有支援 sequence、但不是「local atomic counter」、是分散式 counter — 每次 nextval 需跨 region coordination 保證唯一性。低吞吐 OK、高吞吐撞牆。</p>
<p>修法：</p>
<ul>
<li>高吞吐表 PK 換 UUID v7（time-sortable、無 coordination）：<code>gen_random_uuid()</code> 或 application-side UUID v7 library</li>
<li>或 application-side ULID（time-sortable、12-byte 緊湊）</li>
<li>完全避免依賴「連續 integer PK」的 application 邏輯（reporting / paging 改用 <code>ORDER BY created_at, id</code>）</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 換 UUID PK
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="n">UUID</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="k">DEFAULT</span><span class="w"> </span><span class="n">gen_random_uuid</span><span class="p">(),</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="p">...</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p>低吞吐表（settings / config）保留 sequence OK；high-volume transactional 表（orders / events）建議 UUID。</p>
<h3 id="case-4aurora-pg-直升-dsql-想當-in-place">Case 4：Aurora PG 直升 DSQL 想當 in-place</h3>
<p><strong>情境</strong>：team 以為「Aurora PG 跟 Aurora DSQL 都是 Aurora、應該能直升」、申請 cluster modify、發現完全是兩個 service。</p>
<p>修法：</p>
<ul>
<li>不是 in-place upgrade、是 full migration（DMS + cutover）</li>
<li>把 DSQL 當完全新的 cluster type、走 Phase 1-4 完整流程</li>
<li>Aurora PG → Aurora DSQL 不比 PG → CRDB 容易、wire-compatible 只解 application connect 問題、不解 schema / paradigm 差異</li>
</ul>
<h3 id="case-5region-failover-semantic">Case 5：Region Failover Semantic</h3>
<p><strong>情境</strong>：team 以為「DSQL multi-region 等於高可用」、設計時假設「整 region 掛還是能寫」、實測發現「網絡分割時 DSQL 走 quorum、可能 reject write」。</p>
<p>DSQL 是 strong consistency 多 region、CAP 取 CP（不是 AP）—  network partition 時部分 region 會拒絕 write、不是「永遠可寫」。</p>
<p>修法：</p>
<ul>
<li>設計 application 要 handle write reject（partition recovery 後 retry）</li>
<li>不要把 DSQL 當「永遠可寫」的 cache 或 queue 用</li>
<li>真要 AP 行為、用 DynamoDB（global table）</li>
</ul>
<h2 id="capacity-規劃">Capacity 規劃</h2>
<p>DSQL 計費跟 Aurora PG 差很多：</p>
<table>
  <thead>
      <tr>
          <th>計費項目</th>
          <th>Aurora PG</th>
          <th>Aurora DSQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Instance</td>
          <td>Per-instance hour</td>
          <td>無（serverless）</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>Per-GB-month</td>
          <td>Per-GB-month（多副本價）</td>
      </tr>
      <tr>
          <td>IO</td>
          <td>Per-million IO</td>
          <td>每 transaction 計費</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>Per-GB-month</td>
          <td>內建（無額外）</td>
      </tr>
      <tr>
          <td>Multi-region</td>
          <td>Cross-region replica（額外）</td>
          <td>每 region 全費 × N</td>
      </tr>
  </tbody>
</table>
<p>實務 cost：Aurora PG db.r6g.4xlarge multi-AZ 月 ~$2000 → DSQL 同 workload ~$5000-10000（依 region 數）。</p>
<p>何時 DSQL cost 划算：</p>
<ul>
<li>多 region active-active 需求剛性（不是 nice-to-have）</li>
<li>Operational FTE 節省超過 cost 差</li>
<li>Burst workload（DSQL 自動 scale、Aurora PG 預配置 idle 期浪費）</li>
</ul>
<h2 id="跟既有-migration-playbook-對比">跟既有 Migration Playbook 對比</h2>
<table>
  <thead>
      <tr>
          <th>Migration</th>
          <th>Type</th>
          <th>主結構</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">→ Aurora PG</a></td>
          <td>C</td>
          <td>Protocol drop-in + operational redesign</td>
      </tr>
      <tr>
          <td><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">→ CockroachDB</a></td>
          <td>E</td>
          <td>Paradigm shift（distributed SQL）</td>
      </tr>
      <tr>
          <td>→ Aurora DSQL（本篇）</td>
          <td>E</td>
          <td>Paradigm shift（PG-compatible distributed）</td>
      </tr>
  </tbody>
</table>
<p><strong>Aurora DSQL vs CockroachDB 選擇</strong>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Aurora DSQL</th>
          <th>CockroachDB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PG compatibility</td>
          <td>Wire-compatible 較完整</td>
          <td>高、但有差異</td>
      </tr>
      <tr>
          <td>Vendor lock-in</td>
          <td>AWS only</td>
          <td>跨雲 / on-prem</td>
      </tr>
      <tr>
          <td>Cost</td>
          <td>AWS pricing</td>
          <td>自管或 CockroachDB Cloud</td>
      </tr>
      <tr>
          <td>Multi-region 模型</td>
          <td>Strong consistency 內建</td>
          <td>可配置（regional / global table）</td>
      </tr>
      <tr>
          <td>Extension</td>
          <td>完全沒</td>
          <td>部分（CDC / changefeed）</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>Zero-touch</td>
          <td>自管或 managed</td>
      </tr>
  </tbody>
</table>
<p>選 DSQL：已綁 AWS、不想管基礎設施、需 PG semantic。
選 CRDB：跨雲、有自管 SRE、需要 fine-grained control。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">migrate-to-aurora</a>：Aurora PG 對比（Type C）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">migrate-to-cockroachdb</a>：CRDB 對比（Type E）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a>：DSQL 不支援的 extension</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/connection-scaling/" data-link-title="PostgreSQL Connection Scaling：process-per-connection model 跟為什麼 pooler 是必裝" data-link-desc="PG 每個 client connection fork 一個 backend process（不是 thread）、RAM 成本 5-15MB/connection、context switch 跟 fork() cost 在 100&#43; connection 後線性放大、所以 pooler 不是 *optional optimization* 而是 *production prerequisite*。本文走 process-per-connection model 跟 MySQL thread-per-connection 對比、max_connections &#43; shared_buffers &#43; work_mem 三 GUC 互動、application-side pool vs middleware pool vs RDS Proxy 三層選擇、5 production 踩雷（connection storm / fork() cost 在 burst 流量 / shared_buffers 跟 connection 數壓縮 / double-pool 配置錯誤 / max_connections 設太大反而慢）、跟 PgBouncer config 互補不重複">connection-scaling</a>：DSQL 內建 pool 跟 PgBouncer 對比</li>
</ul>
<h2 id="下一步">下一步</h2>
<ul>
<li>看 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora overview</a> 認識 Aurora family</li>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">migrate-to-cockroachdb</a> 對比另一個 Type E migration</li>
<li>回 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview</a> 看全圖</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration&lt;/a> playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB&lt;/a>。本文是 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 多重歸類跟 tie-breaking&lt;/a> 規則的實證 — 三維皆 High 配對的處理方式不是「選 type A 或 type C 或 type E」、是 &lt;em>主導維度走 Type E、其他高維度獨立加段&lt;/em>。每階段切換用 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate&lt;/a> 把關。&lt;/p>&lt;/blockquote>
&lt;h2 id="三維皆-high決策矩陣">三維皆 High：決策矩陣&lt;/h2>
&lt;p>跑 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit&lt;/a> 對 PostgreSQL → CockroachDB：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>評估&lt;/th>
 &lt;th>等級&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Schema / API&lt;/td>
 &lt;td>PostgreSQL wire protocol 兼容、但 SQL feature set 部分缺（CTE recursive 部分 / window function 部分 / extension 完全缺）&lt;/td>
 &lt;td>&lt;strong>High&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operational model&lt;/td>
 &lt;td>Single-node + Patroni → distributed Raft + 自動 rebalance；HA / backup / topology 全換&lt;/td>
 &lt;td>&lt;strong>High&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Abstraction / paradigm&lt;/td>
 &lt;td>Single-node MVCC + transaction → distributed Serializable Snapshot Isolation (SSI)&lt;/td>
 &lt;td>&lt;strong>High&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Number of components&lt;/td>
 &lt;td>同 1 個 DB cluster&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application change&lt;/td>
 &lt;td>Transaction retry pattern 必須改、ORM 可能需 patch&lt;/td>
 &lt;td>Medium&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>3 維 High + 1 維 Medium。按 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">methodology audit Step 5&lt;/a> 的多重歸類處理規則：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor <a href="/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration</a> playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> 跟 <a href="/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB</a>。本文是 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 多重歸類跟 tie-breaking</a> 規則的實證 — 三維皆 High 配對的處理方式不是「選 type A 或 type C 或 type E」、是 <em>主導維度走 Type E、其他高維度獨立加段</em>。每階段切換用 <a href="/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate</a> 把關。</p></blockquote>
<h2 id="三維皆-high決策矩陣">三維皆 High：決策矩陣</h2>
<p>跑 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a> 對 PostgreSQL → CockroachDB：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>PostgreSQL wire protocol 兼容、但 SQL feature set 部分缺（CTE recursive 部分 / window function 部分 / extension 完全缺）</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>Single-node + Patroni → distributed Raft + 自動 rebalance；HA / backup / topology 全換</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Abstraction / paradigm</td>
          <td>Single-node MVCC + transaction → distributed Serializable Snapshot Isolation (SSI)</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Number of components</td>
          <td>同 1 個 DB cluster</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Transaction retry pattern 必須改、ORM 可能需 patch</td>
          <td>Medium</td>
      </tr>
  </tbody>
</table>
<p>3 維 High + 1 維 Medium。按 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">methodology audit Step 5</a> 的多重歸類處理規則：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">主導維度判讀 (優先序): Schema &gt; Paradigm &gt; Operational &gt; Components
</span></span><span class="line"><span class="ln">2</span><span class="cl">
</span></span><span class="line"><span class="ln">3</span><span class="cl">實際應用: Schema High + Paradigm High + Operational High
</span></span><span class="line"><span class="ln">4</span><span class="cl">- Schema 是 High、但 CRDB 提供 PostgreSQL wire protocol 兼容
</span></span><span class="line"><span class="ln">5</span><span class="cl">- Paradigm 是 High、是 *單機 → 分散式* 的根本轉變、讀者最關心
</span></span><span class="line"><span class="ln">6</span><span class="cl">- Operational 是 High、但很大程度是 Paradigm 的 downstream
</span></span><span class="line"><span class="ln">7</span><span class="cl">
</span></span><span class="line"><span class="ln">8</span><span class="cl">→ 主結構選 Paradigm（Type E）、Schema + Operational 抽獨立段補充</span></span></code></pre></div><p>不強迫單一 type 標籤 — 本文是 <em>Type E 為主 + Type A / C 高維度增補</em> 的 multi-axis 形態。</p>
<h2 id="結構-differentiatortype-e-主結構--多軸增補段">結構 differentiator：Type E 主結構 + 多軸增補段</h2>
<p>跟前批 5 個 migration playbook 對照：</p>
<table>
  <thead>
      <tr>
          <th>結構元素</th>
          <th>Type A Splunk → Elastic</th>
          <th>Type B Redis → DragonflyDB</th>
          <th>Type C PostgreSQL → Aurora</th>
          <th>Type D Datadog → Grafana</th>
          <th>Type E Kafka ↔ NATS</th>
          <th><strong>本文（三維 High）</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Phased translation</td>
          <td>yes</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>partial</td>
      </tr>
      <tr>
          <td>Compatibility audit</td>
          <td>-</td>
          <td>yes</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
      </tr>
      <tr>
          <td>Operational redesign 對位</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>-</td>
          <td>-</td>
          <td><strong>yes（獨立段）</strong></td>
      </tr>
      <tr>
          <td>Schema gap 對位</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td><strong>yes（獨立段）</strong></td>
      </tr>
      <tr>
          <td>Parallel streams</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Paradigm contrast</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>yes</td>
      </tr>
      <tr>
          <td>Application 重設計</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>yes</td>
      </tr>
      <tr>
          <td>混合架構 long-term</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>partial（部分 workload）</td>
      </tr>
  </tbody>
</table>
<p>本文是「Type E 為主 + Type A schema gap 段 + Type C operational redesign 段」混合形態、9-10 章節、260-300 行。</p>
<h2 id="維度-1paradigm-shift主導">維度 1：Paradigm shift（主導）</h2>
<p>CRDB 是 <em>distributed SQL DB</em>、不是「PostgreSQL 多節點版」。核心差異：</p>
<table>
  <thead>
      <tr>
          <th>概念</th>
          <th>PostgreSQL</th>
          <th>CockroachDB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Transaction isolation</td>
          <td>MVCC、Read Committed default</td>
          <td>Serializable Snapshot Isolation (SSI)、強一致</td>
      </tr>
      <tr>
          <td>Transaction conflict</td>
          <td>First writer wins</td>
          <td>Retry-on-conflict、application 必須處理 <code>40001</code> retry code</td>
      </tr>
      <tr>
          <td>Replication</td>
          <td>Streaming replication + standby</td>
          <td>Raft consensus、每筆寫 quorum + 自動 rebalance</td>
      </tr>
      <tr>
          <td>Partition</td>
          <td>Declarative partitioning（手動）</td>
          <td>Automatic range-based + locality-aware</td>
      </tr>
      <tr>
          <td>Latency p99</td>
          <td>1-10ms（單 region）</td>
          <td>5-50ms（cross-AZ Raft quorum）</td>
      </tr>
      <tr>
          <td>Throughput limit</td>
          <td>單 primary 上限 ~10-50K TPS</td>
          <td>Linear scale by adding node、~5K TPS / node</td>
      </tr>
  </tbody>
</table>
<p>關鍵 paradigm 改變：<em>transaction 是 retry-able 操作、不是 atomic guaranteed</em>。所有 transaction code 需要包 retry loop（CRDB 提供 <code>cockroach_restart</code> savepoint）。</p>
<h2 id="維度-2schema-gappostgresql-features-crdb-不支援">維度 2：Schema gap（PostgreSQL features CRDB 不支援）</h2>
<p>CRDB 號稱 PostgreSQL-compatible、但 <em>covergence rate 80-90%</em>；常見 gap：</p>
<table>
  <thead>
      <tr>
          <th>PostgreSQL feature</th>
          <th>CRDB 狀態</th>
          <th>影響</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Stored procedure / function (PL/pgSQL)</td>
          <td>Limited（CRDB 22.2+ 部分支援）</td>
          <td>Migration scope 內必須 audit + 改寫</td>
      </tr>
      <tr>
          <td>Common Table Expression (CTE) recursive</td>
          <td>Limited (depth + structure)</td>
          <td>複雜 CTE 可能跑不通、必須 query refactor</td>
      </tr>
      <tr>
          <td>Window function 全集</td>
          <td>Partial</td>
          <td>報表 query 需逐 case 驗證</td>
      </tr>
      <tr>
          <td>Extensions (pg_repack / pgaudit / TimescaleDB)</td>
          <td><strong>不支援</strong></td>
          <td>用 CRDB 自家 alternative 或自管 application 層</td>
      </tr>
      <tr>
          <td>Triggers</td>
          <td>Limited</td>
          <td>Audit / data integrity 邏輯遷到 application 層</td>
      </tr>
      <tr>
          <td>Custom types / domain</td>
          <td>Partial</td>
          <td>用 CHECK constraint 替代</td>
      </tr>
      <tr>
          <td>Geographic types (PostGIS)</td>
          <td>CRDB native geo support（語法不同）</td>
          <td>Spatial query 改寫</td>
      </tr>
      <tr>
          <td><code>SELECT FOR UPDATE</code> semantics</td>
          <td>對等但底層機制不同（distributed lock）</td>
          <td>注意 deadlock pattern 差異</td>
      </tr>
      <tr>
          <td>Advisory locks</td>
          <td><strong>不支援</strong></td>
          <td>Application 端用其他 distributed lock（Redis / Consul）</td>
      </tr>
  </tbody>
</table>
<p>Migration 必須 <em>先 audit 完整 SQL feature 使用</em>、列出 gap、評估解法或退役。</p>
<h2 id="維度-3operational-redesign">維度 3：Operational redesign</h2>
<p>CRDB operational model 完全不同：</p>
<table>
  <thead>
      <tr>
          <th>Operational concept</th>
          <th>PostgreSQL self-managed</th>
          <th>CRDB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster bootstrap</td>
          <td>Patroni / Stolon + manual</td>
          <td><code>cockroach init</code> + 自動 Raft formation</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>Patroni + DCS + watchdog</td>
          <td>內建 Raft、無 single primary</td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>Patroni-managed、15-60s</td>
          <td>透明 Raft re-election、&lt; 5s</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>pgBackRest + WAL archive</td>
          <td><code>BACKUP TO</code> (incremental + full)</td>
      </tr>
      <tr>
          <td>Restore</td>
          <td><code>pgBackRest restore</code> + PITR</td>
          <td><code>RESTORE FROM</code></td>
      </tr>
      <tr>
          <td>Replication</td>
          <td>Streaming + logical</td>
          <td>Built-in、無 logical replication 對等概念</td>
      </tr>
      <tr>
          <td>Schema migration</td>
          <td><code>pg_dump</code> / Flyway / Liquibase</td>
          <td><code>cockroach sql</code> + online schema change（無 lock）</td>
      </tr>
      <tr>
          <td>Monitoring</td>
          <td>pg_stat_* views + Prometheus exporter</td>
          <td>CRDB admin UI + Prometheus（schema 不同）</td>
      </tr>
      <tr>
          <td>Sizing</td>
          <td>Vertical scale（單 node big spec）</td>
          <td>Horizontal scale（多 node 小 spec）</td>
      </tr>
  </tbody>
</table>
<p>SRE 心智模型完全重訓：<em>無 primary 概念 / 無 streaming lag 概念 / 無 standby promote 概念</em>。</p>
<h2 id="migration-流程混合形態">Migration 流程（混合形態）</h2>
<p>不是線性 phased、是 <em>phased + parallel + partial</em> 混合：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Phase 0: scope 判讀
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  - 列 application、區分「適合 CRDB」vs「保留 PostgreSQL」
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  - SQL feature audit
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  - Application transaction pattern audit
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">Phase 1: schema port + application 改寫
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  - DDL 轉成 CRDB syntax
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  - 不支援 extension 找 alternative
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  - Application transaction code 加 retry loop
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl">Phase 2: 雙寫期（部分 application 開始走 CRDB）
</span></span><span class="line"><span class="ln">12</span><span class="cl">  - 新 application 走 CRDB
</span></span><span class="line"><span class="ln">13</span><span class="cl">  - 舊 application 持續 PostgreSQL
</span></span><span class="line"><span class="ln">14</span><span class="cl">  - CDC bridge（Debezium → Kafka → CRDB consumer）
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl">Phase 3: cutover 適合的 application
</span></span><span class="line"><span class="ln">17</span><span class="cl">  - 每個 application 獨立 cutover
</span></span><span class="line"><span class="ln">18</span><span class="cl">  - 不是「全 DB 一次切」
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl">Phase 4: 長期混合架構
</span></span><span class="line"><span class="ln">21</span><span class="cl">  - 某些 workload 永遠保留 PostgreSQL（不適合分散式）
</span></span><span class="line"><span class="ln">22</span><span class="cl">  - CRDB 跑 distributed 適配 workload</span></span></code></pre></div><p>整體 3-6 個月、不收斂到全 CRDB。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1transaction-retry-沒處理application-大量-40001-error">Case 1：Transaction retry 沒處理、application 大量 <code>40001</code> error</h3>
<p><strong>徵兆</strong>：cutover 後 application 5-10% transaction 報 <code>restart transaction: TransactionRetryWithProtoRefreshError</code>、業務 fail。</p>
<p><strong>根因</strong>：PostgreSQL Read Committed 不要求 application 處理 conflict、CRDB Serializable Isolation 必須 <em>retry-on-conflict</em>；application code 沒 retry loop。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">// CRDB transaction with retry</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="k">for</span> <span class="nx">retries</span> <span class="o">:=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">retries</span> <span class="p">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="nx">retries</span><span class="o">++</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="nx">tx</span><span class="p">,</span> <span class="nx">_</span> <span class="o">:=</span> <span class="nx">db</span><span class="p">.</span><span class="nf">Begin</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="c1">// ... transaction logic ...</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="nx">err</span> <span class="o">:=</span> <span class="nx">tx</span><span class="p">.</span><span class="nf">Commit</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="k">if</span> <span class="nx">err</span> <span class="o">!=</span> <span class="kc">nil</span> <span class="o">&amp;&amp;</span> <span class="nx">strings</span><span class="p">.</span><span class="nf">Contains</span><span class="p">(</span><span class="nx">err</span><span class="p">.</span><span class="nf">Error</span><span class="p">(),</span> <span class="s">&#34;40001&#34;</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">        <span class="nx">time</span><span class="p">.</span><span class="nf">Sleep</span><span class="p">(</span><span class="nf">backoff</span><span class="p">(</span><span class="nx">retries</span><span class="p">))</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">        <span class="k">continue</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="k">break</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p>framework-level：用 CRDB-provided client lib（go-cockroachdb / crdb-jdbc）有 retry helper。</p>
<h3 id="case-2extension-缺位application-feature-整段掉">Case 2：Extension 缺位、application feature 整段掉</h3>
<p><strong>徵兆</strong>：cutover 後 application 某個地理計算功能直接報錯、PostGIS 函數不存在；migrate 計畫漏看。</p>
<p><strong>根因</strong>：CRDB native geo 不同 syntax / API、PostGIS extension 不能直接搬。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration 必跑 extension audit</strong>：列所有 <code>pg_extension</code>、找對應 CRDB feature 或退役</li>
<li><strong>PostGIS 替代</strong>：CRDB native ST_* functions、部分 syntax 對齊但 spatial index 不同</li>
<li><strong>退役不能換的 feature</strong>：評估保留 PostgreSQL（混合架構）</li>
</ol>
<h3 id="case-3sequential-pk-撞-raft-quorum-瓶頸">Case 3：Sequential PK 撞 Raft quorum 瓶頸</h3>
<p><strong>徵兆</strong>：cutover 後寫入吞吐量 / latency 不如預期、CRDB cluster CPU &lt; 30% 但 write latency p99 high。</p>
<p><strong>根因</strong>：application 用 <code>AUTO_INCREMENT</code> / <code>SERIAL</code> 連續 PK；CRDB 把連續 key 放 <em>同一 range</em> / 同一 Raft group、寫入串行化、無法平行 scale。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>改 UUID v7 / <code>unique_rowid()</code></strong>：時序排序但散佈跨 range、自動 partition by hash</li>
<li><strong><code>PRIMARY KEY (region, id)</code></strong>：multi-region 場景 multi-tenancy 自然拆分</li>
<li><strong>不適合的 workload 留 PostgreSQL</strong>：不是所有 schema 都適合 distributed</li>
</ol>
<h3 id="case-4long-transaction-對-raft-衝擊">Case 4：Long transaction 對 Raft 衝擊</h3>
<p><strong>徵兆</strong>：跨 1 分鐘+ 的 transaction（batch processing / 大 ETL）大量 retry、最後失敗；同期間其他短 transaction 也 retry rate 上升。</p>
<p><strong>根因</strong>：CRDB long transaction holds intent on touched ranges、阻塞其他 transaction；SSI conflict 機率隨 transaction 時間平方增長。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Long transaction 拆短</strong>：batch 用多個 short transaction、checkpoint 在 application 層</li>
<li><strong>Heavy ETL 不跑 CRDB</strong>：用 CRDB CDC export 到 OLAP（Snowflake / BigQuery）跑 batch</li>
<li><strong>Read-only long transaction 用 follower read</strong>：<code>AS OF SYSTEM TIME</code> 不 hold intent、適合 reporting</li>
</ol>
<h3 id="case-5backup--restore-行為跟-postgresql-不同sre-runbook-失效">Case 5：Backup / restore 行為跟 PostgreSQL 不同、SRE runbook 失效</h3>
<p><strong>徵兆</strong>：DBA 嘗試 <code>pg_restore</code> 失敗、CRDB 端 backup format 完全不同；incident response 卡關 1-2 小時。</p>
<p><strong>根因</strong>：CRDB backup 是 <em>cluster-internal format</em>、不能用 PostgreSQL tooling；SRE runbook 仍是 PostgreSQL world、應急時心智模型錯位。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Runbook 重寫</strong>：CRDB-specific backup / restore 流程、SRE training</li>
<li><strong>DR drill</strong>：cutover 前跑完整 DR drill、用 CRDB tooling 完成、不依賴 PostgreSQL 經驗</li>
<li><strong>Multi-region backup</strong>：CRDB 跨 region backup 配置、避免單 region 故障</li>
</ol>
<h2 id="capacity-規劃">Capacity 規劃</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PostgreSQL self-managed</th>
          <th>CockroachDB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Single-node 上限</td>
          <td>~10-50K TPS（vertical scale 到 32-128 vCPU）</td>
          <td>~5K TPS / node（horizontal scale by adding node）</td>
      </tr>
      <tr>
          <td>跨 region</td>
          <td>高 latency 跨區 streaming</td>
          <td>設計 native、Locality-aware queries</td>
      </tr>
      <tr>
          <td>Sharding</td>
          <td>手動 partition / pg_partman</td>
          <td>自動 range-based</td>
      </tr>
      <tr>
          <td>Storage / TPS ratio</td>
          <td>不變</td>
          <td>Storage 跨 node 3x（Raft quorum 3-replica default）</td>
      </tr>
      <tr>
          <td>Total cost (10TB)</td>
          <td>$2-4K USD / month（self-managed）</td>
          <td>$5-10K USD / month（CRDB Cloud + 3x storage）</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：CRDB cost 顯著高、選 CRDB 必須是 <em>paradigm 需求</em>（distributed transaction / multi-region / linear scale）；單純成本 / availability 改善走 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">Aurora</a> 更划算。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-postgresql--aurora-migration-對比">跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora migration</a> 對比</h3>
<p>兩條 PostgreSQL 出路：</p>
<ul>
<li><strong>Aurora</strong>：operational simplification、protocol drop-in、cost 中等漲；適合 <em>不需 distributed transaction</em> 的 production</li>
<li><strong>CRDB</strong>：distributed paradigm shift、application 必須改、cost 顯著漲；適合 <em>真的需要 distributed</em> 的 workload</li>
</ul>
<p>多數 application 不需要 distributed transaction、Aurora 更合理；真正需要 cross-region 強一致 / linear scale by adding node 才走 CRDB。</p>
<h3 id="跟-application-transaction-pattern-重設計">跟 application transaction pattern 重設計</h3>
<p>CRDB 強制 application 改 transaction code、retry loop 必加。團隊心智模型轉換是 migration 主要 effort、技術部分相對少。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>CRDB → PostgreSQL reverse migration</strong>：當業務 simplify 後 distributed 不必要、reverse migration cost 高、實務上 CRDB 是 <em>single-direction lock-in</em></li>
<li><strong>CRDB Serverless</strong>：cost 起點低、burst workload 適合；steady workload 仍是 dedicated cluster</li>
<li><strong>Multi-region active-active</strong>：CRDB 真正強項、但網路成本爆、僅金融 / 政府客戶 ROI 合理</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source / target vendor：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> / <a href="/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB</a></li>
<li>對位 migration：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a>（另一條 PostgreSQL 出路）</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> / <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Process content 結構由最大差異維度決定</a>（本文驗證 <em>多重歸類 multi-axis 處理</em>）</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Partition Redesign：當 monthly partition 越跑越慢</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/partition-redesign/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/partition-redesign/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。對應 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Type F「Topology re-layout」&lt;/a> 第 2 個 dogfood（第 1 個是 &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis cluster re-sharding&lt;/a>）— 驗證 Type F anatomy 在不同 vendor 上的通用性。&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼-monthly-partition-越跑越慢">為什麼 monthly partition 越跑越慢&lt;/h2>
&lt;p>上線時 monthly range partition 設計很合理 — 每月一個 partition、12 個月一年、partition_pruning 在 &lt;code>WHERE event_time &amp;gt;= '2026-05-01'&lt;/code> 時跑單 partition、查詢快。但業務跑了 18 個月後：&lt;/p>
&lt;ul>
&lt;li>每月 partition size 從 50GB 漲到 500GB（流量 10x）&lt;/li>
&lt;li>單月查詢 &lt;code>WHERE event_time BETWEEN '2026-05-01' AND '2026-05-15'&lt;/code> 仍掃整月 500GB（partition_pruning 粒度只到 month）&lt;/li>
&lt;li>Vacuum 一個月 partition 需要 6-8 小時、跑不進 maintenance window&lt;/li>
&lt;li>DROP 老 partition 釋放 storage 是 monthly cadence、但 retention policy 要求 daily granularity&lt;/li>
&lt;/ul>
&lt;p>partition 設計需要 &lt;em>redesign&lt;/em>、不是「optimize」 — 從 monthly range partition 改成 daily range partition、partition 數量從 36 個（3 年 retention）變 1095 個。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。對應 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Type F「Topology re-layout」</a> 第 2 個 dogfood（第 1 個是 <a href="/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis cluster re-sharding</a>）— 驗證 Type F anatomy 在不同 vendor 上的通用性。</p></blockquote>
<h2 id="為什麼-monthly-partition-越跑越慢">為什麼 monthly partition 越跑越慢</h2>
<p>上線時 monthly range partition 設計很合理 — 每月一個 partition、12 個月一年、partition_pruning 在 <code>WHERE event_time &gt;= '2026-05-01'</code> 時跑單 partition、查詢快。但業務跑了 18 個月後：</p>
<ul>
<li>每月 partition size 從 50GB 漲到 500GB（流量 10x）</li>
<li>單月查詢 <code>WHERE event_time BETWEEN '2026-05-01' AND '2026-05-15'</code> 仍掃整月 500GB（partition_pruning 粒度只到 month）</li>
<li>Vacuum 一個月 partition 需要 6-8 小時、跑不進 maintenance window</li>
<li>DROP 老 partition 釋放 storage 是 monthly cadence、但 retention policy 要求 daily granularity</li>
</ul>
<p>partition 設計需要 <em>redesign</em>、不是「optimize」 — 從 monthly range partition 改成 daily range partition、partition 數量從 36 個（3 年 retention）變 1095 個。</p>
<p><a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a> 結果：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>同 PostgreSQL、同 table 定義、partition key 不變</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>同 PostgreSQL operational stack</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>同 OLTP RDBMS</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>同 1 個 DB</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>不改（partition_pruning 透明）</td>
          <td>Low</td>
      </tr>
      <tr>
          <td><strong>Data topology</strong></td>
          <td><strong>Partition strategy 從 monthly → daily</strong></td>
          <td><strong>High</strong></td>
      </tr>
  </tbody>
</table>
<p>6 維皆 Low + topology High = <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">Type F「Topology re-layout」</a>。</p>
<h2 id="pre-layout-analysispartition-不平衡偵測">Pre-layout analysis：partition 不平衡偵測</h2>
<p>執行 redesign 前必須先量化當前 topology：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. 每 partition size + row count
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="n">child</span><span class="p">.</span><span class="n">relname</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">partition_name</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">size</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="n">child</span><span class="p">.</span><span class="n">reltuples</span><span class="p">::</span><span class="nb">bigint</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">estimated_rows</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">  </span><span class="n">pg_stat_get_last_vacuum_time</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">last_vacuum</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_inherits</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">JOIN</span><span class="w"> </span><span class="n">pg_class</span><span class="w"> </span><span class="n">parent</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">pg_inherits</span><span class="p">.</span><span class="n">inhparent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">parent</span><span class="p">.</span><span class="n">oid</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">JOIN</span><span class="w"> </span><span class="n">pg_class</span><span class="w"> </span><span class="n">child</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">pg_inherits</span><span class="p">.</span><span class="n">inhrelid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">parent</span><span class="p">.</span><span class="n">relname</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;events&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="p">)</span><span class="w"> </span><span class="k">DESC</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- 2. partition_pruning 命中率
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span><span class="k">EXPLAIN</span><span class="w"> </span><span class="p">(</span><span class="k">ANALYZE</span><span class="p">,</span><span class="w"> </span><span class="n">BUFFERS</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="k">BETWEEN</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="s1">&#39;2026-05-15&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="c1">-- 期望: 只 scan 1 partition (target: daily) 或 1 partition (current: monthly)
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1">-- 觀察: monthly 設計下、即使 query 只跨 15 天、planner 仍 scan 整月 partition (~500GB)
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="c1">-- 3. 找 partition imbalance
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w">  </span><span class="n">to_char</span><span class="p">(</span><span class="n">event_time</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;YYYY-MM&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">month</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">  </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">row_count</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="k">DESC</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w"></span><span class="c1">-- 找 hot month / cold month、判斷 redesign 後分佈</span></span></span></code></pre></div><p>Pre-layout 階段的 output：</p>
<ul>
<li><strong>當前 topology 量化</strong>：36 monthly partition、總 size 1.8TB、最大 partition 500GB、最小 50GB</li>
<li><strong>Hot key 分佈</strong>：80% 流量集中最近 3 個月</li>
<li><strong>Redesign 目標</strong>：daily partition、最近 3 個月 hot daily / 3 個月 + 之前 cold weekly / 1 年 + 之前 monthly（sub-partition strategy）</li>
<li><strong>Migration scope</strong>：1095 個 partition 不直接全建、按 retention policy 階段性</li>
</ul>
<h2 id="re-layout-機制attach--detach-線上重劃">Re-layout 機制：ATTACH / DETACH 線上重劃</h2>
<p>PostgreSQL 不支援「直接改 partition strategy」、必須走 <em>新 partition tree + 資料搬遷</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. 建新 daily partition table (parallel to events)
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_daily</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="n">id</span><span class="w"> </span><span class="nb">bigint</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="n">event_time</span><span class="w"> </span><span class="n">timestamptz</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="n">payload</span><span class="w"> </span><span class="n">jsonb</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">RANGE</span><span class="w"> </span><span class="p">(</span><span class="n">event_time</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- 2. 預建未來 90 天 daily partition
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">  </span><span class="n">format</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">    </span><span class="s1">&#39;CREATE TABLE events_daily_%s PARTITION OF events_daily FOR VALUES FROM (%L) TO (%L)&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="n">to_char</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;YYYY_MM_DD&#39;</span><span class="p">),</span><span class="w"> </span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">d</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nb">interval</span><span class="w"> </span><span class="s1">&#39;1 day&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">  </span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">generate_series</span><span class="p">(</span><span class="k">current_date</span><span class="p">,</span><span class="w"> </span><span class="k">current_date</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nb">interval</span><span class="w"> </span><span class="s1">&#39;90 days&#39;</span><span class="p">,</span><span class="w"> </span><span class="nb">interval</span><span class="w"> </span><span class="s1">&#39;1 day&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">d</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="c1">-- 3. dual-write phase: application 同寫 events + events_daily
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1">-- (用 trigger 或 application-side)
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">OR</span><span class="w"> </span><span class="k">REPLACE</span><span class="w"> </span><span class="k">FUNCTION</span><span class="w"> </span><span class="n">dual_write_events</span><span class="p">()</span><span class="w"> </span><span class="k">RETURNS</span><span class="w"> </span><span class="k">TRIGGER</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="err">$$</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"></span><span class="k">BEGIN</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">  </span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">events_daily</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="k">NEW</span><span class="p">.</span><span class="o">*</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w">  </span><span class="k">RETURN</span><span class="w"> </span><span class="k">NEW</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w"></span><span class="k">END</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w"></span><span class="err">$$</span><span class="w"> </span><span class="k">LANGUAGE</span><span class="w"> </span><span class="n">plpgsql</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TRIGGER</span><span class="w"> </span><span class="n">events_dual_write</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w"></span><span class="k">AFTER</span><span class="w"> </span><span class="k">INSERT</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w"></span><span class="k">FOR</span><span class="w"> </span><span class="k">EACH</span><span class="w"> </span><span class="k">ROW</span><span class="w"> </span><span class="k">EXECUTE</span><span class="w"> </span><span class="k">FUNCTION</span><span class="w"> </span><span class="n">dual_write_events</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="w"></span><span class="c1">-- 4. backfill historical data per partition
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">events_daily</span><span class="w">
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="s1">&#39;2026-05-02&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="w"></span><span class="c1">-- ... 每天跑一個 day partition、avoid long transaction
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="w"></span><span class="c1">-- 5. cutover: rename swap
</span></span></span><span class="line"><span class="ln">36</span><span class="cl"><span class="c1"></span><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">37</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">RENAME</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">events_old</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">38</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_daily</span><span class="w"> </span><span class="k">RENAME</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">events</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">39</span><span class="cl"><span class="w"></span><span class="k">DROP</span><span class="w"> </span><span class="k">TRIGGER</span><span class="w"> </span><span class="n">events_dual_write</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events_old</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">40</span><span class="cl"><span class="w"></span><span class="k">COMMIT</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">41</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">42</span><span class="cl"><span class="w"></span><span class="c1">-- 6. 觀察 1-2 週、DROP events_old</span></span></span></code></pre></div><p>關鍵：rename swap 是 <em>single transaction</em>、cutover 瞬間發生；application connection 不需重連、但 prepared statement cache 可能要刷新。</p>
<h2 id="execution-flow-per-step">Execution flow per-step</h2>
<p>5 段、每段含 rollback boundary：</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>動作</th>
          <th>Rollback boundary</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 預建 partition</td>
          <td>建 events_daily + 90 天 partition、不影響 production</td>
          <td>DROP events_daily、無 impact</td>
      </tr>
      <tr>
          <td>2 Dual-write</td>
          <td>加 trigger 同寫兩端、observe diff</td>
          <td>DROP trigger、events_daily 留作 cleanup</td>
      </tr>
      <tr>
          <td>3 Backfill</td>
          <td>逐日 backfill 歷史資料、用 CHECK constraint 確保完整性</td>
          <td>DROP backfilled partition、不影響 source events</td>
      </tr>
      <tr>
          <td>4 Verify</td>
          <td>對 sample query 跑 events vs events_daily、確認 row count 一致</td>
          <td>仍在 dual-write、發現 diff 可暫停 cutover</td>
      </tr>
      <tr>
          <td>5 Cutover</td>
          <td>Rename swap</td>
          <td><strong>不可逆</strong>、回退需 reverse rename + dual-write restart</td>
      </tr>
  </tbody>
</table>
<p>Step 5 是不可逆邊界、應該排在 <em>低流量 maintenance window</em> 跑、且 cutover 前必須有 backup checkpoint。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1backfill-期間-long-transaction-阻塞-vacuum">Case 1：Backfill 期間 long transaction 阻塞 vacuum</h3>
<p><strong>徵兆</strong>：backfill 跑 6 小時的 <code>INSERT INTO events_daily SELECT * FROM events WHERE ...</code>、期間 events 表的 autovacuum 完全不跑、dead tuple 累積、production query 變慢。</p>
<p><strong>根因</strong>：PostgreSQL transaction 期間 <em>xmin horizon 鎖死</em>、vacuum 只能回收「不會被任何 active transaction 看到」的 dead tuple；long backfill = long open transaction、vacuum 失效。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>拆 batch INSERT</strong>：每日 backfill 拆成 small batch（10 萬 row 一個 transaction）、每個 commit 釋放 xmin</li>
<li><strong>用 COPY 不用 INSERT</strong>：<code>COPY events_daily FROM (SELECT * FROM events WHERE ...)</code> 是 PG 對 batch 最快 + 對 vacuum 影響小</li>
<li><strong>Backfill 跑在 standby</strong>：用 logical replication 從 standby 拉資料、不在 primary 跑長 transaction</li>
</ol>
<h3 id="case-2trigger-dual-write-對-application-造成-latency">Case 2：Trigger dual-write 對 application 造成 latency</h3>
<p><strong>徵兆</strong>：加 trigger 後 application 寫入 latency p99 從 5ms 漲到 25-50ms；high-throughput batch job 直接 timeout。</p>
<p><strong>根因</strong>：每筆 INSERT 都觸發 trigger function 跑一次 INSERT 到 events_daily、IO 雙倍、index 也雙倍維護。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>改 application-side dual-write</strong>：application code 顯式寫兩端、用 connection pool batch 攤平 IO</li>
<li><strong>用 logical replication slot</strong>：events → events_daily 用 logical replication 取代 trigger、降 IO 衝擊</li>
<li><strong>dual-write 時間最小化</strong>：trigger 只在 backfill + verify 期間打開、cutover 前關掉</li>
</ol>
<h3 id="case-3partition_pruning-沒命中planner-仍掃所有-partition">Case 3：Partition_pruning 沒命中、planner 仍掃所有 partition</h3>
<p><strong>徵兆</strong>：cutover 完成後、application 端某些 query latency 從 200ms 跳到 5000ms；EXPLAIN 顯示 <code>Append</code> 下面所有 1095 個 partition 都被 scan。</p>
<p><strong>根因</strong>：partition 數量爆到 1000+、planner planning_time 對某些 query 變長（含 prepared statement 沒帶 partition key bound）；或 query 用了 <code>WHERE event_time = some_function(now())</code>、planning-time pruning 不觸發。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong><code>enable_partition_pruning = on</code></strong> 預設、確認沒被 disable</li>
<li><strong>PG 11+ runtime pruning</strong>：prepared statement 用 generic plan、runtime pruning 補位</li>
<li><strong>Sub-partition strategy</strong>：1095 個 daily 太多、改 <em>最近 90 天 daily / 之前 monthly</em> 混合 strategy、減 partition count</li>
<li><strong>Planner statistics</strong>：跑 <code>ANALYZE</code> 重建 statistics、partition 樹太大時 planner 需新 stats</li>
</ol>
<h3 id="case-4constraint-exclusion-失敗跨-partition-unique-不-enforce">Case 4：Constraint exclusion 失敗、跨 partition unique 不 enforce</h3>
<p><strong>徵兆</strong>：cutover 後發現某 user 的 event 在多個 partition 都有、unique constraint <code>(user_id, event_id)</code> 沒 enforce；data audit 抓到 duplicate。</p>
<p><strong>根因</strong>：PostgreSQL partition table 的 <code>UNIQUE</code> constraint <em>必須包含 partition key</em>；本來 monthly partition 下 <code>UNIQUE (user_id, event_id)</code> 加上 <code>event_time</code>（partition key）變 <code>UNIQUE (user_id, event_id, event_time)</code>、實際語意是「同月同 user 同 event_id 唯一」；改 daily 後變「同日同 user 同 event_id 唯一」— unique scope 從月變天、原本月內跨日 dedup 失效。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-redesign</strong>：明示 unique constraint 的 <em>時間 scope</em>、redesign 後 scope 縮小是否可接受</li>
<li><strong>Application-side dedup</strong>：跨 partition 唯一性走 application 層 lookup（用 Redis SETEX 暫存 key）</li>
<li><strong>退到 non-partitioned dedup 表</strong>：建獨立 user_events_dedup 表、application 寫入前先 lookup</li>
</ol>
<h3 id="case-5drop-老-partition-太頻繁shared_buffers-cache-miss-爆">Case 5：DROP 老 partition 太頻繁、shared_buffers cache miss 爆</h3>
<p><strong>徵兆</strong>：daily partition 上線後、每天凌晨 cron DROP <code>events_2025_05_18</code>（90 天前）；DROP 後 shared_buffers 大量 invalidate、application 端 query latency p99 從 10ms 跳到 100-200ms 持續 30 分鐘。</p>
<p><strong>根因</strong>：PostgreSQL shared_buffers cache 對被 DROP 表的 page 全部 invalidate；DROP 大 partition（10GB+）後 cache hit rate 從 99% 掉到 60%、application 等 disk IO。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>DROP 跑在 off-peak</strong>：凌晨 3-4 點 cron、避開業務高峰</li>
<li><strong>預熱 next partition</strong>：DROP 前用 <code>pg_prewarm</code> 主動 load 熱 partition 進 cache</li>
<li><strong>改 DETACH + DROP TABLE delayed</strong>：DETACH 是 fast、DROP TABLE 排到 weekly batch、降頻率</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Monthly partition (current)</th>
          <th>Daily partition (target)</th>
          <th>Trade-off</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Partition count</td>
          <td>36 (3 年 retention)</td>
          <td>1095 (3 年 retention)</td>
          <td>30x partition count、planner cost 略升</td>
      </tr>
      <tr>
          <td>Single partition size</td>
          <td>50-500GB</td>
          <td>1-20GB</td>
          <td>Daily 更易 vacuum</td>
      </tr>
      <tr>
          <td>DROP old data</td>
          <td>Monthly cadence</td>
          <td>Daily cadence</td>
          <td>更細 retention 控制</td>
      </tr>
      <tr>
          <td>Query latency</td>
          <td>跨 partition 多時 50-200ms</td>
          <td>跨 partition 少時 5-50ms</td>
          <td>Daily 多數 query 更快</td>
      </tr>
      <tr>
          <td>Planning time</td>
          <td>5-10ms</td>
          <td>50-100ms (對 generic plan)</td>
          <td>Planning overhead + 1 order</td>
      </tr>
      <tr>
          <td>Maintenance window</td>
          <td>Vacuum 1 partition 6 小時</td>
          <td>Vacuum 1 partition 5-30 分鐘</td>
          <td>維護視窗更小、可日跑</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：daily partition 適合 <em>高流量 + 跨日查詢多 + retention 細的場景</em>；超大 partition (TB 級單日) 仍要 sub-partition 拆。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-autovacuum-tuning-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum tuning</a> 整合</h3>
<p>Daily partition 後 autovacuum 行為：</p>
<ul>
<li>每 daily partition 獨立 autovacuum、scale_factor + threshold per-partition tuning</li>
<li><code>autovacuum_max_workers</code> 要從 3 拉到 6-10（partition 數爆）</li>
<li>Cold partition (&gt; 30 天) <code>autovacuum_enabled = false</code>、不浪費 CPU</li>
</ul>
<h3 id="跟-patroni-ha-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> 整合</h3>
<p>Failover 期間 partition migration 不能跑、必須在 stable cluster state 執行；Patroni promote 後重新評估 partition health。</p>
<h3 id="跟-logical-replication--debezium-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a> 整合</h3>
<p><code>publish_via_partition_root = true</code> 讓 publication 從 parent 角度看；CDC consumer 不需要對每個 partition 設 subscription。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>跨 daily partition 的 archive strategy</strong>：archive 到 S3 cold storage、daily granularity 給更細 retention 控制</li>
<li><strong>pg_partman extension</strong>：自動建 daily partition、不用 cron；但要先確認 Aurora / RDS 支援</li>
<li><strong>Sub-partitioning</strong>：未來流量爆時用「daily by time + list by tenant」雙軸 partition</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">Declarative Partitioning</a>（partition 基礎）/ <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a></li>
<li>平行 Type F dogfood：<a href="/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis Cluster Re-sharding</a>（dogfood #1）/ <a href="/blog/backend/01-database/vendors/mongodb/shard-expansion-multi-dc/" data-link-title="MongoDB Shard Expansion &#43; Multi-DC：Type F「不需要 parallel run」的 multi-region 例外" data-link-desc="MongoDB sharded cluster 加 shard &#43; 跨 DC expansion 是 Type F「topology re-layout」第 3 個 dogfood — 同時改 sharding &#43; replication topology &#43; region distribution；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 第 3 點「Type F 不需要 parallel run」claim 的例外（multi-region rollout 必須 parallel run &#43; 切流量）；涵蓋 chunk migration / replica set add member / cross-DC routing">MongoDB Shard + Multi-DC</a>（dogfood #3、F-multi-region sub-type）</li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> / <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Process content 結構由最大差異維度決定</a> / <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Data topology 是第 6 audit 維度</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Multi-Region GDPR Rollout：政策驅動的 migration 屬本 methodology 嗎</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/multi-region-gdpr-rollout/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/multi-region-gdpr-rollout/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。同時是 &lt;a href="https://tarrragon.github.io/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 self-aware limitation&lt;/a> 第 1 點「6 維仍可能漏類（identity / consistency / residency 三軸候選）」的 &lt;em>residency 軸驗證&lt;/em>、跟 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration playbook methodology「何時不該套」段&lt;/a> 對「政策合規驅動」是否在 methodology scope 的反思。&lt;/p>&lt;/blockquote>
&lt;h2 id="政策驅動的-migration-屬本-methodology-嗎">政策驅動的 migration 屬本 methodology 嗎&lt;/h2>
&lt;p>&lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> 「何時不該套」段曾把「compliance-driven migration」歸為排除情境、後來改寫為「不在排除範圍 — 法規驅動只是 driver、資料層仍走 type A-E 之一」。本文是該改寫的 &lt;em>正面實證&lt;/em> — GDPR EU residency 強制需求驅動 single-region → multi-region rollout、本文是 &lt;em>政策驅動但仍走 audit + type 對映流程&lt;/em> 的 case study。&lt;/p>
&lt;p>但 reviewer D 在第三輪 audit 提出：residency 不只是 &lt;em>driver&lt;/em>、本身是 &lt;em>cross-cutting constraint&lt;/em>、反向約束 topology + operational + schema；該不該升 &lt;em>獨立 audit 軸&lt;/em>？本文是該議題的 dogfood。&lt;/p>
&lt;h2 id="三層約束driver--topology--contract">三層約束：driver / topology / contract&lt;/h2>
&lt;p>GDPR 對 PostgreSQL multi-region rollout 的影響在三個層次：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。同時是 <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 self-aware limitation</a> 第 1 點「6 維仍可能漏類（identity / consistency / residency 三軸候選）」的 <em>residency 軸驗證</em>、跟 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration playbook methodology「何時不該套」段</a> 對「政策合規驅動」是否在 methodology scope 的反思。</p></blockquote>
<h2 id="政策驅動的-migration-屬本-methodology-嗎">政策驅動的 migration 屬本 methodology 嗎</h2>
<p><a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> 「何時不該套」段曾把「compliance-driven migration」歸為排除情境、後來改寫為「不在排除範圍 — 法規驅動只是 driver、資料層仍走 type A-E 之一」。本文是該改寫的 <em>正面實證</em> — GDPR EU residency 強制需求驅動 single-region → multi-region rollout、本文是 <em>政策驅動但仍走 audit + type 對映流程</em> 的 case study。</p>
<p>但 reviewer D 在第三輪 audit 提出：residency 不只是 <em>driver</em>、本身是 <em>cross-cutting constraint</em>、反向約束 topology + operational + schema；該不該升 <em>獨立 audit 軸</em>？本文是該議題的 dogfood。</p>
<h2 id="三層約束driver--topology--contract">三層約束：driver / topology / contract</h2>
<p>GDPR 對 PostgreSQL multi-region rollout 的影響在三個層次：</p>
<ol>
<li><strong>Driver layer</strong>：EU 客戶資料必須 <em>物理上儲存在 EU</em>（GDPR Article 44-49）— 觸發 multi-region migration 的根本理由</li>
<li><strong>Topology layer</strong>：跨 region replication 不能 <em>自由跨 region 複製</em> EU 客戶資料、必須按 GDPR scope 分區；topology 設計受合規約束</li>
<li><strong>Contract layer</strong>：審計能 <em>demonstrate</em> 「EU 資料在 EU」、操作日誌 + replication evidence 必須可追溯；application + ops contract 多出合規 obligation</li>
</ol>
<p>跑 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">6 維 diff dimension audit</a> 對「single us-east → us-east + eu-west」：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>同 PostgreSQL、可能加 region column</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>HA / backup / monitoring 跨 region 重設計</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>同 OLTP RDBMS</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>同 PostgreSQL instance + Patroni</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Routing logic by user region、必改</td>
          <td>Medium</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>Single → multi-region replication</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td><strong>Residency contract</strong></td>
          <td><strong>EU 資料禁止離開 EU、log + replication 範圍受約束</strong></td>
          <td><strong>High</strong></td>
      </tr>
  </tbody>
</table>
<p>6 維 audit 抓不到「Residency contract = High」這軸。用既有 6 維歸類、會走 Type F multi-axis（topology + operational + application change 多 High）+ 政策合規補強段；但這個歸類 <em>漏掉合規對 topology / operational / application 的反向約束</em>：</p>
<ul>
<li>Topology layer：6 維只 audit 「topology 是否變動」、漏 audit 「topology 範圍是否受合規約束」</li>
<li>Operational layer：6 維只 audit 「operational 是否重設計」、漏 audit 「audit log / encryption / access control 是否符合合規要求」</li>
<li>Application layer：6 維只 audit 「application code 是否改」、漏 audit 「資料 routing 是否符合 residency rule」</li>
</ul>
<p><strong>Residency 不只是 driver、是 cross-cutting constraint</strong>、會反向約束其他 3-4 維、且帶獨立工作量（合規 evidence collection / DPIA / audit prep）。</p>
<h2 id="residency-axis-是否獨立3-個論據">Residency axis 是否獨立：3 個論據</h2>
<p><strong>Yes、residency 是獨立軸</strong>：</p>
<ol>
<li><strong>可獨立發生</strong>：原本 multi-region setup、新增「PCI 強制信用卡資料只能 us-east」、是 <em>純 residency 變更</em>、其他 6 維皆 Low（topology 不重設計、operational 不重設計、application 加 routing rule 即可）；但 residency 約束 routing + log 範圍</li>
<li><strong>驅動工作量分佈</strong>：本文 multi-region GDPR rollout 工作量分佈：
<ul>
<li>Topology setup（logical replication / region setup）：~25%</li>
<li>Operational redesign（HA / backup / monitoring）：~20%</li>
<li>Application routing change（region detection / data filter）：~15%</li>
<li><strong>Residency compliance（DPIA / audit log / access control / encryption / evidence）：~40%</strong></li>
</ul>
</li>
<li><strong>Cross-cutting nature</strong>：residency 不只影響「資料放哪」、影響：
<ul>
<li>Backup 可不可以 cross-region store（多數 GDPR 不允許）</li>
<li>Audit log 是否包含 EU PII（需 EU 端 log + 跨 region log filter）</li>
<li>Encryption key 是否可 cross-region share（多數情境不允許）</li>
<li>Application access logs 是否含 EU IP / user ID</li>
</ul>
</li>
</ol>
<p><strong>No、residency 可塞 operational + driver</strong>：</p>
<ul>
<li>反論：residency 是 operational 子議題、加 audit + replication scope 規則就好</li>
<li>拒絕：residency 反向約束 topology / application / operational、且帶獨立合規工作量（DPIA / cross-border transfer agreement / data subject rights）；不是單純 operational 子議題</li>
</ul>
<p>實證：本文 migration 工作量 40% 在 compliance、確認 residency 是 <em>獨立工作量主軸</em>。</p>
<h2 id="結構type-f-multi-axis--residency-compliance-獨立段">結構：Type F multi-axis + residency compliance 獨立段</h2>
<p>本文結構是 <em>Type F 為主</em>（topology high + operational high）+ <em>residency compliance 獨立段</em>（不在 6 維任一個）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. 政策驅動的 migration 屬本 methodology 嗎（meta-reflection 開頭）
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. 三層約束：driver / topology / contract
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. Residency axis 是否獨立的論據
</span></span><span class="line"><span class="ln">4</span><span class="cl">4. 結構 differentiator（Type F multi-axis + residency compliance 段）
</span></span><span class="line"><span class="ln">5</span><span class="cl">5. EU residency 對 topology / operational / application 的反向約束
</span></span><span class="line"><span class="ln">6</span><span class="cl">6. Migration 流程（含 DPIA 跟 evidence collection 階段）
</span></span><span class="line"><span class="ln">7</span><span class="cl">7. Production 故障演練
</span></span><span class="line"><span class="ln">8</span><span class="cl">8. Capacity / cost（含合規 audit cost）
</span></span><span class="line"><span class="ln">9</span><span class="cl">9. 整合 / 下一步</span></span></code></pre></div><p>9 章節、240-270 行。比標準 Type F 多 1 段（residency compliance）+ 1 段（meta-reflection）。</p>
<h2 id="eu-residency-對其他維度的反向約束">EU residency 對其他維度的反向約束</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Residency rule → Topology constraint:
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">- EU customer data 不能 replicate to us-east
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">- Backup of EU table 不能 store in non-EU region
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">- Logical replication subscriber 在 us-east 必須 filter out EU data
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">Residency rule → Operational constraint:
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">- Cross-region monitoring 不能 export EU PII to global SaaS (Datadog)
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">- Audit log 含 EU user_id 必須 store 在 EU
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">- Encryption key (KMS) 不能 share 跨 region（EU 端用 EU KMS）
</span></span><span class="line"><span class="ln">10</span><span class="cl">- DBA / SRE access EU data 必須 from EU jurisdiction + 記 audit trail
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl">Residency rule → Application constraint:
</span></span><span class="line"><span class="ln">13</span><span class="cl">- Application 必須 detect user region + route 對應 DB endpoint
</span></span><span class="line"><span class="ln">14</span><span class="cl">- Cross-region join / aggregate 對 EU user 必須走 EU 端 query
</span></span><span class="line"><span class="ln">15</span><span class="cl">- Data export feature 必須 reject 跨 region export request</span></span></code></pre></div><p>每條反向約束都是 <em>新工作量</em>、不在 6 維 audit 內。</p>
<h2 id="migration-流程含-dpia--evidence-collection">Migration 流程（含 DPIA + evidence collection）</h2>
<p>10 step、跨 5 個月：</p>
<table>
  <thead>
      <tr>
          <th>Phase</th>
          <th>Step</th>
          <th>對應 6 維 / 合規</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0 Pre-migration</td>
          <td>1. DPIA（Data Protection Impact Assessment）</td>
          <td>Compliance pre-requisite</td>
      </tr>
      <tr>
          <td>0</td>
          <td>2. 法務 review 跨境傳輸 agreement</td>
          <td>Compliance</td>
      </tr>
      <tr>
          <td>1 Setup</td>
          <td>3. EU PostgreSQL cluster build + Patroni</td>
          <td>Operational + Topology</td>
      </tr>
      <tr>
          <td>1</td>
          <td>4. EU KMS + audit log + monitoring stack</td>
          <td>Operational + Residency</td>
      </tr>
      <tr>
          <td>2 Data</td>
          <td>5. Logical replication 設 filter（exclude EU table from us-east）</td>
          <td>Topology + Residency</td>
      </tr>
      <tr>
          <td>2</td>
          <td>6. Initial sync EU table 到 EU cluster</td>
          <td>Topology</td>
      </tr>
      <tr>
          <td>3 App</td>
          <td>7. Application 端加 region detection + routing</td>
          <td>Application change</td>
      </tr>
      <tr>
          <td>3</td>
          <td>8. Cross-region query banning（cross-region join 拒絕 EU table）</td>
          <td>Application + Residency</td>
      </tr>
      <tr>
          <td>4 Verify</td>
          <td>9. Compliance audit + evidence package</td>
          <td>Residency</td>
      </tr>
      <tr>
          <td>4</td>
          <td>10. DPO sign-off + DR drill</td>
          <td>Residency + Operational</td>
      </tr>
  </tbody>
</table>
<p>Step 1 + 9 + 10 是 <em>residency-specific</em>、不在既有 6 維內。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1replication-filter-漏-tableeu-資料-leak-到-us-east">Case 1：Replication filter 漏 table、EU 資料 leak 到 us-east</h3>
<p><strong>徵兆</strong>：6 個月後 internal audit 發現 us-east 端 <code>customers</code> table 含 EU 客戶資料；replication filter 設定漏改、新加的 <code>eu_customer_extensions</code> table 被自動 replicate 到 us-east。</p>
<p><strong>根因</strong>：PostgreSQL logical replication publication 預設 <code>FOR ALL TABLES</code>、新加的 table 自動納入；應該明示 <code>FOR TABLE list...</code> 並 GDPR review。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Publication 改 explicit table list</strong>：<code>CREATE PUBLICATION xxx FOR TABLE users, orders, ...</code>、不用 <code>FOR ALL TABLES</code></li>
<li><strong>Schema change review 加 GDPR check</strong>：每個 DDL PR 必須答「新 table 是否含 EU PII、是否該 filter」</li>
<li><strong>Replication monitor</strong>：定期跑 <code>SELECT * FROM pg_publication_tables</code> 對照 expected list、漂移立刻 alert</li>
<li><strong>Evidence collection</strong>：filter 配置 + audit log 留檔、出事 DPO 知道何時 leak</li>
</ol>
<h3 id="case-2backup-跨-region-store合規違規">Case 2：Backup 跨 region store、合規違規</h3>
<p><strong>徵兆</strong>：跑 1 年後 GDPR audit 抓到 EU table 的 backup 存在 us-west S3 bucket；違反 Article 44-49 限制。</p>
<p><strong>根因</strong>：pgBackRest 預設用 <em>global S3 bucket</em>（在 us-east-1）；EU PostgreSQL cluster backup 跑去 us-east、跨境傳輸無 transfer mechanism。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Per-region backup config</strong>：EU cluster 用 EU S3 bucket（eu-west-1）、寫進 pgBackRest config</li>
<li><strong>Backup test</strong>：每月跑一次 backup restore drill、validate backup 是 from EU region</li>
<li><strong>Bucket policy 強 enforce</strong>：EU bucket 加 <code>aws:RequestedRegion=eu-west-1</code> 強制 region match</li>
<li><strong>Audit log archive 同理</strong>：log shipping 也必須 region-respect</li>
</ol>
<h3 id="case-3monitor-saas-收集-eu-pii合規-alert">Case 3：Monitor SaaS 收集 EU PII、合規 alert</h3>
<p><strong>徵兆</strong>：Datadog APM 收集了 EU customer 端 request 含 user_email 在 trace、被 DPO catch、required to delete 過去 90 天的 Datadog data。</p>
<p><strong>根因</strong>：APM trace 預設收集 application context、含 PII；Datadog 是 us-east SaaS、PII 跨境到 Datadog us-east、違規。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>APM scrub PII</strong>：application 端在 trace 前 scrub user_email / user_id 替換成 hash</li>
<li><strong>EU-specific monitor stack</strong>：EU PostgreSQL + APM 用 Grafana on EU EKS、不送 Datadog</li>
<li><strong>跨 region SaaS use 必須 audit</strong>：所有外部 SaaS（Datadog / Sentry / NewRelic）必須 GDPR-friendly 配置</li>
<li><strong>Privacy by design</strong>：log / trace 預設 scrub PII、不是 opt-in</li>
</ol>
<h3 id="case-4cross-region-query-跑-eu--us-資料residency-違規">Case 4：Cross-region query 跑 EU + US 資料、residency 違規</h3>
<p><strong>徵兆</strong>：BI dashboard 跑跨 region aggregation query（EU sales + US sales）、PostgreSQL FDW 從 us-east cluster query EU cluster、EU 端 server log 顯示「PII export to us-east」。</p>
<p><strong>根因</strong>：開發者用 PostgreSQL Foreign Data Wrapper（FDW）方便跑跨 region query、不知道這在 GDPR 視為跨境 PII export。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Architecture: aggregate at edge</strong>：BI 跑 <em>per-region aggregate</em>、再在 BI layer compose（無 PII）；不直接跨 region join</li>
<li><strong>FDW 限制</strong>：disable FDW from us-east → EU cluster、enforce one-way data flow</li>
<li><strong>DBA access policy</strong>：DBA 不能直接 query EU cluster 從 us-east jumpbox</li>
<li><strong>Query audit</strong>：production query log 跑 PII detection（regex / NER）、發現跨境 export 立即 alert</li>
</ol>
<h3 id="case-5dr-drill-跨-region-failover暴露-residency-assumption-失敗">Case 5：DR drill 跨 region failover、暴露 residency assumption 失敗</h3>
<p><strong>徵兆</strong>：DR drill「EU 完全不可用、切到 us-east」執行後、發現 us-east 端 <em>沒 EU 資料</em> — 因為一直 strict residency filter；business 端 EU 客戶 24 小時無法服務。</p>
<p><strong>根因</strong>：strict GDPR residency 跟 strict DR availability 衝突 — 要 <em>跨 region DR</em> 就要 <em>跨 region 持有資料</em>、要 <em>strict residency</em> 就 <em>DR 範圍受限</em>。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>DR strategy revision</strong>：EU 端 multi-AZ within EU、不靠跨 region；EU region 全不可用情境接受 longer RTO</li>
<li><strong>Compliance + DR negotiation</strong>：跟 DPO / 法務談 <em>DR 跨境 short-window 是否可接受</em>、簽 cross-border transfer agreement</li>
<li><strong>Backup recovery 在 EU 內</strong>：EU 端 backup 跨 AZ store、不跨 region；EU AZ 災難用 EU 另一個 AZ 重建</li>
<li><strong>明示 RTO trade-off</strong>：EU customer SLA 寫「regional DR 內 RTO 1 小時、global DR 24-48 小時」、residency 跟 DR 是 <em>互斥取捨</em></li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Single region</th>
          <th>Multi-region GDPR-compliant</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Infrastructure cost</td>
          <td>baseline</td>
          <td>+60-100%（雙 cluster + cross-region replication）</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.5-1</td>
          <td>1-2 FTE（雙 region SRE + compliance）</td>
      </tr>
      <tr>
          <td>Compliance cost</td>
          <td>0</td>
          <td>$50-200K USD setup（DPIA / audit / DPO time）+ ongoing</td>
      </tr>
      <tr>
          <td>Egress cost</td>
          <td>Low</td>
          <td>High（cross-region replication 流量）</td>
      </tr>
      <tr>
          <td>Application latency</td>
          <td>Single AZ</td>
          <td>EU customer 連 EU、低；US customer 連 US、低</td>
      </tr>
      <tr>
          <td>DR RTO</td>
          <td>30 分鐘 (single region)</td>
          <td>EU regional 1 小時 / global 24-48 小時</td>
      </tr>
      <tr>
          <td>Audit cost</td>
          <td>Minimal</td>
          <td>季度 DPIA + 年度 compliance audit</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：GDPR multi-region 成本 1.5-2.5x、但合規是 <em>必要 spend</em>、用 cost optimization 的框架看會誤判；多數歐洲業務 7+ 年回本（避免 4% revenue fine）。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-postgresql--aurora-對位">跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a> 對位</h3>
<p>Aurora Global Database 可簡化跨 region setup、但 residency filter 仍需 application 端；不是「Aurora 就解決 GDPR」。</p>
<h3 id="跟-multi-dc-mongodb-對位">跟 <a href="/blog/backend/01-database/vendors/mongodb/shard-expansion-multi-dc/" data-link-title="MongoDB Shard Expansion &#43; Multi-DC：Type F「不需要 parallel run」的 multi-region 例外" data-link-desc="MongoDB sharded cluster 加 shard &#43; 跨 DC expansion 是 Type F「topology re-layout」第 3 個 dogfood — 同時改 sharding &#43; replication topology &#43; region distribution；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 第 3 點「Type F 不需要 parallel run」claim 的例外（multi-region rollout 必須 parallel run &#43; 切流量）；涵蓋 chunk migration / replica set add member / cross-DC routing">Multi-DC MongoDB</a> 對位</h3>
<p>兩篇都是 multi-region rollout、但本文加合規維度；MongoDB 篇純 capacity + DR driver、本文加 residency constraint、結構不同。</p>
<h3 id="跟-128-self-aware-limitation-第-1-點對位">跟 #128 self-aware limitation 第 1 點對位</h3>
<p>本文驗證 <em>residency axis 候選</em>：</p>
<ul>
<li><strong>Yes 軸獨立</strong>：reverse-constrain topology + operational + application、且帶獨立 compliance 工作量（DPIA / evidence collection / DPO sign-off）</li>
<li><strong>作為 driver 不夠</strong>：methodology 把 residency 歸為 driver 太窄、忽略 cross-cutting constraint 性質</li>
</ul>
<p>未來 audit 可能擴 7 維（加 residency / compliance contract）；累積 PCI / HIPAA / SOX 等不同合規 case 後再評估。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Identity + Consistency + Residency 三軸候選統合</strong>：本批 3 篇分別驗證、未來累積 evidence 後考慮獨立 #129 卡 / 擴 audit 到 7-8 維</li>
<li><strong>Schrems II + new EU data transfer rules</strong>：跨大西洋資料傳輸法規變動快、playbook 半衰期短</li>
<li><strong>Data localization in China / Russia / India</strong>：類似 GDPR 但細節不同、未來 case 累積後評估</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>平行 multi-region case：<a href="/blog/backend/01-database/vendors/mongodb/shard-expansion-multi-dc/" data-link-title="MongoDB Shard Expansion &#43; Multi-DC：Type F「不需要 parallel run」的 multi-region 例外" data-link-desc="MongoDB sharded cluster 加 shard &#43; 跨 DC expansion 是 Type F「topology re-layout」第 3 個 dogfood — 同時改 sharding &#43; replication topology &#43; region distribution；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 第 3 點「Type F 不需要 parallel run」claim 的例外（multi-region rollout 必須 parallel run &#43; 切流量）；涵蓋 chunk migration / replica set add member / cross-DC routing">MongoDB Shard + Multi-DC</a></li>
<li>平行 axis 候選驗證：<a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/migrate-to-aws-secrets-manager/" data-link-title="Vault → AWS Secrets Manager：「secret」不是「secret」、identity model 才是核心差異" data-link-desc="Vault → AWS Secrets Manager migration 表面是 secret store 替換、實際核心是 identity model 對位（Vault token &#43; policy vs AWS IAM &#43; resource policy）；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 提出的 identity axis 候選 — identity 是否獨立 audit 軸；5 個 production 踩雷（IAM principal 對位 / dynamic credential 對等失敗 / lease lifecycle 模型不同 / audit log 結構差 / 計費模型反轉）">Vault → AWS Secrets Manager</a>（identity 候選）/ <a href="/blog/backend/01-database/vendors/dynamodb/consistency-model-optimization/" data-link-title="DynamoDB Strongly Consistent → Eventually Consistent：same protocol, different contract" data-link-desc="DynamoDB consistency model 從 strongly consistent read 改 eventually consistent read 是 50% cost 優化但風險集中在 application contract — 同 vendor / 同 protocol / 同 table / 不同 read consistency；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 提出的 consistency axis 候選；涵蓋 read pattern audit / 5 個 production 踩雷">DynamoDB Consistency Model</a>（consistency 候選）</li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> / <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 self-aware limitation 第 1 點</a>（residency axis 候選驗證、本文是該驗證的 dogfood）</li>
</ul>
]]></content:encoded></item><item><title>從自管 PostgreSQL / MySQL 遷到 Aurora：operational redesign migration playbook</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/aurora/migrate-from-self-managed-pg-mysql/</link><pubDate>Wed, 27 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/aurora/migrate-from-self-managed-pg-mysql/</guid><description>&lt;p>從自管 PostgreSQL / MySQL 遷到 Aurora 是 &lt;em>operational redesign hybrid&lt;/em>（Type C &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration&lt;/a>）— wire protocol 相容、application 不改、但 HA / backup / monitoring / capacity 模型完全不同。本 playbook 走 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration playbook 6 規格面&lt;/a>（Driver / Diff audit / Phase plan / Evidence / Cutover / Cleanup）、補三個 Aurora-specific 議題：(1) 合規禁止跨境複製的 no-go condition、(2) 合規驅動遷移的時程模型（市場數 × 平均審查月份）、(3) Aurora 不是 all-purpose store 邊界。每階段進入下一步前都要過 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate&lt;/a> — Evidence 段列出的證據是 gate 條件、不是 nice-to-have。&lt;/p>
&lt;p>本 playbook 不重複 Aurora overview（請看 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&amp;#43;75% 效能改善的 production 證據">Aurora vendor 頁&lt;/a>）— 前置閱讀建議 &lt;a href="../storage-architecture/">Aurora storage architecture&lt;/a>（理解為什麼 operational redesign）、&lt;a href="../cross-az-failover-rto/">Aurora cross-AZ failover RTO&lt;/a>（HA redesign 主項）、&lt;a href="../read-replica-scaling/">Aurora read replica scaling&lt;/a>（fleet 治理 SSoT、含合規 driver）。&lt;/p>
&lt;h2 id="migration-type-判定">Migration type 判定&lt;/h2>
&lt;p>本 playbook 是 &lt;em>Type C：Operational redesign hybrid&lt;/em>：&lt;/p>
&lt;ul>
&lt;li>PostgreSQL / MySQL → Aurora wire protocol 相容、application 多數不改&lt;/li>
&lt;li>但 operational model（HA / backup / monitoring / capacity）完全不同、需要 redesign&lt;/li>
&lt;li>跟 Type A schema translation 差：不需要翻譯 application SQL&lt;/li>
&lt;li>跟 Type B drop-in 差：HA / backup / monitoring / capacity 模型需要 redesign&lt;/li>
&lt;li>跟 Type E paradigm shift 差：保留 single-primary SQL 跟 ACID transaction 語意&lt;/li>
&lt;/ul>
&lt;p>對照其他 Aurora-related migration playbook：&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora-dsql/" data-link-title="PostgreSQL → Aurora DSQL Migration：PG wire-compatible Distributed SQL 的 Paradigm Shift" data-link-desc="Aurora DSQL（2024-12 re:Invent preview / 2025-05 GA）是 AWS 推的 PG wire-compatible *active-active distributed SQL*、跟 self-managed PG / Aurora PG 不同 paradigm（OCC &amp;#43; snapshot isolation &amp;#43; multi-region strong consistency）。Migration 結構是 *protocol drop-in &amp;#43; paradigm shift*：app SQL 不太改、但 transaction retry / extension 缺位 / 多 region 一致性需重設計。本文走 DSQL vs Aurora PG vs self-managed PG 三軸對比、為什麼遷的三條 driver（global write / operational zero-touch / region resiliency）、Type E phased plan、5 production 踩雷（transaction retry 沒處理 / extension 缺位 / sequence throughput 限制 / Aurora PG 直升 DSQL 不可行 / region failover semantic）、跟 PG → Aurora 跟 PG → CockroachDB 對比">PG → Aurora DSQL&lt;/a> 是 Type E paradigm shift（distributed SQL、multi-region active-active）&lt;/li>
&lt;li>&lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &amp;#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">PG → CockroachDB&lt;/a> 是 Type E paradigm shift + cross-cloud&lt;/li>
&lt;/ul>
&lt;h2 id="driver為什麼遷">Driver：為什麼遷&lt;/h2>
&lt;h3 id="主要-driver">主要 driver&lt;/h3>
&lt;ul>
&lt;li>團隊規模成長、DBA bandwidth 飽和、backup / failover / patch 操作負擔超過產品價值&lt;/li>
&lt;li>Read replica scaling 需求（傳統 streaming replication lag 秒級、Aurora 10-30ms — 詳見 &lt;a href="../read-replica-scaling/">Aurora read replica scaling&lt;/a>）&lt;/li>
&lt;li>Storage growth 痛點（local SSD 上限、resize 要 downtime、Aurora 自動 grow 到 128 TB）&lt;/li>
&lt;/ul>
&lt;h3 id="次要-driver">次要 driver&lt;/h3>
&lt;ul>
&lt;li>HA model 簡化（Patroni / Orchestrator → Aurora cluster endpoint、見 &lt;a href="../cross-az-failover-rto/">cross-AZ failover RTO&lt;/a>）&lt;/li>
&lt;li>Backup 自動化（pgBackRest / xtrabackup → Aurora automated backup + PITR）&lt;/li>
&lt;li>Multi-region DR 需求（&lt;a href="../global-database-multi-region/">Aurora Global Database&lt;/a>、但合規場景例外）&lt;/li>
&lt;/ul>
&lt;h3 id="no-go-condition嚴格遵守">No-go condition（嚴格遵守）&lt;/h3>
&lt;p>跨雲 / on-prem 需求觸動 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/vendor-lock-in/" data-link-title="Vendor Lock-In" data-link-desc="說明採用供應商產品後，其 API 與格式滲入程式碼造成的退出成本">vendor lock-in&lt;/a> — Aurora storage layer 是 AWS 專屬、wire protocol 相容不代表退出成本低、long-term 跨雲策略未定時 self-managed PG / MySQL 反而保留路徑。&lt;/p></description><content:encoded><![CDATA[<p>從自管 PostgreSQL / MySQL 遷到 Aurora 是 <em>operational redesign hybrid</em>（Type C <a href="/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration</a>）— wire protocol 相容、application 不改、但 HA / backup / monitoring / capacity 模型完全不同。本 playbook 走 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration playbook 6 規格面</a>（Driver / Diff audit / Phase plan / Evidence / Cutover / Cleanup）、補三個 Aurora-specific 議題：(1) 合規禁止跨境複製的 no-go condition、(2) 合規驅動遷移的時程模型（市場數 × 平均審查月份）、(3) Aurora 不是 all-purpose store 邊界。每階段進入下一步前都要過 <a href="/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate</a> — Evidence 段列出的證據是 gate 條件、不是 nice-to-have。</p>
<p>本 playbook 不重複 Aurora overview（請看 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor 頁</a>）— 前置閱讀建議 <a href="../storage-architecture/">Aurora storage architecture</a>（理解為什麼 operational redesign）、<a href="../cross-az-failover-rto/">Aurora cross-AZ failover RTO</a>（HA redesign 主項）、<a href="../read-replica-scaling/">Aurora read replica scaling</a>（fleet 治理 SSoT、含合規 driver）。</p>
<h2 id="migration-type-判定">Migration type 判定</h2>
<p>本 playbook 是 <em>Type C：Operational redesign hybrid</em>：</p>
<ul>
<li>PostgreSQL / MySQL → Aurora wire protocol 相容、application 多數不改</li>
<li>但 operational model（HA / backup / monitoring / capacity）完全不同、需要 redesign</li>
<li>跟 Type A schema translation 差：不需要翻譯 application SQL</li>
<li>跟 Type B drop-in 差：HA / backup / monitoring / capacity 模型需要 redesign</li>
<li>跟 Type E paradigm shift 差：保留 single-primary SQL 跟 ACID transaction 語意</li>
</ul>
<p>對照其他 Aurora-related migration playbook：</p>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora-dsql/" data-link-title="PostgreSQL → Aurora DSQL Migration：PG wire-compatible Distributed SQL 的 Paradigm Shift" data-link-desc="Aurora DSQL（2024-12 re:Invent preview / 2025-05 GA）是 AWS 推的 PG wire-compatible *active-active distributed SQL*、跟 self-managed PG / Aurora PG 不同 paradigm（OCC &#43; snapshot isolation &#43; multi-region strong consistency）。Migration 結構是 *protocol drop-in &#43; paradigm shift*：app SQL 不太改、但 transaction retry / extension 缺位 / 多 region 一致性需重設計。本文走 DSQL vs Aurora PG vs self-managed PG 三軸對比、為什麼遷的三條 driver（global write / operational zero-touch / region resiliency）、Type E phased plan、5 production 踩雷（transaction retry 沒處理 / extension 缺位 / sequence throughput 限制 / Aurora PG 直升 DSQL 不可行 / region failover semantic）、跟 PG → Aurora 跟 PG → CockroachDB 對比">PG → Aurora DSQL</a> 是 Type E paradigm shift（distributed SQL、multi-region active-active）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">PG → CockroachDB</a> 是 Type E paradigm shift + cross-cloud</li>
</ul>
<h2 id="driver為什麼遷">Driver：為什麼遷</h2>
<h3 id="主要-driver">主要 driver</h3>
<ul>
<li>團隊規模成長、DBA bandwidth 飽和、backup / failover / patch 操作負擔超過產品價值</li>
<li>Read replica scaling 需求（傳統 streaming replication lag 秒級、Aurora 10-30ms — 詳見 <a href="../read-replica-scaling/">Aurora read replica scaling</a>）</li>
<li>Storage growth 痛點（local SSD 上限、resize 要 downtime、Aurora 自動 grow 到 128 TB）</li>
</ul>
<h3 id="次要-driver">次要 driver</h3>
<ul>
<li>HA model 簡化（Patroni / Orchestrator → Aurora cluster endpoint、見 <a href="../cross-az-failover-rto/">cross-AZ failover RTO</a>）</li>
<li>Backup 自動化（pgBackRest / xtrabackup → Aurora automated backup + PITR）</li>
<li>Multi-region DR 需求（<a href="../global-database-multi-region/">Aurora Global Database</a>、但合規場景例外）</li>
</ul>
<h3 id="no-go-condition嚴格遵守">No-go condition（嚴格遵守）</h3>
<p>跨雲 / on-prem 需求觸動 <a href="/blog/backend/knowledge-cards/vendor-lock-in/" data-link-title="Vendor Lock-In" data-link-desc="說明採用供應商產品後，其 API 與格式滲入程式碼造成的退出成本">vendor lock-in</a> — Aurora storage layer 是 AWS 專屬、wire protocol 相容不代表退出成本低、long-term 跨雲策略未定時 self-managed PG / MySQL 反而保留路徑。</p>
<table>
  <thead>
      <tr>
          <th>條件</th>
          <th>為什麼是 no-go</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>跨雲 / on-prem 需求</td>
          <td>Aurora AWS-only、wire protocol 相容但 storage 是 AWS 專屬</td>
      </tr>
      <tr>
          <td>需要 latest upstream 特性</td>
          <td>Aurora 通常落後 upstream PostgreSQL / MySQL 1-2 major version</td>
      </tr>
      <tr>
          <td>預算極敏感</td>
          <td>Aurora 比 self-managed PostgreSQL / MySQL 貴 20-30%</td>
      </tr>
      <tr>
          <td>合規禁止跨境複製</td>
          <td>受監管市場 <a href="/blog/backend/knowledge-cards/data-residency/" data-link-title="Data Residency" data-link-desc="合規要求資料留在特定地理邊界內、跨境複製違反合規、推動 fleet 拓樸決策">Data Residency</a> <em>禁止跨境複製</em>、Aurora Global Database 在這種場景 <em>違反合規</em> — 要改用每市場獨立 cluster</td>
      </tr>
      <tr>
          <td>客製化 storage / I/O</td>
          <td>Aurora storage 是 AWS managed、不能客製化（vs self-managed 可以做 cgroup / quota / 自訂 storage 配置）</td>
      </tr>
  </tbody>
</table>
<p><strong>合規禁止跨境複製 no-go</strong>（<a href="/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered 揭露</a>）：</p>
<p>受監管市場資料不能跨境複製、Aurora Global Database 在這種場景違反合規。讀者規劃 Aurora migration 時不能假設「Aurora 一定有 Global Database 選項」— 要改用每市場獨立 cluster（fleet 拓樸吸收合規邊界、見 <a href="../read-replica-scaling/">Aurora read replica scaling</a> fleet SSoT）。</p>
<h3 id="替代方案">替代方案</h3>
<ul>
<li><strong>RDS PostgreSQL / MySQL</strong>：更接近 upstream、單 AZ 便宜、不重寫 storage</li>
<li><strong>自管 + Patroni HA + pgBackRest</strong>：保留控制、跨雲可用</li>
<li><strong>CockroachDB / Aurora DSQL</strong>：multi-region active-active write 需求</li>
</ul>
<h3 id="case-anchor">Case anchor</h3>
<ul>
<li><a href="/blog/backend/09-performance-capacity/cases/netflix-aurora-consolidation/" data-link-title="9.C23 Netflix：把關聯式 DB 統一到 Aurora、效能 &#43;75%、成本 -28%" data-link-desc="Netflix 把多套關聯式 DB 統一到 Aurora、效能提升 75%、成本下降 28%、串流數十億小時">9.C23 Netflix Aurora consolidation</a>：多套 RDBMS 統一到 Aurora、driver 是 <em>operational consolidation</em>、不是純效能</li>
<li><a href="/blog/backend/09-performance-capacity/cases/draftkings-aurora-financial-ledger/" data-link-title="9.C4 DraftKings：Aurora 撐 100 萬 ops/min 的體育博彩金融帳本" data-link-desc="DraftKings 用 Aurora MySQL 跑體育博彩金融帳本、Super Bowl 流量 &#43;50% 不影響延遲">9.C4 DraftKings</a>：200 個 cluster、按業務切分（不是一個大 cluster + 200 schema）</li>
<li><a href="/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered</a>：受監管場景、合規 lead time 是時程主項</li>
</ul>
<p><strong>Netflix scope warning（必引用）</strong>：</p>
<ul>
<li><a href="/blog/backend/09-performance-capacity/cases/netflix-aurora-consolidation/" data-link-title="9.C23 Netflix：把關聯式 DB 統一到 Aurora、效能 &#43;75%、成本 -28%" data-link-desc="Netflix 把多套關聯式 DB 統一到 Aurora、效能提升 75%、成本下降 28%、串流數十億小時">case「需要警惕」段第 2 點原文</a>：「Netflix 數據層遠不止 Aurora — 還有 Cassandra（playback metadata）、EVCache（cache layer）、Iceberg（data warehouse）。Aurora 主要是『需要 ACID 的 OLTP 工作負載』、不是『all-purpose store』」</li>
<li>工程含義：consolidation 是 <em>ACID OLTP 整合到 Aurora</em>、不是 <em>所有 store 整合到 Aurora</em></li>
<li>讀者規劃整合範圍時要明示什麼 workload 不在範圍（cache、analytics、time-series、search、KV 高峰）</li>
<li>「+75% performance improvement 是跨多 workload 的最大改善幅度、不是『每個 workload 都 +75%』。實際每個 workload 改善幅度從 10% 到 75% 不等」（case「需要警惕」段第 1 點）</li>
</ul>
<h2 id="diff-audit6-維-source--target-差異盤點">Diff audit：6 維 source / target 差異盤點</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>差異</th>
          <th>主導程度</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema</td>
          <td>PostgreSQL extension 相容性（pg_cron 改 Lambda / Step Functions、pg_partman 改 manual / native partitioning、TimescaleDB 不支援、PostGIS 支援）；MySQL plugin（HandlerSocket 不支援、audit plugin 改 CloudTrail）</td>
          <td>中</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>HA model、backup、monitoring、parameter management（postgresql.conf → DB parameter group / cluster parameter group）</td>
          <td>高（主導）</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>保留（single-primary SQL、ACID transaction、wire protocol）</td>
          <td>無變動</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>connection pool（PgBouncer → RDS Proxy 或保留 PgBouncer in front of Aurora）、logical replication（pglogical / Debezium → Aurora 原生支援、但有版本限制）</td>
          <td>中</td>
      </tr>
      <tr>
          <td>Application</td>
          <td>保留（connection string 改 endpoint、SSL config 改 RDS CA、driver 不改）</td>
          <td>低</td>
      </tr>
      <tr>
          <td>Topology</td>
          <td>保留（single-region scaling、若要 multi-region 走另一條 playbook to DSQL）；fleet 拓樸決策（拆幾個 cluster）詳見 <a href="../read-replica-scaling/">read replica scaling</a> fleet SSoT</td>
          <td>中-高</td>
      </tr>
  </tbody>
</table>
<p><strong>主導差異</strong>：Operational layer（HA / backup / monitoring）、不是 schema 或 application。</p>
<h3 id="schema-diff-細節">Schema diff 細節</h3>
<p><strong>PostgreSQL → Aurora PostgreSQL</strong>：</p>
<table>
  <thead>
      <tr>
          <th>Extension</th>
          <th>Aurora 支援</th>
          <th>Migration 策略</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>pg_cron</td>
          <td>不支援</td>
          <td>改 Lambda 排程 + RDS event 或 Step Functions</td>
      </tr>
      <tr>
          <td>pg_partman</td>
          <td>不支援</td>
          <td>改 native declarative partitioning（PostgreSQL 11+）</td>
      </tr>
      <tr>
          <td>TimescaleDB</td>
          <td>不支援</td>
          <td>改 native partition + materialized view、或保留 self-managed</td>
      </tr>
      <tr>
          <td>PostGIS</td>
          <td>支援</td>
          <td>直接遷</td>
      </tr>
      <tr>
          <td>pgvector</td>
          <td>支援（新版）</td>
          <td>確認 Aurora PostgreSQL version、可能需要升級</td>
      </tr>
      <tr>
          <td>pglogical</td>
          <td>不支援</td>
          <td>改 Aurora 原生 logical replication（有版本限制）</td>
      </tr>
  </tbody>
</table>
<p><strong>MySQL → Aurora MySQL</strong>：</p>
<table>
  <thead>
      <tr>
          <th>Plugin</th>
          <th>Aurora 支援</th>
          <th>Migration 策略</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HandlerSocket</td>
          <td>不支援</td>
          <td>改 SQL access 或 Aurora-specific KV cache</td>
      </tr>
      <tr>
          <td>Vault audit</td>
          <td>不支援</td>
          <td>改 AWS CloudTrail + RDS audit log</td>
      </tr>
      <tr>
          <td>MyRocks engine</td>
          <td>不支援</td>
          <td>改 InnoDB（Aurora 預設）、評估 storage 成本</td>
      </tr>
      <tr>
          <td>MaxScale</td>
          <td>不支援</td>
          <td>改 Aurora reader endpoint 或 RDS Proxy</td>
      </tr>
  </tbody>
</table>
<h3 id="operational-diff-細節">Operational diff 細節</h3>
<table>
  <thead>
      <tr>
          <th>元素</th>
          <th>Self-managed</th>
          <th>Aurora</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HA</td>
          <td>Patroni / Orchestrator + etcd / ZooKeeper</td>
          <td>Cluster endpoint + 自動 cross-AZ failover</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>pgBackRest / xtrabackup + S3 lifecycle</td>
          <td>Automated backup + manual snapshot + PITR</td>
      </tr>
      <tr>
          <td>Monitoring</td>
          <td>Prometheus exporter + Grafana</td>
          <td>CloudWatch + Performance Insights</td>
      </tr>
      <tr>
          <td>Parameter</td>
          <td>postgresql.conf / my.cnf</td>
          <td>DB parameter group / cluster parameter group</td>
      </tr>
      <tr>
          <td>Failover testing</td>
          <td>Patroni <code>patronictl failover</code></td>
          <td><code>aws rds failover-db-cluster</code></td>
      </tr>
      <tr>
          <td>WAL / binlog 觀測</td>
          <td><code>pg_stat_wal</code> / <code>SHOW MASTER STATUS</code></td>
          <td>CloudWatch + Performance Insights wait events</td>
      </tr>
  </tbody>
</table>
<h3 id="application-diff-細節">Application diff 細節</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl"># Self-managed PostgreSQL
</span></span><span class="line"><span class="ln">2</span><span class="cl">jdbc:postgresql://primary.internal:5432/mydb?ssl=true&amp;sslmode=verify-full&amp;sslrootcert=/etc/ssl/postgresql.crt
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"># Aurora PostgreSQL
</span></span><span class="line"><span class="ln">5</span><span class="cl">jdbc:postgresql://my-cluster.cluster-xxx.us-east-1.rds.amazonaws.com:5432/mydb?ssl=true&amp;sslmode=verify-full&amp;sslrootcert=rds-ca.pem</span></span></code></pre></div><p>Application 改動量小：connection string 換 endpoint、SSL CA 換 RDS CA、driver 不變。</p>
<p>對應 knowledge card：<a href="/blog/backend/knowledge-cards/failover/" data-link-title="Failover" data-link-desc="說明主要服務或節點失效時如何切換到備援能力">failover</a>、<a href="/blog/backend/knowledge-cards/replication-lag/" data-link-title="Replication Lag" data-link-desc="說明資料副本落後正式來源多久，以及它如何影響讀取正確性">replication-lag</a>。</p>
<h2 id="phase-plan階段切換">Phase plan：階段切換</h2>
<h3 id="phase-0pre-migration-audit2-4-週">Phase 0：Pre-migration audit（2-4 週）</h3>
<p>工作：</p>
<ul>
<li>Extension audit：<code>SELECT * FROM pg_extension</code> / <code>SHOW PLUGINS</code>、列出 source 使用的 extension</li>
<li>Parameter audit：postgresql.conf vs Aurora parameter group、列差異</li>
<li>Application connection string audit：所有服務的 DB connection 點位</li>
<li>Benchmark baseline：write QPS / read QPS / p99 latency</li>
<li>Cost baseline：current self-managed monthly cost vs Aurora estimate</li>
</ul>
<p>Output：</p>
<ul>
<li>Migration feasibility report（含 no-go condition check）</li>
<li>Aurora cluster sizing 估算</li>
<li>Extension migration plan（each extension 對應的策略）</li>
</ul>
<h3 id="phase-1aurora-infra-準備1-2-週">Phase 1：Aurora infra 準備（1-2 週）</h3>
<p>工作：</p>
<ul>
<li>Aurora cluster 開設（dev / staging / prod）</li>
<li>Parameter group 對位（從 source postgresql.conf / my.cnf 翻譯到 Aurora parameter group）</li>
<li>SG / subnet / IAM 設定</li>
<li>RDS Proxy 配置（如需要）</li>
<li>CloudWatch dashboard + Performance Insights baseline</li>
<li>Backup retention 設定（1-35 天）</li>
</ul>
<p>Output：</p>
<ul>
<li>Aurora cluster 待 data load</li>
<li>Monitoring 已 ready、能對照 source 跟 target</li>
</ul>
<h3 id="phase-2data-migration2-8-週依資料量">Phase 2：Data migration（2-8 週、依資料量）</h3>
<p>三條 path、依場景選：</p>
<h4 id="path-aaws-dms-full-load--cdc">Path A：AWS DMS full load + CDC</h4>
<ul>
<li>適合：&lt; 1 TB、可接受 read-only 短窗口</li>
<li>流程：DMS full load → DMS CDC → application cutover</li>
<li>優點：managed、validation 工具齊全</li>
<li>缺點：CDC lag 受 DMS task config 影響、bulk DDL 不友善</li>
</ul>
<h4 id="path-bpg_dump--mysqldump--logical-replication-catch-up">Path B：pg_dump / mysqldump + logical replication catch-up</h4>
<ul>
<li>適合：&gt; 1 TB、要長 CDC 期、預算敏感</li>
<li>流程：snapshot → pg_dump / mysqldump → restore to Aurora → logical replication catch-up → application cutover</li>
<li>優點：成本低、可控性高</li>
<li>缺點：手動步驟多、要自己管 CDC lag</li>
</ul>
<h4 id="path-csnapshot-restore">Path C：Snapshot restore</h4>
<ul>
<li>適合：已在 RDS PostgreSQL / MySQL</li>
<li>流程：RDS snapshot → Aurora restore-from-snapshot → catch-up → application cutover</li>
<li>優點：最快、AWS-internal 操作</li>
<li>缺點：只適用 RDS source、不適用 self-managed</li>
</ul>
<h3 id="phase-3dual-read-validation1-2-週">Phase 3：Dual-read validation（1-2 週）</h3>
<p>工作：</p>
<ul>
<li>Application read 50/50 split source / target</li>
<li>比對 query 結果（per-table checksum + sampling）</li>
<li>量測 latency（Aurora p99 ≤ source × 1.2）</li>
<li>確認 stale read 比例 &lt; 0.01%</li>
</ul>
<p>Output：</p>
<ul>
<li>Validation report：query 結果差異、latency 對照</li>
<li>Go/no-go decision for cutover</li>
</ul>
<h3 id="phase-4cutover-1-小時-window">Phase 4：Cutover（&lt; 1 小時 window）</h3>
<p>工作：</p>
<ul>
<li>Source set read-only</li>
<li>CDC catch-up final（lag → 0）</li>
<li>Application switch endpoint（DNS / service discovery / config flag）</li>
<li>Smoke test（critical path query + write）</li>
<li>Monitor error rate + latency 1 小時</li>
</ul>
<p>Output：</p>
<ul>
<li>Cutover complete</li>
<li>Source 切到 read-only、保留作為 rollback 餘地</li>
</ul>
<h3 id="phase-5cleanup4-8-週">Phase 5：Cleanup（4-8 週）</h3>
<p>工作：</p>
<ul>
<li>Source 保留 1 個月 read-only（rollback window）</li>
<li>確認穩定後 snapshot → S3 archive → decommission</li>
<li>舊 monitoring / backup / runbook archive</li>
</ul>
<p>Output：</p>
<ul>
<li>Source decommissioned</li>
<li>新 runbook + monitoring 為 SSoT</li>
</ul>
<h3 id="本-phase-plan-適用範圍">本 phase plan 適用範圍</h3>
<p><strong>Non-regulated workload</strong>（一般 SaaS / e-commerce / 內部系統）。受監管場景（銀行 / 保險 / 醫療）請見下方「合規驅動遷移的時程模型」段、技術 phase 不變但 lead time 完全不同。</p>
<h2 id="合規驅動遷移的時程模型">合規驅動遷移的時程模型</h2>
<p>受監管產業遷移的關鍵時程是 <em>合規審查 lead time</em>、不是技術遷移時間 — 本段是補充給銀行 / 保險 / 醫療讀者、避免照本 playbook 走嚴重低估時程。</p>
<h3 id="standard-chartered-揭露的時程模型">Standard Chartered 揭露的時程模型</h3>
<p><a href="/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered case</a> 「判讀」段第 3 點 + 「策略」段第 3 點原文：「每個受監管市場的審查可能 3-12 個月、合計遷移時程是『市場數 × 平均審查月份』、不是『技術遷移月份』」。</p>
<p>工程含義：</p>
<ul>
<li>技術 phase plan 假設 2-8 週 data migration + &lt; 1 小時 cutover</li>
<li>合規 lead time 是 <em>獨立軸</em>、可能比技術時程長一個數量級</li>
<li>不同市場合規進度不同步、可能要分批上線</li>
</ul>
<h3 id="合規時程組合">合規時程組合</h3>
<table>
  <thead>
      <tr>
          <th>軸</th>
          <th>時程估算</th>
          <th>不可壓縮原因</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>技術遷移</td>
          <td>2-8 週 data migration + &lt; 1 小時 cutover</td>
          <td>工程可控</td>
      </tr>
      <tr>
          <td>單市場合規審查</td>
          <td>3-12 個月（Standard Chartered case 揭露）</td>
          <td>監管機構 lead time、不是技術問題</td>
      </tr>
      <tr>
          <td>多市場合規 lead time</td>
          <td>市場數 × 平均審查月份（7 市場 × 6 個月 ≈ 3.5 年最壞情況）</td>
          <td>各市場各自審、平行度受監管機構文化影響</td>
      </tr>
      <tr>
          <td>跨境複製禁令審查</td>
          <td>包含在合規審查內、可能讓 Global Database 從候選變反指標</td>
          <td>監管要求 data residency、無 cross-region replication option</td>
      </tr>
  </tbody>
</table>
<h3 id="讀者判讀">讀者判讀</h3>
<ul>
<li>受監管場景 <em>不能</em> 用本 playbook 的「2-8 週 data migration + &lt; 1 小時 cutover」估時程交付給管理層 — 合規 lead time 是時程主項</li>
<li>受監管場景 <em>不能</em> 假設 Aurora Global Database 是 multi-region DR 選項 — 合規禁止跨境複製場景下 Global Database 違反合規（見 <a href="../global-database-multi-region/">global-database-multi-region</a>），要改用每市場獨立 cluster</li>
<li>合規場景的 phase plan 要把每市場當成獨立 mini-migration、用 <em>市場批次</em> 推進、不是一次 big bang</li>
</ul>
<p><strong>scope warning（必明示、case 自承）</strong>：Standard Chartered case 未公開是 PostgreSQL 還是 MySQL、未公開具體 cost 數字 — 引用時不能擴寫「Standard Chartered 用 Aurora PostgreSQL」這類細節（case 用「相關 case study」匿名標明）。</p>
<p><strong>合規時程 scope 警示</strong>：「3-12 個月、7 市場 × 6 個月 ≈ 3.5 年」是 Standard Chartered case 揭露範圍。實際合規 lead time 隨產業（銀行 / 保險 / 醫療）跟國家（東南亞 / 歐盟 / 北美 / 中東）差異大、不是恆定數字。讀者要把自家對應監管框架的實際 lead time 算進來、不是直接套 Standard Chartered 數字。</p>
<h2 id="evidence每階段驗證資料">Evidence：每階段驗證資料</h2>
<table>
  <thead>
      <tr>
          <th>Phase</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Phase 0</td>
          <td>extension list、parameter diff、application SQL 抽樣 test on Aurora dev cluster</td>
      </tr>
      <tr>
          <td>Phase 1</td>
          <td>Aurora cluster ready、monitoring dashboard 跟 source 對照</td>
      </tr>
      <tr>
          <td>Phase 2</td>
          <td>DMS row count match、checksum（per-table MD5）、CDC replication lag &lt; 5 秒</td>
      </tr>
      <tr>
          <td>Phase 3</td>
          <td>query result diff &lt; 0.01%、p99 latency Aurora ≤ source × 1.2、application error rate baseline</td>
      </tr>
      <tr>
          <td>Phase 4</td>
          <td>cutover 完成後 1 小時內 error rate &lt; baseline × 2、write success rate 100%</td>
      </tr>
      <tr>
          <td>Phase 5</td>
          <td>30 天無 rollback trigger、cost 月帳對齊預估</td>
      </tr>
  </tbody>
</table>
<p><strong>受監管追加 evidence</strong>：</p>
<ul>
<li>每市場合規 sign-off 文件（central bank / 金融監管機關）</li>
<li>跨境複製禁令審查記錄</li>
<li>Data residency 驗證測試（資料未流出受監管市場 boundary）</li>
<li>Audit log 連續性驗證（source / target audit log 銜接）</li>
</ul>
<p><strong>回路徑</strong>：<a href="/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">4.20 Observability Evidence Package</a> 抽 CDC / latency evidence。</p>
<h2 id="cutover切流決策">Cutover：切流決策</h2>
<p><strong>Cutover window</strong>：</p>
<ul>
<li>建議 4 AM local time（lowest traffic）</li>
<li>預留 4 小時 buffer</li>
<li>受監管場景可能要在合規規定的 maintenance window（例如某些央行規定週日凌晨）</li>
</ul>
<p><strong>Rollback condition</strong>：</p>
<ul>
<li>error rate &gt; baseline × 5</li>
<li>write latency p99 &gt; baseline × 3 持續 10 分鐘</li>
<li>data corruption signal（checksum mismatch、unexpected row count drop）</li>
</ul>
<p><strong>Rollback path</strong>：</p>
<ul>
<li>Application connection string 切回 source</li>
<li>Source 仍 read-write（cutover 前留 read-write 路徑、若已 read-only 要先解凍）</li>
<li>CDC 反向同步（Aurora → source）catch-up</li>
</ul>
<p><strong>Decision owner</strong>：</p>
<ul>
<li>DBA lead + service owner + on-call SRE 三方 sign-off</li>
<li>受監管場景追加 compliance officer sign-off</li>
<li>Cutover decision log 記錄（<a href="/blog/backend/knowledge-cards/rollback-window/" data-link-title="Rollback Window" data-link-desc="說明變更進入 production 後還能用哪種方式回退或改路線的時間與條件">rollback window</a> / <a href="/blog/backend/knowledge-cards/rollback-condition/" data-link-title="Rollback Condition" data-link-desc="說明決策執行後出現哪些訊號時要撤回、回退或改路線">rollback condition</a> 文件化）</li>
</ul>
<p>對應 knowledge card：<a href="/blog/backend/knowledge-cards/rollback-window/" data-link-title="Rollback Window" data-link-desc="說明變更進入 production 後還能用哪種方式回退或改路線的時間與條件">rollback-window</a>、<a href="/blog/backend/knowledge-cards/rollback-condition/" data-link-title="Rollback Condition" data-link-desc="說明決策執行後出現哪些訊號時要撤回、回退或改路線">rollback-condition</a>。</p>
<h2 id="cleanup雙軌退役">Cleanup：雙軌退役</h2>
<table>
  <thead>
      <tr>
          <th>元素</th>
          <th>Cleanup 策略</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Source database</td>
          <td>read-only 1 個月、確認穩定後 snapshot → S3 archive → decommission</td>
      </tr>
      <tr>
          <td>舊 monitoring</td>
          <td>Prometheus exporter 拆、Grafana dashboard archive、CloudWatch dashboard 為 SSoT</td>
      </tr>
      <tr>
          <td>舊 backup chain</td>
          <td>pgBackRest / xtrabackup retention 保留至合規邊界（金融 7 年、一般 90 天）</td>
      </tr>
      <tr>
          <td>舊 runbook</td>
          <td>Patroni / Orchestrator runbook archive、新 runbook 對 Aurora cluster endpoint</td>
      </tr>
      <tr>
          <td>舊 CDC connector</td>
          <td>DMS task 留 7 天觀察期 → delete；自管 Debezium / pglogical 在 source decommission 同時退役</td>
      </tr>
  </tbody>
</table>
<p><strong>不可逆 cleanup 邊界</strong>：</p>
<ul>
<li>Source decommission 後資料只能從 backup restore</li>
<li>確保 backup 可用性測試通過再 decommission</li>
<li>受監管場景要保留 source backup 到合規 retention（金融 7 年、可能更長）</li>
</ul>
<h2 id="案例對照">案例對照</h2>
<h3 id="netflix-aurora-consolidationoperational-consolidation-的價值">Netflix Aurora consolidation：operational consolidation 的價值</h3>
<p><a href="/blog/backend/09-performance-capacity/cases/netflix-aurora-consolidation/" data-link-title="9.C23 Netflix：把關聯式 DB 統一到 Aurora、效能 &#43;75%、成本 -28%" data-link-desc="Netflix 把多套關聯式 DB 統一到 Aurora、效能提升 75%、成本下降 28%、串流數十億小時">9.C23 Netflix</a> 多套 RDBMS（PostgreSQL / MySQL / Oracle）→ Aurora、+75% 效能 / -28% 成本。</p>
<p><strong>驗證的 driver</strong>：</p>
<ul>
<li>DB 種類太多本身是規模化的成本（每多一種 DB 多一套 DBA 知識 / backup / monitoring）</li>
<li>整合到 Aurora 釋放工程資源、不是純效能改善</li>
</ul>
<p><strong>case 自帶警示（必引用）</strong>：</p>
<ul>
<li>「+75% 是跨多 workload 最大改善幅度、不是每 workload 都 +75%」（case「需要警惕」段第 1 點）</li>
<li><strong>Aurora 非 all-purpose store 邊界</strong>：「Netflix 數據層遠不止 Aurora — 還有 Cassandra（playback metadata）、EVCache（cache layer）、Iceberg（data warehouse）。Aurora 主要是『需要 ACID 的 OLTP 工作負載』」（case「需要警惕」段第 2 點）</li>
</ul>
<p>工程含義：consolidation 是「ACID OLTP 整合到 Aurora」、不是「所有 store 整合到 Aurora」。讀者規劃整合範圍時要明示什麼 workload 不在範圍：</p>
<table>
  <thead>
      <tr>
          <th>Workload</th>
          <th>是否在 Aurora consolidation 範圍</th>
          <th>替代</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ACID OLTP</td>
          <td>是</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Playback metadata</td>
          <td>否（Netflix 用 Cassandra）</td>
          <td>Cassandra / ScyllaDB</td>
      </tr>
      <tr>
          <td>Cache layer</td>
          <td>否（Netflix 用 EVCache）</td>
          <td>EVCache / Redis / Memcached</td>
      </tr>
      <tr>
          <td>Data warehouse</td>
          <td>否（Netflix 用 Iceberg）</td>
          <td>Iceberg / Snowflake / Redshift</td>
      </tr>
      <tr>
          <td>Time-series</td>
          <td>否（性能不適合）</td>
          <td>InfluxDB / TimescaleDB self-managed</td>
      </tr>
      <tr>
          <td>Search</td>
          <td>否（無 inverted index 優化）</td>
          <td>Elasticsearch / OpenSearch</td>
      </tr>
  </tbody>
</table>
<h3 id="draftkingsfleet-拓樸-redesign">DraftKings：fleet 拓樸 redesign</h3>
<p><a href="/blog/backend/09-performance-capacity/cases/draftkings-aurora-financial-ledger/" data-link-title="9.C4 DraftKings：Aurora 撐 100 萬 ops/min 的體育博彩金融帳本" data-link-desc="DraftKings 用 Aurora MySQL 跑體育博彩金融帳本、Super Bowl 流量 &#43;50% 不影響延遲">9.C4 DraftKings</a> 200 個獨立 Aurora cluster、按業務切分（不是一個大 cluster + 200 schema）。</p>
<p><strong>驗證的 driver</strong>：</p>
<ul>
<li>Migration 不只是技術切換、也是 cluster 拓樸 redesign</li>
<li>業務本身可切分（每體育類別 / 每地理 / 每產品線）就在 migration 時順便拆 cluster</li>
<li>Blast radius 隔離跟容量規劃分散一起獲得</li>
</ul>
<p><strong>Fleet 拓樸決策</strong>：詳見 <a href="../read-replica-scaling/">Aurora read replica scaling</a> 邊界段 SSoT。本 playbook 提醒 <em>migration 是拆 cluster 的好時機</em>、不展開拓樸決策本身。</p>
<h3 id="standard-chartered合規-lead-time--跨境複製禁令">Standard Chartered：合規 lead time + 跨境複製禁令</h3>
<p><a href="/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered</a> 受監管場景揭露：</p>
<ul>
<li>合規 lead time 是時程主項（3-12 個月 / 市場）</li>
<li>跨境複製禁止讓 Global Database 變反指標</li>
<li>每市場獨立 cluster + cross-AZ failover 是合規場景的標準解</li>
</ul>
<h3 id="反例aurora-不適合的場景">反例：Aurora 不適合的場景</h3>
<ul>
<li>Multi-region active-active write：見 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora-dsql/" data-link-title="PostgreSQL → Aurora DSQL Migration：PG wire-compatible Distributed SQL 的 Paradigm Shift" data-link-desc="Aurora DSQL（2024-12 re:Invent preview / 2025-05 GA）是 AWS 推的 PG wire-compatible *active-active distributed SQL*、跟 self-managed PG / Aurora PG 不同 paradigm（OCC &#43; snapshot isolation &#43; multi-region strong consistency）。Migration 結構是 *protocol drop-in &#43; paradigm shift*：app SQL 不太改、但 transaction retry / extension 缺位 / 多 region 一致性需重設計。本文走 DSQL vs Aurora PG vs self-managed PG 三軸對比、為什麼遷的三條 driver（global write / operational zero-touch / region resiliency）、Type E phased plan、5 production 踩雷（transaction retry 沒處理 / extension 缺位 / sequence throughput 限制 / Aurora PG 直升 DSQL 不可行 / region failover semantic）、跟 PG → Aurora 跟 PG → CockroachDB 對比">PG → Aurora DSQL Migration</a></li>
<li>跨雲：見 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">PG → CockroachDB Migration</a></li>
<li>極端寫入吞吐（&gt; 100K WPS）：考慮 sharding、CockroachDB、或 DynamoDB</li>
</ul>
<h2 id="邊界與整合--下一步">邊界與整合 / 下一步</h2>
<p><strong>Sibling playbook</strong>：</p>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora-dsql/" data-link-title="PostgreSQL → Aurora DSQL Migration：PG wire-compatible Distributed SQL 的 Paradigm Shift" data-link-desc="Aurora DSQL（2024-12 re:Invent preview / 2025-05 GA）是 AWS 推的 PG wire-compatible *active-active distributed SQL*、跟 self-managed PG / Aurora PG 不同 paradigm（OCC &#43; snapshot isolation &#43; multi-region strong consistency）。Migration 結構是 *protocol drop-in &#43; paradigm shift*：app SQL 不太改、但 transaction retry / extension 缺位 / 多 region 一致性需重設計。本文走 DSQL vs Aurora PG vs self-managed PG 三軸對比、為什麼遷的三條 driver（global write / operational zero-touch / region resiliency）、Type E phased plan、5 production 踩雷（transaction retry 沒處理 / extension 缺位 / sequence throughput 限制 / Aurora PG 直升 DSQL 不可行 / region failover semantic）、跟 PG → Aurora 跟 PG → CockroachDB 對比">PG → Aurora DSQL</a> — paradigm shift、Type E、multi-region active-active</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">PG → CockroachDB</a> — cross-cloud、paradigm shift</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PG → Aurora</a> — 既有 PG-specific playbook、可對照本 playbook 的 vendor-neutral 版本</li>
</ul>
<p><strong>Sibling deep article</strong>：</p>
<ul>
<li><a href="../storage-architecture/">Aurora storage architecture</a> — 理解 storage 設計才知道為什麼 operational redesign</li>
<li><a href="../cross-az-failover-rto/">Aurora cross-AZ failover RTO</a> — HA redesign 主項</li>
<li><a href="../read-replica-scaling/">Aurora read replica scaling</a> — fleet 治理 SSoT、含合規 driver</li>
<li><a href="../global-database-multi-region/">Aurora Global Database</a> — 合規禁止跨境複製的 anti-recommendation</li>
</ul>
<p><strong>1.x 章節互引</strong>：</p>
<ul>
<li><a href="/blog/backend/01-database/large-scale-db-migration/" data-link-title="1.12 大規模 DB 遷移實戰" data-link-desc="跨 DB 遷移的 dual-write、[shadow read](/backend/knowledge-cards/shadow-read/)、cutover、rollback 流程 — 從實戰案例提煉的工程做法">1.12 大規模 DB 遷移實戰</a> — migration 上游 framework</li>
</ul>
<p><strong>何時不用本 playbook</strong>：</p>
<ul>
<li>從 Aurora 遷到別處（反向、走對應的反向 playbook）</li>
<li>從 RDS PostgreSQL 升 Aurora PostgreSQL 是 in-place upgrade、用 RDS console「Convert to Aurora」即可、不需要這套 playbook</li>
<li>跨雲遷移：本 playbook 不涵蓋 GCP / Azure SQL → Aurora 流程</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor overview</a> — 服務定位、適用 / 不適用場景</li>
<li><a href="/blog/backend/knowledge-cards/failover/" data-link-title="Failover" data-link-desc="說明主要服務或節點失效時如何切換到備援能力">Failover 卡片</a> — 概念基底</li>
<li><a href="/blog/backend/knowledge-cards/replication-lag/" data-link-title="Replication Lag" data-link-desc="說明資料副本落後正式來源多久，以及它如何影響讀取正確性">Replication Lag 卡片</a> — operational diff 主軸</li>
<li><a href="/blog/backend/knowledge-cards/rollback-window/" data-link-title="Rollback Window" data-link-desc="說明變更進入 production 後還能用哪種方式回退或改路線的時間與條件">Rollback Window 卡片</a> — cutover decision</li>
<li><a href="/blog/backend/knowledge-cards/rollback-condition/" data-link-title="Rollback Condition" data-link-desc="說明決策執行後出現哪些訊號時要撤回、回退或改路線">Rollback Condition 卡片</a> — rollback trigger</li>
<li><a href="/blog/backend/09-performance-capacity/cases/netflix-aurora-consolidation/" data-link-title="9.C23 Netflix：把關聯式 DB 統一到 Aurora、效能 &#43;75%、成本 -28%" data-link-desc="Netflix 把多套關聯式 DB 統一到 Aurora、效能提升 75%、成本下降 28%、串流數十億小時">9.C23 Netflix</a> — operational consolidation 跟 Aurora 非 all-purpose store 邊界</li>
<li><a href="/blog/backend/09-performance-capacity/cases/draftkings-aurora-financial-ledger/" data-link-title="9.C4 DraftKings：Aurora 撐 100 萬 ops/min 的體育博彩金融帳本" data-link-desc="DraftKings 用 Aurora MySQL 跑體育博彩金融帳本、Super Bowl 流量 &#43;50% 不影響延遲">9.C4 DraftKings</a> — fleet 拓樸 redesign</li>
<li><a href="/blog/backend/09-performance-capacity/cases/standard-chartered-aurora-banking/" data-link-title="9.C14 Standard Chartered：受監管銀行的 Aurora 4000 TPS 容量提升" data-link-desc="Standard Chartered 銀行遷移到 Aurora 後吞吐量提升 10 倍至 4000 TPS、跨 7 個受監管市場">9.C14 Standard Chartered</a> — 合規 lead time + 跨境複製禁令</li>
<li><a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook 寫作方法論</a> — 本文遵循的 6 規格面寫作模板</li>
<li>官方：<a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Migrating.html">Aurora migration documentation</a></li>
</ul>
]]></content:encoded></item><item><title>Cosmos DB for PostgreSQL：基於 Citus 的分散式 PostgreSQL、跟核心 Cosmos DB 是不同產品、何時選它而非核心 Cosmos 或一般 PG</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/cosmosdb/cosmos-for-postgresql/</link><pubDate>Tue, 02 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/cosmosdb/cosmos-for-postgresql/</guid><description>&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/cosmosdb/" data-link-title="Azure Cosmos DB" data-link-desc="全球分散式 multi-model DB、5 個 consistency levels、Microsoft 自家 dogfood 證據">Cosmos DB&lt;/a> overview 的 deep article、寫作參照 &lt;a href="https://tarrragon.github.io/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">vendor deep article methodology&lt;/a>。Cosmos DB for PostgreSQL 是 Azure 在 2022 把 Citus（PostgreSQL 的分散式 extension）納入後推出的 &lt;em>分散式 PostgreSQL&lt;/em> 託管服務 — 它跑真正的 PostgreSQL engine、支援標準 SQL / JOIN / ACID 交易、把單表水平分片到多個 worker node。它跟本 vendor 頁主講的核心 Cosmos DB（NoSQL、multi-model、RU/s 計費）是 &lt;em>兩個不同產品&lt;/em>、只是共用品牌名稱。本文的主責任是釐清這個定位混淆、再講它的架構與選型判準：何時選它、何時該回核心 Cosmos DB、何時一般 PostgreSQL 就夠。&lt;/p>
&lt;p>本文沒有專屬 production case anchor：Cosmos DB for PostgreSQL 的公開 case 覆蓋稀薄、機制以 Azure / Citus vendor 規格與分散式 PostgreSQL 通用工程展開、選型判準用「scale-out PG vs NoSQL vs single-node PG」這個具體決策驅動。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Scope warning&lt;/strong>：本文涉及的服務命名、node 規格上限、Citus 版本、PostgreSQL major version 支援屬時間敏感、Azure 服務命名歷史上有變動、實作前以 &lt;a href="https://learn.microsoft.com/azure/cosmos-db/postgresql/">Cosmos DB for PostgreSQL 官方文件&lt;/a> cross-verify。&lt;/p>&lt;/blockquote>
&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>典型觸發場景：team 在 Azure 上跑 PostgreSQL、單機 primary 撐到上限 — write throughput、資料量、或單表太大導致 index / vacuum / query 變慢。看到「Cosmos DB」以為是要把資料搬進 NoSQL、重寫 application 成 document model；或反過來、看到「Cosmos DB for PostgreSQL」以為它就是核心 Cosmos DB 的一個 PostgreSQL API、結果發現它是完全不同的東西。命名混淆讓選型從一開始就走偏。&lt;/p>
&lt;p>讀者徵兆：&lt;/p>
&lt;ul>
&lt;li>「單機 PostgreSQL 撐不住、但 application 是 SQL / JOIN / 交易重、不想重寫成 NoSQL」&lt;/li>
&lt;li>「Cosmos DB for PostgreSQL 跟核心 Cosmos DB 是同一個東西嗎」&lt;/li>
&lt;li>「它跟一般 Azure Database for PostgreSQL 差在哪、什麼時候才需要它」&lt;/li>
&lt;li>「跟 CockroachDB / Aurora / Spanner 這些 distributed SQL 怎麼選」&lt;/li>
&lt;/ul>
&lt;p>真實壓力：SQL workload 撐到單機上限時、選錯方向的成本是年級的。誤以為要遷 NoSQL 而重寫 application 是浪費；誤以為核心 Cosmos DB 有「PostgreSQL 相容」而選錯產品也是浪費。正確的選型要先把這個服務放回它真正的分類 — &lt;em>分散式 SQL&lt;/em>、見 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/distributed-sql/" data-link-title="Distributed SQL" data-link-desc="把 SQL 與交易語意延伸到多節點與多區域的資料庫形態">distributed SQL&lt;/a>。&lt;/p>
&lt;h2 id="核心機制citus-based-coordinator-worker-分散式-postgresql">核心機制：Citus-based coordinator-worker 分散式 PostgreSQL&lt;/h2>
&lt;p>Cosmos DB for PostgreSQL 的底層是 Citus、把 PostgreSQL 從單機擴展成 coordinator + worker 的分散式叢集。它的關鍵概念有幾個。&lt;/p></description><content:encoded><![CDATA[<p>本文是 <a href="/blog/backend/01-database/vendors/cosmosdb/" data-link-title="Azure Cosmos DB" data-link-desc="全球分散式 multi-model DB、5 個 consistency levels、Microsoft 自家 dogfood 證據">Cosmos DB</a> overview 的 deep article、寫作參照 <a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">vendor deep article methodology</a>。Cosmos DB for PostgreSQL 是 Azure 在 2022 把 Citus（PostgreSQL 的分散式 extension）納入後推出的 <em>分散式 PostgreSQL</em> 託管服務 — 它跑真正的 PostgreSQL engine、支援標準 SQL / JOIN / ACID 交易、把單表水平分片到多個 worker node。它跟本 vendor 頁主講的核心 Cosmos DB（NoSQL、multi-model、RU/s 計費）是 <em>兩個不同產品</em>、只是共用品牌名稱。本文的主責任是釐清這個定位混淆、再講它的架構與選型判準：何時選它、何時該回核心 Cosmos DB、何時一般 PostgreSQL 就夠。</p>
<p>本文沒有專屬 production case anchor：Cosmos DB for PostgreSQL 的公開 case 覆蓋稀薄、機制以 Azure / Citus vendor 規格與分散式 PostgreSQL 通用工程展開、選型判準用「scale-out PG vs NoSQL vs single-node PG」這個具體決策驅動。</p>
<blockquote>
<p><strong>Scope warning</strong>：本文涉及的服務命名、node 規格上限、Citus 版本、PostgreSQL major version 支援屬時間敏感、Azure 服務命名歷史上有變動、實作前以 <a href="https://learn.microsoft.com/azure/cosmos-db/postgresql/">Cosmos DB for PostgreSQL 官方文件</a> cross-verify。</p></blockquote>
<h2 id="問題情境">問題情境</h2>
<p>典型觸發場景：team 在 Azure 上跑 PostgreSQL、單機 primary 撐到上限 — write throughput、資料量、或單表太大導致 index / vacuum / query 變慢。看到「Cosmos DB」以為是要把資料搬進 NoSQL、重寫 application 成 document model；或反過來、看到「Cosmos DB for PostgreSQL」以為它就是核心 Cosmos DB 的一個 PostgreSQL API、結果發現它是完全不同的東西。命名混淆讓選型從一開始就走偏。</p>
<p>讀者徵兆：</p>
<ul>
<li>「單機 PostgreSQL 撐不住、但 application 是 SQL / JOIN / 交易重、不想重寫成 NoSQL」</li>
<li>「Cosmos DB for PostgreSQL 跟核心 Cosmos DB 是同一個東西嗎」</li>
<li>「它跟一般 Azure Database for PostgreSQL 差在哪、什麼時候才需要它」</li>
<li>「跟 CockroachDB / Aurora / Spanner 這些 distributed SQL 怎麼選」</li>
</ul>
<p>真實壓力：SQL workload 撐到單機上限時、選錯方向的成本是年級的。誤以為要遷 NoSQL 而重寫 application 是浪費；誤以為核心 Cosmos DB 有「PostgreSQL 相容」而選錯產品也是浪費。正確的選型要先把這個服務放回它真正的分類 — <em>分散式 SQL</em>、見 <a href="/blog/backend/knowledge-cards/distributed-sql/" data-link-title="Distributed SQL" data-link-desc="把 SQL 與交易語意延伸到多節點與多區域的資料庫形態">distributed SQL</a>。</p>
<h2 id="核心機制citus-based-coordinator-worker-分散式-postgresql">核心機制：Citus-based coordinator-worker 分散式 PostgreSQL</h2>
<p>Cosmos DB for PostgreSQL 的底層是 Citus、把 PostgreSQL 從單機擴展成 coordinator + worker 的分散式叢集。它的關鍵概念有幾個。</p>
<p>它跑 <em>真正的 PostgreSQL</em>。不是 wire-compat、不是 PostgreSQL API on top of NoSQL — 是 PostgreSQL engine 加 Citus extension。標準 SQL、JOIN、ACID 交易、PostgreSQL extension 生態（含部分如 PostGIS）都在。這跟核心 Cosmos DB（自己的 query language、SQL-like 但無 JOIN、RU/s 計費）是根本不同的東西。</p>
<p>架構是 coordinator-worker。coordinator node 接 query、根據 distribution column 把 query 路由 / 拆分到 worker node、worker 存實際的 shard。application 連 coordinator、看起來像連一個 PostgreSQL。</p>
<p>distribution column 是核心設計決策、類比核心 Cosmos DB 的 partition key 之於 NoSQL、也類比 <a href="../partition-key-design/">partition-key-design</a> 講的分散原則。表按 distribution column 的值分片到 worker；同一 distribution column 值的 row 落在同一 shard。JOIN 與交易若在同一 distribution column 值內、可以下推到單一 worker 高效執行（co-location）；跨 distribution column 的 JOIN / 交易要跨 worker 協調、較貴。</p>
<p>表分三種：distributed table（按 distribution column 分片、大表用）、reference table（每個 worker 全複本、小的維度表用、讓 JOIN co-locate）、local table（只在 coordinator）。建模的關鍵是把常一起 JOIN 的大表用 <em>同一 distribution column</em> 分片、達成 co-location。</p>
<h2 id="選型判準三方對照">選型判準：三方對照</h2>
<p>這是本文主判讀段。Cosmos DB for PostgreSQL 的正確位置是「single-node PG 不夠、但 workload 仍是 SQL 範式」的中間地帶。</p>
<p>選 Cosmos DB for PostgreSQL 的條件：</p>
<ul>
<li>workload 是 SQL 範式（關聯 schema、JOIN、交易）、不想 / 不能重寫成 NoSQL</li>
<li>single-node PostgreSQL 已達上限（write throughput / 資料量 / 單表大小）、且資料有好的 distribution column（多租戶的 tenant_id、time-series 的某維度）</li>
<li>工作負載偏向多租戶 SaaS 或 real-time analytics over fresh data — Citus 的典型適配場景</li>
<li>想留在 PostgreSQL 生態（SQL、extension、既有 tooling）而非進 NoSQL</li>
</ul>
<p>回核心 Cosmos DB（NoSQL）的條件：</p>
<ul>
<li>資料形狀已是 document / KV、access pattern 固定、不需要 JOIN 與複雜 SQL</li>
<li>需要 multi-model（document + graph + KV）、5 個 consistency level、turnkey multi-region active-active write</li>
<li>RU/s 容量抽象與 serverless 計費更符合 workload — 見 <a href="../ru-cost-model-sizing/">ru-cost-model-sizing</a></li>
</ul>
<p>一般 Azure Database for PostgreSQL（single-node managed PG）就夠的條件：</p>
<ul>
<li>single-node 還沒到上限 — 多數 OLTP baseline 用 vertical scaling + read replica 就夠、不需要分散式</li>
<li>沒有好的 distribution column — 分散式 PostgreSQL 沒有均勻 distribution column 會 hot worker、好處拿不到、複雜度卻全付</li>
<li>不想承擔 distributed SQL 的複雜度（distribution column 設計、co-location 規劃、跨 shard query 成本）</li>
</ul>
<p>判讀句：先確認 single-node PG 真的到上限、再確認 workload 是 SQL 範式（否則考慮 NoSQL）、最後確認有好的 distribution column。三個都成立、Cosmos DB for PostgreSQL 才是對的；缺任一個、回 single-node PG 或核心 Cosmos DB。</p>
<h3 id="跟其他-distributed-sql-的位置">跟其他 distributed SQL 的位置</h3>
<p>Cosmos DB for PostgreSQL 是 Azure 上、PostgreSQL-native、scale-out（co-location 設計驅動）的 distributed SQL。跟 <a href="/blog/backend/01-database/vendors/spanner/" data-link-title="Google Cloud Spanner" data-link-desc="全球分散式 strong-consistency OLTP、TrueTime API、線性擴展到 10 億 req/sec">Spanner</a>（全球 external consistency、自己的 SQL 方言）、<a href="/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB</a>（跨雲、PostgreSQL wire、自動 range 分散）、Aurora DSQL（AWS、全球 active-active）位置不同：Cosmos DB for PostgreSQL 強在「真 PostgreSQL engine + extension 生態 + co-location 控制」、弱在它的分散需要 distribution column 設計（不像 CockroachDB / Spanner 自動分 range）、且綁 Azure。</p>
<h2 id="操作流程">操作流程</h2>
<h3 id="建叢集與設定-distribution-column">建叢集與設定 distribution column</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 建 distributed table、按 tenant_id 分片（多租戶 SaaS 典型）
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">tenant_id</span><span class="w">   </span><span class="nb">bigint</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">event_id</span><span class="w">    </span><span class="nb">bigint</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">payload</span><span class="w">     </span><span class="n">jsonb</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">created_at</span><span class="w">  </span><span class="n">timestamptz</span><span class="w"> </span><span class="k">DEFAULT</span><span class="w"> </span><span class="n">now</span><span class="p">()</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;events&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;tenant_id&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 維度小表設 reference table、讓 JOIN co-locate
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">tenants</span><span class="w"> </span><span class="p">(</span><span class="n">tenant_id</span><span class="w"> </span><span class="nb">bigint</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="nb">text</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_reference_table</span><span class="p">(</span><span class="s1">&#39;tenants&#39;</span><span class="p">);</span></span></span></code></pre></div><p>驗證：<code>SELECT * FROM citus_tables;</code> 看每張表的 distribution column 與 shard 分布；對 distributed table 的查詢若帶 distribution column filter、<code>EXPLAIN</code> 顯示下推到單一 shard、不帶則 fan-out 到所有 worker。</p>
<h3 id="驗證-co-location">驗證 co-location</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 同 distribution column 的兩張 distributed table JOIN 應 co-located
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">colocation_id</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">citus_tables</span><span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">colocation_id</span><span class="p">;</span></span></span></code></pre></div><p>驗證：常一起 JOIN 的大表落在同一 colocation group、JOIN 在 worker 本地完成、不跨 worker shuffle。</p>
<h3 id="加-worker-擴容">加 worker 擴容</h3>
<p>加 worker node 後 rebalance shard。驗證：rebalance 後 shard 在新舊 worker 間分布均勻、單一 worker 不再是 hot spot。</p>
<h3 id="rollback-boundary">Rollback boundary</h3>
<p>Cosmos DB for PostgreSQL 是叢集級服務、scale worker 是運維操作、可逆（縮回去）。但 <em>distribution column 一旦選定、改它要重建表 + 重灌資料</em> — 跟核心 Cosmos DB 的 partition key 不可改是同一類不可逆設計、見 <a href="../partition-key-design/">partition-key-design</a>。</p>
<h2 id="失敗模式">失敗模式</h2>
<h3 id="把它跟核心-cosmos-db-當同一產品選">把它跟核心 Cosmos DB 當同一產品選</h3>
<p>選型時把「Cosmos DB for PostgreSQL」當成「核心 Cosmos DB 的 PostgreSQL 介面」、規劃用 RU/s、找 consistency level 設定、結果整套 mental model 對不上 — 因為它是分散式 PostgreSQL、用 node 規格計費、用 PostgreSQL 的交易隔離級別。修法是選型第一步就確認「這是分散式 SQL、不是 NoSQL」、規劃按 PostgreSQL + Citus 的模型走、不要套核心 Cosmos DB 的概念。</p>
<h3 id="沒有好的-distribution-column-硬上分散式">沒有好的 distribution column 硬上分散式</h3>
<p>workload 沒有均勻的 distribution column（例如資料天然集中在少數 tenant）、硬分片後變 hot worker、分散式的好處拿不到、複雜度全付。徵兆是少數 worker CPU / IO 飽和、其他 worker 閒置。修法是選型階段就評估 distribution column 的 cardinality 與均勻度；不均勻時、要嘛留 single-node PG（垂直擴 + read replica）、要嘛重新設計 distribution column（如多租戶用 composite 或對 hot tenant 特殊處理）。</p>
<h3 id="大量跨-shard-query--非-co-located-join">大量跨 shard query / 非 co-located JOIN</h3>
<p>application query 大多不帶 distribution column filter、或常做跨 distribution column 的 JOIN、每個 query fan-out 到所有 worker + shuffle、latency 與成本都差。徵兆是 <code>EXPLAIN</code> 顯示 query 打所有 worker、p99 latency 高。修法是重新設計 schema 讓常一起查的表 co-located、把 distribution column 放進熱 query 的 filter；改不動時、這個 workload 可能不適合 scale-out PG、回 single-node 或考慮其他方案。</p>
<h3 id="該用-nosql-卻選了分散式-pg或反之">該用 NoSQL 卻選了分散式 PG（或反之）</h3>
<p>document / KV、固定 access pattern、不需要 JOIN 的 workload 選了 Cosmos DB for PostgreSQL、付了 SQL / distribution column 設計的複雜度卻沒用到關聯能力 — 這類 workload 核心 Cosmos DB（NoSQL）更自然。反過來、SQL / JOIN / 交易重的 workload 被推去核心 Cosmos DB（NoSQL）要重寫成 document model 也是錯。修法是回到「workload 是 SQL 範式還是 document / KV 範式」的根本判斷、見本文選型判準段與 <a href="../mongodb-api-vs-sql-api/">mongodb-api-vs-sql-api</a> 的範式判讀。</p>
<h3 id="anti-recommendationsingle-node-pg-沒到上限不要上">Anti-recommendation：single-node PG 沒到上限不要上</h3>
<p>分散式 PostgreSQL 帶來 distribution column 設計、co-location 規劃、跨 shard query 成本、rebalance 運維。single-node managed PostgreSQL 加 vertical scaling 與 read replica 能撐的 OLTP baseline 比多數團隊以為的大。沒有觸及 single-node 真實上限（write throughput 飽和、單表大到 maintenance 困難、資料量超出單機）就上分散式、是用複雜度換不存在的容量需求。</p>
<h2 id="容量與觀測">容量與觀測</h2>
<ul>
<li>必看 metric：各 worker node 的 CPU / IO / 連線（找 hot worker）、shard 在 worker 間的分布均勻度、跨 shard query 比例、coordinator 連線數</li>
<li>容量單位：node 規格（不是 RU/s）— 規劃是 coordinator + N worker 的 vCPU / memory / storage、跟核心 Cosmos DB 的 RU 思維完全不同、不要混用 <a href="../ru-cost-model-sizing/">ru-cost-model-sizing</a> 的 RU 模型來估這個服務</li>
<li>distribution column 均勻度是容量上限的真實決定因素 — 跟 <a href="/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition</a> 同模型、hot worker 讓名義叢集容量達不到</li>
<li>回 <a href="/blog/backend/09-performance-capacity/capacity-planning/" data-link-title="9.6 容量規劃模型" data-link-desc="peak forecast、headroom budget、growth curve、autoscaling sizing">9.6 容量規劃模型</a>：scale-out 的有效容量 = node 數 × 單 node 容量 × distribution 均勻度</li>
<li>Alert：單一 worker 飽和（distribution skew）、跨 shard query 比例上升、rebalance 後仍不均</li>
</ul>
<h2 id="邊界與整合">邊界與整合</h2>
<ul>
<li>定位釐清：本服務是 <em>分散式 PostgreSQL</em>、不是核心 Cosmos DB（NoSQL）— 共用品牌名稱、產品不同、選型不要混淆</li>
<li>跟核心 Cosmos DB 的分界：SQL / JOIN / 交易 + 到單機上限 → 本服務；document / KV / multi-model / multi-region active-active → 核心 Cosmos DB、見 <a href="../mongodb-api-vs-sql-api/">mongodb-api-vs-sql-api</a></li>
<li>跟 PostgreSQL vendor 的分界：single-node 沒到上限 → <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">Azure Database for PostgreSQL / 一般 PG</a>；PostgreSQL 既有的 <a href="/blog/backend/01-database/vendors/postgresql/specialized-pg-variants/" data-link-title="Specialized PostgreSQL Variants" data-link-desc="pgvectorscale、Citus、TimescaleDB、PostGIS、AlloyDB、Cosmos DB for PostgreSQL、serverless PG 等 PostgreSQL 變體的選型邊界">Specialized PostgreSQL Variants</a> 段已把 Cosmos DB for PostgreSQL 列為 Citus-based 變體之一</li>
<li>跟其他 distributed SQL：<a href="/blog/backend/01-database/vendors/spanner/" data-link-title="Google Cloud Spanner" data-link-desc="全球分散式 strong-consistency OLTP、TrueTime API、線性擴展到 10 億 req/sec">Spanner</a>（全球強一致）、<a href="/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB</a>（跨雲、自動 range）— 本服務強在真 PostgreSQL engine + co-location 控制、弱在需 distribution column 設計 + 綁 Azure</li>
<li>distribution column 不可改：跟 <a href="../partition-key-design/">partition-key-design</a> 的 partition key 不可改是同類不可逆設計</li>
<li>Knowledge card：<a href="/blog/backend/knowledge-cards/distributed-sql/" data-link-title="Distributed SQL" data-link-desc="把 SQL 與交易語意延伸到多節點與多區域的資料庫形態">distributed SQL</a> / <a href="/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition</a></li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/cosmosdb/" data-link-title="Azure Cosmos DB" data-link-desc="全球分散式 multi-model DB、5 個 consistency levels、Microsoft 自家 dogfood 證據">Cosmos DB vendor overview</a> — 本文是該頁尾 Cosmos DB for PostgreSQL backlog 的深度展開</li>
<li><a href="../mongodb-api-vs-sql-api/">mongodb-api-vs-sql-api</a> — SQL 範式 vs document / KV 範式的根本判讀</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor</a> / <a href="/blog/backend/01-database/vendors/postgresql/specialized-pg-variants/" data-link-title="Specialized PostgreSQL Variants" data-link-desc="pgvectorscale、Citus、TimescaleDB、PostGIS、AlloyDB、Cosmos DB for PostgreSQL、serverless PG 等 PostgreSQL 變體的選型邊界">Specialized PostgreSQL Variants</a> — single-node PG 與 Citus 變體定位</li>
<li><a href="/blog/backend/01-database/vendors/spanner/" data-link-title="Google Cloud Spanner" data-link-desc="全球分散式 strong-consistency OLTP、TrueTime API、線性擴展到 10 億 req/sec">Spanner vendor</a> / <a href="/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB vendor</a> — 其他 distributed SQL 對照</li>
<li><a href="../partition-key-design/">partition-key-design</a> — distribution column 不可改的同類設計</li>
<li><a href="/blog/backend/knowledge-cards/distributed-sql/" data-link-title="Distributed SQL" data-link-desc="把 SQL 與交易語意延伸到多節點與多區域的資料庫形態">Distributed SQL 卡片</a> / <a href="/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition 卡片</a> — 概念基底</li>
<li>官方：<a href="https://learn.microsoft.com/azure/cosmos-db/postgresql/">Azure Cosmos DB for PostgreSQL</a> / <a href="https://learn.microsoft.com/azure/cosmos-db/postgresql/concepts-distributed-data">Citus distributed tables</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL pgBouncer 配置 + 連線池治理</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pgbouncer-config/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pgbouncer-config/</guid><description>&lt;p>PostgreSQL 的 connection 是 &lt;em>昂貴的 process&lt;/em>、每個 connection ~10MB RAM、idle connection 也吃 backend slot。當 application instance 數量爆炸（K8s replica × 多 deployment × pool size）、直接連 PostgreSQL 會把 backend slot 耗盡、新 connection 全 refuse — 即使 active query 不多。pgBouncer 是 &lt;em>connection pool proxy&lt;/em>、把幾千個 application connection 收斂成幾百個 PostgreSQL backend connection、production-grade PostgreSQL 部署的標配。&lt;/p>
&lt;p>本文不是 pgBouncer overview（請看 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor 頁&lt;/a> 中 connection pool 段）— 而是 &lt;em>production 部署 + 故障演練&lt;/em> 的實作層教學。覆蓋三層 pool（application → pgBouncer → PostgreSQL）的對齊、transaction pooling 跟 session pooling 的選擇陷阱、跟 HA failover 的整合、容量規劃。&lt;/p>
&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>典型觸發場景：團隊規模從 50 人爬到 200 人、microservice 從 20 個爬到 100 個、K8s replica 從 3 個爬到每服務 5-10 個。直連 PostgreSQL 的 connection 計算：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">100 service × 6 replica × 30 application pool = 18000 connection&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>PostgreSQL 預設 &lt;code>max_connections = 100&lt;/code>、production 設 &lt;code>max_connections = 500-1000&lt;/code> 已經是上限（每多一個都加 memory + context switch cost）。18000 連線打 PostgreSQL 直接打爆。&lt;/p>
&lt;p>進一步問題：&lt;/p>
&lt;ul>
&lt;li>一半 connection 是 &lt;em>idle&lt;/em>（application pool 預留、實際沒查詢）— 浪費 backend slot&lt;/li>
&lt;li>Cold start 時所有 replica 同時建 connection、瞬間 spike&lt;/li>
&lt;li>DB failover 時所有 application 同時 reconnect、prod-test pattern 跑不通&lt;/li>
&lt;li>DNS-based failover 時 application connection pool 不知道 backend 換了&lt;/li>
&lt;/ul>
&lt;p>pgBouncer 解這四個問題。但 &lt;em>引入 pgBouncer&lt;/em> 後又會引入新的問題層（pgBouncer 跟 application pool 不對齊、transaction pooling 的 session state 限制、HA 故障時 pgBouncer 也要 failover）— 本文討論這些。&lt;/p>
&lt;h2 id="核心概念pool-mode--sizing">核心概念：pool mode + sizing&lt;/h2>
&lt;p>pgBouncer 的 first-class concept 是 &lt;em>pool mode&lt;/em>、決定 application connection 跟 PostgreSQL backend connection 的綁定方式：&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL 的 connection 是 <em>昂貴的 process</em>、每個 connection ~10MB RAM、idle connection 也吃 backend slot。當 application instance 數量爆炸（K8s replica × 多 deployment × pool size）、直接連 PostgreSQL 會把 backend slot 耗盡、新 connection 全 refuse — 即使 active query 不多。pgBouncer 是 <em>connection pool proxy</em>、把幾千個 application connection 收斂成幾百個 PostgreSQL backend connection、production-grade PostgreSQL 部署的標配。</p>
<p>本文不是 pgBouncer overview（請看 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor 頁</a> 中 connection pool 段）— 而是 <em>production 部署 + 故障演練</em> 的實作層教學。覆蓋三層 pool（application → pgBouncer → PostgreSQL）的對齊、transaction pooling 跟 session pooling 的選擇陷阱、跟 HA failover 的整合、容量規劃。</p>
<h2 id="問題情境">問題情境</h2>
<p>典型觸發場景：團隊規模從 50 人爬到 200 人、microservice 從 20 個爬到 100 個、K8s replica 從 3 個爬到每服務 5-10 個。直連 PostgreSQL 的 connection 計算：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">100 service × 6 replica × 30 application pool = 18000 connection</span></span></code></pre></div><p>PostgreSQL 預設 <code>max_connections = 100</code>、production 設 <code>max_connections = 500-1000</code> 已經是上限（每多一個都加 memory + context switch cost）。18000 連線打 PostgreSQL 直接打爆。</p>
<p>進一步問題：</p>
<ul>
<li>一半 connection 是 <em>idle</em>（application pool 預留、實際沒查詢）— 浪費 backend slot</li>
<li>Cold start 時所有 replica 同時建 connection、瞬間 spike</li>
<li>DB failover 時所有 application 同時 reconnect、prod-test pattern 跑不通</li>
<li>DNS-based failover 時 application connection pool 不知道 backend 換了</li>
</ul>
<p>pgBouncer 解這四個問題。但 <em>引入 pgBouncer</em> 後又會引入新的問題層（pgBouncer 跟 application pool 不對齊、transaction pooling 的 session state 限制、HA 故障時 pgBouncer 也要 failover）— 本文討論這些。</p>
<h2 id="核心概念pool-mode--sizing">核心概念：pool mode + sizing</h2>
<p>pgBouncer 的 first-class concept 是 <em>pool mode</em>、決定 application connection 跟 PostgreSQL backend connection 的綁定方式：</p>
<ul>
<li><strong>Session pooling</strong>：application connection 拿到 backend connection 後、整個 session 期間都綁同一個 backend。tear-down 才釋放。語義跟「直連」一樣、不破壞 session state。但 <em>idle connection 仍占 backend slot</em>、收斂效率低、適合 <em>連線數不多但要保留 session state</em>（用了 prepared statement、temporary table、advisory lock 等）的場景。</li>
<li><strong>Transaction pooling</strong>：application connection 在 <em>transaction 邊界</em> 才綁 backend、commit / rollback 後立即釋放。同一個 application connection 不同 transaction 可能拿到不同 backend。收斂效率高（idle connection 完全不占 backend slot）、但 <em>session state 限制嚴</em> — 不能用 <code>SET</code> 改 session-level setting、不能用 prepared statement（除非 application 端禁用）、不能用 advisory lock 跨 transaction。</li>
<li><strong>Statement pooling</strong>：每個 statement 完就釋放 backend。極端高收斂但 <em>連 transaction 都不能跨 statement</em>、絕大多數 application 用不了、只在 batch query 場景。</li>
</ul>
<p><strong>Production 預設選 transaction pooling</strong>、application 端禁用 prepared statement（或用 <a href="https://www.pgbouncer.org/config.html#max_prepared_statements">PgBouncer-supported prepared statement</a>、需 pgBouncer 1.21+）。例外場景才開 session pooling。</p>
<p><strong>Pool sizing 公式</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">PostgreSQL max_connections     = pgBouncer N × default_pool_size + reserve
</span></span><span class="line"><span class="ln">2</span><span class="cl">pgBouncer default_pool_size    = per-database backend connection 上限
</span></span><span class="line"><span class="ln">3</span><span class="cl">Application pool size          = 每 application instance 拿幾個 pgBouncer connection</span></span></code></pre></div><p>實例：50 個 application replica、每 instance pool 30 個、pgBouncer 後 default_pool_size = 20（per database）、3 個 database。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Total application → pgBouncer = 50 × 30 = 1500 connection
</span></span><span class="line"><span class="ln">2</span><span class="cl">pgBouncer → PostgreSQL        = 3 × 20 = 60 connection
</span></span><span class="line"><span class="ln">3</span><span class="cl">PostgreSQL max_connections    = 60 + reserve (50 預留 admin / migration) = 110</span></span></code></pre></div><p>1500 → 110 收斂 13.6 倍、PostgreSQL 還在合理上限內。</p>
<h2 id="step-by-step-配置">Step-by-step 配置</h2>
<p><strong>pgBouncer.ini</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">[databases]</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="na">mydb</span> <span class="o">=</span> <span class="s">host=postgres-primary.internal port=5432 dbname=mydb auth_user=pgbouncer</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="k">[pgbouncer]</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="na">listen_port</span> <span class="o">=</span> <span class="s">6432</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="na">listen_addr</span> <span class="o">=</span> <span class="s">0.0.0.0</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="na">auth_type</span> <span class="o">=</span> <span class="s">scram-sha-256</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="na">auth_file</span> <span class="o">=</span> <span class="s">/etc/pgbouncer/userlist.txt</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="na">auth_query</span> <span class="o">=</span> <span class="s">SELECT usename, passwd FROM pg_shadow WHERE usename=$1</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="na">pool_mode</span> <span class="o">=</span> <span class="s">transaction</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="na">default_pool_size</span> <span class="o">=</span> <span class="s">20</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="na">min_pool_size</span> <span class="o">=</span> <span class="s">5</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="na">reserve_pool_size</span> <span class="o">=</span> <span class="s">10</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="na">reserve_pool_timeout</span> <span class="o">=</span> <span class="s">5</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="na">max_client_conn</span> <span class="o">=</span> <span class="s">2000</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="na">max_db_connections</span> <span class="o">=</span> <span class="s">100</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="na">server_idle_timeout</span> <span class="o">=</span> <span class="s">600</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="na">server_lifetime</span> <span class="o">=</span> <span class="s">3600</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="na">server_connect_timeout</span> <span class="o">=</span> <span class="s">15</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="na">server_login_retry</span> <span class="o">=</span> <span class="s">5</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="na">client_idle_timeout</span> <span class="o">=</span> <span class="s">0</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="na">client_login_timeout</span> <span class="o">=</span> <span class="s">60</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl">
</span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="na">stats_period</span> <span class="o">=</span> <span class="s">60</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="na">log_connections</span> <span class="o">=</span> <span class="s">0</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="na">log_disconnections</span> <span class="o">=</span> <span class="s">0</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="na">log_pooler_errors</span> <span class="o">=</span> <span class="s">1</span>
</span></span><span class="line"><span class="ln">32</span><span class="cl">
</span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="na">admin_users</span> <span class="o">=</span> <span class="s">pgbouncer_admin</span>
</span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="na">stats_users</span> <span class="o">=</span> <span class="s">pgbouncer_stats</span></span></span></code></pre></div><p>關鍵欄位解釋：</p>
<ul>
<li><code>pool_mode = transaction</code>：絕大多數 production 場景</li>
<li><code>default_pool_size = 20</code>：每 database 對 PostgreSQL 的 backend connection 上限、調整時要算進 PostgreSQL <code>max_connections</code></li>
<li><code>reserve_pool_size = 10</code> + <code>reserve_pool_timeout = 5</code>：當 default_pool_size 用滿、等 5 秒還拿不到 connection 才用 reserve pool — 是 <em>突發 spike</em> 的 buffer、不是 baseline</li>
<li><code>max_client_conn = 2000</code>：application 端能連 pgBouncer 的最大數</li>
<li><code>server_lifetime = 3600</code>：每 1 小時強制 recycle backend connection、避免 long-lived connection 累積 memory bloat（PostgreSQL <code>pg_stat_activity</code> 看 connection age）</li>
<li><code>auth_query</code>：pgBouncer 直接從 PostgreSQL <code>pg_shadow</code> 拉密碼、不需要在 pgBouncer 本地維護 userlist — production 推薦做法</li>
</ul>
<p><strong>Application 端 pool 設定</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># 例：Spring Boot HikariCP</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">spring.datasource.url</span><span class="p">:</span><span class="w"> </span><span class="l">jdbc:postgresql://pgbouncer.internal:6432/mydb</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">spring.datasource.hikari.maximum-pool-size</span><span class="p">:</span><span class="w"> </span><span class="m">30</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="nt">spring.datasource.hikari.minimum-idle</span><span class="p">:</span><span class="w"> </span><span class="m">5</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="nt">spring.datasource.hikari.connection-timeout</span><span class="p">:</span><span class="w"> </span><span class="m">30000</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="nt">spring.datasource.hikari.idle-timeout</span><span class="p">:</span><span class="w"> </span><span class="m">600000</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="nt">spring.datasource.hikari.max-lifetime</span><span class="p">:</span><span class="w"> </span><span class="m">1800000</span><span class="w">  </span><span class="c"># 30 min &lt; pgBouncer server_lifetime 60 min</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c"># 例：SQLAlchemy</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="l">engine = create_engine(</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">    </span><span class="s2">&#34;postgresql://pgbouncer.internal:6432/mydb&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="l">pool_size=30,</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">    </span><span class="l">max_overflow=5,</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">    </span><span class="l">pool_pre_ping=True,       </span><span class="w"> </span><span class="c"># 必開、檢測 stale connection</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">    </span><span class="l">pool_recycle=1800,        </span><span class="w"> </span><span class="c"># 30 min、跟 pgBouncer server_lifetime 對齊</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="l">)</span></span></span></code></pre></div><p><strong>Application 跟 pgBouncer 對齊</strong>：</p>
<ul>
<li>application <code>max-lifetime</code> &lt; pgBouncer <code>server_lifetime</code>：避免 application 拿到已被 pgBouncer recycle 的 connection</li>
<li><code>pool_pre_ping = True</code>：每次 checkout 前 send <code>SELECT 1</code>、檢測 stale connection — 對 transaction pooling 是必要的</li>
<li>application 端 <em>不要</em> 用 prepared statement（除非 pgBouncer 1.21+ 設 <code>max_prepared_statements</code>）</li>
</ul>
<h2 id="故障演練--邊界-case">故障演練 / 邊界 case</h2>
<h3 id="case-1pool-exhaustiondefault_pool_size-用滿">Case 1：Pool exhaustion（default_pool_size 用滿）</h3>
<p>徵兆：application log <code>ERROR: no more connections allowed</code>、pgBouncer log <code>pool is full</code>、pgBouncer admin console <code>SHOW POOLS</code> 顯示 <code>cl_waiting &gt; 0</code>。</p>
<p>Debug：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 連 pgBouncer admin
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="err">\</span><span class="k">c</span><span class="w"> </span><span class="n">pgbouncer</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SHOW</span><span class="w"> </span><span class="n">POOLS</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 看 cl_active / cl_waiting / sv_active / sv_idle
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SHOW</span><span class="w"> </span><span class="n">SERVERS</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- 看 server connection state（active / idle / used）</span></span></span></code></pre></div><p>修：</p>
<ul>
<li>短期：調高 <code>default_pool_size</code> 跟 PostgreSQL <code>max_connections</code>、配合 reserve pool</li>
<li>中期：找 <em>long-running query</em>（PostgreSQL <code>pg_stat_activity</code> 看 <code>query_start</code>、kill 過長 query）</li>
<li>長期：拆 database / 改 read replica / 移 OLAP query 到 data warehouse</li>
</ul>
<h3 id="case-2transaction-pooling-下-session-state-漏洞">Case 2：Transaction pooling 下 session state 漏洞</h3>
<p>徵兆：random 失敗 <code>prepared statement &quot;S_3&quot; does not exist</code>、<code>relation &quot;tmp_xxx&quot; does not exist</code>、advisory lock 不釋放。</p>
<p>原因：application 用了 prepared statement / temporary table / advisory lock、但 transaction commit 後 backend connection 釋放、下一個 transaction 拿到不同 backend、session state 不存在。</p>
<p>修：</p>
<ul>
<li>Application 框架禁用 prepared statement（JDBC <code>prepareThreshold=0</code>、SQLAlchemy <code>use_native_prepared_statements=False</code>）</li>
<li>temporary table 改 <a href="https://www.postgresql.org/docs/current/sql-createtable.html#SQL-CREATETABLE-UNLOGGED-TABLES">unlogged table</a> + cleanup</li>
<li>advisory lock 改 row-level lock 或 application-level lock（Redis）</li>
<li>或：切到 session pooling、犧牲收斂效率</li>
</ul>
<h3 id="case-3dns-based-failover-後-application-連到舊-master">Case 3：DNS-based failover 後 application 連到舊 master</h3>
<p>徵兆：PostgreSQL 切換 master 後、application 寫操作 <em>時好時壞</em>（看連到哪台）。</p>
<p>原因：pgBouncer 在 application 跟 PostgreSQL 之間、application 不知道 backend 換了；pgBouncer 自己也需要 reload config 才會連新 master。</p>
<p>修：</p>
<ul>
<li>pgBouncer 用 <code>RECONNECT</code> admin command 強制 close all backend connection、重連</li>
<li>配 Patroni / Stolon 等 HA 工具自動 trigger pgBouncer reconnect</li>
<li>application 端 <code>pool_pre_ping</code> 開啟、stale connection 自動踢</li>
</ul>
<h3 id="case-4server-lifetime-recycle-跟-in-flight-transaction-衝突">Case 4：Server lifetime recycle 跟 in-flight transaction 衝突</h3>
<p>徵兆：偶發 <code>server closed the connection unexpectedly</code>、跟 long-running transaction 重疊。</p>
<p>原因：pgBouncer <code>server_lifetime = 3600</code> 強制 recycle、但有 transaction 在跑時 pgBouncer 不會切、超過時間後仍會切。</p>
<p>修：</p>
<ul>
<li>確認沒有 <em>超過 1 小時</em> 的 transaction（PostgreSQL <code>pg_stat_activity</code> 看 <code>xact_start</code>）</li>
<li>必要時調高 <code>server_lifetime</code>、但 memory bloat 風險上升</li>
<li>application 端做 transaction timeout</li>
</ul>
<h3 id="case-5pgbouncer-自己-crash--oom">Case 5：pgBouncer 自己 crash / OOM</h3>
<p>徵兆：所有 application 同時失去 PostgreSQL 連線。</p>
<p>原因：pgBouncer 是 single-process（除非 1.21+ 用 <code>so_reuseport</code> 多 process）、memory leak / OOM / 部署事件都會打掉整個 connection layer。</p>
<p>修：</p>
<ul>
<li>多 pgBouncer instance + load balancer（HAProxy / Envoy）前置、application 連 LB</li>
<li><code>so_reuseport = 1</code>（1.21+）讓多個 pgBouncer process 共用 port</li>
<li>Resource limit 跟 alert：RSS &gt; N、connection count &gt; M</li>
<li>HA mode：active-passive 配 keepalived</li>
</ul>
<h2 id="容量--cost-規劃">容量 / cost 規劃</h2>
<p><strong>單一 pgBouncer 容量上限</strong>：</p>
<ul>
<li><code>max_client_conn</code>：實務 &lt; 5000 per instance（再高 CPU 跟 file descriptor 緊）</li>
<li><code>default_pool_size × database 數</code>：實務 &lt; 200 per instance</li>
<li>single process CPU bound：在 10K QPS 等級已經是瓶頸、要橫向 scale</li>
</ul>
<p><strong>何時加 pgBouncer instance</strong>：</p>
<ul>
<li>application connection 數突破 3000 / pgBouncer instance</li>
<li>pgBouncer CPU usage &gt; 60%（baseline、不算 spike）</li>
<li>跨 region application 需要 region-local pgBouncer</li>
</ul>
<p><strong>何時改架構（pgBouncer 不夠用）</strong>：</p>
<ul>
<li>PostgreSQL backend connection 數突破 500（即使有 pgBouncer 也撐不住）→ 改 read replica / partitioning / sharding</li>
<li>write 量太大（每秒 50K+ TPS）→ 改 sharding（<a href="https://vitess.io">Vitess</a> / <a href="https://www.citusdata.com">Citus</a>）或全球分散式 SQL（<a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a>）</li>
<li>application 大量 prepared statement / session state 需求 → 改 <a href="https://github.com/postgresml/pgcat">PgCat</a>（Rust 寫、支援更完整的 session feature）或回 session pooling</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<p><strong>跟 HA failover 整合</strong>（<a href="https://github.com/zalando/patroni">Patroni</a>）：</p>
<ul>
<li>Patroni 切換 master 後 trigger pgBouncer <code>RECONNECT</code></li>
<li>pgBouncer 透過 service discovery（Consul / etcd）拿新 master 位址、不是寫死在 config</li>
<li>application 不需感知 failover、connection 從 pgBouncer 拿到新 master 的 backend</li>
</ul>
<p><strong>跟監控整合</strong>：</p>
<ul>
<li>pgBouncer admin console <code>SHOW STATS</code> / <code>SHOW POOLS</code> / <code>SHOW SERVERS</code> 拉到 Prometheus（<a href="https://github.com/jbub/pgbouncer_exporter">pgbouncer_exporter</a>）</li>
<li>必看 metric：<code>cl_waiting</code>（等 backend 的 client 數）、<code>sv_active</code>（active backend 數）、<code>avg_query_time</code>、<code>avg_xact_time</code></li>
<li>Alert：<code>cl_waiting &gt; 0 持續 30s</code>、<code>server connection error rate &gt; 0</code></li>
</ul>
<p><strong>跟 application observability 整合</strong>：</p>
<ul>
<li>Application APM（<a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog</a> / Honeycomb / OpenTelemetry）的 DB span 顯示 <em>application 看到的 latency</em>、pgBouncer metric 顯示 <em>pgBouncer ↔ PostgreSQL latency</em> — 兩者差異揭露 connection wait time</li>
</ul>
<p><strong>何時 revisit 這個配置</strong>：</p>
<ul>
<li>application 數量倍增（trigger pool sizing 重算）</li>
<li>PostgreSQL 升級（pgBouncer 跟 PostgreSQL 版本相容性）</li>
<li>跨 region 部署（要不要 region-local pgBouncer）</li>
<li>切換到 RDS Proxy / Aurora Cluster Endpoint（managed alternative）</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a> — 本文是該頁尾「pgBouncer / PgCat 配置 best practice」backlog 的深度展開</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/connection-scaling/" data-link-title="PostgreSQL Connection Scaling：process-per-connection model 跟為什麼 pooler 是必裝" data-link-desc="PG 每個 client connection fork 一個 backend process（不是 thread）、RAM 成本 5-15MB/connection、context switch 跟 fork() cost 在 100&#43; connection 後線性放大、所以 pooler 不是 *optional optimization* 而是 *production prerequisite*。本文走 process-per-connection model 跟 MySQL thread-per-connection 對比、max_connections &#43; shared_buffers &#43; work_mem 三 GUC 互動、application-side pool vs middleware pool vs RDS Proxy 三層選擇、5 production 踩雷（connection storm / fork() cost 在 burst 流量 / shared_buffers 跟 connection 數壓縮 / double-pool 配置錯誤 / max_connections 設太大反而慢）、跟 PgBouncer config 互補不重複">Connection Scaling Deep Dive</a> — connection-per-process model 跟為什麼 pooler 是必裝（根因 vs 配置）</li>
<li><a href="/blog/backend/01-database/high-concurrency-access/" data-link-title="1.1 高併發下的 SQL 讀寫邊界" data-link-desc="說明高併發服務如何共用資料庫 client、控制 transaction、管理 connection pool、避免資料庫成為瓶頸">1.1 高併發資料存取</a> — 上游：什麼時候需要 connection pool</li>
<li><a href="/blog/backend/knowledge-cards/connection-pool/" data-link-title="Connection Pool" data-link-desc="說明連線池如何限制下游資源並影響服務容量">Connection Pool 卡片</a> — 概念基底</li>
<li><a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章方法論</a> — 本文是該方法論的 demo #1</li>
<li><a href="/blog/backend/09-performance-capacity/cases/ntt-docomo-lemino-japanese-streaming/" data-link-title="9.C29 NTT DOCOMO Lemino：3 個月達 500 萬 MAU 的串流後端" data-link-desc="Lemino 用 DynamoDB &#43; AWS Media Services 撐 30 channels live &#43; 5M MAU、工程工時下降 90%">9.C29 Lemino RDB connection limit case</a> — connection 爆是 streaming surge 場景的 vendor-switch 主因</li>
<li>官方：<a href="https://www.pgbouncer.org/usage.html">pgBouncer Documentation</a></li>
</ul>
]]></content:encoded></item><item><title>Aurora PostgreSQL I/O-Optimized Cost</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/aurora-io-optimized-cost/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/aurora-io-optimized-cost/</guid><description>&lt;p>Aurora PostgreSQL I/O-Optimized cost 的核心責任是把 Aurora storage configuration 從定價選項轉成 workload 決策。AWS 官方文件將 Aurora cluster storage configuration 分成 Aurora Standard 與 Aurora I/O-Optimized；前者適合一般 I/O 分布，後者針對 I/O 密集 workload 提供不同成本結構。&lt;/p>
&lt;p>本文的判讀錨點是：I/O-Optimized 是成本與 workload profile 決策，而非效能保證。要看的是 read / write I/O charge、storage、instance、backup、replica、query pattern、maintenance 與未來成長。&lt;/p>
&lt;p>官方文件路由的核心責任是固定時間敏感 claim。實作前先查 &lt;a href="https://docs.aws.amazon.com/en_us/AmazonRDS/latest/AuroraUserGuide/Aurora.Overview.StorageReliability.html">Aurora storage configurations&lt;/a> 與 &lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.Aurora_Fea_Regions_DB-eng.Feature.storage-type.html">supported engines / regions&lt;/a>；本文最後檢查日是 2026-05-22。&lt;/p>
&lt;h2 id="cost-model">Cost Model&lt;/h2>
&lt;p>Cost model 的核心責任是拆解 Aurora bill 的來源。Aurora 成本通常包含 instance、storage、I/O request、backup、replica、data transfer 與 support / operation。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>成本項&lt;/th>
 &lt;th>Standard 判讀&lt;/th>
 &lt;th>I/O-Optimized 判讀&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Instance&lt;/td>
 &lt;td>仍依 instance / capacity 計費&lt;/td>
 &lt;td>仍依 instance / capacity 計費&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Storage&lt;/td>
 &lt;td>依儲存使用量&lt;/td>
 &lt;td>依 I/O-Optimized storage 設定&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>I/O requests&lt;/td>
 &lt;td>I/O 成本可成為主要變動項&lt;/td>
 &lt;td>I/O charge 結構改變，適合高 I/O workload&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Backup / snapshot&lt;/td>
 &lt;td>依保留與使用量&lt;/td>
 &lt;td>仍需納入總成本&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Data transfer&lt;/td>
 &lt;td>跨 AZ / region / service 需審查&lt;/td>
 &lt;td>仍需納入總成本&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>成本評估要用真實帳單和 CloudWatch 指標。只用平均 QPS 估算會漏掉 batch job、vacuum、index build、replica、backfill 與報表查詢帶來的 I/O 尖峰。&lt;/p>
&lt;h2 id="workload-signals">Workload Signals&lt;/h2>
&lt;p>Workload signals 的核心責任是找出 I/O 是否為主要成本與瓶頸。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>訊號&lt;/th>
 &lt;th>意義&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>I/O request 成本占比高&lt;/td>
 &lt;td>Standard 可能受 I/O charge 影響大&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Buffer cache hit ratio 低&lt;/td>
 &lt;td>工作集超過 memory 或 query 掃描過重&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>大量 random read / write&lt;/td>
 &lt;td>storage I/O 壓力明顯&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>ETL / backfill 經常跑&lt;/td>
 &lt;td>短期 I/O spike 可能影響帳單與 latency&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Index / query 設計已優化&lt;/td>
 &lt;td>成本切換更能反映真實 workload&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>先做 query 與 index review。若 I/O 來自缺 index、全表掃描、過度 eager loading 或不必要 backfill，直接切 I/O-Optimized 只會把浪費制度化。&lt;/p>
&lt;h2 id="evaluation-process">Evaluation Process&lt;/h2>
&lt;p>Evaluation process 的核心責任是讓切換決策可回溯。&lt;/p>
&lt;ol>
&lt;li>收集 30 到 90 天成本：instance、storage、I/O、backup、transfer。&lt;/li>
&lt;li>收集 workload 指標：read/write IOPS、cache hit、slow query、top SQL。&lt;/li>
&lt;li>標記特殊事件：migration、backfill、incident、seasonality。&lt;/li>
&lt;li>建立 Standard vs I/O-Optimized 成本試算。&lt;/li>
&lt;li>在 staging / canary 確認 application behavior。&lt;/li>
&lt;li>設定切換後 7 / 14 / 30 天回顧點。&lt;/li>
&lt;/ol>
&lt;p>試算要包含季節性。月初結算、年度促銷、批次報表與資料重整都可能讓 I/O profile 和普通週不同。&lt;/p></description><content:encoded><![CDATA[<p>Aurora PostgreSQL I/O-Optimized cost 的核心責任是把 Aurora storage configuration 從定價選項轉成 workload 決策。AWS 官方文件將 Aurora cluster storage configuration 分成 Aurora Standard 與 Aurora I/O-Optimized；前者適合一般 I/O 分布，後者針對 I/O 密集 workload 提供不同成本結構。</p>
<p>本文的判讀錨點是：I/O-Optimized 是成本與 workload profile 決策，而非效能保證。要看的是 read / write I/O charge、storage、instance、backup、replica、query pattern、maintenance 與未來成長。</p>
<p>官方文件路由的核心責任是固定時間敏感 claim。實作前先查 <a href="https://docs.aws.amazon.com/en_us/AmazonRDS/latest/AuroraUserGuide/Aurora.Overview.StorageReliability.html">Aurora storage configurations</a> 與 <a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.Aurora_Fea_Regions_DB-eng.Feature.storage-type.html">supported engines / regions</a>；本文最後檢查日是 2026-05-22。</p>
<h2 id="cost-model">Cost Model</h2>
<p>Cost model 的核心責任是拆解 Aurora bill 的來源。Aurora 成本通常包含 instance、storage、I/O request、backup、replica、data transfer 與 support / operation。</p>
<table>
  <thead>
      <tr>
          <th>成本項</th>
          <th>Standard 判讀</th>
          <th>I/O-Optimized 判讀</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Instance</td>
          <td>仍依 instance / capacity 計費</td>
          <td>仍依 instance / capacity 計費</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>依儲存使用量</td>
          <td>依 I/O-Optimized storage 設定</td>
      </tr>
      <tr>
          <td>I/O requests</td>
          <td>I/O 成本可成為主要變動項</td>
          <td>I/O charge 結構改變，適合高 I/O workload</td>
      </tr>
      <tr>
          <td>Backup / snapshot</td>
          <td>依保留與使用量</td>
          <td>仍需納入總成本</td>
      </tr>
      <tr>
          <td>Data transfer</td>
          <td>跨 AZ / region / service 需審查</td>
          <td>仍需納入總成本</td>
      </tr>
  </tbody>
</table>
<p>成本評估要用真實帳單和 CloudWatch 指標。只用平均 QPS 估算會漏掉 batch job、vacuum、index build、replica、backfill 與報表查詢帶來的 I/O 尖峰。</p>
<h2 id="workload-signals">Workload Signals</h2>
<p>Workload signals 的核心責任是找出 I/O 是否為主要成本與瓶頸。</p>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>意義</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>I/O request 成本占比高</td>
          <td>Standard 可能受 I/O charge 影響大</td>
      </tr>
      <tr>
          <td>Buffer cache hit ratio 低</td>
          <td>工作集超過 memory 或 query 掃描過重</td>
      </tr>
      <tr>
          <td>大量 random read / write</td>
          <td>storage I/O 壓力明顯</td>
      </tr>
      <tr>
          <td>ETL / backfill 經常跑</td>
          <td>短期 I/O spike 可能影響帳單與 latency</td>
      </tr>
      <tr>
          <td>Index / query 設計已優化</td>
          <td>成本切換更能反映真實 workload</td>
      </tr>
  </tbody>
</table>
<p>先做 query 與 index review。若 I/O 來自缺 index、全表掃描、過度 eager loading 或不必要 backfill，直接切 I/O-Optimized 只會把浪費制度化。</p>
<h2 id="evaluation-process">Evaluation Process</h2>
<p>Evaluation process 的核心責任是讓切換決策可回溯。</p>
<ol>
<li>收集 30 到 90 天成本：instance、storage、I/O、backup、transfer。</li>
<li>收集 workload 指標：read/write IOPS、cache hit、slow query、top SQL。</li>
<li>標記特殊事件：migration、backfill、incident、seasonality。</li>
<li>建立 Standard vs I/O-Optimized 成本試算。</li>
<li>在 staging / canary 確認 application behavior。</li>
<li>設定切換後 7 / 14 / 30 天回顧點。</li>
</ol>
<p>試算要包含季節性。月初結算、年度促銷、批次報表與資料重整都可能讓 I/O profile 和普通週不同。</p>
<h2 id="migration-and-rollback">Migration and Rollback</h2>
<p>Migration and rollback 的核心責任是把 storage configuration change 放進變更流程。Aurora storage configuration 是 cluster-level decision，應先確認支援區域、engine version、切換限制、維護窗口與回退條件。</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-check</td>
          <td>engine version、region support、current bill</td>
      </tr>
      <tr>
          <td>Cost baseline</td>
          <td>近期成本與 I/O 指標</td>
      </tr>
      <tr>
          <td>Change window</td>
          <td>application traffic、maintenance</td>
      </tr>
      <tr>
          <td>Post-check</td>
          <td>latency、I/O、error、bill trend</td>
      </tr>
      <tr>
          <td>Review</td>
          <td>7 / 14 / 30 天成本與效能</td>
      </tr>
  </tbody>
</table>
<p>Rollback 條件要明確。若切換後成本下降未達目標、latency 沒改善、或 workload profile 改變，應重新評估 Standard 與 query optimization。</p>
<h2 id="anti-patterns">Anti-Patterns</h2>
<p>Anti-pattern 的核心責任是避免把計費選項當成效能調校。</p>
<table>
  <thead>
      <tr>
          <th>反模式</th>
          <th>風險</th>
          <th>修正方向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>未看 top SQL 直接切換</td>
          <td>把壞 query 的成本包進新方案</td>
          <td>先做 query / index review</td>
      </tr>
      <tr>
          <td>用單日帳單推估全年</td>
          <td>忽略 seasonality</td>
          <td>至少看完整業務週期</td>
      </tr>
      <tr>
          <td>忽略 backup / transfer</td>
          <td>總成本估算失真</td>
          <td>全 bill component 一起比較</td>
      </tr>
      <tr>
          <td>切換後無 review</td>
          <td>成本漂移無 owner</td>
          <td>設定 7 / 14 / 30 天 tripwire</td>
      </tr>
  </tbody>
</table>
<p>I/O-Optimized 的價值來自成本結構對齊 workload。它應該是 FinOps 與 database operation 的共同決策。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Aurora I/O-Optimized cost 完成後，Aurora 遷移讀 <a href="../migrate-to-aurora/">PostgreSQL to Aurora Migration</a>；query 成本讀 <a href="../query-optimization/">Query Optimization</a>；capacity 與瓶頸判斷讀 <a href="/blog/backend/09-performance-capacity/bottleneck-localization/" data-link-title="9.5 瓶頸定位流程" data-link-desc="從 app 到 DB / cache / broker / 第三方 quota 的逐層瓶頸定位">Bottleneck Localization</a>。</p>
]]></content:encoded></item><item><title>Managed PostgreSQL Comparison</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/managed-pg-comparison/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/managed-pg-comparison/</guid><description>&lt;p>Managed PostgreSQL comparison 的核心責任是把「都是 PostgreSQL」拆成不同的操作責任邊界。Managed service 可能代管 backup、patch、replica、minor upgrade、monitoring、connection proxy、serverless scaling 或 branch workflow；但 application schema、query、migration、role、cost 與 incident decision 仍需要 team 承擔。&lt;/p>
&lt;p>本文的判讀錨點是：managed PostgreSQL 是 operation trade-off，而非 vendor-neutral checkbox。選型要看 workload、合規、extension、HA / DR、connection、cost visibility、exit route 與 team skill。&lt;/p>
&lt;p>官方文件路由的核心責任是固定 provider claim。實作前分別查 &lt;a href="https://docs.cloud.google.com/alloydb/docs">AlloyDB docs&lt;/a>、&lt;a href="https://cloud.google.com/sql/postgresql">Cloud SQL for PostgreSQL&lt;/a>、&lt;a href="https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/overview">Azure Database for PostgreSQL Flexible Server&lt;/a> 與 &lt;a href="https://supabase.com/docs/guides/deployment/branching">Supabase branching docs&lt;/a>；本文最後檢查日是 2026-05-22。&lt;/p>
&lt;h2 id="provider-boundary">Provider Boundary&lt;/h2>
&lt;p>Provider boundary 的核心責任是定義 vendor 接手哪些資料庫操作。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>類型&lt;/th>
 &lt;th>代表選項&lt;/th>
 &lt;th>適合情境&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Cloud managed PostgreSQL&lt;/td>
 &lt;td>RDS PostgreSQL、Cloud SQL、Azure PG&lt;/td>
 &lt;td>標準 PostgreSQL、雲平台整合&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Aurora PostgreSQL-compatible&lt;/td>
 &lt;td>Amazon Aurora PostgreSQL&lt;/td>
 &lt;td>AWS 生態、高可用 storage layer、read scaling&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Serverless / branching PG&lt;/td>
 &lt;td>Neon、Supabase 部分能力&lt;/td>
 &lt;td>dev preview、稀疏 workload、快速分支&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Specialist managed PG&lt;/td>
 &lt;td>Crunchy Bridge 等&lt;/td>
 &lt;td>PostgreSQL 專業支援、extension 需求&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Self-managed&lt;/td>
 &lt;td>VM / K8s 上自管&lt;/td>
 &lt;td>需要完整控制、具備 DBA 能力&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Provider boundary 要寫成 responsibility matrix。誰負責 backup restore、major upgrade、extension enable、failover、connection proxy、audit export、encryption key、support ticket 與 incident decision。&lt;/p>
&lt;p>Serverless / branching PG 這一列的 Neon 與 Supabase 不在同一個 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/capability-outsourcing-depth/" data-link-title="Capability Outsourcing Depth（外包深度）" data-link-desc="說明外包一塊後端能力有三種深度（managed 基礎設施、feature SaaS、BaaS bundle）、深度決定保留多少控制權與遷出代價">外包深度&lt;/a>。Neon 是純 serverless PostgreSQL（managed 基礎設施）；Supabase 是把 Postgres 當其中一塊的 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/baas/" data-link-title="BaaS（Backend as a Service）" data-link-desc="說明把認證、資料庫、檔案儲存、推播打包成現成模組、由前端 SDK 直連的後端交付形態">BaaS bundle&lt;/a>（同時含 Auth、Storage、Realtime）。只需要資料庫、兩者皆可比較且 Neon 更輕；要連認證、儲存一起到位、才是 Supabase 的賣點。這個外包深度差異與「該買整個 bundle 還是只用它的 Postgres」的判讀、見 &lt;a href="https://tarrragon.github.io/blog/backend/00-service-selection/capability-buy-vs-build/" data-link-title="0.22 能力級買 vs 建：feature-as-a-service 與 BaaS bundle 選型" data-link-desc="在交付形態決定整個系統要不要自建之後、逐能力判斷該外包還是自建：辨識 managed 基礎設施、feature SaaS 與 BaaS bundle 三種外包深度、no-code 到 dev-tool 的服務光譜、買 vs 建判準與權重浮動、整合接縫與遷出代價">0.22 能力級買 vs 建&lt;/a>。&lt;/p>
&lt;h2 id="evaluation-dimensions">Evaluation Dimensions&lt;/h2>
&lt;p>Evaluation dimensions 的核心責任是讓比較避免只看價格或品牌。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>PostgreSQL fidelity&lt;/td>
 &lt;td>engine version、extension、parameter、superuser 限制&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>HA / DR&lt;/td>
 &lt;td>AZ failover、cross-region replica、PITR、restore drill&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Connection&lt;/td>
 &lt;td>max connection、pooler、proxy、serverless cold start&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Migration&lt;/td>
 &lt;td>import/export、logical replication、downtime window&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Observability&lt;/td>
 &lt;td>logs、metrics、slow query、audit、SIEM export&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Security&lt;/td>
 &lt;td>network、IAM、KMS、TLS、RLS / pgAudit support&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cost&lt;/td>
 &lt;td>instance、storage、I/O、backup、egress、support&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Exit&lt;/td>
 &lt;td>dump、logical replication、snapshot portability&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>PostgreSQL fidelity 是第一關。若服務依賴 extension、logical decoding、superuser function、custom parameter 或 filesystem access，managed provider 的限制會直接影響可行性。&lt;/p></description><content:encoded><![CDATA[<p>Managed PostgreSQL comparison 的核心責任是把「都是 PostgreSQL」拆成不同的操作責任邊界。Managed service 可能代管 backup、patch、replica、minor upgrade、monitoring、connection proxy、serverless scaling 或 branch workflow；但 application schema、query、migration、role、cost 與 incident decision 仍需要 team 承擔。</p>
<p>本文的判讀錨點是：managed PostgreSQL 是 operation trade-off，而非 vendor-neutral checkbox。選型要看 workload、合規、extension、HA / DR、connection、cost visibility、exit route 與 team skill。</p>
<p>官方文件路由的核心責任是固定 provider claim。實作前分別查 <a href="https://docs.cloud.google.com/alloydb/docs">AlloyDB docs</a>、<a href="https://cloud.google.com/sql/postgresql">Cloud SQL for PostgreSQL</a>、<a href="https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/overview">Azure Database for PostgreSQL Flexible Server</a> 與 <a href="https://supabase.com/docs/guides/deployment/branching">Supabase branching docs</a>；本文最後檢查日是 2026-05-22。</p>
<h2 id="provider-boundary">Provider Boundary</h2>
<p>Provider boundary 的核心責任是定義 vendor 接手哪些資料庫操作。</p>
<table>
  <thead>
      <tr>
          <th>類型</th>
          <th>代表選項</th>
          <th>適合情境</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cloud managed PostgreSQL</td>
          <td>RDS PostgreSQL、Cloud SQL、Azure PG</td>
          <td>標準 PostgreSQL、雲平台整合</td>
      </tr>
      <tr>
          <td>Aurora PostgreSQL-compatible</td>
          <td>Amazon Aurora PostgreSQL</td>
          <td>AWS 生態、高可用 storage layer、read scaling</td>
      </tr>
      <tr>
          <td>Serverless / branching PG</td>
          <td>Neon、Supabase 部分能力</td>
          <td>dev preview、稀疏 workload、快速分支</td>
      </tr>
      <tr>
          <td>Specialist managed PG</td>
          <td>Crunchy Bridge 等</td>
          <td>PostgreSQL 專業支援、extension 需求</td>
      </tr>
      <tr>
          <td>Self-managed</td>
          <td>VM / K8s 上自管</td>
          <td>需要完整控制、具備 DBA 能力</td>
      </tr>
  </tbody>
</table>
<p>Provider boundary 要寫成 responsibility matrix。誰負責 backup restore、major upgrade、extension enable、failover、connection proxy、audit export、encryption key、support ticket 與 incident decision。</p>
<p>Serverless / branching PG 這一列的 Neon 與 Supabase 不在同一個 <a href="/blog/backend/knowledge-cards/capability-outsourcing-depth/" data-link-title="Capability Outsourcing Depth（外包深度）" data-link-desc="說明外包一塊後端能力有三種深度（managed 基礎設施、feature SaaS、BaaS bundle）、深度決定保留多少控制權與遷出代價">外包深度</a>。Neon 是純 serverless PostgreSQL（managed 基礎設施）；Supabase 是把 Postgres 當其中一塊的 <a href="/blog/backend/knowledge-cards/baas/" data-link-title="BaaS（Backend as a Service）" data-link-desc="說明把認證、資料庫、檔案儲存、推播打包成現成模組、由前端 SDK 直連的後端交付形態">BaaS bundle</a>（同時含 Auth、Storage、Realtime）。只需要資料庫、兩者皆可比較且 Neon 更輕；要連認證、儲存一起到位、才是 Supabase 的賣點。這個外包深度差異與「該買整個 bundle 還是只用它的 Postgres」的判讀、見 <a href="/blog/backend/00-service-selection/capability-buy-vs-build/" data-link-title="0.22 能力級買 vs 建：feature-as-a-service 與 BaaS bundle 選型" data-link-desc="在交付形態決定整個系統要不要自建之後、逐能力判斷該外包還是自建：辨識 managed 基礎設施、feature SaaS 與 BaaS bundle 三種外包深度、no-code 到 dev-tool 的服務光譜、買 vs 建判準與權重浮動、整合接縫與遷出代價">0.22 能力級買 vs 建</a>。</p>
<h2 id="evaluation-dimensions">Evaluation Dimensions</h2>
<p>Evaluation dimensions 的核心責任是讓比較避免只看價格或品牌。</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PostgreSQL fidelity</td>
          <td>engine version、extension、parameter、superuser 限制</td>
      </tr>
      <tr>
          <td>HA / DR</td>
          <td>AZ failover、cross-region replica、PITR、restore drill</td>
      </tr>
      <tr>
          <td>Connection</td>
          <td>max connection、pooler、proxy、serverless cold start</td>
      </tr>
      <tr>
          <td>Migration</td>
          <td>import/export、logical replication、downtime window</td>
      </tr>
      <tr>
          <td>Observability</td>
          <td>logs、metrics、slow query、audit、SIEM export</td>
      </tr>
      <tr>
          <td>Security</td>
          <td>network、IAM、KMS、TLS、RLS / pgAudit support</td>
      </tr>
      <tr>
          <td>Cost</td>
          <td>instance、storage、I/O、backup、egress、support</td>
      </tr>
      <tr>
          <td>Exit</td>
          <td>dump、logical replication、snapshot portability</td>
      </tr>
  </tbody>
</table>
<p>PostgreSQL fidelity 是第一關。若服務依賴 extension、logical decoding、superuser function、custom parameter 或 filesystem access，managed provider 的限制會直接影響可行性。</p>
<h2 id="workload-fit">Workload Fit</h2>
<p>Workload fit 的核心責任是把 provider 能力和產品需求對齊。</p>
<table>
  <thead>
      <tr>
          <th>Workload</th>
          <th>優先考量</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SaaS OLTP</td>
          <td>HA、PITR、connection pool、online migration</td>
      </tr>
      <tr>
          <td>Analytics-heavy OLTP</td>
          <td>read replica、I/O cost、work_mem、warehouse boundary</td>
      </tr>
      <tr>
          <td>Dev / preview env</td>
          <td>branching、fast restore、low idle cost</td>
      </tr>
      <tr>
          <td>Regulated workload</td>
          <td>audit、KMS、network isolation、retention</td>
      </tr>
      <tr>
          <td>Extension-heavy app</td>
          <td>PostGIS、pgvector、TimescaleDB、logical decoding support</td>
      </tr>
  </tbody>
</table>
<p>Serverless / branching PG 適合 preview 與稀疏 workload，但 sustained high-throughput production 要審查 cold start、connection、storage separation latency 與 cost curve。</p>
<p>Aurora PostgreSQL 適合 AWS-heavy 架構與高可用 storage layer，但要審查 PostgreSQL compatibility、parameter 限制、I/O cost 與 migration / exit。</p>
<h2 id="migration-and-exit">Migration and Exit</h2>
<p>Migration and exit 的核心責任是避免 managed service 變成單向門。導入前要先知道如何進去、如何出來。</p>
<table>
  <thead>
      <tr>
          <th>流程</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Import</td>
          <td>dump / restore、logical replication、DMS</td>
      </tr>
      <tr>
          <td>Cutover</td>
          <td>freeze window、replica catch-up、validation</td>
      </tr>
      <tr>
          <td>Rollback</td>
          <td>source snapshot、write replay、DNS switch</td>
      </tr>
      <tr>
          <td>Exit</td>
          <td>pg_dump、logical replication、snapshot export</td>
      </tr>
      <tr>
          <td>Rehearsal</td>
          <td>staging restore、row count、checksum</td>
      </tr>
  </tbody>
</table>
<p>Exit route 要比口頭承諾更具體。至少要能在 staging 將資料匯出到 vanilla PostgreSQL 或下一個 managed provider，並跑 application smoke test。</p>
<h2 id="cost-review">Cost Review</h2>
<p>Cost review 的核心責任是把 managed convenience 轉成總成本。總成本包含 instance、storage、I/O、backup、replica、egress、support、observability、operation labor 與 incident cost。</p>
<table>
  <thead>
      <tr>
          <th>Cost driver</th>
          <th>常見誤判</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>I/O</td>
          <td>只看 instance price</td>
      </tr>
      <tr>
          <td>Backup retention</td>
          <td>長 retention 被忽略</td>
      </tr>
      <tr>
          <td>Cross-region replica</td>
          <td>data transfer / storage 增加</td>
      </tr>
      <tr>
          <td>Observability export</td>
          <td>log volume 與 SIEM 成本</td>
      </tr>
      <tr>
          <td>Serverless idle</td>
          <td>idle 低但 sustained workload 成本不同</td>
      </tr>
  </tbody>
</table>
<p>Cost review 要設 tripwire。當 I/O 成本占比提高、backup retention 變長、replica 增加或 serverless workload 變成常駐，重新評估方案。</p>
<h2 id="decision-route">Decision Route</h2>
<p>Decision route 的核心責任是把 provider 選型導向具體路線。</p>
<table>
  <thead>
      <tr>
          <th>需求</th>
          <th>優先路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>標準雲平台 PostgreSQL</td>
          <td>RDS / Cloud SQL / Azure PG</td>
      </tr>
      <tr>
          <td>AWS 生態 + HA storage layer</td>
          <td>Aurora PostgreSQL</td>
      </tr>
      <tr>
          <td>Preview branch / dev env</td>
          <td>Neon / Supabase branch workflow</td>
      </tr>
      <tr>
          <td>Extension / PG 專業支援</td>
          <td>specialist managed PG</td>
      </tr>
      <tr>
          <td>完整控制與特殊 extension</td>
          <td>self-managed PostgreSQL</td>
      </tr>
  </tbody>
</table>
<p>Managed provider 的最終選擇要回到 team skill。少維護元件是價值；把尚未理解的限制外包給 vendor，會在 incident 和 migration 時回來。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Managed PostgreSQL comparison 完成後，Aurora 遷移讀 <a href="../migrate-to-aurora/">PostgreSQL to Aurora Migration</a>；Aurora DSQL 讀 <a href="../migrate-to-aurora-dsql/">PostgreSQL to Aurora DSQL</a>；serverless / specialized variant 讀 <a href="../specialized-pg-variants/">Specialized PostgreSQL Variants</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL Connection Pool Lab</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/hands-on/connection-pool-lab/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/hands-on/connection-pool-lab/</guid><description>&lt;p>PostgreSQL connection pool lab 的核心責任是讓讀者看到 connection pressure 如何從 application pool 傳到 PostgreSQL backend process。這篇承接 &lt;a href="../../connection-scaling/">Connection Scaling&lt;/a> 與 &lt;a href="../../pgbouncer-config/">PgBouncer Config&lt;/a>。&lt;/p>
&lt;p>本文的驗收標準是：你能比較 direct connection 與 PgBouncer transaction pooling，取得 &lt;code>pg_stat_activity&lt;/code>、PgBouncer &lt;code>SHOW POOLS&lt;/code>、latency / error sample 與 failure note。&lt;/p>
&lt;h2 id="baseline-direct-connections">Baseline Direct Connections&lt;/h2>
&lt;p>Baseline direct connections 的核心責任是先看 application 直連 PostgreSQL 時的 backend 數。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="nb">export&lt;/span> &lt;span class="nv">DATABASE_URL&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;postgres://lab_admin:lab_admin_pw@localhost:54329/appdb?sslmode=disable&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">psql &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$DATABASE_URL&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> -c &lt;span class="s2">&amp;#34;SELECT count(*) FROM pg_stat_activity WHERE datname = current_database();&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>用多個 terminal 或簡單 workload 產生 idle connection：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">for&lt;/span> i in &lt;span class="m">1&lt;/span> &lt;span class="m">2&lt;/span> &lt;span class="m">3&lt;/span> &lt;span class="m">4&lt;/span> 5&lt;span class="p">;&lt;/span> &lt;span class="k">do&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> psql &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$DATABASE_URL&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> -c &lt;span class="s2">&amp;#34;SELECT pg_sleep(10);&amp;#34;&lt;/span> &lt;span class="p">&amp;amp;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="k">done&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">psql &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$DATABASE_URL&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> -c &lt;span class="s2">&amp;#34;SELECT state, count(*) FROM pg_stat_activity WHERE datname = current_database() GROUP BY state;&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>這一步證明每個 client session 會占用 PostgreSQL backend process。&lt;/p>
&lt;h2 id="add-pgbouncer">Add PgBouncer&lt;/h2>
&lt;p>Add PgBouncer 的核心責任是把 client connection 與 server connection 拆開。以下 compose fragment 可加入 local lab：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">pgbouncer&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">edoburu/pgbouncer:latest&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">environment&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">DB_HOST&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">postgres&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">DB_USER&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">lab_admin&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">DB_PASSWORD&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">lab_admin_pw&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">DB_NAME&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">appdb&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">POOL_MODE&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">transaction&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">MAX_CLIENT_CONN&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">100&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">DEFAULT_POOL_SIZE&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">5&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">ports&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s2">&amp;#34;64329:5432&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>啟動後設定 pooler URL：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="nb">export&lt;/span> &lt;span class="nv">POOL_URL&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;postgres://lab_admin:lab_admin_pw@localhost:64329/appdb?sslmode=disable&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="compare-pool-behavior">Compare Pool Behavior&lt;/h2>
&lt;p>Compare pool behavior 的核心責任是觀察 client 多、server 少的效果。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">for&lt;/span> i in &lt;span class="k">$(&lt;/span>seq &lt;span class="m">1&lt;/span> 20&lt;span class="k">)&lt;/span>&lt;span class="p">;&lt;/span> &lt;span class="k">do&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> psql &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$POOL_URL&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> -c &lt;span class="s2">&amp;#34;SELECT pg_sleep(1);&amp;#34;&lt;/span> &lt;span class="p">&amp;amp;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="k">done&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">psql &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$DATABASE_URL&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> -c &lt;span class="s2">&amp;#34;SELECT state, count(*) FROM pg_stat_activity WHERE datname = current_database() GROUP BY state;&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>再進 PgBouncer admin console，實際命令依 image 設定調整：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">psql &lt;span class="s2">&amp;#34;postgres://lab_admin:lab_admin_pw@localhost:64329/pgbouncer?sslmode=disable&amp;#34;&lt;/span> -c &lt;span class="s2">&amp;#34;SHOW POOLS;&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>驗收重點是：client workload 增加時，PostgreSQL backend 數量被 pool size 控制，排隊發生在 pooler 層。&lt;/p>
&lt;h2 id="pool-exhaustion">Pool Exhaustion&lt;/h2>
&lt;p>Pool exhaustion 的核心責任是看過載時的錯誤與等待。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL connection pool lab 的核心責任是讓讀者看到 connection pressure 如何從 application pool 傳到 PostgreSQL backend process。這篇承接 <a href="../../connection-scaling/">Connection Scaling</a> 與 <a href="../../pgbouncer-config/">PgBouncer Config</a>。</p>
<p>本文的驗收標準是：你能比較 direct connection 與 PgBouncer transaction pooling，取得 <code>pg_stat_activity</code>、PgBouncer <code>SHOW POOLS</code>、latency / error sample 與 failure note。</p>
<h2 id="baseline-direct-connections">Baseline Direct Connections</h2>
<p>Baseline direct connections 的核心責任是先看 application 直連 PostgreSQL 時的 backend 數。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">export</span> <span class="nv">DATABASE_URL</span><span class="o">=</span><span class="s2">&#34;postgres://lab_admin:lab_admin_pw@localhost:54329/appdb?sslmode=disable&#34;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> -c <span class="s2">&#34;SELECT count(*) FROM pg_stat_activity WHERE datname = current_database();&#34;</span></span></span></code></pre></div><p>用多個 terminal 或簡單 workload 產生 idle connection：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">for</span> i in <span class="m">1</span> <span class="m">2</span> <span class="m">3</span> <span class="m">4</span> 5<span class="p">;</span> <span class="k">do</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> -c <span class="s2">&#34;SELECT pg_sleep(10);&#34;</span> <span class="p">&amp;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="k">done</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> -c <span class="s2">&#34;SELECT state, count(*) FROM pg_stat_activity WHERE datname = current_database() GROUP BY state;&#34;</span></span></span></code></pre></div><p>這一步證明每個 client session 會占用 PostgreSQL backend process。</p>
<h2 id="add-pgbouncer">Add PgBouncer</h2>
<p>Add PgBouncer 的核心責任是把 client connection 與 server connection 拆開。以下 compose fragment 可加入 local lab：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="w">  </span><span class="nt">pgbouncer</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l">edoburu/pgbouncer:latest</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="nt">environment</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">      </span><span class="nt">DB_HOST</span><span class="p">:</span><span class="w"> </span><span class="l">postgres</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">      </span><span class="nt">DB_USER</span><span class="p">:</span><span class="w"> </span><span class="l">lab_admin</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">      </span><span class="nt">DB_PASSWORD</span><span class="p">:</span><span class="w"> </span><span class="l">lab_admin_pw</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">      </span><span class="nt">DB_NAME</span><span class="p">:</span><span class="w"> </span><span class="l">appdb</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">      </span><span class="nt">POOL_MODE</span><span class="p">:</span><span class="w"> </span><span class="l">transaction</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">      </span><span class="nt">MAX_CLIENT_CONN</span><span class="p">:</span><span class="w"> </span><span class="m">100</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">      </span><span class="nt">DEFAULT_POOL_SIZE</span><span class="p">:</span><span class="w"> </span><span class="m">5</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">    </span><span class="nt">ports</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">      </span>- <span class="s2">&#34;64329:5432&#34;</span></span></span></code></pre></div><p>啟動後設定 pooler URL：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="nb">export</span> <span class="nv">POOL_URL</span><span class="o">=</span><span class="s2">&#34;postgres://lab_admin:lab_admin_pw@localhost:64329/appdb?sslmode=disable&#34;</span></span></span></code></pre></div><h2 id="compare-pool-behavior">Compare Pool Behavior</h2>
<p>Compare pool behavior 的核心責任是觀察 client 多、server 少的效果。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">for</span> i in <span class="k">$(</span>seq <span class="m">1</span> 20<span class="k">)</span><span class="p">;</span> <span class="k">do</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  psql <span class="s2">&#34;</span><span class="nv">$POOL_URL</span><span class="s2">&#34;</span> -c <span class="s2">&#34;SELECT pg_sleep(1);&#34;</span> <span class="p">&amp;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="k">done</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> -c <span class="s2">&#34;SELECT state, count(*) FROM pg_stat_activity WHERE datname = current_database() GROUP BY state;&#34;</span></span></span></code></pre></div><p>再進 PgBouncer admin console，實際命令依 image 設定調整：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">psql <span class="s2">&#34;postgres://lab_admin:lab_admin_pw@localhost:64329/pgbouncer?sslmode=disable&#34;</span> -c <span class="s2">&#34;SHOW POOLS;&#34;</span></span></span></code></pre></div><p>驗收重點是：client workload 增加時，PostgreSQL backend 數量被 pool size 控制，排隊發生在 pooler 層。</p>
<h2 id="pool-exhaustion">Pool Exhaustion</h2>
<p>Pool exhaustion 的核心責任是看過載時的錯誤與等待。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">for</span> i in <span class="k">$(</span>seq <span class="m">1</span> 50<span class="k">)</span><span class="p">;</span> <span class="k">do</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  psql <span class="s2">&#34;</span><span class="nv">$POOL_URL</span><span class="s2">&#34;</span> -c <span class="s2">&#34;BEGIN; SELECT pg_sleep(5); COMMIT;&#34;</span> <span class="p">&amp;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="k">done</span></span></span></code></pre></div><p>觀察：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> -c <span class="s2">&#34;SELECT count(*) FROM pg_stat_activity WHERE datname = current_database();&#34;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">psql <span class="s2">&#34;postgres://lab_admin:lab_admin_pw@localhost:64329/pgbouncer?sslmode=disable&#34;</span> -c <span class="s2">&#34;SHOW POOLS;&#34;</span></span></span></code></pre></div><p>Pool exhaustion 的 evidence 包含 waiting clients、timeout、application latency 與 error message。這些要接到 production alert。</p>
<h2 id="failure-note">Failure Note</h2>
<p>Failure note 的核心責任是把 lab 結果轉成 runbook。記錄三件事：</p>
<ol>
<li>Direct connection baseline backend 數。</li>
<li>PgBouncer transaction pooling 下 server connection 數。</li>
<li>Pool exhaustion 時的 latency / error / queue。</li>
</ol>
<p>若 application 使用 session state、prepared statement、temp table 或 advisory lock，還要補 transaction pooling compatibility matrix。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>完成本篇後，回到 <a href="../../connection-pooler-comparison/">Connection Pooler Comparison</a> 做選型；要看 PgBouncer production 設定讀 <a href="../../pgbouncer-config/">PgBouncer Config</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL Connection Pooler Comparison</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/connection-pooler-comparison/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/connection-pooler-comparison/</guid><description>&lt;p>PostgreSQL connection pooler comparison 的核心責任是把連線數壓力、transaction 語意與維運責任拆開判讀。PostgreSQL backend process 成本高，application instance 擴張後，connection pooler 常成為保護資料庫的第一層容量控制。&lt;/p>
&lt;p>本文的判讀錨點是：pooler 解決的是 connection fan-out 與 queueing，而非查詢本身變快。查詢慢、lock wait、transaction 過長、index 錯誤仍要回到 &lt;a href="../query-optimization/">Query Optimization&lt;/a> 與 &lt;a href="../mvcc-lock-model/">MVCC / lock model&lt;/a>。&lt;/p>
&lt;h2 id="pooling-models">Pooling Models&lt;/h2>
&lt;p>Pooling model 的核心責任是決定 client connection 和 server connection 的綁定時間。PgBouncer 代表最常見的 PostgreSQL pooler 模型；官方文件將 pool mode 分成 session、transaction 與 statement。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>模式&lt;/th>
 &lt;th>Server connection 綁定&lt;/th>
 &lt;th>適合情境&lt;/th>
 &lt;th>主要風險&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Session&lt;/td>
 &lt;td>client session 全程&lt;/td>
 &lt;td>使用 session state、temp table&lt;/td>
 &lt;td>壓縮率低&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Transaction&lt;/td>
 &lt;td>transaction 期間&lt;/td>
 &lt;td>Web API、短交易、Stateless query&lt;/td>
 &lt;td>session variable、prepared statement 語意受限&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Statement&lt;/td>
 &lt;td>single statement&lt;/td>
 &lt;td>特殊 read-only workload&lt;/td>
 &lt;td>transaction workflow 受限&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>App pool&lt;/td>
 &lt;td>application process 內&lt;/td>
 &lt;td>單服務、低 fan-out&lt;/td>
 &lt;td>多 instance 後總連線失控&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/transaction-pooling/" data-link-title="Transaction Pooling" data-link-desc="說明 connection pooler 的 transaction 綁定模式如何壓縮連線並改變 session 語意">Transaction pooling&lt;/a> 的價值在於把大量 idle client connection 收斂成少量 active server connection。它要求 application 把 session state 放回 request / transaction boundary，例如 timezone、role、search_path、prepared statement 與 advisory lock 都要明確管理。&lt;/p>
&lt;p>Session pooling 的價值在於相容性。若 application 大量使用 temp table、LISTEN / NOTIFY、session-level setting 或 server-side prepared statement，session pooling 能降低行為差異，但連線壓縮效果較弱。&lt;/p>
&lt;h2 id="product-boundary">Product Boundary&lt;/h2>
&lt;p>Product boundary 的核心責任是把 pooler 放在正確的維運位置。不同選項的責任邊界差異很大。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>選項&lt;/th>
 &lt;th>主要責任&lt;/th>
 &lt;th>適合情境&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>PgBouncer&lt;/td>
 &lt;td>輕量 PostgreSQL connection pooling&lt;/td>
 &lt;td>自管 VM / K8s、transaction pooling 標準路線&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Odyssey&lt;/td>
 &lt;td>多租戶與複雜 routing pooler&lt;/td>
 &lt;td>大型部署、需要進階 routing / auth&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>RDS Proxy&lt;/td>
 &lt;td>AWS managed connection proxy&lt;/td>
 &lt;td>RDS / Aurora 生態、希望降低 proxy 維運&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application pool&lt;/td>
 &lt;td>服務內部連線池&lt;/td>
 &lt;td>instance 數少、連線總量可控&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>No pooler&lt;/td>
 &lt;td>直接連 PostgreSQL&lt;/td>
 &lt;td>小型服務、低併發、連線數遠低於上限&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>PgBouncer 的操作重點是 mode、pool size、server reset query、auth、TLS 與 metrics。它很適合放在 application 與 database 中間，承擔連線排隊與 backpressure。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL connection pooler comparison 的核心責任是把連線數壓力、transaction 語意與維運責任拆開判讀。PostgreSQL backend process 成本高，application instance 擴張後，connection pooler 常成為保護資料庫的第一層容量控制。</p>
<p>本文的判讀錨點是：pooler 解決的是 connection fan-out 與 queueing，而非查詢本身變快。查詢慢、lock wait、transaction 過長、index 錯誤仍要回到 <a href="../query-optimization/">Query Optimization</a> 與 <a href="../mvcc-lock-model/">MVCC / lock model</a>。</p>
<h2 id="pooling-models">Pooling Models</h2>
<p>Pooling model 的核心責任是決定 client connection 和 server connection 的綁定時間。PgBouncer 代表最常見的 PostgreSQL pooler 模型；官方文件將 pool mode 分成 session、transaction 與 statement。</p>
<table>
  <thead>
      <tr>
          <th>模式</th>
          <th>Server connection 綁定</th>
          <th>適合情境</th>
          <th>主要風險</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Session</td>
          <td>client session 全程</td>
          <td>使用 session state、temp table</td>
          <td>壓縮率低</td>
      </tr>
      <tr>
          <td>Transaction</td>
          <td>transaction 期間</td>
          <td>Web API、短交易、Stateless query</td>
          <td>session variable、prepared statement 語意受限</td>
      </tr>
      <tr>
          <td>Statement</td>
          <td>single statement</td>
          <td>特殊 read-only workload</td>
          <td>transaction workflow 受限</td>
      </tr>
      <tr>
          <td>App pool</td>
          <td>application process 內</td>
          <td>單服務、低 fan-out</td>
          <td>多 instance 後總連線失控</td>
      </tr>
  </tbody>
</table>
<p><a href="/blog/backend/knowledge-cards/transaction-pooling/" data-link-title="Transaction Pooling" data-link-desc="說明 connection pooler 的 transaction 綁定模式如何壓縮連線並改變 session 語意">Transaction pooling</a> 的價值在於把大量 idle client connection 收斂成少量 active server connection。它要求 application 把 session state 放回 request / transaction boundary，例如 timezone、role、search_path、prepared statement 與 advisory lock 都要明確管理。</p>
<p>Session pooling 的價值在於相容性。若 application 大量使用 temp table、LISTEN / NOTIFY、session-level setting 或 server-side prepared statement，session pooling 能降低行為差異，但連線壓縮效果較弱。</p>
<h2 id="product-boundary">Product Boundary</h2>
<p>Product boundary 的核心責任是把 pooler 放在正確的維運位置。不同選項的責任邊界差異很大。</p>
<table>
  <thead>
      <tr>
          <th>選項</th>
          <th>主要責任</th>
          <th>適合情境</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PgBouncer</td>
          <td>輕量 PostgreSQL connection pooling</td>
          <td>自管 VM / K8s、transaction pooling 標準路線</td>
      </tr>
      <tr>
          <td>Odyssey</td>
          <td>多租戶與複雜 routing pooler</td>
          <td>大型部署、需要進階 routing / auth</td>
      </tr>
      <tr>
          <td>RDS Proxy</td>
          <td>AWS managed connection proxy</td>
          <td>RDS / Aurora 生態、希望降低 proxy 維運</td>
      </tr>
      <tr>
          <td>Application pool</td>
          <td>服務內部連線池</td>
          <td>instance 數少、連線總量可控</td>
      </tr>
      <tr>
          <td>No pooler</td>
          <td>直接連 PostgreSQL</td>
          <td>小型服務、低併發、連線數遠低於上限</td>
      </tr>
  </tbody>
</table>
<p>PgBouncer 的操作重點是 mode、pool size、server reset query、auth、TLS 與 metrics。它很適合放在 application 與 database 中間，承擔連線排隊與 backpressure。</p>
<p>Managed proxy 的操作重點是平台限制、failover behavior、credential integration、latency overhead 與 observability。若 team 想少維護一個 pooler process，managed proxy 可以降低操作成本，但要接受雲平台邊界。</p>
<h2 id="decision-signals">Decision Signals</h2>
<p>Decision signals 的核心責任是判斷何時導入 pooler，以及導入哪一種。連線數壓力要用 evidence 說明。</p>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>代表問題</th>
          <th>建議路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>max_connections</code> 接近上限</td>
          <td>application fan-out 過高</td>
          <td>PgBouncer transaction pooling</td>
      </tr>
      <tr>
          <td>大量 idle connection</td>
          <td>client 連線長期閒置</td>
          <td>transaction pooling 或 app pool 調整</td>
      </tr>
      <tr>
          <td>failover 後 reconnect storm</td>
          <td>client 同時重連衝擊 primary</td>
          <td>pooler queue + jitter</td>
      </tr>
      <tr>
          <td>query latency 高但 connection 不高</td>
          <td>查詢 / lock / index 問題</td>
          <td>query optimization</td>
      </tr>
      <tr>
          <td>session state 依賴多</td>
          <td>transaction pooling 相容性風險</td>
          <td>session pooling 或 refactor session state</td>
      </tr>
  </tbody>
</table>
<p>Connection pooler 的成功訊號是 database backend count 下降、queue 可觀測、error rate 穩定、tail latency 受控。若導入後只是把 timeout 從 DB 移到 pooler，代表 capacity model 仍需調整。</p>
<h2 id="transaction-pooling-compatibility">Transaction Pooling Compatibility</h2>
<p>Transaction pooling compatibility 的核心責任是找出 application 對 session state 的隱性依賴。這些依賴要在 staging 先測出來。</p>
<table>
  <thead>
      <tr>
          <th>依賴類型</th>
          <th>風險</th>
          <th>修正策略</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>SET search_path</code></td>
          <td>下一個 transaction 可能換連線</td>
          <td>每個 transaction 明確設定或固定 schema</td>
      </tr>
      <tr>
          <td>temp table</td>
          <td>transaction 後 server connection 釋放</td>
          <td>改 permanent staging table 或 session mode</td>
      </tr>
      <tr>
          <td>prepared statement</td>
          <td>server-side state 不穩定</td>
          <td>使用 client-side prepare 或 session mode</td>
      </tr>
      <tr>
          <td>advisory lock</td>
          <td>lock ownership 混亂</td>
          <td>transaction-scoped lock 或移出 pooler path</td>
      </tr>
      <tr>
          <td>LISTEN / NOTIFY</td>
          <td>session channel 需要持續連線</td>
          <td>專用 direct connection</td>
      </tr>
  </tbody>
</table>
<p>Compatibility review 要在 repository / migration / background job 三個層面跑。Web request 通常容易改成 transaction-safe；migration tool、CDC job、worker queue 常有長連線與 session state，要分開配置。</p>
<h2 id="sizing-and-evidence">Sizing and Evidence</h2>
<p>Sizing and evidence 的核心責任是用 workload 設定 pool size。Pooler 設太大會把壓力直接傳到 PostgreSQL；設太小會造成 queue 與 timeout。</p>
<p>基本 sizing 步驟：</p>
<ol>
<li>量測 active query concurrency，而非只看 request concurrency。</li>
<li>設定 database 保留連線給 admin、replication、migration 與 emergency access。</li>
<li>每個 service 設定 pool quota，避免單一服務吃掉全部 backend。</li>
<li>觀測 wait time、server utilization、client timeout、query latency。</li>
<li>用 load test 驗證 failover / reconnect storm。</li>
</ol>
<p>Pooler dashboard 至少要有 client connections、server connections、waiting clients、pool wait time、server reuse、timeout count 與 authentication failure。</p>
<h2 id="anti-patterns">Anti-Patterns</h2>
<p>Anti-pattern 的核心責任是把 pooler 常見誤用提前排除。</p>
<table>
  <thead>
      <tr>
          <th>反模式</th>
          <th>風險</th>
          <th>修正方向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>把 pool size 設到 DB 上限</td>
          <td>DB 失去保護層</td>
          <td>每個服務配額 + 保留 admin capacity</td>
      </tr>
      <tr>
          <td>transaction pooling 直接上線</td>
          <td>session state 依賴在 production 爆出</td>
          <td>staging compatibility matrix</td>
      </tr>
      <tr>
          <td>pooler 沒有 metrics</td>
          <td>queueing 事故難以判讀</td>
          <td>pooler dashboard + alert</td>
      </tr>
      <tr>
          <td>migration 共用 web pool</td>
          <td>長 DDL 卡住 web request</td>
          <td>migration 專用連線與維護窗口</td>
      </tr>
      <tr>
          <td>retry 無 jitter</td>
          <td>reconnect storm 放大</td>
          <td>exponential backoff + jitter</td>
      </tr>
  </tbody>
</table>
<p>Pooler 是 backpressure 元件。它要讓系統在過載時可排隊、可拒絕、可觀測，而非把所有請求推進 database。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Connection pooler comparison 完成後，實作層讀 <a href="../pgbouncer-config/">PgBouncer config</a>；要觀察連線壓力讀 <a href="../connection-scaling/">Connection Scaling</a>；需要演練讀 <a href="../hands-on/connection-pool-lab/">Connection Pool Lab</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL Cross-region DR</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/cross-region-dr/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/cross-region-dr/</guid><description>&lt;p>PostgreSQL cross-region DR 的核心責任是把區域性事故下的資料恢復、服務切換與資料一致性風險寫成可演練流程。跨區 DR 通常由法規、業務連續性、雲區故障、區域隔離或高可用承諾觸發。&lt;/p>
&lt;p>本文的判讀錨點是：cross-region DR 是恢復策略，而非自動等同 multi-region active-active。PostgreSQL 可以透過 backup / WAL archive、physical standby、logical replication、managed service replica 或 application-level replication 支援不同 RPO / RTO；每種路線都有資料延遲、切換與回切成本。&lt;/p>
&lt;h2 id="dr-strategy">DR Strategy&lt;/h2>
&lt;p>DR strategy 的核心責任是把恢復目標和技術路線對齊。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>策略&lt;/th>
 &lt;th>RPO / RTO 型態&lt;/th>
 &lt;th>適合情境&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Backup + WAL archive&lt;/td>
 &lt;td>RPO 依 WAL archive，RTO 依 restore&lt;/td>
 &lt;td>成本敏感、低頻災難復原&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cross-region standby&lt;/td>
 &lt;td>RPO 接近 replication lag，RTO 較短&lt;/td>
 &lt;td>需要較快啟動 read / promote&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Logical replication&lt;/td>
 &lt;td>table-level / selective DR&lt;/td>
 &lt;td>跨版本、跨 schema、局部資料同步&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Managed global DB&lt;/td>
 &lt;td>雲平台提供跨區 replica&lt;/td>
 &lt;td>希望降低自管複製與 promote 維運&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application replay&lt;/td>
 &lt;td>event / queue 重建狀態&lt;/td>
 &lt;td>domain event 已是 source of truth&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>RPO 要由業務定義。若付款、訂單、庫存只允許秒級遺失，backup-only 路線通常成本不足；若是內部報表或可重建資料，backup + WAL archive 可能足夠。&lt;/p>
&lt;h2 id="physical-vs-logical">Physical vs Logical&lt;/h2>
&lt;p>Physical vs logical 的核心責任是區分 byte-level recovery 與 row-level replication。Physical replica 保留 PostgreSQL cluster 層級狀態；&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/logical-replication/" data-link-title="Logical Replication" data-link-desc="說明以表為粒度解碼 row-level 變更的複製方式，對照 byte-level 的實體複製">logical replication&lt;/a> 提供 table / publication 層級彈性。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>面向&lt;/th>
 &lt;th>Physical standby&lt;/th>
 &lt;th>Logical replication&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>粒度&lt;/td>
 &lt;td>cluster / database&lt;/td>
 &lt;td>table / publication&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>版本彈性&lt;/td>
 &lt;td>通常要求版本與系統相容&lt;/td>
 &lt;td>可支援跨版本 / selective migration&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>DDL&lt;/td>
 &lt;td>跟隨 WAL / 需相容&lt;/td>
 &lt;td>需要 schema coordination&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Failover&lt;/td>
 &lt;td>promote standby&lt;/td>
 &lt;td>application / target DB 切換&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>風險&lt;/td>
 &lt;td>replication lag、timeline&lt;/td>
 &lt;td>slot lag、schema drift、missing key&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Physical standby 適合整體 DR。它的 runbook 要處理 WAL archive、replication lag、promotion、timeline、DNS / connection string 切換與回切。&lt;/p>
&lt;p>Logical replication 適合局部資料或跨版本轉換。它的 runbook 要處理 publication、subscription、replication slot、schema migration ordering 與資料 diff。&lt;/p>
&lt;h2 id="failover-runbook">Failover Runbook&lt;/h2>
&lt;p>Failover runbook 的核心責任是把災難切換變成可演練步驟。最小流程包含 incident declare、source freeze、replica health check、promote、traffic switch、data validation 與 rollback / rebuild。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL cross-region DR 的核心責任是把區域性事故下的資料恢復、服務切換與資料一致性風險寫成可演練流程。跨區 DR 通常由法規、業務連續性、雲區故障、區域隔離或高可用承諾觸發。</p>
<p>本文的判讀錨點是：cross-region DR 是恢復策略，而非自動等同 multi-region active-active。PostgreSQL 可以透過 backup / WAL archive、physical standby、logical replication、managed service replica 或 application-level replication 支援不同 RPO / RTO；每種路線都有資料延遲、切換與回切成本。</p>
<h2 id="dr-strategy">DR Strategy</h2>
<p>DR strategy 的核心責任是把恢復目標和技術路線對齊。</p>
<table>
  <thead>
      <tr>
          <th>策略</th>
          <th>RPO / RTO 型態</th>
          <th>適合情境</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Backup + WAL archive</td>
          <td>RPO 依 WAL archive，RTO 依 restore</td>
          <td>成本敏感、低頻災難復原</td>
      </tr>
      <tr>
          <td>Cross-region standby</td>
          <td>RPO 接近 replication lag，RTO 較短</td>
          <td>需要較快啟動 read / promote</td>
      </tr>
      <tr>
          <td>Logical replication</td>
          <td>table-level / selective DR</td>
          <td>跨版本、跨 schema、局部資料同步</td>
      </tr>
      <tr>
          <td>Managed global DB</td>
          <td>雲平台提供跨區 replica</td>
          <td>希望降低自管複製與 promote 維運</td>
      </tr>
      <tr>
          <td>Application replay</td>
          <td>event / queue 重建狀態</td>
          <td>domain event 已是 source of truth</td>
      </tr>
  </tbody>
</table>
<p>RPO 要由業務定義。若付款、訂單、庫存只允許秒級遺失，backup-only 路線通常成本不足；若是內部報表或可重建資料，backup + WAL archive 可能足夠。</p>
<h2 id="physical-vs-logical">Physical vs Logical</h2>
<p>Physical vs logical 的核心責任是區分 byte-level recovery 與 row-level replication。Physical replica 保留 PostgreSQL cluster 層級狀態；<a href="/blog/backend/knowledge-cards/logical-replication/" data-link-title="Logical Replication" data-link-desc="說明以表為粒度解碼 row-level 變更的複製方式，對照 byte-level 的實體複製">logical replication</a> 提供 table / publication 層級彈性。</p>
<table>
  <thead>
      <tr>
          <th>面向</th>
          <th>Physical standby</th>
          <th>Logical replication</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>粒度</td>
          <td>cluster / database</td>
          <td>table / publication</td>
      </tr>
      <tr>
          <td>版本彈性</td>
          <td>通常要求版本與系統相容</td>
          <td>可支援跨版本 / selective migration</td>
      </tr>
      <tr>
          <td>DDL</td>
          <td>跟隨 WAL / 需相容</td>
          <td>需要 schema coordination</td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>promote standby</td>
          <td>application / target DB 切換</td>
      </tr>
      <tr>
          <td>風險</td>
          <td>replication lag、timeline</td>
          <td>slot lag、schema drift、missing key</td>
      </tr>
  </tbody>
</table>
<p>Physical standby 適合整體 DR。它的 runbook 要處理 WAL archive、replication lag、promotion、timeline、DNS / connection string 切換與回切。</p>
<p>Logical replication 適合局部資料或跨版本轉換。它的 runbook 要處理 publication、subscription、replication slot、schema migration ordering 與資料 diff。</p>
<h2 id="failover-runbook">Failover Runbook</h2>
<p>Failover runbook 的核心責任是把災難切換變成可演練步驟。最小流程包含 incident declare、source freeze、replica health check、promote、traffic switch、data validation 與 rollback / rebuild。</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>操作</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Declare incident</td>
          <td>確認 primary region 事故範圍</td>
          <td>incident decision log</td>
      </tr>
      <tr>
          <td>Freeze source</td>
          <td>停止寫入或確認 source 已不可用</td>
          <td>last known LSN / timestamp</td>
      </tr>
      <tr>
          <td>Check replica</td>
          <td>lag、WAL received、read health</td>
          <td>replica status snapshot</td>
      </tr>
      <tr>
          <td>Promote</td>
          <td>promote standby 或啟用 target</td>
          <td>new timeline / role</td>
      </tr>
      <tr>
          <td>Switch traffic</td>
          <td>DNS、secret、connection string</td>
          <td>app smoke test</td>
      </tr>
      <tr>
          <td>Validate</td>
          <td>row count、critical invariant</td>
          <td>validation report</td>
      </tr>
      <tr>
          <td>Rebuild</td>
          <td>重建舊 primary 或新 standby</td>
          <td>follow-up runbook</td>
      </tr>
  </tbody>
</table>
<p>Failover 決策要有 owner。自動化可以執行步驟，但是否接受資料遺失、是否凍結寫入、是否 promote，仍需要明確責任人與 tripwire。</p>
<h2 id="data-reconciliation">Data Reconciliation</h2>
<p>Data reconciliation 的核心責任是處理 cross-region 切換後的資料差異。只要 replication lag 存在，failover 後就可能有未套用交易。</p>
<table>
  <thead>
      <tr>
          <th>差異類型</th>
          <th>處理方式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>已提交但未複製</td>
          <td>從 source WAL / app log / event 補償</td>
      </tr>
      <tr>
          <td>client retry 重複寫入</td>
          <td>idempotency key / natural key 去重</td>
      </tr>
      <tr>
          <td>sequence / identity</td>
          <td>target sequence reset / collision check</td>
      </tr>
      <tr>
          <td>external side effect</td>
          <td>payment、email、queue 需對帳</td>
      </tr>
  </tbody>
</table>
<p>Reconciliation 要先定義 critical table。所有表都做 full diff 成本高；付款、訂單、權限、ledger、mutation log 等高風險資料要有專用 validation query。</p>
<h2 id="drill-design">Drill Design</h2>
<p>Drill design 的核心責任是定期驗證 RPO / RTO。DR 文件只有在演練後才可信。</p>
<p>演練至少包含：</p>
<ol>
<li>從 backup + WAL 還原到指定時間。</li>
<li>Promote standby 到 isolated environment。</li>
<li>Application 使用 DR endpoint 跑 smoke test。</li>
<li>計算實際 RPO / RTO。</li>
<li>記錄失敗點、人工步驟與下一次修正。</li>
</ol>
<p>演練應避開 production destructive action。使用 isolated VPC、staging app、read-only validation 與 mock external side effect。</p>
<h2 id="no-go-conditions">No-Go Conditions</h2>
<p>No-go conditions 的核心責任是指出 PostgreSQL cross-region DR 的邊界。</p>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>建議路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>多區同時交易寫入是核心需求</td>
          <td>CockroachDB / Spanner / YugabyteDB 類 distributed SQL</td>
      </tr>
      <tr>
          <td>RPO 接近零且跨區距離大</td>
          <td>synchronous replication latency 成本評估</td>
      </tr>
      <tr>
          <td>Team 缺少 DR 演練能力</td>
          <td>managed service + vendor runbook</td>
      </tr>
      <tr>
          <td>數據 residency 限制跨區複製</td>
          <td>regional shard / policy-driven replication</td>
      </tr>
  </tbody>
</table>
<p>Cross-region DR 要誠實面對延遲。把每個 region 都變成 writer 需要 distributed transaction 模型；PostgreSQL DR 路線主要提供恢復與切換。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Cross-region DR 完成後，恢復實作讀 <a href="../pitr-wal-archiving/">PITR / WAL Archiving</a>；replication 架構讀 <a href="../replication-topology/">Replication Topology</a>；跨區 rollout 的資料政策讀 <a href="../multi-region-gdpr-rollout/">Multi-region GDPR Rollout</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL Developer / DBA Responsibility Split</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/developer-dba-responsibility-split/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/developer-dba-responsibility-split/</guid><description>&lt;p>PostgreSQL developer / DBA responsibility split 的核心責任是把資料庫決策拆成 application ownership、database operation 與 platform governance。PostgreSQL 功能深，事故常跨 query、schema、connection、backup、replication 與 capacity；若責任分工模糊，問題會在 release 與 incident 時放大。&lt;/p>
&lt;p>本文的判讀錨點是：developer 和 DBA 分工要讓每個決策有清楚 owner、evidence、review gate 與 rollback，而非把資料庫丟給某一方。&lt;/p>
&lt;h2 id="ownership-map">Ownership Map&lt;/h2>
&lt;p>Ownership map 的核心責任是定義誰能改什麼、誰要驗證什麼。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>面向&lt;/th>
 &lt;th>Developer owner&lt;/th>
 &lt;th>DBA / platform owner&lt;/th>
 &lt;th>Shared gate&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Schema design&lt;/td>
 &lt;td>domain model、constraint、query&lt;/td>
 &lt;td>naming、storage、partition、extension&lt;/td>
 &lt;td>migration review&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Query performance&lt;/td>
 &lt;td>repository SQL、query shape&lt;/td>
 &lt;td>index、planner、statistics、capacity&lt;/td>
 &lt;td>explain evidence&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Migration&lt;/td>
 &lt;td>app compatibility、rollback&lt;/td>
 &lt;td>lock impact、DDL strategy、PITR&lt;/td>
 &lt;td>release gate&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Connection&lt;/td>
 &lt;td>pool usage、transaction length&lt;/td>
 &lt;td>pooler、max connection、proxy&lt;/td>
 &lt;td>load test&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Backup / DR&lt;/td>
 &lt;td>restore smoke test&lt;/td>
 &lt;td>WAL archive、PITR、replica&lt;/td>
 &lt;td>restore drill&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Security&lt;/td>
 &lt;td>tenant / workflow intent&lt;/td>
 &lt;td>role、RLS、audit、grant&lt;/td>
 &lt;td>access review&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>這張表的重點是 shared gate。Developer 最懂產品語意，DBA / platform 最懂資料庫風險；正式變更需要兩邊的 evidence 合併。&lt;/p>
&lt;h2 id="schema-and-migration">Schema and Migration&lt;/h2>
&lt;p>Schema and migration 的核心責任是讓 application release 與 database change 同步。Developer 應提供 business invariant、compatibility window、read/write path；DBA / platform 應審查 lock、index build、table rewrite、replica lag 與 rollback。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Migration 類型&lt;/th>
 &lt;th>Developer evidence&lt;/th>
 &lt;th>DBA / platform evidence&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Add nullable column&lt;/td>
 &lt;td>app read/write compatibility&lt;/td>
 &lt;td>DDL lock time、replica impact&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Add NOT NULL&lt;/td>
 &lt;td>backfill plan、default behavior&lt;/td>
 &lt;td>table rewrite / validation strategy&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Index build&lt;/td>
 &lt;td>query contract、expected selectivity&lt;/td>
 &lt;td>concurrent build、disk、bloat&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Partition change&lt;/td>
 &lt;td>routing logic、retention behavior&lt;/td>
 &lt;td>detach / attach、maintenance window&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Type change&lt;/td>
 &lt;td>serialization、API compatibility&lt;/td>
 &lt;td>cast risk、rewrite duration&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Migration review 要從 failure mode 開始。若 migration 卡住，誰停止 rollout；若 backfill 造成 lag，誰降速；若 app 新舊版本同時存在，哪個 schema 能兼容兩者。&lt;/p>
&lt;h2 id="query-and-capacity">Query and Capacity&lt;/h2>
&lt;p>Query and capacity 的核心責任是把 query shape 和 database resource 對齊。Developer 負責避免 N+1、長交易、無界查詢與錯誤 pagination；DBA / platform 負責 index、statistics、vacuum、work_mem、connection 與 storage。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL developer / DBA responsibility split 的核心責任是把資料庫決策拆成 application ownership、database operation 與 platform governance。PostgreSQL 功能深，事故常跨 query、schema、connection、backup、replication 與 capacity；若責任分工模糊，問題會在 release 與 incident 時放大。</p>
<p>本文的判讀錨點是：developer 和 DBA 分工要讓每個決策有清楚 owner、evidence、review gate 與 rollback，而非把資料庫丟給某一方。</p>
<h2 id="ownership-map">Ownership Map</h2>
<p>Ownership map 的核心責任是定義誰能改什麼、誰要驗證什麼。</p>
<table>
  <thead>
      <tr>
          <th>面向</th>
          <th>Developer owner</th>
          <th>DBA / platform owner</th>
          <th>Shared gate</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema design</td>
          <td>domain model、constraint、query</td>
          <td>naming、storage、partition、extension</td>
          <td>migration review</td>
      </tr>
      <tr>
          <td>Query performance</td>
          <td>repository SQL、query shape</td>
          <td>index、planner、statistics、capacity</td>
          <td>explain evidence</td>
      </tr>
      <tr>
          <td>Migration</td>
          <td>app compatibility、rollback</td>
          <td>lock impact、DDL strategy、PITR</td>
          <td>release gate</td>
      </tr>
      <tr>
          <td>Connection</td>
          <td>pool usage、transaction length</td>
          <td>pooler、max connection、proxy</td>
          <td>load test</td>
      </tr>
      <tr>
          <td>Backup / DR</td>
          <td>restore smoke test</td>
          <td>WAL archive、PITR、replica</td>
          <td>restore drill</td>
      </tr>
      <tr>
          <td>Security</td>
          <td>tenant / workflow intent</td>
          <td>role、RLS、audit、grant</td>
          <td>access review</td>
      </tr>
  </tbody>
</table>
<p>這張表的重點是 shared gate。Developer 最懂產品語意，DBA / platform 最懂資料庫風險；正式變更需要兩邊的 evidence 合併。</p>
<h2 id="schema-and-migration">Schema and Migration</h2>
<p>Schema and migration 的核心責任是讓 application release 與 database change 同步。Developer 應提供 business invariant、compatibility window、read/write path；DBA / platform 應審查 lock、index build、table rewrite、replica lag 與 rollback。</p>
<table>
  <thead>
      <tr>
          <th>Migration 類型</th>
          <th>Developer evidence</th>
          <th>DBA / platform evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Add nullable column</td>
          <td>app read/write compatibility</td>
          <td>DDL lock time、replica impact</td>
      </tr>
      <tr>
          <td>Add NOT NULL</td>
          <td>backfill plan、default behavior</td>
          <td>table rewrite / validation strategy</td>
      </tr>
      <tr>
          <td>Index build</td>
          <td>query contract、expected selectivity</td>
          <td>concurrent build、disk、bloat</td>
      </tr>
      <tr>
          <td>Partition change</td>
          <td>routing logic、retention behavior</td>
          <td>detach / attach、maintenance window</td>
      </tr>
      <tr>
          <td>Type change</td>
          <td>serialization、API compatibility</td>
          <td>cast risk、rewrite duration</td>
      </tr>
  </tbody>
</table>
<p>Migration review 要從 failure mode 開始。若 migration 卡住，誰停止 rollout；若 backfill 造成 lag，誰降速；若 app 新舊版本同時存在，哪個 schema 能兼容兩者。</p>
<h2 id="query-and-capacity">Query and Capacity</h2>
<p>Query and capacity 的核心責任是把 query shape 和 database resource 對齊。Developer 負責避免 N+1、長交易、無界查詢與錯誤 pagination；DBA / platform 負責 index、statistics、vacuum、work_mem、connection 與 storage。</p>
<p>Query review 的最小 evidence：</p>
<ol>
<li>SQL text 或 repository method。</li>
<li>Expected cardinality 與資料量。</li>
<li><code>EXPLAIN</code> / <code>EXPLAIN ANALYZE</code> 結果。</li>
<li>Index 依賴與 fallback plan。</li>
<li>Timeout、pagination、transaction boundary。</li>
</ol>
<p>Capacity review 要把 query 放進 workload。單一 query 快不代表整體穩定；高頻 query、batch job、migration backfill、CDC consumer 都會共享 I/O、CPU、lock 與 WAL。</p>
<h2 id="incident-roles">Incident Roles</h2>
<p>Incident roles 的核心責任是讓資料庫事故有分工。Incident 發生時，developer 看 workflow、feature flag、traffic 與 recent deploy；DBA / platform 看 lock、replica、WAL、disk、pooler 與 backup。</p>
<table>
  <thead>
      <tr>
          <th>Incident</th>
          <th>Developer 第一反應</th>
          <th>DBA / platform 第一反應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Lock storm</td>
          <td>暫停相關 workflow、停 rollout</td>
          <td>查 blocking PID、DDL、transaction</td>
      </tr>
      <tr>
          <td>Connection exhaustion</td>
          <td>降低 app concurrency、停 retry storm</td>
          <td>pooler queue、max connection、admin access</td>
      </tr>
      <tr>
          <td>Replica lag</td>
          <td>暫停 heavy write / backfill</td>
          <td>WAL sender、slot、standby apply</td>
      </tr>
      <tr>
          <td>Bad migration</td>
          <td>block release、保留 failed state</td>
          <td>restore point、rollback / PITR</td>
      </tr>
      <tr>
          <td>Slow query spike</td>
          <td>feature flag、query owner</td>
          <td>plan regression、statistics、index</td>
      </tr>
  </tbody>
</table>
<p>Incident command 要保留決策紀錄。資料庫事故常有高壓操作，例如 kill session、promote replica、drop slot、restore backup；每個操作都要記錄原因與回復路線。</p>
<h2 id="review-cadence">Review Cadence</h2>
<p>Review cadence 的核心責任是把資料庫品質納入日常。建議節奏如下：</p>
<table>
  <thead>
      <tr>
          <th>節奏</th>
          <th>Review 內容</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>每個 release</td>
          <td>migration diff、new query、role / grant</td>
      </tr>
      <tr>
          <td>每週</td>
          <td>slow query、lock wait、replica lag、pool</td>
      </tr>
      <tr>
          <td>每月</td>
          <td>backup restore drill、index bloat、vacuum</td>
      </tr>
      <tr>
          <td>每季</td>
          <td>DR drill、major version plan、extension review</td>
      </tr>
  </tbody>
</table>
<p>Review cadence 要跟服務風險對齊。高交易量或合規服務需要更短週期；內部工具可以更輕量，但仍要保留 backup / restore evidence。</p>
<h2 id="handoff-artifact">Handoff Artifact</h2>
<p>Handoff artifact 的核心責任是讓下一位維護者能接手。</p>
<p>最小內容：</p>
<ol>
<li>Database owner、application owner、platform owner。</li>
<li>Schema migration process 與 rollback route。</li>
<li>Query review checklist。</li>
<li>Connection / pooler policy。</li>
<li>Backup / PITR / DR evidence。</li>
<li>Security / role / audit owner。</li>
<li>Incident escalation route。</li>
</ol>
<p>這份 artifact 應連回 <a href="../">PostgreSQL overview</a>、<a href="../hands-on/schema-migration-evidence-lab/">Schema Migration Evidence Lab</a> 與 <a href="../hands-on/pitr-restore-drill/">PITR Restore Drill</a>。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>責任分工建立後，migration gate 讀 <a href="../online-schema-change/">Online Schema Change</a>；連線責任讀 <a href="../connection-pooler-comparison/">Connection Pooler Comparison</a>；安全責任讀 <a href="../security-rls-audit-logging/">Security / RLS / Audit Logging</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL HA Failover Drill</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/hands-on/ha-failover-drill/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/hands-on/ha-failover-drill/</guid><description>&lt;p>PostgreSQL HA failover drill 的核心責任是讓讀者觀察 primary promotion 對 application、pooler 與 incident decision 的影響。這篇承接 &lt;a href="../../patroni-ha/">Patroni HA&lt;/a> 與 &lt;a href="../../cross-region-dr/">Cross-region DR&lt;/a>。&lt;/p>
&lt;p>本文的驗收標準是：你能記錄 failover timeline、replication lag snapshot、client error sample、data validation query 與 incident decision log entry。實際觸發方式依 Patroni、managed PostgreSQL 或雲平台而異；lab 重點是 evidence。&lt;/p>
&lt;h2 id="pre-failover-baseline">Pre-Failover Baseline&lt;/h2>
&lt;p>Pre-failover baseline 的核心責任是確認 primary / standby 狀態與 client route。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_is_in_recovery&lt;/span>&lt;span class="p">();&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">now&lt;/span>&lt;span class="p">(),&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_current_wal_lsn&lt;/span>&lt;span class="p">();&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">application_name&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">state&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">sync_state&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">replay_lag&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_stat_replication&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>在 standby 查：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_is_in_recovery&lt;/span>&lt;span class="p">();&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">now&lt;/span>&lt;span class="p">(),&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_last_wal_receive_lsn&lt;/span>&lt;span class="p">(),&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_last_wal_replay_lsn&lt;/span>&lt;span class="p">();&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Baseline 要保存 primary host、standby host、replication lag、application connection string、pooler route 與 current timeline。&lt;/p>
&lt;h2 id="client-workload">Client Workload&lt;/h2>
&lt;p>Client workload 的核心責任是讓 failover 對 application 的影響可見。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">while&lt;/span> true&lt;span class="p">;&lt;/span> &lt;span class="k">do&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> date -u
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> psql &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$DATABASE_URL&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> -c &lt;span class="s2">&amp;#34;INSERT INTO restore_markers(marker) VALUES (&amp;#39;failover-drill&amp;#39;) RETURNING id, created_at;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> sleep &lt;span class="m">1&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="k">done&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>這個 loop 會在 failover 期間產生成功、timeout、connection reset 或 read-only error。正式演練要用 synthetic workload，避免影響真實使用者。&lt;/p>
&lt;h2 id="trigger-failover">Trigger Failover&lt;/h2>
&lt;p>Trigger failover 的核心責任是以受控方式促成 promotion。Patroni lab 可以用 &lt;code>patronictl failover&lt;/code>；managed service 則用 provider failover / reboot with failover 功能。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">failover_start_time:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">trigger_method:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">old_primary:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">candidate:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">operator:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">reason:&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Failover 觸發前要先確認這是演練，並且 workload、backup、rollback 與 stakeholder 都已對齊。&lt;/p>
&lt;h2 id="observe-promotion">Observe Promotion&lt;/h2>
&lt;p>Observe promotion 的核心責任是記錄資料庫與 client 的時間線。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>時間點&lt;/th>
 &lt;th>Evidence&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Trigger issued&lt;/td>
 &lt;td>command / provider event&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Old primary down&lt;/td>
 &lt;td>connection error / health check&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>New primary promoted&lt;/td>
 &lt;td>&lt;code>pg_is_in_recovery() = false&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Client reconnect&lt;/td>
 &lt;td>first successful write&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Pooler stable&lt;/td>
 &lt;td>pool queue / server connection&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Validation complete&lt;/td>
 &lt;td>row count / marker sequence&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Promotion timeline 要保留秒級時間戳。這是評估 RTO、client retry 與 pooler behavior 的基礎。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL HA failover drill 的核心責任是讓讀者觀察 primary promotion 對 application、pooler 與 incident decision 的影響。這篇承接 <a href="../../patroni-ha/">Patroni HA</a> 與 <a href="../../cross-region-dr/">Cross-region DR</a>。</p>
<p>本文的驗收標準是：你能記錄 failover timeline、replication lag snapshot、client error sample、data validation query 與 incident decision log entry。實際觸發方式依 Patroni、managed PostgreSQL 或雲平台而異；lab 重點是 evidence。</p>
<h2 id="pre-failover-baseline">Pre-Failover Baseline</h2>
<p>Pre-failover baseline 的核心責任是確認 primary / standby 狀態與 client route。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_is_in_recovery</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">now</span><span class="p">(),</span><span class="w"> </span><span class="n">pg_current_wal_lsn</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">application_name</span><span class="p">,</span><span class="w"> </span><span class="k">state</span><span class="p">,</span><span class="w"> </span><span class="n">sync_state</span><span class="p">,</span><span class="w"> </span><span class="n">replay_lag</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_replication</span><span class="p">;</span></span></span></code></pre></div><p>在 standby 查：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_is_in_recovery</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">now</span><span class="p">(),</span><span class="w"> </span><span class="n">pg_last_wal_receive_lsn</span><span class="p">(),</span><span class="w"> </span><span class="n">pg_last_wal_replay_lsn</span><span class="p">();</span></span></span></code></pre></div><p>Baseline 要保存 primary host、standby host、replication lag、application connection string、pooler route 與 current timeline。</p>
<h2 id="client-workload">Client Workload</h2>
<p>Client workload 的核心責任是讓 failover 對 application 的影響可見。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">while</span> true<span class="p">;</span> <span class="k">do</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">  date -u
</span></span><span class="line"><span class="ln">3</span><span class="cl">  psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> -c <span class="s2">&#34;INSERT INTO restore_markers(marker) VALUES (&#39;failover-drill&#39;) RETURNING id, created_at;&#34;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">  sleep <span class="m">1</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="k">done</span></span></span></code></pre></div><p>這個 loop 會在 failover 期間產生成功、timeout、connection reset 或 read-only error。正式演練要用 synthetic workload，避免影響真實使用者。</p>
<h2 id="trigger-failover">Trigger Failover</h2>
<p>Trigger failover 的核心責任是以受控方式促成 promotion。Patroni lab 可以用 <code>patronictl failover</code>；managed service 則用 provider failover / reboot with failover 功能。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">failover_start_time:
</span></span><span class="line"><span class="ln">2</span><span class="cl">trigger_method:
</span></span><span class="line"><span class="ln">3</span><span class="cl">old_primary:
</span></span><span class="line"><span class="ln">4</span><span class="cl">candidate:
</span></span><span class="line"><span class="ln">5</span><span class="cl">operator:
</span></span><span class="line"><span class="ln">6</span><span class="cl">reason:</span></span></code></pre></div><p>Failover 觸發前要先確認這是演練，並且 workload、backup、rollback 與 stakeholder 都已對齊。</p>
<h2 id="observe-promotion">Observe Promotion</h2>
<p>Observe promotion 的核心責任是記錄資料庫與 client 的時間線。</p>
<table>
  <thead>
      <tr>
          <th>時間點</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Trigger issued</td>
          <td>command / provider event</td>
      </tr>
      <tr>
          <td>Old primary down</td>
          <td>connection error / health check</td>
      </tr>
      <tr>
          <td>New primary promoted</td>
          <td><code>pg_is_in_recovery() = false</code></td>
      </tr>
      <tr>
          <td>Client reconnect</td>
          <td>first successful write</td>
      </tr>
      <tr>
          <td>Pooler stable</td>
          <td>pool queue / server connection</td>
      </tr>
      <tr>
          <td>Validation complete</td>
          <td>row count / marker sequence</td>
      </tr>
  </tbody>
</table>
<p>Promotion timeline 要保留秒級時間戳。這是評估 RTO、client retry 與 pooler behavior 的基礎。</p>
<h2 id="data-validation">Data Validation</h2>
<p>Data validation 的核心責任是確認 failover 後資料一致性。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">restore_markers</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">marker</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;failover-drill&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">max</span><span class="p">(</span><span class="n">created_at</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">restore_markers</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">status</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">accounts</span><span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">status</span><span class="p">;</span></span></span></code></pre></div><p>若 workload 有 idempotency key，還要檢查 duplicate。若外部 side effect 參與交易，例如 payment 或 queue，必須有 reconciliation query。</p>
<h2 id="pooler-and-client-behavior">Pooler and Client Behavior</h2>
<p>Pooler and client behavior 的核心責任是確認 failover 後連線能重新指向新 primary。</p>
<p>檢查項目：</p>
<ol>
<li>Application retry 是否有 backoff / jitter。</li>
<li>PgBouncer / proxy 是否清掉舊 server connection。</li>
<li>DNS / endpoint TTL 是否符合 RTO。</li>
<li>Read-only error 是否被正確分類。</li>
<li>Migration / background job 是否暫停。</li>
</ol>
<p>Failover 的完成標準包含 database promote、client reconnect 與 pooler stable。若 client 長時間連到舊 primary 或 pooler 卡住，服務仍處於 unavailable 狀態。</p>
<h2 id="incident-decision-log">Incident Decision Log</h2>
<p>Incident decision log 的核心責任是把演練變成可審查紀錄。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Incident / drill id:
</span></span><span class="line"><span class="ln">2</span><span class="cl">Decision: promote standby
</span></span><span class="line"><span class="ln">3</span><span class="cl">Reason:
</span></span><span class="line"><span class="ln">4</span><span class="cl">Accepted data loss:
</span></span><span class="line"><span class="ln">5</span><span class="cl">RTO observed:
</span></span><span class="line"><span class="ln">6</span><span class="cl">Client impact:
</span></span><span class="line"><span class="ln">7</span><span class="cl">Validation result:
</span></span><span class="line"><span class="ln">8</span><span class="cl">Follow-up:</span></span></code></pre></div><p>每次 drill 都要產生 follow-up。常見 follow-up 是調整 retry、降低 DNS TTL、補 pooler command、增加 validation query 或改善 monitoring。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>完成本篇後，HA 架構讀 <a href="../../patroni-ha/">Patroni HA</a>；跨區災難復原讀 <a href="../../cross-region-dr/">Cross-region DR</a>；connection retry 與 pooler 行為讀 <a href="../connection-pool-lab/">Connection Pool Lab</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL Hands-on 操作路線</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/hands-on/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/hands-on/</guid><description>&lt;p>PostgreSQL hands-on 操作路線的核心責任是把 overview 與 deep article 的判讀轉成可演練的操作流程。這一層對齊 LLM &lt;code>hands-on/&lt;/code> 的功能：讀者不只知道 PostgreSQL 的機制，也能在 local lab 或 staging 產出可驗證 artifact。&lt;/p>
&lt;h2 id="章節列表">章節列表&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>章節&lt;/th>
 &lt;th>主題&lt;/th>
 &lt;th>產出 artifact&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;a href="local-lab-quickstart/">Local lab quickstart&lt;/a>&lt;/td>
 &lt;td>Docker Compose 啟動 PostgreSQL、建立 schema、跑 sample workload&lt;/td>
 &lt;td>local DSN、schema migration log、basic metric snapshot&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;a href="connection-pool-lab/">Connection pool lab&lt;/a>&lt;/td>
 &lt;td>application pool → pgBouncer → PostgreSQL 的連線壓力演練&lt;/td>
 &lt;td>pool config、connection count evidence、failure note&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;a href="pitr-restore-drill/">PITR restore drill&lt;/a>&lt;/td>
 &lt;td>base backup + WAL archive + restore target time 的恢復演練&lt;/td>
 &lt;td>restore record、RPO / RTO evidence、validation query&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;a href="schema-migration-evidence-lab/">Schema migration evidence lab&lt;/a>&lt;/td>
 &lt;td>expand / contract migration、validation query、rollback condition&lt;/td>
 &lt;td>migration plan、row count、rollback note&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;a href="ha-failover-drill/">HA failover drill&lt;/a>&lt;/td>
 &lt;td>Patroni / managed failover 的 application impact 演練&lt;/td>
 &lt;td>failover timeline、client error sample、decision log&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h2 id="設計原則">設計原則&lt;/h2>
&lt;p>PostgreSQL hands-on 章節只收錄能產出 evidence 的操作。純安裝指令留給官方文件；本路線要教讀者如何知道設定生效、失敗時看到什麼、以及 evidence 要交給 04 / 06 / 08 的哪個 artifact。&lt;/p>
&lt;h2 id="引用路徑">引用路徑&lt;/h2>
&lt;ul>
&lt;li>上游：&lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview&lt;/a>&lt;/li>
&lt;li>Deep article：&lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &amp;#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgBouncer Config&lt;/a>、&lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &amp;#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &amp;#43; WAL archive 構成 PITR 的雙軌資料、archive_command &amp;#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &amp;#43; monitoring 整合">PITR + WAL Archiving&lt;/a>、&lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &amp;#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA&lt;/a>&lt;/li>
&lt;li>跨模組：&lt;a href="https://tarrragon.github.io/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">Observability Evidence Package&lt;/a>、&lt;a href="https://tarrragon.github.io/blog/backend/06-reliability/migration-safety/" data-link-title="6.11 Migration Safety 與 DB Rollout" data-link-desc="把 schema migration 從一次性事件變成可逆、可漸進的 rollout 流程">Migration Safety&lt;/a>、&lt;a href="https://tarrragon.github.io/blog/backend/08-incident-response/incident-decision-log/" data-link-title="8.19 Incident Decision Log" data-link-desc="把事中假設、決策、證據、回退條件與責任人留下可復盤紀錄">Incident Decision Log&lt;/a>&lt;/li>
&lt;/ul></description><content:encoded><![CDATA[<p>PostgreSQL hands-on 操作路線的核心責任是把 overview 與 deep article 的判讀轉成可演練的操作流程。這一層對齊 LLM <code>hands-on/</code> 的功能：讀者不只知道 PostgreSQL 的機制，也能在 local lab 或 staging 產出可驗證 artifact。</p>
<h2 id="章節列表">章節列表</h2>
<table>
  <thead>
      <tr>
          <th>章節</th>
          <th>主題</th>
          <th>產出 artifact</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="local-lab-quickstart/">Local lab quickstart</a></td>
          <td>Docker Compose 啟動 PostgreSQL、建立 schema、跑 sample workload</td>
          <td>local DSN、schema migration log、basic metric snapshot</td>
      </tr>
      <tr>
          <td><a href="connection-pool-lab/">Connection pool lab</a></td>
          <td>application pool → pgBouncer → PostgreSQL 的連線壓力演練</td>
          <td>pool config、connection count evidence、failure note</td>
      </tr>
      <tr>
          <td><a href="pitr-restore-drill/">PITR restore drill</a></td>
          <td>base backup + WAL archive + restore target time 的恢復演練</td>
          <td>restore record、RPO / RTO evidence、validation query</td>
      </tr>
      <tr>
          <td><a href="schema-migration-evidence-lab/">Schema migration evidence lab</a></td>
          <td>expand / contract migration、validation query、rollback condition</td>
          <td>migration plan、row count、rollback note</td>
      </tr>
      <tr>
          <td><a href="ha-failover-drill/">HA failover drill</a></td>
          <td>Patroni / managed failover 的 application impact 演練</td>
          <td>failover timeline、client error sample、decision log</td>
      </tr>
  </tbody>
</table>
<h2 id="設計原則">設計原則</h2>
<p>PostgreSQL hands-on 章節只收錄能產出 evidence 的操作。純安裝指令留給官方文件；本路線要教讀者如何知道設定生效、失敗時看到什麼、以及 evidence 要交給 04 / 06 / 08 的哪個 artifact。</p>
<h2 id="引用路徑">引用路徑</h2>
<ul>
<li>上游：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview</a></li>
<li>Deep article：<a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgBouncer Config</a>、<a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL Archiving</a>、<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a></li>
<li>跨模組：<a href="/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">Observability Evidence Package</a>、<a href="/blog/backend/06-reliability/migration-safety/" data-link-title="6.11 Migration Safety 與 DB Rollout" data-link-desc="把 schema migration 從一次性事件變成可逆、可漸進的 rollout 流程">Migration Safety</a>、<a href="/blog/backend/08-incident-response/incident-decision-log/" data-link-title="8.19 Incident Decision Log" data-link-desc="把事中假設、決策、證據、回退條件與責任人留下可復盤紀錄">Incident Decision Log</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Local Lab Quickstart</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/hands-on/local-lab-quickstart/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/hands-on/local-lab-quickstart/</guid><description>&lt;p>PostgreSQL local lab quickstart 的核心責任是建立後續 connection、migration、backup 與 failover 演練共用的本地環境。這個 lab 提供一個可重建的 PostgreSQL instance、app-facing user、baseline schema、seed data 與 basic evidence。&lt;/p>
&lt;p>本文的驗收標準是：你能啟動本地 PostgreSQL，套用 schema，跑 sample workload，取得 &lt;code>pg_stat_activity&lt;/code> / &lt;code>pg_stat_database&lt;/code> snapshot，最後 teardown 並重建。&lt;/p>
&lt;h2 id="docker-compose">Docker Compose&lt;/h2>
&lt;p>Docker Compose 的核心責任是讓 lab 環境可重建。建立 &lt;code>docker-compose.yml&lt;/code>：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="nt">services&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">postgres&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">image&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">postgres:16&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">environment&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">POSTGRES_USER&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">lab_admin&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">POSTGRES_PASSWORD&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">lab_admin_pw&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">POSTGRES_DB&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">appdb&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">ports&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s2">&amp;#34;54329:5432&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">command&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s2">&amp;#34;postgres&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s2">&amp;#34;-c&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s2">&amp;#34;log_min_duration_statement=100&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s2">&amp;#34;-c&amp;#34;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>- &lt;span class="s2">&amp;#34;shared_preload_libraries=pg_stat_statements&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>啟動：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">docker compose up -d
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="nb">export&lt;/span> &lt;span class="nv">DATABASE_URL&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;postgres://lab_admin:lab_admin_pw@localhost:54329/appdb?sslmode=disable&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="baseline-schema">Baseline Schema&lt;/h2>
&lt;p>Baseline schema 的核心責任是建立可測 transaction、index、lock 與 migration 的資料模型。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">psql &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$DATABASE_URL&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="s">&amp;lt;&amp;lt;&amp;#39;SQL&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="s">CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="s">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="s">CREATE TABLE accounts (
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="s"> id bigserial PRIMARY KEY,
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="s"> tenant_id uuid NOT NULL,
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="s"> owner_name text NOT NULL,
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="s"> status text NOT NULL CHECK (status IN (&amp;#39;active&amp;#39;, &amp;#39;closed&amp;#39;)),
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="s"> created_at timestamptz NOT NULL DEFAULT now()
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="s">);
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="s">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="s">CREATE TABLE ledger_entries (
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="s"> id bigserial PRIMARY KEY,
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="s"> account_id bigint NOT NULL REFERENCES accounts(id),
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="s"> amount_cents bigint NOT NULL CHECK (amount_cents &amp;lt;&amp;gt; 0),
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="s"> idempotency_key text NOT NULL UNIQUE,
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl">&lt;span class="s"> created_at timestamptz NOT NULL DEFAULT now()
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">18&lt;/span>&lt;span class="cl">&lt;span class="s">);
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">19&lt;/span>&lt;span class="cl">&lt;span class="s">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">20&lt;/span>&lt;span class="cl">&lt;span class="s">CREATE INDEX idx_ledger_entries_account_created
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">21&lt;/span>&lt;span class="cl">&lt;span class="s">ON ledger_entries(account_id, created_at DESC);
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">22&lt;/span>&lt;span class="cl">&lt;span class="s">SQL&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>這組 schema 後續可用於 migration、lock、PITR 與 pool lab。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL local lab quickstart 的核心責任是建立後續 connection、migration、backup 與 failover 演練共用的本地環境。這個 lab 提供一個可重建的 PostgreSQL instance、app-facing user、baseline schema、seed data 與 basic evidence。</p>
<p>本文的驗收標準是：你能啟動本地 PostgreSQL，套用 schema，跑 sample workload，取得 <code>pg_stat_activity</code> / <code>pg_stat_database</code> snapshot，最後 teardown 並重建。</p>
<h2 id="docker-compose">Docker Compose</h2>
<p>Docker Compose 的核心責任是讓 lab 環境可重建。建立 <code>docker-compose.yml</code>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="nt">services</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">  </span><span class="nt">postgres</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l">postgres:16</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="nt">environment</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">      </span><span class="nt">POSTGRES_USER</span><span class="p">:</span><span class="w"> </span><span class="l">lab_admin</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">      </span><span class="nt">POSTGRES_PASSWORD</span><span class="p">:</span><span class="w"> </span><span class="l">lab_admin_pw</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">      </span><span class="nt">POSTGRES_DB</span><span class="p">:</span><span class="w"> </span><span class="l">appdb</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">    </span><span class="nt">ports</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">      </span>- <span class="s2">&#34;54329:5432&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="nt">command</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">      </span>- <span class="s2">&#34;postgres&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">      </span>- <span class="s2">&#34;-c&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">      </span>- <span class="s2">&#34;log_min_duration_statement=100&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">      </span>- <span class="s2">&#34;-c&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">      </span>- <span class="s2">&#34;shared_preload_libraries=pg_stat_statements&#34;</span></span></span></code></pre></div><p>啟動：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">docker compose up -d
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="nb">export</span> <span class="nv">DATABASE_URL</span><span class="o">=</span><span class="s2">&#34;postgres://lab_admin:lab_admin_pw@localhost:54329/appdb?sslmode=disable&#34;</span></span></span></code></pre></div><h2 id="baseline-schema">Baseline Schema</h2>
<p>Baseline schema 的核心責任是建立可測 transaction、index、lock 與 migration 的資料模型。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> <span class="s">&lt;&lt;&#39;SQL&#39;
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="s">CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="s">CREATE TABLE accounts (
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="s">  id bigserial PRIMARY KEY,
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s">  tenant_id uuid NOT NULL,
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s">  owner_name text NOT NULL,
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="s">  status text NOT NULL CHECK (status IN (&#39;active&#39;, &#39;closed&#39;)),
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="s">  created_at timestamptz NOT NULL DEFAULT now()
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s">);
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="s">CREATE TABLE ledger_entries (
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="s">  id bigserial PRIMARY KEY,
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="s">  account_id bigint NOT NULL REFERENCES accounts(id),
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="s">  amount_cents bigint NOT NULL CHECK (amount_cents &lt;&gt; 0),
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="s">  idempotency_key text NOT NULL UNIQUE,
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="s">  created_at timestamptz NOT NULL DEFAULT now()
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="s">);
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="s">CREATE INDEX idx_ledger_entries_account_created
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="s">ON ledger_entries(account_id, created_at DESC);
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="s">SQL</span></span></span></code></pre></div><p>這組 schema 後續可用於 migration、lock、PITR 與 pool lab。</p>
<h2 id="seed-and-workload">Seed and Workload</h2>
<p>Seed and workload 的核心責任是產生可觀察的資料與查詢。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> <span class="s">&lt;&lt;&#39;SQL&#39;
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="s">INSERT INTO accounts(tenant_id, owner_name, status)
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="s">VALUES
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="s">  (&#39;00000000-0000-0000-0000-000000000001&#39;, &#39;Ada&#39;, &#39;active&#39;),
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="s">  (&#39;00000000-0000-0000-0000-000000000002&#39;, &#39;Lin&#39;, &#39;active&#39;);
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s">INSERT INTO ledger_entries(account_id, amount_cents, idempotency_key)
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="s">SELECT 1, 100, &#39;seed-ada-&#39; || g
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="s">FROM generate_series(1, 100) AS g;
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="s">SELECT a.owner_name, SUM(l.amount_cents) AS balance_cents
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="s">FROM accounts a
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="s">JOIN ledger_entries l ON l.account_id = a.id
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="s">GROUP BY a.owner_name;
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="s">SQL</span></span></span></code></pre></div><p>Sample workload 要保留 SQL 與輸出，作為後續 migration / restore validation 的 baseline。</p>
<h2 id="basic-evidence">Basic Evidence</h2>
<p>Basic evidence 的核心責任是把 lab 狀態保存成可比較 snapshot。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> <span class="s">&lt;&lt;&#39;SQL&#39;
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="s">SELECT current_database(), current_user, version();
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="s">SELECT relname, n_live_tup FROM pg_stat_user_tables ORDER BY relname;
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="s">SELECT datname, numbackends, xact_commit, xact_rollback
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="s">FROM pg_stat_database
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s">WHERE datname = current_database();
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s">SELECT pid, state, wait_event_type, query
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="s">FROM pg_stat_activity
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="s">WHERE datname = current_database();
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s">SQL</span></span></span></code></pre></div><p>這些查詢是 PostgreSQL lab 的最小 evidence。正式服務要再加入 slow query、lock wait、replica lag、backup status 與 pooler metrics。</p>
<h2 id="teardown">Teardown</h2>
<p>Teardown 的核心責任是讓 lab 可重跑。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">docker compose down -v</span></span></code></pre></div><p>重建後應能重新套用 schema 與 seed。若 lab 需要跨章節沿用資料，先用 <code>pg_dump</code> 保存 fixture，再 teardown。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>完成本篇後，連線壓力進入 <a href="../connection-pool-lab/">Connection Pool Lab</a>；migration evidence 進入 <a href="../schema-migration-evidence-lab/">Schema Migration Evidence Lab</a>；backup / PITR 進入 <a href="../pitr-restore-drill/">PITR Restore Drill</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL Logical Decoding Plugins</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/logical-decoding-plugins/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/logical-decoding-plugins/</guid><description>&lt;p>PostgreSQL logical decoding plugins 的核心責任是把 WAL 中的變更轉成外部消費者可理解的事件格式。PostgreSQL 官方 logical decoding 文件說明，logical decoding 透過 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/replication-slot/" data-link-title="Replication Slot" data-link-desc="說明邏輯複製如何用 slot 追蹤消費進度，並對來源端造成保留壓力">replication slot&lt;/a> 將 WAL 變更解碼成 plugin output；output plugin 決定外部看到的是 PostgreSQL protocol、JSON、測試文字或自訂格式。&lt;/p>
&lt;p>本文的判讀錨點是：plugin 選型是 CDC contract 決策。它影響 schema evolution、事件欄位、delete 表示、transaction boundary、consumer compatibility、slot lag 與故障復原。&lt;/p>
&lt;h2 id="plugin-boundary">Plugin Boundary&lt;/h2>
&lt;p>Plugin boundary 的核心責任是定義 database 變更如何離開 PostgreSQL。常見選項包含內建 &lt;code>pgoutput&lt;/code>、測試用 &lt;code>test_decoding&lt;/code>、JSON-oriented plugin，以及 Debezium connector 支援的 plugin / protocol。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Plugin / path&lt;/th>
 &lt;th>主要責任&lt;/th>
 &lt;th>適合情境&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;code>pgoutput&lt;/code>&lt;/td>
 &lt;td>PostgreSQL logical replication protocol&lt;/td>
 &lt;td>built-in logical replication、Debezium 常見路線&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>test_decoding&lt;/code>&lt;/td>
 &lt;td>人類可讀測試 output&lt;/td>
 &lt;td>lab、debug、教育用途&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>wal2json&lt;/code>&lt;/td>
 &lt;td>JSON change event&lt;/td>
 &lt;td>自訂 consumer、legacy CDC&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>decoderbufs&lt;/td>
 &lt;td>Protobuf event&lt;/td>
 &lt;td>強 schema contract 的 pipeline&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Native subscription&lt;/td>
 &lt;td>DB-to-DB replication&lt;/td>
 &lt;td>PostgreSQL 之間 table replication&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;code>pgoutput&lt;/code> 適合標準化 CDC。它與 publication / subscription model 對齊，能保留 PostgreSQL logical replication 的主路線。&lt;/p>
&lt;p>&lt;code>test_decoding&lt;/code> 適合教學與排錯。它讓人看到 transaction 裡發生的 insert / update / delete，但它的定位是測試與理解，不應作為正式 event contract。&lt;/p>
&lt;h2 id="replication-slot-responsibility">Replication Slot Responsibility&lt;/h2>
&lt;p>Replication slot responsibility 的核心責任是保護 consumer 進度，同時管理 WAL retention。Logical slot 會讓 PostgreSQL 保留尚未被 consumer 確認的 WAL；consumer 停住時，slot lag 會轉成 disk pressure。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Signal&lt;/th>
 &lt;th>意義&lt;/th>
 &lt;th>操作反應&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;code>confirmed_flush_lsn&lt;/code>&lt;/td>
 &lt;td>consumer 已確認的位置&lt;/td>
 &lt;td>用來判斷 CDC 進度&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>retained WAL size&lt;/td>
 &lt;td>slot 造成的 WAL 保留量&lt;/td>
 &lt;td>alert、調整 consumer、drop / advance&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>inactive slot&lt;/td>
 &lt;td>consumer 離線&lt;/td>
 &lt;td>檢查 connector、暫停 release&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>publication table diff&lt;/td>
 &lt;td>CDC scope 與 schema 不一致&lt;/td>
 &lt;td>review publication / table ownership&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Slot 是 production resource。每個 logical slot 都要有 owner、consumer、SLO、drop condition、backfill plan 與 alert。&lt;/p>
&lt;h2 id="event-contract">Event Contract&lt;/h2>
&lt;p>Event contract 的核心責任是讓 downstream 知道每個變更代表什麼。CDC 事件至少要說明 key、before/after image、operation、commit timestamp、transaction ordering、schema version 與 delete representation。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL logical decoding plugins 的核心責任是把 WAL 中的變更轉成外部消費者可理解的事件格式。PostgreSQL 官方 logical decoding 文件說明，logical decoding 透過 <a href="/blog/backend/knowledge-cards/replication-slot/" data-link-title="Replication Slot" data-link-desc="說明邏輯複製如何用 slot 追蹤消費進度，並對來源端造成保留壓力">replication slot</a> 將 WAL 變更解碼成 plugin output；output plugin 決定外部看到的是 PostgreSQL protocol、JSON、測試文字或自訂格式。</p>
<p>本文的判讀錨點是：plugin 選型是 CDC contract 決策。它影響 schema evolution、事件欄位、delete 表示、transaction boundary、consumer compatibility、slot lag 與故障復原。</p>
<h2 id="plugin-boundary">Plugin Boundary</h2>
<p>Plugin boundary 的核心責任是定義 database 變更如何離開 PostgreSQL。常見選項包含內建 <code>pgoutput</code>、測試用 <code>test_decoding</code>、JSON-oriented plugin，以及 Debezium connector 支援的 plugin / protocol。</p>
<table>
  <thead>
      <tr>
          <th>Plugin / path</th>
          <th>主要責任</th>
          <th>適合情境</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>pgoutput</code></td>
          <td>PostgreSQL logical replication protocol</td>
          <td>built-in logical replication、Debezium 常見路線</td>
      </tr>
      <tr>
          <td><code>test_decoding</code></td>
          <td>人類可讀測試 output</td>
          <td>lab、debug、教育用途</td>
      </tr>
      <tr>
          <td><code>wal2json</code></td>
          <td>JSON change event</td>
          <td>自訂 consumer、legacy CDC</td>
      </tr>
      <tr>
          <td>decoderbufs</td>
          <td>Protobuf event</td>
          <td>強 schema contract 的 pipeline</td>
      </tr>
      <tr>
          <td>Native subscription</td>
          <td>DB-to-DB replication</td>
          <td>PostgreSQL 之間 table replication</td>
      </tr>
  </tbody>
</table>
<p><code>pgoutput</code> 適合標準化 CDC。它與 publication / subscription model 對齊，能保留 PostgreSQL logical replication 的主路線。</p>
<p><code>test_decoding</code> 適合教學與排錯。它讓人看到 transaction 裡發生的 insert / update / delete，但它的定位是測試與理解，不應作為正式 event contract。</p>
<h2 id="replication-slot-responsibility">Replication Slot Responsibility</h2>
<p>Replication slot responsibility 的核心責任是保護 consumer 進度，同時管理 WAL retention。Logical slot 會讓 PostgreSQL 保留尚未被 consumer 確認的 WAL；consumer 停住時，slot lag 會轉成 disk pressure。</p>
<table>
  <thead>
      <tr>
          <th>Signal</th>
          <th>意義</th>
          <th>操作反應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>confirmed_flush_lsn</code></td>
          <td>consumer 已確認的位置</td>
          <td>用來判斷 CDC 進度</td>
      </tr>
      <tr>
          <td>retained WAL size</td>
          <td>slot 造成的 WAL 保留量</td>
          <td>alert、調整 consumer、drop / advance</td>
      </tr>
      <tr>
          <td>inactive slot</td>
          <td>consumer 離線</td>
          <td>檢查 connector、暫停 release</td>
      </tr>
      <tr>
          <td>publication table diff</td>
          <td>CDC scope 與 schema 不一致</td>
          <td>review publication / table ownership</td>
      </tr>
  </tbody>
</table>
<p>Slot 是 production resource。每個 logical slot 都要有 owner、consumer、SLO、drop condition、backfill plan 與 alert。</p>
<h2 id="event-contract">Event Contract</h2>
<p>Event contract 的核心責任是讓 downstream 知道每個變更代表什麼。CDC 事件至少要說明 key、before/after image、operation、commit timestamp、transaction ordering、schema version 與 delete representation。</p>
<table>
  <thead>
      <tr>
          <th>Contract 面向</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Key</td>
          <td>table 是否有 replica identity / primary key</td>
      </tr>
      <tr>
          <td>Update image</td>
          <td>是否需要 before value</td>
      </tr>
      <tr>
          <td>Delete</td>
          <td>tombstone、key-only delete、soft delete</td>
      </tr>
      <tr>
          <td>Ordering</td>
          <td>transaction order 是否要保留</td>
      </tr>
      <tr>
          <td>Schema evolution</td>
          <td>新欄位、rename、drop 欄位如何通知</td>
      </tr>
      <tr>
          <td>Backfill</td>
          <td>initial snapshot 與 streaming 如何銜接</td>
      </tr>
  </tbody>
</table>
<p><a href="/blog/backend/knowledge-cards/replica-identity/" data-link-title="Replica Identity" data-link-desc="說明 row-level 變更事件如何帶穩定 key，讓下游能正確套用 update 與 delete">Replica identity</a> 是 CDC 的核心設定。沒有穩定 key 的 table 會讓 update / delete event 難以被 downstream 正確套用；這類 table 要先補 primary key 或明確設定 replica identity。</p>
<h2 id="connector-patterns">Connector Patterns</h2>
<p>Connector patterns 的核心責任是把 plugin output 接到實際 pipeline。Debezium、custom consumer、DB native subscription 的維運責任不同。</p>
<table>
  <thead>
      <tr>
          <th>Pattern</th>
          <th>優點</th>
          <th>風險</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Debezium connector</td>
          <td>成熟 snapshot + streaming workflow</td>
          <td>connector state、Kafka / offset operation</td>
      </tr>
      <tr>
          <td>Native subscription</td>
          <td>PostgreSQL 原生 DB-to-DB</td>
          <td>schema drift、DDL coordination</td>
      </tr>
      <tr>
          <td>Custom consumer</td>
          <td>可客製 event contract</td>
          <td>slot management 與 error handling 自行負責</td>
      </tr>
      <tr>
          <td>Batch export + CDC</td>
          <td>backfill 與 streaming 分開</td>
          <td>cutover LSN 與 duplication handling</td>
      </tr>
  </tbody>
</table>
<p>Connector 要定義 backfill 與 streaming 的接點。最常見的事故是 snapshot 還沒完成就開始消費、或 cutover LSN 沒有被記錄，導致 downstream 重複或漏資料。</p>
<h2 id="failure-modes">Failure Modes</h2>
<p>Failure modes 的核心責任是把 CDC 事故分成 database、connector、schema 與 downstream 四層。</p>
<table>
  <thead>
      <tr>
          <th>Failure mode</th>
          <th>判讀訊號</th>
          <th>第一反應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Slot lag growth</td>
          <td>retained WAL 持續增加</td>
          <td>暫停重型寫入、修 connector、評估 drop</td>
      </tr>
      <tr>
          <td>Schema break</td>
          <td>connector 解析失敗</td>
          <td>停止 DDL rollout、補 schema evolution</td>
      </tr>
      <tr>
          <td>Missing key</td>
          <td>update / delete 缺少可套用 key</td>
          <td>修 replica identity / key contract</td>
      </tr>
      <tr>
          <td>Duplicate event</td>
          <td>consumer 重啟或 offset 回退</td>
          <td>idempotent consumer</td>
      </tr>
      <tr>
          <td>Downstream slow</td>
          <td>Kafka / sink lag 增加</td>
          <td>擴 sink、調 batch、保護 slot</td>
      </tr>
  </tbody>
</table>
<p>Slot lag 是最高優先訊號，因為它會占用 PostgreSQL WAL storage。Runbook 要有「何時暫停 producer」、「何時 drop slot」、「如何重建 snapshot」的明確門檻。</p>
<h2 id="selection-checklist">Selection Checklist</h2>
<p>Selection checklist 的核心責任是讓 plugin 選型可審查。</p>
<ol>
<li>Downstream 需要 DB-to-DB replication、JSON event、Protobuf event 還是 connector-managed event。</li>
<li>每張 table 是否有 stable key 與 replica identity。</li>
<li>Initial snapshot 如何銜接 streaming。</li>
<li>Schema evolution 如何通知 consumer。</li>
<li>Slot lag、connector lag、sink lag 如何告警。</li>
<li>Consumer 是否 idempotent。</li>
<li>Disaster recovery 後 slot / offset 如何重建。</li>
</ol>
<p>完成這份 checklist 後，再決定 plugin 與 connector。CDC 的成功標準是 downstream 能長期維持正確資料，而不只是成功建立 slot。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Logical decoding plugins 完成後，實作 CDC pipeline 讀 <a href="../logical-replication-debezium/">Logical Replication / Debezium</a>；slot 維運讀 <a href="../replication-slot-management/">Replication Slot Management</a>；跨資料庫搬遷讀 <a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">Database Migration Playbook</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL pg_partman Advanced</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pg-partman-advanced/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pg-partman-advanced/</guid><description>&lt;p>PostgreSQL pg_partman advanced 的核心責任是把 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/table-partitioning/" data-link-title="Table Partitioning" data-link-desc="說明單一資料庫內如何把大表拆成多個分區，並由查詢規劃器只掃相關片段">declarative partitioning&lt;/a> 的日常維護自動化。pg_partman 可以協助建立未來 partition、管理 retention、執行 maintenance job，讓 time-based 或 serial-based partition 不再依賴人工 DDL。&lt;/p>
&lt;p>本文的判讀錨點是：pg_partman 解決的是 partition lifecycle operation，而非 partition strategy 本身。Partition key、query pattern、retention、index、foreign key 與 migration 仍要先在 &lt;a href="../declarative-partitioning/">Declarative Partitioning&lt;/a> 與 &lt;a href="../partition-redesign/">Partition Redesign&lt;/a> 做對。&lt;/p>
&lt;h2 id="responsibility-boundary">Responsibility Boundary&lt;/h2>
&lt;p>Responsibility boundary 的核心責任是區分 PostgreSQL 原生 partition 和 pg_partman。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>層級&lt;/th>
 &lt;th>責任&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>PostgreSQL declarative partitioning&lt;/td>
 &lt;td>partition table、constraint、planner pruning&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>pg_partman&lt;/td>
 &lt;td>future partition premake、retention、maintenance&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Scheduler / job runner&lt;/td>
 &lt;td>定期執行 maintenance&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>DBA / platform&lt;/td>
 &lt;td>monitoring、backup、DDL review&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application&lt;/td>
 &lt;td>query pattern、partition key 使用&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>pg_partman 的價值在於減少重複 DDL。它不會替 application 選出正確 partition key，也不會自動修復跨 partition query 設計。&lt;/p>
&lt;h2 id="core-concepts">Core Concepts&lt;/h2>
&lt;p>Core concepts 的核心責任是理解 pg_partman operation vocabulary。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>概念&lt;/th>
 &lt;th>意義&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Parent table&lt;/td>
 &lt;td>partitioned table 的入口&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Child table&lt;/td>
 &lt;td>實際存放資料的 partition&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Premake&lt;/td>
 &lt;td>預先建立未來 partition&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Retention&lt;/td>
 &lt;td>自動 detach / drop 舊 partition&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Maintenance&lt;/td>
 &lt;td>建立新 partition、處理 retention 的 job&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Template&lt;/td>
 &lt;td>child partition 繼承 index / constraint 的模板&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Premake 是防止 insert 打到不存在 partition 的保護。若 partition 建立落後於時間，application insert 會失敗或落到 default partition；production 要對 future partition count 設 alert。&lt;/p>
&lt;p>Retention 是資料生命週期操作。Drop 舊 partition 速度快，但要先確認 legal retention、backup、analytics dependency 與 downstream CDC。&lt;/p>
&lt;h2 id="setup-pattern">Setup Pattern&lt;/h2>
&lt;p>Setup pattern 的核心責任是把 pg_partman 導入流程放進 migration gate。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">IF&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">EXISTS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_partman&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">bigserial&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">tenant_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">uuid&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">created_at&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">timestamptz&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">payload&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">jsonb&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">RANGE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">created_at&lt;/span>&lt;span class="p">);&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>實際建立 partman config 要依 pg_partman 版本與 provider 支援文件執行。Managed PostgreSQL 可能限制 extension version、background worker 或 scheduler，因此 setup 前要先確認 provider boundary。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL pg_partman advanced 的核心責任是把 <a href="/blog/backend/knowledge-cards/table-partitioning/" data-link-title="Table Partitioning" data-link-desc="說明單一資料庫內如何把大表拆成多個分區，並由查詢規劃器只掃相關片段">declarative partitioning</a> 的日常維護自動化。pg_partman 可以協助建立未來 partition、管理 retention、執行 maintenance job，讓 time-based 或 serial-based partition 不再依賴人工 DDL。</p>
<p>本文的判讀錨點是：pg_partman 解決的是 partition lifecycle operation，而非 partition strategy 本身。Partition key、query pattern、retention、index、foreign key 與 migration 仍要先在 <a href="../declarative-partitioning/">Declarative Partitioning</a> 與 <a href="../partition-redesign/">Partition Redesign</a> 做對。</p>
<h2 id="responsibility-boundary">Responsibility Boundary</h2>
<p>Responsibility boundary 的核心責任是區分 PostgreSQL 原生 partition 和 pg_partman。</p>
<table>
  <thead>
      <tr>
          <th>層級</th>
          <th>責任</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PostgreSQL declarative partitioning</td>
          <td>partition table、constraint、planner pruning</td>
      </tr>
      <tr>
          <td>pg_partman</td>
          <td>future partition premake、retention、maintenance</td>
      </tr>
      <tr>
          <td>Scheduler / job runner</td>
          <td>定期執行 maintenance</td>
      </tr>
      <tr>
          <td>DBA / platform</td>
          <td>monitoring、backup、DDL review</td>
      </tr>
      <tr>
          <td>Application</td>
          <td>query pattern、partition key 使用</td>
      </tr>
  </tbody>
</table>
<p>pg_partman 的價值在於減少重複 DDL。它不會替 application 選出正確 partition key，也不會自動修復跨 partition query 設計。</p>
<h2 id="core-concepts">Core Concepts</h2>
<p>Core concepts 的核心責任是理解 pg_partman operation vocabulary。</p>
<table>
  <thead>
      <tr>
          <th>概念</th>
          <th>意義</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Parent table</td>
          <td>partitioned table 的入口</td>
      </tr>
      <tr>
          <td>Child table</td>
          <td>實際存放資料的 partition</td>
      </tr>
      <tr>
          <td>Premake</td>
          <td>預先建立未來 partition</td>
      </tr>
      <tr>
          <td>Retention</td>
          <td>自動 detach / drop 舊 partition</td>
      </tr>
      <tr>
          <td>Maintenance</td>
          <td>建立新 partition、處理 retention 的 job</td>
      </tr>
      <tr>
          <td>Template</td>
          <td>child partition 繼承 index / constraint 的模板</td>
      </tr>
  </tbody>
</table>
<p>Premake 是防止 insert 打到不存在 partition 的保護。若 partition 建立落後於時間，application insert 會失敗或落到 default partition；production 要對 future partition count 設 alert。</p>
<p>Retention 是資料生命週期操作。Drop 舊 partition 速度快，但要先確認 legal retention、backup、analytics dependency 與 downstream CDC。</p>
<h2 id="setup-pattern">Setup Pattern</h2>
<p>Setup pattern 的核心責任是把 pg_partman 導入流程放進 migration gate。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="k">IF</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="n">pg_partman</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="n">id</span><span class="w"> </span><span class="n">bigserial</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">  </span><span class="n">tenant_id</span><span class="w"> </span><span class="n">uuid</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="n">created_at</span><span class="w"> </span><span class="n">timestamptz</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">  </span><span class="n">payload</span><span class="w"> </span><span class="n">jsonb</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">RANGE</span><span class="w"> </span><span class="p">(</span><span class="n">created_at</span><span class="p">);</span></span></span></code></pre></div><p>實際建立 partman config 要依 pg_partman 版本與 provider 支援文件執行。Managed PostgreSQL 可能限制 extension version、background worker 或 scheduler，因此 setup 前要先確認 provider boundary。</p>
<p>最小 setup evidence：</p>
<ol>
<li>Extension version。</li>
<li>Parent table DDL。</li>
<li>Partition key 與 interval。</li>
<li>Premake 數量。</li>
<li>Retention policy。</li>
<li>Maintenance job schedule。</li>
<li>Test insert 到 current / future partition。</li>
</ol>
<h2 id="maintenance-runbook">Maintenance Runbook</h2>
<p>Maintenance runbook 的核心責任是讓 partition lifecycle 可觀測。</p>
<table>
  <thead>
      <tr>
          <th>Signal</th>
          <th>意義</th>
          <th>反應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>future partition count</td>
          <td>premake 是否足夠</td>
          <td>手動跑 maintenance、修 scheduler</td>
      </tr>
      <tr>
          <td>default partition rows</td>
          <td>routing 失敗或 partition 缺漏</td>
          <td>建 partition、搬資料、修 app timestamp</td>
      </tr>
      <tr>
          <td>old partition count</td>
          <td>retention 是否執行</td>
          <td>檢查 policy、legal hold、job error</td>
      </tr>
      <tr>
          <td>maintenance duration</td>
          <td>DDL / lock / catalog 壓力</td>
          <td>調整 schedule、拆 table</td>
      </tr>
      <tr>
          <td>index build time</td>
          <td>child index 建立成本</td>
          <td>template / concurrent strategy review</td>
      </tr>
  </tbody>
</table>
<p>Maintenance job 要有 owner。Cron、pg_cron、background worker、Kubernetes job 或 managed scheduler 都可以；重點是 job failure 會告警，並且有人處理。</p>
<h2 id="migration-and-backfill">Migration and Backfill</h2>
<p>Migration and backfill 的核心責任是把既有大表轉成 partman-managed partition。這通常比新表導入更高風險。</p>
<table>
  <thead>
      <tr>
          <th>Phase</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Audit</td>
          <td>table size、query pattern、write rate</td>
      </tr>
      <tr>
          <td>New schema</td>
          <td>parent table、child partition、index</td>
      </tr>
      <tr>
          <td>Backfill</td>
          <td>batch size、lag、lock、checksum</td>
      </tr>
      <tr>
          <td>Dual write</td>
          <td>app compatibility</td>
      </tr>
      <tr>
          <td>Cutover</td>
          <td>rename / view / routing switch</td>
      </tr>
      <tr>
          <td>Cleanup</td>
          <td>old table retention、rollback</td>
      </tr>
  </tbody>
</table>
<p>Backfill 要控制 WAL、replica lag、autovacuum、index bloat 與 lock。大型 table 應先用 shadow table 或 partition redesign playbook，避開 peak traffic 直接重建。</p>
<h2 id="failure-modes">Failure Modes</h2>
<p>Failure modes 的核心責任是列出 pg_partman 常見事故。</p>
<table>
  <thead>
      <tr>
          <th>Failure mode</th>
          <th>判讀訊號</th>
          <th>修正方向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>未建立未來 partition</td>
          <td>insert 失敗或 default partition 增長</td>
          <td>補 partition、修 maintenance schedule</td>
      </tr>
      <tr>
          <td>retention drop 過早</td>
          <td>查詢缺歷史資料</td>
          <td>restore backup、調 policy、legal review</td>
      </tr>
      <tr>
          <td>managed provider 不支援</td>
          <td>extension / worker 限制</td>
          <td>改 manual partition job 或 provider</td>
      </tr>
      <tr>
          <td>index / constraint 漂移</td>
          <td>child partition schema 不一致</td>
          <td>template review、schema diff</td>
      </tr>
      <tr>
          <td>planner pruning 失效</td>
          <td>query 未帶 partition key</td>
          <td>query rewrite、index review</td>
      </tr>
  </tbody>
</table>
<p>pg_partman 事故通常是 lifecycle 事故。Runbook 要先看 maintenance job，再看 partition metadata 與 application query。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>pg_partman advanced 完成後，partition 設計讀 <a href="../declarative-partitioning/">Declarative Partitioning</a>；重排策略讀 <a href="../partition-redesign/">Partition Redesign</a>；migration gate 讀 <a href="../online-schema-change/">Online Schema Change</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL PITR Restore Drill</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/hands-on/pitr-restore-drill/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/hands-on/pitr-restore-drill/</guid><description>&lt;p>PostgreSQL PITR restore drill 的核心責任是證明 backup 可以還原到指定時間點。這篇承接 &lt;a href="../../pitr-wal-archiving/">PITR + WAL Archiving&lt;/a>，把備份從存在狀態推進到可恢復證據。&lt;/p>
&lt;p>本文的驗收標準是：你能記錄 base backup 時間、target time、restore duration、validation query 與 RPO / RTO note。實際命令會依 pgBackRest、Barman、cloud snapshot 或 managed service 而變；本文提供 vendor-neutral drill frame。&lt;/p>
&lt;h2 id="prepare-recovery-point">Prepare Recovery Point&lt;/h2>
&lt;p>Prepare recovery point 的核心責任是建立可辨識 transaction。先寫入一筆 marker，記錄時間。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">psql &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$DATABASE_URL&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="s">&amp;lt;&amp;lt;&amp;#39;SQL&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="s">CREATE TABLE IF NOT EXISTS restore_markers (
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="s"> id bigserial PRIMARY KEY,
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="s"> marker text NOT NULL,
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="s"> created_at timestamptz NOT NULL DEFAULT clock_timestamp()
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="s">);
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="s">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="s">INSERT INTO restore_markers(marker) VALUES (&amp;#39;before-bad-change&amp;#39;);
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="s">SELECT id, marker, created_at FROM restore_markers ORDER BY id DESC LIMIT 1;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="s">SQL&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>把 &lt;code>created_at&lt;/code> 記為 target time。正式 drill 要用 UTC，並記錄 timezone、operator、backup set 與 WAL archive status。&lt;/p>
&lt;h2 id="create-bad-change">Create Bad Change&lt;/h2>
&lt;p>Create bad change 的核心責任是模擬需要 PITR 的錯誤。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">psql &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$DATABASE_URL&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="s">&amp;lt;&amp;lt;&amp;#39;SQL&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="s">INSERT INTO restore_markers(marker) VALUES (&amp;#39;bad-change-after-target&amp;#39;);
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="s">UPDATE accounts SET status = &amp;#39;closed&amp;#39;;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="s">SELECT status, count(*) FROM accounts GROUP BY status;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="s">SQL&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>這一步在 lab 中代表誤操作。Production 事故中，bad change 可能是誤刪、錯誤 batch、壞 migration 或 application bug。&lt;/p>
&lt;h2 id="restore-workflow">Restore Workflow&lt;/h2>
&lt;p>Restore workflow 的核心責任是把 backup tool 的操作轉成固定 evidence。不同工具命令不同，但流程一致：&lt;/p>
&lt;ol>
&lt;li>選定 base backup。&lt;/li>
&lt;li>設定 recovery target time。&lt;/li>
&lt;li>套用 WAL 到 target time。&lt;/li>
&lt;li>Promote restored instance。&lt;/li>
&lt;li>跑 validation query。&lt;/li>
&lt;li>啟動 application smoke test。&lt;/li>
&lt;/ol>
&lt;p>Example pseudo-runbook：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">restore_target_time = 2026-05-21T10:15:30Z
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">base_backup = latest backup before target
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">wal_archive = available through target
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">restore_path = isolated environment&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Restore 必須在隔離環境先完成。直接覆蓋 production 會讓 evidence 與 rollback 空間消失。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL PITR restore drill 的核心責任是證明 backup 可以還原到指定時間點。這篇承接 <a href="../../pitr-wal-archiving/">PITR + WAL Archiving</a>，把備份從存在狀態推進到可恢復證據。</p>
<p>本文的驗收標準是：你能記錄 base backup 時間、target time、restore duration、validation query 與 RPO / RTO note。實際命令會依 pgBackRest、Barman、cloud snapshot 或 managed service 而變；本文提供 vendor-neutral drill frame。</p>
<h2 id="prepare-recovery-point">Prepare Recovery Point</h2>
<p>Prepare recovery point 的核心責任是建立可辨識 transaction。先寫入一筆 marker，記錄時間。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> <span class="s">&lt;&lt;&#39;SQL&#39;
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="s">CREATE TABLE IF NOT EXISTS restore_markers (
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="s">  id bigserial PRIMARY KEY,
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="s">  marker text NOT NULL,
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="s">  created_at timestamptz NOT NULL DEFAULT clock_timestamp()
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="s">);
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="s">INSERT INTO restore_markers(marker) VALUES (&#39;before-bad-change&#39;);
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="s">SELECT id, marker, created_at FROM restore_markers ORDER BY id DESC LIMIT 1;
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s">SQL</span></span></span></code></pre></div><p>把 <code>created_at</code> 記為 target time。正式 drill 要用 UTC，並記錄 timezone、operator、backup set 與 WAL archive status。</p>
<h2 id="create-bad-change">Create Bad Change</h2>
<p>Create bad change 的核心責任是模擬需要 PITR 的錯誤。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> <span class="s">&lt;&lt;&#39;SQL&#39;
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="s">INSERT INTO restore_markers(marker) VALUES (&#39;bad-change-after-target&#39;);
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s">UPDATE accounts SET status = &#39;closed&#39;;
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s">SELECT status, count(*) FROM accounts GROUP BY status;
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s">SQL</span></span></span></code></pre></div><p>這一步在 lab 中代表誤操作。Production 事故中，bad change 可能是誤刪、錯誤 batch、壞 migration 或 application bug。</p>
<h2 id="restore-workflow">Restore Workflow</h2>
<p>Restore workflow 的核心責任是把 backup tool 的操作轉成固定 evidence。不同工具命令不同，但流程一致：</p>
<ol>
<li>選定 base backup。</li>
<li>設定 recovery target time。</li>
<li>套用 WAL 到 target time。</li>
<li>Promote restored instance。</li>
<li>跑 validation query。</li>
<li>啟動 application smoke test。</li>
</ol>
<p>Example pseudo-runbook：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">restore_target_time = 2026-05-21T10:15:30Z
</span></span><span class="line"><span class="ln">2</span><span class="cl">base_backup = latest backup before target
</span></span><span class="line"><span class="ln">3</span><span class="cl">wal_archive = available through target
</span></span><span class="line"><span class="ln">4</span><span class="cl">restore_path = isolated environment</span></span></code></pre></div><p>Restore 必須在隔離環境先完成。直接覆蓋 production 會讓 evidence 與 rollback 空間消失。</p>
<h2 id="validation-query">Validation Query</h2>
<p>Validation query 的核心責任是確認 restore 到正確時間點。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$RESTORED_DATABASE_URL</span><span class="s2">&#34;</span> <span class="s">&lt;&lt;&#39;SQL&#39;
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="s">SELECT marker, created_at
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s">FROM restore_markers
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s">ORDER BY id;
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="s">SELECT status, count(*)
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="s">FROM accounts
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="s">GROUP BY status;
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="s">SQL</span></span></span></code></pre></div><p>預期結果是存在 <code>before-bad-change</code>，且 <code>bad-change-after-target</code> 尚未出現。<code>accounts</code> 狀態應維持 target time 前的分布。</p>
<h2 id="rpo--rto-evidence">RPO / RTO Evidence</h2>
<p>RPO / RTO evidence 的核心責任是把 drill 結果轉成服務語言。</p>
<table>
  <thead>
      <tr>
          <th>Evidence</th>
          <th>記錄內容</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Backup timestamp</td>
          <td>使用哪份 base backup</td>
      </tr>
      <tr>
          <td>Target time</td>
          <td>要恢復到哪一秒</td>
      </tr>
      <tr>
          <td>WAL availability</td>
          <td>target time 前後 WAL 是否完整</td>
      </tr>
      <tr>
          <td>Restore duration</td>
          <td>從開始 restore 到 validation 成功</td>
      </tr>
      <tr>
          <td>Data gap</td>
          <td>target time 後需補償的 transaction</td>
      </tr>
      <tr>
          <td>Smoke test</td>
          <td>application 核心 workflow 是否可用</td>
      </tr>
  </tbody>
</table>
<p>PITR 的成功標準是資料與 application 都可用。只讓 PostgreSQL 啟動成功，還不足以交付服務。</p>
<h2 id="drill-retrospective">Drill Retrospective</h2>
<p>Drill retrospective 的核心責任是把演練缺口轉成下一步。</p>
<p>常見缺口：</p>
<ol>
<li>找不到正確 base backup。</li>
<li>WAL archive 缺段。</li>
<li>target time timezone 混亂。</li>
<li>Restore 太慢，超過 RTO。</li>
<li>Application secret / config 指不到 restored DB。</li>
<li>Validation query 缺少 business invariant。</li>
</ol>
<p>完成本篇後，跨區恢復讀 <a href="../../cross-region-dr/">Cross-region DR</a>；備份策略讀 <a href="../../pitr-wal-archiving/">PITR + WAL Archiving</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL Schema Migration Evidence Lab</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/hands-on/schema-migration-evidence-lab/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/hands-on/schema-migration-evidence-lab/</guid><description>&lt;p>PostgreSQL schema migration evidence lab 的核心責任是把 schema change 轉成 release gate 可使用的 evidence。這篇承接 &lt;a href="../../online-schema-change/">Online Schema Change&lt;/a> 與 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">Database Migration Playbook&lt;/a>。&lt;/p>
&lt;p>本文的驗收標準是：你能設計 expand migration、量測 lock、跑 backfill validation、建立 contract migration 的 fail-forward / rollback 判準。&lt;/p>
&lt;h2 id="expand-migration">Expand Migration&lt;/h2>
&lt;p>Expand migration 的核心責任是先加入向後相容 schema。以下範例新增 &lt;code>accounts.email&lt;/code>，先允許 null。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">psql &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$DATABASE_URL&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="s">&amp;lt;&amp;lt;&amp;#39;SQL&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="s">\timing on
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="s">BEGIN;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="s">ALTER TABLE accounts ADD COLUMN email text;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="s">COMMIT;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="s">SQL&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>新增 nullable column 通常是低風險操作，但仍要記錄 timing 與 lock。正式服務要在低流量窗口或 staging 上先測。&lt;/p>
&lt;h2 id="lock-evidence">Lock Evidence&lt;/h2>
&lt;p>Lock evidence 的核心責任是讓 migration 的阻塞風險可見。開另一個 terminal，在 migration 前後查 lock。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">psql &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$DATABASE_URL&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="s">&amp;lt;&amp;lt;&amp;#39;SQL&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="s">SELECT locktype, relation::regclass, mode, granted, pid
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="s">FROM pg_locks
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="s">WHERE relation IN (&amp;#39;accounts&amp;#39;::regclass, &amp;#39;ledger_entries&amp;#39;::regclass)
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="s">ORDER BY granted, mode;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="s">SQL&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Release gate 要保存 lock mode、duration、blocked session 與 application impact。高風險 DDL 要先改成 expand / backfill / contract。&lt;/p>
&lt;h2 id="backfill-and-validation">Backfill and Validation&lt;/h2>
&lt;p>Backfill and validation 的核心責任是把資料補齊並證明結果符合 domain。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">psql &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$DATABASE_URL&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="s">&amp;lt;&amp;lt;&amp;#39;SQL&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="s">UPDATE accounts
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="s">SET email = lower(owner_name) || &amp;#39;@example.test&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="s">WHERE email IS NULL;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="s">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="s">SELECT count(*) AS missing_email
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">&lt;span class="s">FROM accounts
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl">&lt;span class="s">WHERE email IS NULL;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">9&lt;/span>&lt;span class="cl">&lt;span class="s">SQL&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>大型表要分 batch backfill，避免 WAL、replica lag、autovacuum 與 lock 壓力。每個 batch 要記錄 row count、duration、error 與 lag。&lt;/p>
&lt;h2 id="add-constraint-safely">Add Constraint Safely&lt;/h2>
&lt;p>Add constraint safely 的核心責任是把資料驗證和 constraint 生效拆開。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">psql &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$DATABASE_URL&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="s">&amp;lt;&amp;lt;&amp;#39;SQL&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="s">ALTER TABLE accounts
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="s">ADD CONSTRAINT accounts_email_present
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="s">CHECK (email IS NOT NULL) NOT VALID;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="s">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="s">ALTER TABLE accounts
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">&lt;span class="s">VALIDATE CONSTRAINT accounts_email_present;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl">&lt;span class="s">SQL&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>NOT VALID&lt;/code> 讓 constraint 先約束新資料，再用 validation 掃既有資料。這是 PostgreSQL online migration 常用技巧。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL schema migration evidence lab 的核心責任是把 schema change 轉成 release gate 可使用的 evidence。這篇承接 <a href="../../online-schema-change/">Online Schema Change</a> 與 <a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">Database Migration Playbook</a>。</p>
<p>本文的驗收標準是：你能設計 expand migration、量測 lock、跑 backfill validation、建立 contract migration 的 fail-forward / rollback 判準。</p>
<h2 id="expand-migration">Expand Migration</h2>
<p>Expand migration 的核心責任是先加入向後相容 schema。以下範例新增 <code>accounts.email</code>，先允許 null。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> <span class="s">&lt;&lt;&#39;SQL&#39;
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="s">\timing on
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s">BEGIN;
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s">ALTER TABLE accounts ADD COLUMN email text;
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s">COMMIT;
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="s">SQL</span></span></span></code></pre></div><p>新增 nullable column 通常是低風險操作，但仍要記錄 timing 與 lock。正式服務要在低流量窗口或 staging 上先測。</p>
<h2 id="lock-evidence">Lock Evidence</h2>
<p>Lock evidence 的核心責任是讓 migration 的阻塞風險可見。開另一個 terminal，在 migration 前後查 lock。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> <span class="s">&lt;&lt;&#39;SQL&#39;
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="s">SELECT locktype, relation::regclass, mode, granted, pid
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s">FROM pg_locks
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s">WHERE relation IN (&#39;accounts&#39;::regclass, &#39;ledger_entries&#39;::regclass)
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s">ORDER BY granted, mode;
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="s">SQL</span></span></span></code></pre></div><p>Release gate 要保存 lock mode、duration、blocked session 與 application impact。高風險 DDL 要先改成 expand / backfill / contract。</p>
<h2 id="backfill-and-validation">Backfill and Validation</h2>
<p>Backfill and validation 的核心責任是把資料補齊並證明結果符合 domain。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> <span class="s">&lt;&lt;&#39;SQL&#39;
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="s">UPDATE accounts
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s">SET email = lower(owner_name) || &#39;@example.test&#39;
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s">WHERE email IS NULL;
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="s">SELECT count(*) AS missing_email
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="s">FROM accounts
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="s">WHERE email IS NULL;
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="s">SQL</span></span></span></code></pre></div><p>大型表要分 batch backfill，避免 WAL、replica lag、autovacuum 與 lock 壓力。每個 batch 要記錄 row count、duration、error 與 lag。</p>
<h2 id="add-constraint-safely">Add Constraint Safely</h2>
<p>Add constraint safely 的核心責任是把資料驗證和 constraint 生效拆開。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> <span class="s">&lt;&lt;&#39;SQL&#39;
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="s">ALTER TABLE accounts
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s">ADD CONSTRAINT accounts_email_present
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s">CHECK (email IS NOT NULL) NOT VALID;
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="s">ALTER TABLE accounts
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="s">VALIDATE CONSTRAINT accounts_email_present;
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="s">SQL</span></span></span></code></pre></div><p><code>NOT VALID</code> 讓 constraint 先約束新資料，再用 validation 掃既有資料。這是 PostgreSQL online migration 常用技巧。</p>
<h2 id="query-plan-evidence">Query Plan Evidence</h2>
<p>Query plan evidence 的核心責任是確認 migration 後 query 仍走正確路徑。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> <span class="s">&lt;&lt;&#39;SQL&#39;
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="s">EXPLAIN (ANALYZE, BUFFERS)
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="s">SELECT *
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="s">FROM accounts
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="s">WHERE email = &#39;ada@example.test&#39;;
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="s">SQL</span></span></span></code></pre></div><p>若 email 查詢成為正式 path，要新增 index，並用 <code>CREATE INDEX CONCURRENTLY</code> 評估 lock 與時間。</p>
<h2 id="contract-migration">Contract Migration</h2>
<p>Contract migration 的核心責任是在 application 都改用新欄位後，收斂舊欄位或舊 constraint。Contract migration 要比 expand 更謹慎，因為 rollback 空間更小。</p>
<p>Contract release gate：</p>
<ol>
<li>所有 app version 已停止讀舊欄位 / 舊行為。</li>
<li>Backfill validation 為零缺口。</li>
<li>Query plan 與 index evidence 已保存。</li>
<li>Rollback path 是 fail-forward 或 restore，兩者擇一寫清楚。</li>
<li>PITR / backup window 符合風險。</li>
</ol>
<h2 id="release-gate-note">Release Gate Note</h2>
<p>Release gate note 的核心責任是形成可交付 artifact。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Migration: add accounts.email
</span></span><span class="line"><span class="ln">2</span><span class="cl">Expand DDL duration:
</span></span><span class="line"><span class="ln">3</span><span class="cl">Backfill rows:
</span></span><span class="line"><span class="ln">4</span><span class="cl">Validation query:
</span></span><span class="line"><span class="ln">5</span><span class="cl">Lock evidence:
</span></span><span class="line"><span class="ln">6</span><span class="cl">Query plan:
</span></span><span class="line"><span class="ln">7</span><span class="cl">Rollback / fail-forward:
</span></span><span class="line"><span class="ln">8</span><span class="cl">Owner:</span></span></code></pre></div><p>完成本篇後，複雜 migration 回到 <a href="../../online-schema-change/">Online Schema Change</a>；需要跨 DB 遷移則讀 <a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">Database Migration Playbook</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL Security / RLS / Audit Logging</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/security-rls-audit-logging/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/security-rls-audit-logging/</guid><description>&lt;p>PostgreSQL security / RLS / audit logging 的核心責任是把資料庫安全拆成存取邊界、資料列可見性與操作證據。PostgreSQL role / grant 決定誰能連線與操作 schema；&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/row-level-security/" data-link-title="Row-Level Security" data-link-desc="說明資料庫如何用 policy 限制同一張表中哪些 row 對某個角色可見或可寫">Row Level Security&lt;/a> 決定同一張表中哪些 row 對某個 role 可見；audit logging 則把敏感操作轉成可查詢、可保留、可告警的證據。&lt;/p>
&lt;p>本文的判讀錨點是：資料庫安全是 application auth 的下游防線。Application 仍要負責身份、session、租戶與 workflow；PostgreSQL security layer 負責在資料邊界補上 least privilege、tenant isolation 與 forensic evidence。&lt;/p>
&lt;h2 id="role-and-grant-baseline">Role and Grant Baseline&lt;/h2>
&lt;p>Role and grant baseline 的核心責任是把人、服務、migration 與分析查詢分開。Production database 至少要區分 application role、migration role、read-only role、admin role 與 replication / CDC role。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Role 類型&lt;/th>
 &lt;th>權限責任&lt;/th>
 &lt;th>常見風險&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Application&lt;/td>
 &lt;td>執行產品讀寫&lt;/td>
 &lt;td>權限過大、可 DDL、可讀所有 schema&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Migration&lt;/td>
 &lt;td>變更 schema&lt;/td>
 &lt;td>和 app 共用 role，事故難以追蹤&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Read-only&lt;/td>
 &lt;td>分析、debug、support&lt;/td>
 &lt;td>讀到 PII 或跨 tenant 資料&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Replication / CDC&lt;/td>
 &lt;td>logical replication、slot access&lt;/td>
 &lt;td>權限與 WAL retention 風險&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Admin&lt;/td>
 &lt;td>emergency operation&lt;/td>
 &lt;td>日常使用 admin role&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Grant review 要以 schema ownership 開始。Tables、sequences、functions、views、extensions 都有權限面；只管 table grant 會漏掉 sequence update、function execution 與 extension 使用。&lt;/p>
&lt;h2 id="row-level-security">Row Level Security&lt;/h2>
&lt;p>Row Level Security 的核心責任是在資料庫層 enforce row visibility。PostgreSQL 官方 RLS 文件描述 policy 可限制 normal query 返回、insert、update、delete 的 row；這讓 tenant boundary 可以在 database 層多一道 guard。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>RLS 使用情境&lt;/th>
 &lt;th>適合條件&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Multi-tenant SaaS&lt;/td>
 &lt;td>tenant_id 明確且每個 query 都可帶入&lt;/td>
 &lt;td>policy 是否覆蓋 SELECT / INSERT / UPDATE&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Support access&lt;/td>
 &lt;td>support role 需受限查詢&lt;/td>
 &lt;td>break-glass 是否有 audit&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Regional data&lt;/td>
 &lt;td>row 上有 region / residency&lt;/td>
 &lt;td>policy 是否和 GDPR / residency 對齊&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Sensitive subset&lt;/td>
 &lt;td>PII row 需特別隔離&lt;/td>
 &lt;td>masking / tokenization 是否仍需存在&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>RLS policy 要有 positive allow rule。每張啟用 RLS 的 table 都要有測試：同 tenant 可讀、跨 tenant 隔離、insert tenant mismatch 被擋、admin / support 例外被記錄。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL security / RLS / audit logging 的核心責任是把資料庫安全拆成存取邊界、資料列可見性與操作證據。PostgreSQL role / grant 決定誰能連線與操作 schema；<a href="/blog/backend/knowledge-cards/row-level-security/" data-link-title="Row-Level Security" data-link-desc="說明資料庫如何用 policy 限制同一張表中哪些 row 對某個角色可見或可寫">Row Level Security</a> 決定同一張表中哪些 row 對某個 role 可見；audit logging 則把敏感操作轉成可查詢、可保留、可告警的證據。</p>
<p>本文的判讀錨點是：資料庫安全是 application auth 的下游防線。Application 仍要負責身份、session、租戶與 workflow；PostgreSQL security layer 負責在資料邊界補上 least privilege、tenant isolation 與 forensic evidence。</p>
<h2 id="role-and-grant-baseline">Role and Grant Baseline</h2>
<p>Role and grant baseline 的核心責任是把人、服務、migration 與分析查詢分開。Production database 至少要區分 application role、migration role、read-only role、admin role 與 replication / CDC role。</p>
<table>
  <thead>
      <tr>
          <th>Role 類型</th>
          <th>權限責任</th>
          <th>常見風險</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Application</td>
          <td>執行產品讀寫</td>
          <td>權限過大、可 DDL、可讀所有 schema</td>
      </tr>
      <tr>
          <td>Migration</td>
          <td>變更 schema</td>
          <td>和 app 共用 role，事故難以追蹤</td>
      </tr>
      <tr>
          <td>Read-only</td>
          <td>分析、debug、support</td>
          <td>讀到 PII 或跨 tenant 資料</td>
      </tr>
      <tr>
          <td>Replication / CDC</td>
          <td>logical replication、slot access</td>
          <td>權限與 WAL retention 風險</td>
      </tr>
      <tr>
          <td>Admin</td>
          <td>emergency operation</td>
          <td>日常使用 admin role</td>
      </tr>
  </tbody>
</table>
<p>Grant review 要以 schema ownership 開始。Tables、sequences、functions、views、extensions 都有權限面；只管 table grant 會漏掉 sequence update、function execution 與 extension 使用。</p>
<h2 id="row-level-security">Row Level Security</h2>
<p>Row Level Security 的核心責任是在資料庫層 enforce row visibility。PostgreSQL 官方 RLS 文件描述 policy 可限制 normal query 返回、insert、update、delete 的 row；這讓 tenant boundary 可以在 database 層多一道 guard。</p>
<table>
  <thead>
      <tr>
          <th>RLS 使用情境</th>
          <th>適合條件</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Multi-tenant SaaS</td>
          <td>tenant_id 明確且每個 query 都可帶入</td>
          <td>policy 是否覆蓋 SELECT / INSERT / UPDATE</td>
      </tr>
      <tr>
          <td>Support access</td>
          <td>support role 需受限查詢</td>
          <td>break-glass 是否有 audit</td>
      </tr>
      <tr>
          <td>Regional data</td>
          <td>row 上有 region / residency</td>
          <td>policy 是否和 GDPR / residency 對齊</td>
      </tr>
      <tr>
          <td>Sensitive subset</td>
          <td>PII row 需特別隔離</td>
          <td>masking / tokenization 是否仍需存在</td>
      </tr>
  </tbody>
</table>
<p>RLS policy 要有 positive allow rule。每張啟用 RLS 的 table 都要有測試：同 tenant 可讀、跨 tenant 隔離、insert tenant mismatch 被擋、admin / support 例外被記錄。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">invoices</span><span class="w"> </span><span class="n">ENABLE</span><span class="w"> </span><span class="k">ROW</span><span class="w"> </span><span class="k">LEVEL</span><span class="w"> </span><span class="k">SECURITY</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">POLICY</span><span class="w"> </span><span class="n">tenant_isolation</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">invoices</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">USING</span><span class="w"> </span><span class="p">(</span><span class="n">tenant_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">current_setting</span><span class="p">(</span><span class="s1">&#39;app.tenant_id&#39;</span><span class="p">)::</span><span class="n">uuid</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">WITH</span><span class="w"> </span><span class="k">CHECK</span><span class="w"> </span><span class="p">(</span><span class="n">tenant_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">current_setting</span><span class="p">(</span><span class="s1">&#39;app.tenant_id&#39;</span><span class="p">)::</span><span class="n">uuid</span><span class="p">);</span></span></span></code></pre></div><p>這段 policy 依賴 application 在 transaction 內設定 <code>app.tenant_id</code>。使用 connection pooler 時，設定必須跟 transaction boundary 對齊，避免 session state 漂移。</p>
<h2 id="audit-logging">Audit Logging</h2>
<p>Audit logging 的核心責任是把敏感資料操作轉成可查詢證據。PostgreSQL 原生日誌可以記錄連線、DDL、錯誤與慢查詢；pgAudit 這類 extension 則補強 session / object audit。</p>
<table>
  <thead>
      <tr>
          <th>Audit 類型</th>
          <th>目的</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DDL audit</td>
          <td>schema 變更追蹤</td>
          <td>migration id、role、statement、timestamp</td>
      </tr>
      <tr>
          <td>Sensitive read</td>
          <td>PII / payment / health data 查詢</td>
          <td>role、tenant、operation、reason</td>
      </tr>
      <tr>
          <td>Privilege change</td>
          <td>grant / revoke / role 變更</td>
          <td>actor、target role、approval</td>
      </tr>
      <tr>
          <td>Failed access</td>
          <td>權限錯誤與 RLS block</td>
          <td>error code、role、relation</td>
      </tr>
      <tr>
          <td>Break-glass</td>
          <td>emergency admin access</td>
          <td>ticket id、duration、review result</td>
      </tr>
  </tbody>
</table>
<p>Audit log 要能進入 SIEM 或集中 log。只留在 database host 上，事故後查詢成本高；正式 runbook 要定義 retention、masking、access control 與 alert。</p>
<h2 id="pii-and-data-protection-boundary">PII and Data Protection Boundary</h2>
<p>PII and data protection boundary 的核心責任是把 database 權限和資料保護策略接起來。RLS 可以限制 row visibility，但 PII 的保護還需要 masking、tokenization、encryption、retention 與 deletion evidence。</p>
<table>
  <thead>
      <tr>
          <th>資料類型</th>
          <th>Database control</th>
          <th>跨模組路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tenant data</td>
          <td>RLS、tenant-scoped role</td>
          <td>data access review</td>
      </tr>
      <tr>
          <td>PII</td>
          <td>column grant、masking view</td>
          <td><a href="/blog/backend/07-security-data-protection/data-protection-and-masking-governance/" data-link-title="7.4 資料保護與遮罩治理" data-link-desc="以問題驅動方式整理資料分級、遮罩、匯出與備份治理">Data Protection</a></td>
      </tr>
      <tr>
          <td>Audit log</td>
          <td>append-only storage、retention</td>
          <td>SIEM / incident evidence</td>
      </tr>
      <tr>
          <td>Deletion request</td>
          <td>tombstone、cascade review</td>
          <td>retention policy、legal hold</td>
      </tr>
  </tbody>
</table>
<p>Column-level grant 和 masking view 適合 read-only analyst。Application role 通常需要明文處理 workflow；analyst / support role 則應走 restricted view。</p>
<h2 id="operational-evidence">Operational Evidence</h2>
<p>Operational evidence 的核心責任是讓安全設定可驗證。每次 release 或權限變更後，要跑固定檢查。</p>
<ol>
<li>Role matrix：每個 role 的 schema / table / sequence / function grant。</li>
<li>RLS test：tenant A / tenant B / support / admin 的可見性測試。</li>
<li>Audit sample：DDL、sensitive read、failed access 是否進 log。</li>
<li>Pooler compatibility：<code>SET LOCAL app.tenant_id</code> 是否跟 transaction 對齊。</li>
<li><a href="/blog/backend/knowledge-cards/break-glass-access/" data-link-title="Break-Glass Access" data-link-desc="說明緊急情況下臨時授予的高權限存取，如何用工單、時限與事後審查治理">Break-glass</a> drill：emergency access 是否可申請、可回收、可審查。</li>
</ol>
<p>Evidence 要保存在 release artifact。Security 設定只有文件描述時，incident 後難以證明它真的生效。</p>
<h2 id="failure-modes">Failure Modes</h2>
<p>Failure modes 的核心責任是把 database security 常見事故提前列出。</p>
<table>
  <thead>
      <tr>
          <th>Failure mode</th>
          <th>判讀訊號</th>
          <th>修正方向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>App role 權限過大</td>
          <td>app 可 DDL / drop / grant</td>
          <td>role split + least privilege</td>
      </tr>
      <tr>
          <td>RLS bypass</td>
          <td>owner / superuser / policy 漏洞</td>
          <td>dedicated app role + RLS test</td>
      </tr>
      <tr>
          <td>Pooler state drift</td>
          <td>tenant setting 漂到下個 request</td>
          <td><code>SET LOCAL</code> + transaction pooling review</td>
      </tr>
      <tr>
          <td>Audit gap</td>
          <td>敏感操作查不到 actor</td>
          <td>pgAudit / log schema / SIEM route</td>
      </tr>
      <tr>
          <td>Support overread</td>
          <td>support role 可讀全 tenant</td>
          <td>masking view + ticket-scoped access</td>
      </tr>
  </tbody>
</table>
<p>RLS bypass 要特別審查 table owner 與 superuser path。正式 application 連線應使用 dedicated role，並避免使用 table owner role 執行一般 request。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Security / RLS / audit logging 完成後，權限與 PII 治理讀 <a href="/blog/backend/07-security-data-protection/data-protection-and-masking-governance/" data-link-title="7.4 資料保護與遮罩治理" data-link-desc="以問題驅動方式整理資料分級、遮罩、匯出與備份治理">Data Protection</a>；connection state 風險讀 <a href="../connection-pooler-comparison/">Connection Pooler Comparison</a>；實作演練可放進 <a href="../hands-on/schema-migration-evidence-lab/">Schema Migration Evidence Lab</a> 的 release gate。</p>
]]></content:encoded></item><item><title>PostgreSQL to YugabyteDB / TiDB Migration</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-yugabytedb-tidb/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-yugabytedb-tidb/</guid><description>&lt;p>PostgreSQL to YugabyteDB / TiDB migration 的核心責任是處理從 single-primary PostgreSQL 走向 distributed SQL 的資料拓撲變更。這條路線通常由 multi-region write、horizontal scale、tenant sharding、availability 或 single-node capacity ceiling 觸發；其中 YugabyteDB 走 PostgreSQL-compatible YSQL 路線，TiDB 走 MySQL-compatible distributed SQL 路線，兩者的 application diff audit 不同。&lt;/p>
&lt;p>本文的判讀錨點是：API compatibility 只解決入口語法的一部分。YugabyteDB 要審查 PostgreSQL 相容與 distributed operation 差異；TiDB 要額外處理 PostgreSQL → MySQL dialect / driver / tooling 轉換。Distributed SQL 會改變 transaction latency、placement、index cost、DDL、sequence、lock、backup、observability 與 incident route。&lt;/p>
&lt;h2 id="official-documentation-route">Official Documentation Route&lt;/h2>
&lt;p>Official documentation route 的核心責任是把 compatibility claim 固定到可回查來源。YugabyteDB compatibility 先查 &lt;a href="https://docs.yugabyte.com/stable/reference/configuration/postgresql-compatibility/">YugabyteDB PostgreSQL compatibility&lt;/a>；TiDB compatibility 先查 &lt;a href="https://docs.pingcap.com/tidb/stable/mysql-compatibility/">TiDB MySQL compatibility&lt;/a>；本文最後檢查日是 2026-05-22。&lt;/p>
&lt;h2 id="driver-check">Driver Check&lt;/h2>
&lt;p>Driver check 的核心責任是確認 distributed SQL 解決的是核心問題。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Driver&lt;/th>
 &lt;th>代表需求&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Multi-region write&lt;/td>
 &lt;td>多地使用者都要低延遲寫入&lt;/td>
 &lt;td>consistency level、latency budget&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Horizontal write scaling&lt;/td>
 &lt;td>單 primary CPU / I/O 到頂&lt;/td>
 &lt;td>shard key、hot key、cross-shard txn&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Tenant distribution&lt;/td>
 &lt;td>tenant 可依 region / size 分布&lt;/td>
 &lt;td>tenant placement、rebalance&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Availability&lt;/td>
 &lt;td>節點 / zone failure 容忍&lt;/td>
 &lt;td>quorum、failover、RPO / RTO&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operational consolidation&lt;/td>
 &lt;td>多 PG shard 想收斂&lt;/td>
 &lt;td>migration complexity、cost&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>若主要問題是 read scaling、connection 數或 query index，先評估 read replica、pooler、partition、Citus 或 Aurora；distributed SQL 適合資料拓撲問題。&lt;/p>
&lt;h2 id="compatibility-audit">Compatibility Audit&lt;/h2>
&lt;p>Compatibility audit 的核心責任是把 PostgreSQL behavior 逐項對照 target。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>面向&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Protocol / API&lt;/td>
 &lt;td>YugabyteDB YSQL vs TiDB MySQL protocol&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>SQL dialect&lt;/td>
 &lt;td>function、extension、type、DDL support&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Transaction&lt;/td>
 &lt;td>isolation、lock、deadlock、retry&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Sequence / ID&lt;/td>
 &lt;td>global sequence latency、UUID policy&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Index&lt;/td>
 &lt;td>secondary index placement、write cost&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Foreign key&lt;/td>
 &lt;td>distributed FK cost / support&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Extension&lt;/td>
 &lt;td>PostGIS、pgvector、custom extension；TiDB 路線需改寫或拆出&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Tooling&lt;/td>
 &lt;td>migration tool、CDC、backup、monitoring&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Compatibility audit 要用 application query suite。只看 schema import 會漏掉 transaction retry、query planner、distributed index、dialect rewrite 與 latency。TiDB 路線還要加 PostgreSQL driver / SQL / type / migration tool 轉 MySQL ecosystem 的審查。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL to YugabyteDB / TiDB migration 的核心責任是處理從 single-primary PostgreSQL 走向 distributed SQL 的資料拓撲變更。這條路線通常由 multi-region write、horizontal scale、tenant sharding、availability 或 single-node capacity ceiling 觸發；其中 YugabyteDB 走 PostgreSQL-compatible YSQL 路線，TiDB 走 MySQL-compatible distributed SQL 路線，兩者的 application diff audit 不同。</p>
<p>本文的判讀錨點是：API compatibility 只解決入口語法的一部分。YugabyteDB 要審查 PostgreSQL 相容與 distributed operation 差異；TiDB 要額外處理 PostgreSQL → MySQL dialect / driver / tooling 轉換。Distributed SQL 會改變 transaction latency、placement、index cost、DDL、sequence、lock、backup、observability 與 incident route。</p>
<h2 id="official-documentation-route">Official Documentation Route</h2>
<p>Official documentation route 的核心責任是把 compatibility claim 固定到可回查來源。YugabyteDB compatibility 先查 <a href="https://docs.yugabyte.com/stable/reference/configuration/postgresql-compatibility/">YugabyteDB PostgreSQL compatibility</a>；TiDB compatibility 先查 <a href="https://docs.pingcap.com/tidb/stable/mysql-compatibility/">TiDB MySQL compatibility</a>；本文最後檢查日是 2026-05-22。</p>
<h2 id="driver-check">Driver Check</h2>
<p>Driver check 的核心責任是確認 distributed SQL 解決的是核心問題。</p>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>代表需求</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Multi-region write</td>
          <td>多地使用者都要低延遲寫入</td>
          <td>consistency level、latency budget</td>
      </tr>
      <tr>
          <td>Horizontal write scaling</td>
          <td>單 primary CPU / I/O 到頂</td>
          <td>shard key、hot key、cross-shard txn</td>
      </tr>
      <tr>
          <td>Tenant distribution</td>
          <td>tenant 可依 region / size 分布</td>
          <td>tenant placement、rebalance</td>
      </tr>
      <tr>
          <td>Availability</td>
          <td>節點 / zone failure 容忍</td>
          <td>quorum、failover、RPO / RTO</td>
      </tr>
      <tr>
          <td>Operational consolidation</td>
          <td>多 PG shard 想收斂</td>
          <td>migration complexity、cost</td>
      </tr>
  </tbody>
</table>
<p>若主要問題是 read scaling、connection 數或 query index，先評估 read replica、pooler、partition、Citus 或 Aurora；distributed SQL 適合資料拓撲問題。</p>
<h2 id="compatibility-audit">Compatibility Audit</h2>
<p>Compatibility audit 的核心責任是把 PostgreSQL behavior 逐項對照 target。</p>
<table>
  <thead>
      <tr>
          <th>面向</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Protocol / API</td>
          <td>YugabyteDB YSQL vs TiDB MySQL protocol</td>
      </tr>
      <tr>
          <td>SQL dialect</td>
          <td>function、extension、type、DDL support</td>
      </tr>
      <tr>
          <td>Transaction</td>
          <td>isolation、lock、deadlock、retry</td>
      </tr>
      <tr>
          <td>Sequence / ID</td>
          <td>global sequence latency、UUID policy</td>
      </tr>
      <tr>
          <td>Index</td>
          <td>secondary index placement、write cost</td>
      </tr>
      <tr>
          <td>Foreign key</td>
          <td>distributed FK cost / support</td>
      </tr>
      <tr>
          <td>Extension</td>
          <td>PostGIS、pgvector、custom extension；TiDB 路線需改寫或拆出</td>
      </tr>
      <tr>
          <td>Tooling</td>
          <td>migration tool、CDC、backup、monitoring</td>
      </tr>
  </tbody>
</table>
<p>Compatibility audit 要用 application query suite。只看 schema import 會漏掉 transaction retry、query planner、distributed index、dialect rewrite 與 latency。TiDB 路線還要加 PostgreSQL driver / SQL / type / migration tool 轉 MySQL ecosystem 的審查。</p>
<h2 id="data-topology">Data Topology</h2>
<p>Data topology 的核心責任是決定資料如何分布。Distributed SQL 的成敗常取決於 primary key、tenant key、region placement 與 hot key 控制。</p>
<table>
  <thead>
      <tr>
          <th>拓撲決策</th>
          <th>判讀問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Distribution key</td>
          <td>query 是否能 co-locate data</td>
      </tr>
      <tr>
          <td>Region placement</td>
          <td>資料是否需要 residency / low latency</td>
      </tr>
      <tr>
          <td>Hot key</td>
          <td>high-write tenant / account 是否集中</td>
      </tr>
      <tr>
          <td>Secondary index</td>
          <td>index write 是否跨 shard / region</td>
      </tr>
      <tr>
          <td>Transaction span</td>
          <td>交易是否常跨 tenant / region</td>
      </tr>
  </tbody>
</table>
<p>Topology 設計要從最高頻 workflow 開始。若核心交易每次都跨 shard，distributed SQL 的 latency 與 conflict cost 會很高。</p>
<h2 id="migration-phases">Migration Phases</h2>
<p>Migration phases 的核心責任是降低跨拓撲遷移風險。</p>
<table>
  <thead>
      <tr>
          <th>Phase</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Lab import</td>
          <td>schema import、query suite、driver test</td>
      </tr>
      <tr>
          <td>Topology design</td>
          <td>key、placement、region、index review</td>
      </tr>
      <tr>
          <td>Backfill</td>
          <td>snapshot、batch、checksum</td>
      </tr>
      <tr>
          <td>CDC catch-up</td>
          <td>LSN / change stream、lag、idempotency</td>
      </tr>
      <tr>
          <td>Shadow read</td>
          <td>result diff、latency profile</td>
      </tr>
      <tr>
          <td>Cutover</td>
          <td>freeze、final sync、traffic switch</td>
      </tr>
      <tr>
          <td>Rollback</td>
          <td>source PG snapshot、write replay plan</td>
      </tr>
  </tbody>
</table>
<p>CDC catch-up 要有 clear cutover LSN。Distributed SQL migration 最怕 source / target 同時有寫入後，缺少 reconciliation plan。</p>
<h2 id="application-changes">Application Changes</h2>
<p>Application changes 的核心責任是讓程式接受 distributed system 的錯誤模式。</p>
<ol>
<li>Transaction retry：serialization / conflict error 要可重試。</li>
<li>Idempotency：critical write 要有 natural key 或 idempotency key。</li>
<li>Latency budget：跨 region transaction 要進 SLO。</li>
<li>Pagination / ordering：distributed query 的排序成本要審查。</li>
<li>Connection / driver：target driver、TLS、pooling、load balancing 要測。</li>
</ol>
<p>Application 若假設 single-node low-latency transaction，遷移後會在 tail latency 與 retry 行為上出現落差。TiDB 路線還會出現 driver、placeholder、SQL function、type mapping 與 error code 的轉換成本；這些要在 staging failure injection 先看到。</p>
<h2 id="no-go-conditions">No-Go Conditions</h2>
<p>No-go conditions 的核心責任是阻止把 distributed SQL 當成萬用擴容。</p>
<table>
  <thead>
      <tr>
          <th>No-go 訊號</th>
          <th>替代路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>主要瓶頸是少數 slow query</td>
          <td>query optimization / index</td>
      </tr>
      <tr>
          <td>多數交易跨全局資料</td>
          <td>重設 bounded context 或保持 single primary</td>
      </tr>
      <tr>
          <td>Team 缺少 distributed operation 能力</td>
          <td>managed provider / simpler topology</td>
      </tr>
      <tr>
          <td>PostgreSQL extension 依賴重</td>
          <td>保留 PG 或拆出 specialized service</td>
      </tr>
      <tr>
          <td>RPO / rollback 沒有演練</td>
          <td>先完成 migration playbook</td>
      </tr>
      <tr>
          <td>想保留 PostgreSQL driver / SQL surface</td>
          <td>優先評估 YugabyteDB / CockroachDB / Citus</td>
      </tr>
  </tbody>
</table>
<p>Distributed SQL 的價值來自拓撲匹配。若 workload 缺少自然分布邊界，導入後只是把單點瓶頸換成分散式複雜度。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>PostgreSQL to YugabyteDB / TiDB migration 完成後，先讀 <a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">Global Distributed OLTP</a>；若需求是 PostgreSQL 內分散式 table，讀 <a href="../citus-distributed/">Citus Distributed</a>；跨 vendor 流程讀 <a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">Database Migration Playbook</a>。</p>
]]></content:encoded></item><item><title>Specialized PostgreSQL Variants</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/specialized-pg-variants/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/specialized-pg-variants/</guid><description>&lt;p>Specialized PostgreSQL variants 的核心責任是把 PostgreSQL ecosystem 裡的 specialized engines、extensions 與 managed variants 放到正確服務位置。PostgreSQL 的擴充性讓它能支援 geospatial、time-series、vector search、distributed table、serverless branch 與 managed acceleration；但每個變體都改變 operation、migration、cost 與 lock-in。&lt;/p>
&lt;p>本文的判讀錨點是：PostgreSQL compatibility 是入口，不等於相同責任。選 variant 前，要先說清楚新增能力解決哪個 workload，並確認 exit route。&lt;/p>
&lt;h2 id="variant-taxonomy">Variant Taxonomy&lt;/h2>
&lt;p>Variant taxonomy 的核心責任是把變體按資料模型與操作責任分類。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>類型&lt;/th>
 &lt;th>代表&lt;/th>
 &lt;th>主要解決問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Extension domain&lt;/td>
 &lt;td>PostGIS、pgvector、TimescaleDB&lt;/td>
 &lt;td>geospatial、vector、time-series&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Distributed PG&lt;/td>
 &lt;td>Citus、Cosmos DB for PostgreSQL&lt;/td>
 &lt;td>sharding、distributed query&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Managed accelerated PG&lt;/td>
 &lt;td>AlloyDB、Aurora PG&lt;/td>
 &lt;td>managed performance / HA / platform&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Serverless / branching&lt;/td>
 &lt;td>Neon、Supabase workflow&lt;/td>
 &lt;td>preview、branch、稀疏 workload&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Compatibility layer&lt;/td>
 &lt;td>YugabyteDB、部分 distributed SQL&lt;/td>
 &lt;td>PostgreSQL-like API + distributed storage&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>分類的重點是避免把不同變體視為同一種升級。Extension domain 強化單一資料模型；distributed PG 改變資料拓撲；managed accelerated PG 改變操作邊界；serverless PG 改變 lifecycle。&lt;/p>
&lt;h2 id="workload-fit">Workload Fit&lt;/h2>
&lt;p>Workload fit 的核心責任是判斷 variant 是否匹配資料形狀。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Workload&lt;/th>
 &lt;th>合適路線&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Geospatial query&lt;/td>
 &lt;td>PostGIS&lt;/td>
 &lt;td>index、SRID、資料量、query latency&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Time-series retention&lt;/td>
 &lt;td>TimescaleDB / partition strategy&lt;/td>
 &lt;td>compression、chunk、retention&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Vector search&lt;/td>
 &lt;td>pgvector / pgvectorscale&lt;/td>
 &lt;td>recall、latency、index build、hybrid search&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Tenant sharding&lt;/td>
 &lt;td>Citus / distributed PG&lt;/td>
 &lt;td>distribution key、co-location、rebalance&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Preview environment&lt;/td>
 &lt;td>serverless / branching PG&lt;/td>
 &lt;td>data privacy、branch lifecycle&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cloud-managed acceleration&lt;/td>
 &lt;td>AlloyDB / Aurora&lt;/td>
 &lt;td>compatibility、cost、exit&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Variant 要先證明普通 PostgreSQL 加 index / partition / read replica 已到邊界。若基礎 query design 還沒成熟，導入 variant 會把複雜度提前。&lt;/p>
&lt;h2 id="migration-gap">Migration Gap&lt;/h2>
&lt;p>Migration gap 的核心責任是列出從 vanilla PostgreSQL 進入 variant 的差異。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>差異面&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>DDL&lt;/td>
 &lt;td>extension object、distributed table、chunk&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Query&lt;/td>
 &lt;td>planner、function、operator、pushdown&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Data movement&lt;/td>
 &lt;td>backfill、reshard、index build&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operation&lt;/td>
 &lt;td>backup、restore、upgrade、failover&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Tooling&lt;/td>
 &lt;td>ORM、migration tool、CDC、monitoring&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Exit&lt;/td>
 &lt;td>dump / restore 是否回到 vanilla PG&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Migration 要有 compatibility test。每個核心 query 在 variant 上跑 explain、latency、result correctness；每個 migration step 都要有 rollback 或 rebuild path。&lt;/p></description><content:encoded><![CDATA[<p>Specialized PostgreSQL variants 的核心責任是把 PostgreSQL ecosystem 裡的 specialized engines、extensions 與 managed variants 放到正確服務位置。PostgreSQL 的擴充性讓它能支援 geospatial、time-series、vector search、distributed table、serverless branch 與 managed acceleration；但每個變體都改變 operation、migration、cost 與 lock-in。</p>
<p>本文的判讀錨點是：PostgreSQL compatibility 是入口，不等於相同責任。選 variant 前，要先說清楚新增能力解決哪個 workload，並確認 exit route。</p>
<h2 id="variant-taxonomy">Variant Taxonomy</h2>
<p>Variant taxonomy 的核心責任是把變體按資料模型與操作責任分類。</p>
<table>
  <thead>
      <tr>
          <th>類型</th>
          <th>代表</th>
          <th>主要解決問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Extension domain</td>
          <td>PostGIS、pgvector、TimescaleDB</td>
          <td>geospatial、vector、time-series</td>
      </tr>
      <tr>
          <td>Distributed PG</td>
          <td>Citus、Cosmos DB for PostgreSQL</td>
          <td>sharding、distributed query</td>
      </tr>
      <tr>
          <td>Managed accelerated PG</td>
          <td>AlloyDB、Aurora PG</td>
          <td>managed performance / HA / platform</td>
      </tr>
      <tr>
          <td>Serverless / branching</td>
          <td>Neon、Supabase workflow</td>
          <td>preview、branch、稀疏 workload</td>
      </tr>
      <tr>
          <td>Compatibility layer</td>
          <td>YugabyteDB、部分 distributed SQL</td>
          <td>PostgreSQL-like API + distributed storage</td>
      </tr>
  </tbody>
</table>
<p>分類的重點是避免把不同變體視為同一種升級。Extension domain 強化單一資料模型；distributed PG 改變資料拓撲；managed accelerated PG 改變操作邊界；serverless PG 改變 lifecycle。</p>
<h2 id="workload-fit">Workload Fit</h2>
<p>Workload fit 的核心責任是判斷 variant 是否匹配資料形狀。</p>
<table>
  <thead>
      <tr>
          <th>Workload</th>
          <th>合適路線</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Geospatial query</td>
          <td>PostGIS</td>
          <td>index、SRID、資料量、query latency</td>
      </tr>
      <tr>
          <td>Time-series retention</td>
          <td>TimescaleDB / partition strategy</td>
          <td>compression、chunk、retention</td>
      </tr>
      <tr>
          <td>Vector search</td>
          <td>pgvector / pgvectorscale</td>
          <td>recall、latency、index build、hybrid search</td>
      </tr>
      <tr>
          <td>Tenant sharding</td>
          <td>Citus / distributed PG</td>
          <td>distribution key、co-location、rebalance</td>
      </tr>
      <tr>
          <td>Preview environment</td>
          <td>serverless / branching PG</td>
          <td>data privacy、branch lifecycle</td>
      </tr>
      <tr>
          <td>Cloud-managed acceleration</td>
          <td>AlloyDB / Aurora</td>
          <td>compatibility、cost、exit</td>
      </tr>
  </tbody>
</table>
<p>Variant 要先證明普通 PostgreSQL 加 index / partition / read replica 已到邊界。若基礎 query design 還沒成熟，導入 variant 會把複雜度提前。</p>
<h2 id="migration-gap">Migration Gap</h2>
<p>Migration gap 的核心責任是列出從 vanilla PostgreSQL 進入 variant 的差異。</p>
<table>
  <thead>
      <tr>
          <th>差異面</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DDL</td>
          <td>extension object、distributed table、chunk</td>
      </tr>
      <tr>
          <td>Query</td>
          <td>planner、function、operator、pushdown</td>
      </tr>
      <tr>
          <td>Data movement</td>
          <td>backfill、reshard、index build</td>
      </tr>
      <tr>
          <td>Operation</td>
          <td>backup、restore、upgrade、failover</td>
      </tr>
      <tr>
          <td>Tooling</td>
          <td>ORM、migration tool、CDC、monitoring</td>
      </tr>
      <tr>
          <td>Exit</td>
          <td>dump / restore 是否回到 vanilla PG</td>
      </tr>
  </tbody>
</table>
<p>Migration 要有 compatibility test。每個核心 query 在 variant 上跑 explain、latency、result correctness；每個 migration step 都要有 rollback 或 rebuild path。</p>
<h2 id="lock-in-and-exit">Lock-In and Exit</h2>
<p>Lock-in and exit 的核心責任是把 variant-specific 能力和可攜性分開。</p>
<table>
  <thead>
      <tr>
          <th>Lock-in 來源</th>
          <th>控制方式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Extension-specific type</td>
          <td>adapter layer、domain boundary</td>
      </tr>
      <tr>
          <td>Managed-only feature</td>
          <td>decision record、exit test</td>
      </tr>
      <tr>
          <td>Distributed table DDL</td>
          <td>topology doc、reshard runbook</td>
      </tr>
      <tr>
          <td>Serverless branch API</td>
          <td>dev workflow boundary</td>
      </tr>
      <tr>
          <td>Proprietary index / function</td>
          <td>fallback query / export strategy</td>
      </tr>
  </tbody>
</table>
<p>Lock-in 可以接受，但要被命名。若 variant 能顯著降低成本或提高能力，採用是合理決策；工程責任是保留 exit evidence 與 migration plan。</p>
<h2 id="decision-matrix">Decision Matrix</h2>
<p>Decision matrix 的核心責任是把 variant 路由接到 PostgreSQL 主章。</p>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>下一步</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>地理查詢是核心產品能力</td>
          <td><a href="../postgis-deep-dive/">PostGIS Deep Dive</a></td>
      </tr>
      <tr>
          <td>時序資料與 retention 是主壓力</td>
          <td><a href="../timescaledb-deep-dive/">TimescaleDB Deep Dive</a></td>
      </tr>
      <tr>
          <td>向量搜尋在 PG 內整合</td>
          <td><a href="../pgvector-deep-dive/">pgvector Deep Dive</a></td>
      </tr>
      <tr>
          <td>tenant sharding / distributed query</td>
          <td><a href="../citus-distributed/">Citus Distributed</a></td>
      </tr>
      <tr>
          <td>managed provider 選型</td>
          <td><a href="../managed-pg-comparison/">Managed PostgreSQL Comparison</a></td>
      </tr>
      <tr>
          <td>分散式 SQL API 相容評估</td>
          <td><a href="../migrate-to-yugabytedb-tidb/">PostgreSQL to YugabyteDB / TiDB</a></td>
      </tr>
  </tbody>
</table>
<p>Decision matrix 要隨案例更新。Variant 選型最需要實際 workload：資料量、query pattern、SLO、team skill、合規與 exit 成本。</p>
<h2 id="review-checklist">Review Checklist</h2>
<p>Review checklist 的核心責任是避免 specialized variant 只被功能吸引。</p>
<ol>
<li>Workload 是否真的需要 specialized capability。</li>
<li>Vanilla PostgreSQL 的 index / partition / replica 是否已評估。</li>
<li>Extension / managed feature 的版本與支援政策。</li>
<li>Backup / restore / upgrade runbook。</li>
<li>Migration tool、CDC、observability 是否支援。</li>
<li>Exit route 是否至少在 staging 演練。</li>
<li>成本模型是否包含 storage、compute、I/O、support、operation。</li>
</ol>
<p>完成 checklist 後，variant 才能進入正式 proposal。這樣可以保留 PostgreSQL ecosystem 的彈性，也避免變體變成隱形平台遷移。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Specialized variants 完成後，回到 <a href="../">PostgreSQL overview</a> 做服務定位；需要 managed provider 比較讀 <a href="../managed-pg-comparison/">Managed PostgreSQL Comparison</a>；需要跨 vendor migration 讀 <a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">Database Migration Playbook</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL to SQLite Simplification</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/sqlite/migrate-from-postgresql-simplification/</link><pubDate>Thu, 21 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/sqlite/migrate-from-postgresql-simplification/</guid><description>&lt;p>PostgreSQL to SQLite simplification 的核心責任是處理反向路線：服務責任縮小後，評估 SQLite 是否能降低操作成本。這條路線適合 single-user app、CLI、desktop app、內部工具、read-mostly artifact store、demo environment、local-first prototype 或 edge-local utility。&lt;/p>
&lt;p>本文的判讀錨點是：降級到 SQLite 是責任縮小，也是讓資料模型回到 single-process / file-owned / local-state 的工程選擇。只要正式需求從 multi-user server DB 回到這個範圍，SQLite 可以提供更低元件數、更容易搬移與更低維護成本。&lt;/p>
&lt;h2 id="simplification-drivers">Simplification Drivers&lt;/h2>
&lt;p>Simplification drivers 的核心責任是確認 PostgreSQL 的能力已超過服務需求。若 server DB 的 HA、role、replica、pool、vacuum、PITR、schema governance 都變成維運負擔，而產品只需要單一 process 持有資料，就可以評估 SQLite。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Driver&lt;/th>
 &lt;th>代表情境&lt;/th>
 &lt;th>SQLite 帶來的收益&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Single-user app&lt;/td>
 &lt;td>desktop、CLI、local admin tool&lt;/td>
 &lt;td>file portability、offline use&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Read-mostly artifact&lt;/td>
 &lt;td>build metadata、catalog snapshot&lt;/td>
 &lt;td>deployment simple、低 runtime dependency&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Internal tool&lt;/td>
 &lt;td>小團隊使用、資料量小、低寫入&lt;/td>
 &lt;td>降低 DB server operation&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Demo / fixture&lt;/td>
 &lt;td>每個 environment 一份可重建資料&lt;/td>
 &lt;td>quick reset、deterministic seed&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Edge-local utility&lt;/td>
 &lt;td>request-local / device-local state&lt;/td>
 &lt;td>low latency、local ownership&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Driver 要連到 ownership。SQLite 適合「這份資料由某個 process / device / artifact 明確持有」；若資料仍屬於多服務共同真相，保留 PostgreSQL 或改成 managed SQL 會更穩定。&lt;/p>
&lt;h2 id="no-go-conditions">No-Go Conditions&lt;/h2>
&lt;p>No-go condition 的核心責任是保護仍需要 server DB 的服務。若 PostgreSQL 的核心能力仍被業務依賴，遷到 SQLite 會把風險轉移到 application code、file backup 與人工流程。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>No-go 訊號&lt;/th>
 &lt;th>代表責任&lt;/th>
 &lt;th>保留路由&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>多 tenant 與 centralized permission&lt;/td>
 &lt;td>DB role、grant、audit 仍有價值&lt;/td>
 &lt;td>PostgreSQL&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>多 instance concurrent writer&lt;/td>
 &lt;td>SQLite writer boundary 壓力過高&lt;/td>
 &lt;td>PostgreSQL / MySQL&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>PITR / HA 是合約要求&lt;/td>
 &lt;td>server DB operation 是正式責任&lt;/td>
 &lt;td>Managed PostgreSQL / Aurora&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Analyst / job 直接查 DB&lt;/td>
 &lt;td>access control 與 query isolation&lt;/td>
 &lt;td>PostgreSQL read replica / warehouse&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cross-service source of truth&lt;/td>
 &lt;td>單檔 ownership 與服務邊界衝突&lt;/td>
 &lt;td>保留 server DB 或拆 bounded context&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>No-go 條件要寫進 migration proposal。Simplification 的目標是降低操作成本；若降級後要用大量自製機制補回 role、audit、HA 與 concurrent write，成本會回到系統裡。&lt;/p>
&lt;h2 id="diff-audit">Diff Audit&lt;/h2>
&lt;p>Diff audit 的核心責任是把 PostgreSQL 語意縮到 SQLite 可以清楚承擔的範圍。PostgreSQL extension、function、type、index、constraint、sequence、view、trigger、role 與 transaction behavior 都要盤點。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL to SQLite simplification 的核心責任是處理反向路線：服務責任縮小後，評估 SQLite 是否能降低操作成本。這條路線適合 single-user app、CLI、desktop app、內部工具、read-mostly artifact store、demo environment、local-first prototype 或 edge-local utility。</p>
<p>本文的判讀錨點是：降級到 SQLite 是責任縮小，也是讓資料模型回到 single-process / file-owned / local-state 的工程選擇。只要正式需求從 multi-user server DB 回到這個範圍，SQLite 可以提供更低元件數、更容易搬移與更低維護成本。</p>
<h2 id="simplification-drivers">Simplification Drivers</h2>
<p>Simplification drivers 的核心責任是確認 PostgreSQL 的能力已超過服務需求。若 server DB 的 HA、role、replica、pool、vacuum、PITR、schema governance 都變成維運負擔，而產品只需要單一 process 持有資料，就可以評估 SQLite。</p>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>代表情境</th>
          <th>SQLite 帶來的收益</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Single-user app</td>
          <td>desktop、CLI、local admin tool</td>
          <td>file portability、offline use</td>
      </tr>
      <tr>
          <td>Read-mostly artifact</td>
          <td>build metadata、catalog snapshot</td>
          <td>deployment simple、低 runtime dependency</td>
      </tr>
      <tr>
          <td>Internal tool</td>
          <td>小團隊使用、資料量小、低寫入</td>
          <td>降低 DB server operation</td>
      </tr>
      <tr>
          <td>Demo / fixture</td>
          <td>每個 environment 一份可重建資料</td>
          <td>quick reset、deterministic seed</td>
      </tr>
      <tr>
          <td>Edge-local utility</td>
          <td>request-local / device-local state</td>
          <td>low latency、local ownership</td>
      </tr>
  </tbody>
</table>
<p>Driver 要連到 ownership。SQLite 適合「這份資料由某個 process / device / artifact 明確持有」；若資料仍屬於多服務共同真相，保留 PostgreSQL 或改成 managed SQL 會更穩定。</p>
<h2 id="no-go-conditions">No-Go Conditions</h2>
<p>No-go condition 的核心責任是保護仍需要 server DB 的服務。若 PostgreSQL 的核心能力仍被業務依賴，遷到 SQLite 會把風險轉移到 application code、file backup 與人工流程。</p>
<table>
  <thead>
      <tr>
          <th>No-go 訊號</th>
          <th>代表責任</th>
          <th>保留路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>多 tenant 與 centralized permission</td>
          <td>DB role、grant、audit 仍有價值</td>
          <td>PostgreSQL</td>
      </tr>
      <tr>
          <td>多 instance concurrent writer</td>
          <td>SQLite writer boundary 壓力過高</td>
          <td>PostgreSQL / MySQL</td>
      </tr>
      <tr>
          <td>PITR / HA 是合約要求</td>
          <td>server DB operation 是正式責任</td>
          <td>Managed PostgreSQL / Aurora</td>
      </tr>
      <tr>
          <td>Analyst / job 直接查 DB</td>
          <td>access control 與 query isolation</td>
          <td>PostgreSQL read replica / warehouse</td>
      </tr>
      <tr>
          <td>Cross-service source of truth</td>
          <td>單檔 ownership 與服務邊界衝突</td>
          <td>保留 server DB 或拆 bounded context</td>
      </tr>
  </tbody>
</table>
<p>No-go 條件要寫進 migration proposal。Simplification 的目標是降低操作成本；若降級後要用大量自製機制補回 role、audit、HA 與 concurrent write，成本會回到系統裡。</p>
<h2 id="diff-audit">Diff Audit</h2>
<p>Diff audit 的核心責任是把 PostgreSQL 語意縮到 SQLite 可以清楚承擔的範圍。PostgreSQL extension、function、type、index、constraint、sequence、view、trigger、role 與 transaction behavior 都要盤點。</p>
<table>
  <thead>
      <tr>
          <th>PostgreSQL feature</th>
          <th>SQLite 轉換策略</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>timestamptz</code></td>
          <td>UTC ISO text 或 integer epoch</td>
          <td>timezone policy 是否固定</td>
      </tr>
      <tr>
          <td><code>jsonb</code> + GIN</td>
          <td>JSON text + limited query / app filter</td>
          <td>query 是否仍需 index</td>
      </tr>
      <tr>
          <td>Sequence / identity</td>
          <td>INTEGER PRIMARY KEY 或 app ID</td>
          <td>id stability 與 import collision</td>
      </tr>
      <tr>
          <td>Partial index</td>
          <td>SQLite partial index</td>
          <td>predicate 與 query planner 是否對齊</td>
      </tr>
      <tr>
          <td>Role / grant</td>
          <td>filesystem permission + app auth</td>
          <td>權限是否可移到 application boundary</td>
      </tr>
      <tr>
          <td>Extension</td>
          <td>application logic 或放棄 feature</td>
          <td>feature 是否仍是正式需求</td>
      </tr>
  </tbody>
</table>
<p>Diff audit 的輸出是一份保留 / 移除 / 改寫清單。每個 PostgreSQL feature 都要回答：這是正式需求、歷史殘留，還是可以移到 application layer 的便利功能。</p>
<h2 id="phase-plan">Phase Plan</h2>
<p>Phase plan 的核心責任是把 server DB 退場變成可回復流程。反向 migration 要超過一次性 dump：先收斂寫入、建立 SQLite schema、匯入資料、跑 adapter test、演練 backup，再退役 PostgreSQL。</p>
<table>
  <thead>
      <tr>
          <th>Phase</th>
          <th>目的</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Scope reduction</td>
          <td>確認資料責任已縮小</td>
          <td>ownership doc、no-go review</td>
      </tr>
      <tr>
          <td>Schema rewrite</td>
          <td>建立 SQLite schema</td>
          <td>migration dry run、STRICT / constraint</td>
      </tr>
      <tr>
          <td>Data export</td>
          <td>從 PostgreSQL 匯出 snapshot</td>
          <td>row count、checksum、dump metadata</td>
      </tr>
      <tr>
          <td>Data import</td>
          <td>寫入 SQLite file</td>
          <td>integrity check、foreign key check</td>
      </tr>
      <tr>
          <td>Adapter switch</td>
          <td>app 改用 SQLite repository</td>
          <td>contract test、error mapping</td>
      </tr>
      <tr>
          <td>Backup runbook</td>
          <td>建立 file lifecycle evidence</td>
          <td>backup restore drill</td>
      </tr>
      <tr>
          <td>Server retirement</td>
          <td>關閉 PostgreSQL 寫入與 credential</td>
          <td>retention、credential removal、incident route</td>
      </tr>
  </tbody>
</table>
<p>Scope reduction 是第一關。若資料仍被多個服務寫入，應先拆出 bounded context 或建立 event / export boundary；SQLite file 才能成為明確 owned artifact。</p>
<h2 id="data-movement">Data Movement</h2>
<p>Data movement 的核心責任是把 PostgreSQL snapshot 轉成 SQLite file 並保留驗證。可用 <code>COPY</code> / CSV、application ETL 或 dedicated migration tool；選擇取決於 type conversion 與資料量。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> -c <span class="s2">&#34;\\copy orders TO &#39;orders.csv&#39; CSV HEADER&#34;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">sqlite3 app.db <span class="s2">&#34;.mode csv&#34;</span> <span class="s2">&#34;.import --skip 1 orders.csv orders&#34;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">sqlite3 app.db <span class="s2">&#34;PRAGMA integrity_check;&#34;</span></span></span></code></pre></div><p>這段命令是教學骨架。正式流程要處理 NULL、delimiter、timezone、numeric precision、FK order、transaction、temporary disk、sensitive data 與 import log。</p>
<p>Import 後要跑三種 evidence：database integrity、row count / checksum、business invariant。Business invariant 例如 active user count、total balance、latest event id、pending job count；這些比單純 row count 更能抓到語意錯誤。</p>
<h2 id="runbook-shift">Runbook Shift</h2>
<p>Runbook shift 的核心責任是把 PostgreSQL operation 移轉成 SQLite file operation。Server DB 的 backup / role / monitoring 退場後，要補上 SQLite 的 backup、restore、file permission、WAL、migration 與 disk 觀測。</p>
<p>最小 SQLite runbook 包含：</p>
<ol>
<li>Database file path、owner process、filesystem permission。</li>
<li>Journal mode、busy timeout、foreign key、schema version。</li>
<li>Backup command、restore drill、retention、checksum。</li>
<li>Migration command、pre-migration snapshot、rollback path。</li>
<li>Observability：busy、WAL size、disk free、backup age。</li>
<li>Incident route：disk full、bad migration、corruption signal。</li>
</ol>
<p>Runbook shift 要同步移除 PostgreSQL credential。Server database 退役時，保留 read-only archive、刪除 application secret、關閉 scheduled job、更新 dashboard 與 incident routing。</p>
<h2 id="cleanup-and-retention">Cleanup and Retention</h2>
<p>Cleanup and retention 的核心責任是讓舊 PostgreSQL 不再成為影子真相。Migration 後若舊 DB 長期可寫，團隊會在事故中分不清哪份資料有效。</p>
<table>
  <thead>
      <tr>
          <th>Cleanup 項目</th>
          <th>操作</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Write disable</td>
          <td>PostgreSQL role 改 read-only 或關閉 app access</td>
      </tr>
      <tr>
          <td>Archive snapshot</td>
          <td>保存最後 dump、checksum、schema</td>
      </tr>
      <tr>
          <td>Credential removal</td>
          <td>移除 app secret、CI secret、admin token</td>
      </tr>
      <tr>
          <td>Dashboard update</td>
          <td>停用 PostgreSQL alert、啟用 SQLite alert</td>
      </tr>
      <tr>
          <td>Documentation</td>
          <td>更新 source-of-truth 與 restore route</td>
      </tr>
  </tbody>
</table>
<p>Retention 要和 data protection 對齊。若 PostgreSQL 內有 PII、audit log 或 legal retention，退役流程要依 retention policy 保存或銷毀，而非直接刪除。</p>
<h2 id="decision-route">Decision Route</h2>
<p>Decision route 的核心責任是讓 simplification 保持可逆。若未來 concurrent writer、central audit、PITR 或 multi-service source-of-truth 回來，系統要能沿 <a href="/blog/backend/01-database/vendors/sqlite/migrate-to-postgresql/" data-link-title="SQLite to PostgreSQL Migration" data-link-desc="SQLite 升級到 PostgreSQL 的 driver、schema diff、data copy、dual run、cutover、rollback 與 cleanup">SQLite to PostgreSQL migration</a> 重新升級。</p>
<table>
  <thead>
      <tr>
          <th>現況</th>
          <th>建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Single-user / local artifact</td>
          <td>SQLite simplification</td>
      </tr>
      <tr>
          <td>Small internal tool + low write</td>
          <td>SQLite + restore drill</td>
      </tr>
      <tr>
          <td>Read-mostly dataset for app bundle</td>
          <td>SQLite artifact + release version</td>
      </tr>
      <tr>
          <td>Multi-user SaaS</td>
          <td>保留 PostgreSQL</td>
      </tr>
      <tr>
          <td>Audit / HA / role 是正式要求</td>
          <td>保留 managed PostgreSQL</td>
      </tr>
  </tbody>
</table>
<p>Simplification 的完成標準是：SQLite file 可以被重建、備份、恢復、升級與交接。只要這些 evidence 完整，從 PostgreSQL 退到 SQLite 是清楚的工程決策。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>PostgreSQL to SQLite simplification 完成後，先讀 <a href="/blog/backend/01-database/vendors/sqlite/file-lifecycle-backup-boundary/" data-link-title="SQLite file lifecycle 與 backup boundary" data-link-desc="把 SQLite 單檔案正式狀態拆成 WAL、backup API、restore drill、corruption recovery 與操作責任邊界">file lifecycle / backup boundary</a> 建立 file operation；再讀 <a href="/blog/backend/01-database/vendors/sqlite/observability-runbook/" data-link-title="SQLite Observability and Runbook" data-link-desc="SQLite production runbook、backup evidence、WAL growth、busy errors、disk usage、restore drill 與 incident route">SQLite observability / runbook</a> 補 evidence；若之後需求再成長，回到 <a href="/blog/backend/01-database/vendors/sqlite/migrate-to-postgresql/" data-link-title="SQLite to PostgreSQL Migration" data-link-desc="SQLite 升級到 PostgreSQL 的 driver、schema diff、data copy、dual run、cutover、rollback 與 cleanup">SQLite to PostgreSQL migration</a>。</p>
]]></content:encoded></item><item><title>SQLite to PostgreSQL Migration</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/sqlite/migrate-to-postgresql/</link><pubDate>Thu, 21 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/sqlite/migrate-to-postgresql/</guid><description>&lt;p>SQLite to PostgreSQL migration 的核心責任是把 embedded single-file state 升級成 server SQL operational model。這條路線通常由 multi-user access、HA、central audit、permission、online schema governance、write concurrency 或 team handoff 壓力觸發。&lt;/p>
&lt;p>本文的判讀錨點是：升級到 PostgreSQL 是服務責任擴大，而非單純換 driver。Migration 要同時處理 schema 語意、資料搬遷、application adapter、backup / PITR、role、observability、cutover 與 rollback。&lt;/p>
&lt;h2 id="migration-drivers">Migration Drivers&lt;/h2>
&lt;p>Migration drivers 的核心責任是確認 PostgreSQL 真的承擔新增責任。SQLite 在 single-node、single-file、low-concurrency 場景很強；PostgreSQL 的價值出現在 server database governance。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Driver&lt;/th>
 &lt;th>代表需求&lt;/th>
 &lt;th>PostgreSQL 承擔的責任&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Concurrent writers&lt;/td>
 &lt;td>多 instance / 多使用者同時寫入&lt;/td>
 &lt;td>MVCC、connection management、lock insight&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>HA / PITR&lt;/td>
 &lt;td>需要時間點恢復與 managed backup&lt;/td>
 &lt;td>WAL archiving、replica、restore drill&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Central audit&lt;/td>
 &lt;td>需要查詢與變更證據&lt;/td>
 &lt;td>role、log、extension、SIEM integration&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Permission boundary&lt;/td>
 &lt;td>app / analyst / job 權限分離&lt;/td>
 &lt;td>DB role、grant、row / schema boundary&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Schema governance&lt;/td>
 &lt;td>migration 要 online 且可審查&lt;/td>
 &lt;td>migration tool、lock review、rollback&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Shared data platform&lt;/td>
 &lt;td>多服務共用正式資料&lt;/td>
 &lt;td>connection pool、capacity、ownership&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Driver 要被量化。若問題只是單一 CLI 檔案變大，先改善 backup、VACUUM、index 與 WAL runbook；若問題是多 instance 同時寫、權限分離、audit 與 PITR，PostgreSQL 才是正確路由。&lt;/p>
&lt;h2 id="diff-audit">Diff Audit&lt;/h2>
&lt;p>Diff audit 的核心責任是把 SQLite 語意轉成 PostgreSQL 語意。SQLite 的 type affinity、date / time convention、auto-increment、foreign key、index、JSON、transaction 與 extension 都要逐項審查。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>面向&lt;/th>
 &lt;th>SQLite source 問題&lt;/th>
 &lt;th>PostgreSQL target 決策&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Type&lt;/td>
 &lt;td>dynamic typing、STRICT usage&lt;/td>
 &lt;td>integer / bigint / numeric / timestamptz&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Primary key&lt;/td>
 &lt;td>rowid、INTEGER PRIMARY KEY&lt;/td>
 &lt;td>identity、sequence、UUID&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Date/time&lt;/td>
 &lt;td>TEXT / INTEGER convention&lt;/td>
 &lt;td>timestamptz、timezone policy&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>JSON&lt;/td>
 &lt;td>JSON text / function usage&lt;/td>
 &lt;td>jsonb、GIN index、query rewrite&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Constraint&lt;/td>
 &lt;td>FK pragma、check、unique collation&lt;/td>
 &lt;td>enforced FK、deferrable、collation&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Index&lt;/td>
 &lt;td>partial / expression / covering index&lt;/td>
 &lt;td>equivalent index + explain&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Transaction&lt;/td>
 &lt;td>single writer、savepoint&lt;/td>
 &lt;td>isolation level、deadlock retry&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Type mapping 要先保護 domain invariant。金額欄位用 integer cents 或 numeric、時間欄位用 timestamptz 或明確 UTC text、boolean 用 boolean；每個轉換都要有 invalid sample 與 round-trip test。&lt;/p></description><content:encoded><![CDATA[<p>SQLite to PostgreSQL migration 的核心責任是把 embedded single-file state 升級成 server SQL operational model。這條路線通常由 multi-user access、HA、central audit、permission、online schema governance、write concurrency 或 team handoff 壓力觸發。</p>
<p>本文的判讀錨點是：升級到 PostgreSQL 是服務責任擴大，而非單純換 driver。Migration 要同時處理 schema 語意、資料搬遷、application adapter、backup / PITR、role、observability、cutover 與 rollback。</p>
<h2 id="migration-drivers">Migration Drivers</h2>
<p>Migration drivers 的核心責任是確認 PostgreSQL 真的承擔新增責任。SQLite 在 single-node、single-file、low-concurrency 場景很強；PostgreSQL 的價值出現在 server database governance。</p>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>代表需求</th>
          <th>PostgreSQL 承擔的責任</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Concurrent writers</td>
          <td>多 instance / 多使用者同時寫入</td>
          <td>MVCC、connection management、lock insight</td>
      </tr>
      <tr>
          <td>HA / PITR</td>
          <td>需要時間點恢復與 managed backup</td>
          <td>WAL archiving、replica、restore drill</td>
      </tr>
      <tr>
          <td>Central audit</td>
          <td>需要查詢與變更證據</td>
          <td>role、log、extension、SIEM integration</td>
      </tr>
      <tr>
          <td>Permission boundary</td>
          <td>app / analyst / job 權限分離</td>
          <td>DB role、grant、row / schema boundary</td>
      </tr>
      <tr>
          <td>Schema governance</td>
          <td>migration 要 online 且可審查</td>
          <td>migration tool、lock review、rollback</td>
      </tr>
      <tr>
          <td>Shared data platform</td>
          <td>多服務共用正式資料</td>
          <td>connection pool、capacity、ownership</td>
      </tr>
  </tbody>
</table>
<p>Driver 要被量化。若問題只是單一 CLI 檔案變大，先改善 backup、VACUUM、index 與 WAL runbook；若問題是多 instance 同時寫、權限分離、audit 與 PITR，PostgreSQL 才是正確路由。</p>
<h2 id="diff-audit">Diff Audit</h2>
<p>Diff audit 的核心責任是把 SQLite 語意轉成 PostgreSQL 語意。SQLite 的 type affinity、date / time convention、auto-increment、foreign key、index、JSON、transaction 與 extension 都要逐項審查。</p>
<table>
  <thead>
      <tr>
          <th>面向</th>
          <th>SQLite source 問題</th>
          <th>PostgreSQL target 決策</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Type</td>
          <td>dynamic typing、STRICT usage</td>
          <td>integer / bigint / numeric / timestamptz</td>
      </tr>
      <tr>
          <td>Primary key</td>
          <td>rowid、INTEGER PRIMARY KEY</td>
          <td>identity、sequence、UUID</td>
      </tr>
      <tr>
          <td>Date/time</td>
          <td>TEXT / INTEGER convention</td>
          <td>timestamptz、timezone policy</td>
      </tr>
      <tr>
          <td>JSON</td>
          <td>JSON text / function usage</td>
          <td>jsonb、GIN index、query rewrite</td>
      </tr>
      <tr>
          <td>Constraint</td>
          <td>FK pragma、check、unique collation</td>
          <td>enforced FK、deferrable、collation</td>
      </tr>
      <tr>
          <td>Index</td>
          <td>partial / expression / covering index</td>
          <td>equivalent index + explain</td>
      </tr>
      <tr>
          <td>Transaction</td>
          <td>single writer、savepoint</td>
          <td>isolation level、deadlock retry</td>
      </tr>
  </tbody>
</table>
<p>Type mapping 要先保護 domain invariant。金額欄位用 integer cents 或 numeric、時間欄位用 timestamptz 或明確 UTC text、boolean 用 boolean；每個轉換都要有 invalid sample 與 round-trip test。</p>
<p>Index mapping 要用 production query 重跑 explain。SQLite 的 <code>EXPLAIN QUERY PLAN</code> 只能說明 SQLite planner；PostgreSQL 需要自己的 <code>EXPLAIN (ANALYZE, BUFFERS)</code>，並使用接近真實分布的資料量。</p>
<h2 id="phase-plan">Phase Plan</h2>
<p>Phase plan 的核心責任是降低一次性 cutover 風險。SQLite to PostgreSQL migration 通常可以分成 schema 建模、資料匯出、adapter 切換、shadow read、freeze / cutover 與 cleanup。</p>
<table>
  <thead>
      <tr>
          <th>Phase</th>
          <th>目的</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema rewrite</td>
          <td>建立 PostgreSQL target schema</td>
          <td>migration dry run、schema review</td>
      </tr>
      <tr>
          <td>Data export</td>
          <td>從 SQLite 取出穩定 snapshot</td>
          <td>source checksum、row count、export log</td>
      </tr>
      <tr>
          <td>Data import</td>
          <td>寫入 PostgreSQL</td>
          <td>target checksum、constraint validation</td>
      </tr>
      <tr>
          <td>Adapter layer</td>
          <td>將 repository 改為可切換</td>
          <td>dual test suite、error mapping</td>
      </tr>
      <tr>
          <td>Shadow read</td>
          <td>比對新舊 query result</td>
          <td>mismatch report、latency profile</td>
      </tr>
      <tr>
          <td>Cutover</td>
          <td>切正式寫入</td>
          <td>freeze window、rollback snapshot</td>
      </tr>
      <tr>
          <td>Cleanup</td>
          <td>退役 SQLite write path</td>
          <td>retention、credential、runbook update</td>
      </tr>
  </tbody>
</table>
<p>Adapter layer 是風險控制點。Repository 應把 SQLite 與 PostgreSQL driver 差異藏在 infrastructure layer，domain 不直接依賴 vendor-specific SQL exception 或 connection object。</p>
<p>Shadow read 適合先驗證 read contract。正式寫入仍留在 SQLite 時，background job 可以把相同 query 跑到 PostgreSQL mirror，記錄 row count、field diff、排序差異與 latency。</p>
<h2 id="data-movement">Data Movement</h2>
<p>Data movement 的核心責任是讓搬遷結果可驗證。SQLite database file 可以透過 <code>.dump</code>、CSV export、application-level export 或 custom ETL 搬入 PostgreSQL；選擇取決於資料量、型別轉換、FK order 與 downtime window。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">sqlite3 app.db <span class="s2">&#34;.mode csv&#34;</span> <span class="s2">&#34;.headers on&#34;</span> <span class="s2">&#34;.once orders.csv&#34;</span> <span class="s2">&#34;SELECT * FROM orders ORDER BY id;&#34;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">psql <span class="s2">&#34;</span><span class="nv">$DATABASE_URL</span><span class="s2">&#34;</span> -c <span class="s2">&#34;\\copy orders FROM &#39;orders.csv&#39; CSV HEADER&#34;</span></span></span></code></pre></div><p>這段命令是教學骨架。正式 migration 要處理 quoting、NULL、timezone、large object、FK order、batch size、transaction size、retry、import log 與 sensitive data handling。</p>
<p>Row count 是基本證據，checksum 是更強證據。可以針對每張表計算穩定排序後的 hash，或在 application layer 對 domain key 與重要欄位做 checksum。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">SUM</span><span class="p">(</span><span class="n">total_cents</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span></span></span></code></pre></div><p>Aggregate checksum 適合快速抓大錯。正式驗證還要補抽樣 row diff、edge case row、foreign key check 與 business invariant。</p>
<h2 id="cutover">Cutover</h2>
<p>Cutover 的核心責任是控制最後一次寫入切換。SQLite source 在 cutover 前應進入 read-only 或 writer freeze，確保最後 snapshot、import 與 validation 對齊。</p>
<table>
  <thead>
      <tr>
          <th>Cutover step</th>
          <th>操作</th>
          <th>Rollback 條件</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Freeze writers</td>
          <td>停止背景 job、API write、admin tool</td>
          <td>source 寫入仍持續或 freeze 失敗</td>
      </tr>
      <tr>
          <td>Final snapshot</td>
          <td>SQLite backup / export</td>
          <td>checksum 失敗</td>
      </tr>
      <tr>
          <td>Final import</td>
          <td>PostgreSQL transaction / batch import</td>
          <td>constraint error、row mismatch</td>
      </tr>
      <tr>
          <td>Smoke test</td>
          <td>核心 read/write workflow</td>
          <td>error rate、latency、permission failure</td>
      </tr>
      <tr>
          <td>Switch traffic</td>
          <td>更新 config / secret / deployment</td>
          <td>application error rate 超過 tripwire</td>
      </tr>
      <tr>
          <td>Monitor</td>
          <td>query latency、lock、connection pool</td>
          <td>pool exhaustion、deadlock spike、data diff</td>
      </tr>
  </tbody>
</table>
<p>Rollback 要保存 source snapshot。若 cutover 後發現 PostgreSQL error mapping、permission 或 performance 問題，可以切回 SQLite read/write snapshot；前提是 cutover window 內所有新寫入都能回放或被阻擋。</p>
<h2 id="postgresql-operation-gate">PostgreSQL Operation Gate</h2>
<p>PostgreSQL operation gate 的核心責任是確認團隊準備好接手 server DB。Migration 成功要包含資料進入 target 與 operation readiness；PostgreSQL 需要 connection pool、backup / PITR、vacuum、index bloat、role、migration lock review 與 alert。</p>
<p>最小 operation checklist：</p>
<ol>
<li>Connection pool 設計：max connections、pool size、timeout、transaction pooling policy。</li>
<li>Backup / PITR：restore drill、retention、RPO / RTO。</li>
<li>Role / grant：application role、migration role、read-only role。</li>
<li>Migration lock review：DDL impact、online migration strategy。</li>
<li>Observability：slow query、lock wait、deadlock、replica lag、disk。</li>
<li>Incident route：rollback、restore、read-only mode、on-call owner。</li>
</ol>
<p>這個 gate 要在 cutover 前完成。SQLite 讓 operation surface 很小；PostgreSQL 擴大能力的同時，也擴大維護責任。</p>
<h2 id="no-go-conditions">No-Go Conditions</h2>
<p>No-go condition 的核心責任是阻止過早升級。若服務仍是 single-user、local-first、low-write、可用簡單 backup 解決，PostgreSQL 可能引入比問題更大的 operation cost。</p>
<table>
  <thead>
      <tr>
          <th>No-go 訊號</th>
          <th>更合適路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Single-user app 或 desktop app</td>
          <td>保留 SQLite + backup / migration runbook</td>
      </tr>
      <tr>
          <td>主要壓力是備份</td>
          <td><a href="/blog/backend/01-database/vendors/sqlite/litestream-litefs-replication/" data-link-title="SQLite Litestream / LiteFS Replication" data-link-desc="Litestream、LiteFS、SQLite backup replication、read replica、failover 與 restore route">Litestream / LiteFS</a></td>
      </tr>
      <tr>
          <td>主要壓力是 edge locality</td>
          <td><a href="/blog/backend/01-database/vendors/sqlite/migrate-to-d1-turso/" data-link-title="SQLite to D1 / Turso Migration" data-link-desc="SQLite 轉向 Cloudflare D1、Turso / libSQL 的 edge driver、compatibility audit、data movement 與 rollback">D1 / Turso route</a></td>
      </tr>
      <tr>
          <td>Team 尚未準備 server DB operation</td>
          <td>先補 observability / restore drill</td>
      </tr>
      <tr>
          <td>Schema / query 還在快速探索</td>
          <td>先穩定 domain model，再做正式 migration</td>
      </tr>
  </tbody>
</table>
<p>No-go 條件要轉成 tripwire。當 writer concurrency、audit、PITR、role 或 HA 需求跨過明確門檻，再啟動 migration。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>SQLite to PostgreSQL migration 完成後，下一步要看 target operation。PostgreSQL 能力讀 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a>；migration 方法讀 <a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">Database Migration Playbook</a>；若需求只是 edge platform，改讀 <a href="/blog/backend/01-database/vendors/sqlite/migrate-to-d1-turso/" data-link-title="SQLite to D1 / Turso Migration" data-link-desc="SQLite 轉向 Cloudflare D1、Turso / libSQL 的 edge driver、compatibility audit、data movement 與 rollback">SQLite to D1 / Turso migration</a>。</p>
]]></content:encoded></item></channel></rss>