<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>PostgreSQL on Tarragon</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/</link><description>Recent content in PostgreSQL on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Wed, 13 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/index.xml" rel="self" type="application/rss+xml"/><item><title>PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/patroni-ha/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/patroni-ha/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PostgreSQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>Patroni-based HA&lt;/em> 的 lifecycle 設計 — 從正常運作到 failover 完成的 5 段、每段配置 + failure mode + recovery。&lt;/p>&lt;/blockquote>
&lt;h2 id="failover-lifecycle5-段不是一條曲線">Failover lifecycle：5 段不是一條曲線&lt;/h2>
&lt;p>PostgreSQL 原生沒有 auto-failover；primary 掛了、application 卡死、SRE 手動 promote standby — 整個過程通常 5-30 分鐘。Patroni 把這條鏈拆成 &lt;em>自動化的 5 段 lifecycle&lt;/em>、每段有自己的 trigger、配置、失敗模式：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>段&lt;/th>
 &lt;th>觸發&lt;/th>
 &lt;th>動作&lt;/th>
 &lt;th>失敗模式&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>1. Detection&lt;/strong>&lt;/td>
 &lt;td>Leader heartbeat 在 DCS（etcd / Consul）失聯&lt;/td>
 &lt;td>Standby 們開始觀察、累積失聯時間到 TTL&lt;/td>
 &lt;td>DCS 本身分裂 → false detection 啟動失敗 failover&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>2. Election&lt;/strong>&lt;/td>
 &lt;td>TTL 過、DCS 開放 leader lock&lt;/td>
 &lt;td>Standby 競爭寫 leader key（DCS quorum-based）&lt;/td>
 &lt;td>Network partition → 兩邊都自認 leader（split-brain）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>3. Promotion&lt;/strong>&lt;/td>
 &lt;td>新 leader 寫 DCS key 成功&lt;/td>
 &lt;td>跑 &lt;code>pg_ctl promote&lt;/code>、停 streaming replication、開始接寫&lt;/td>
 &lt;td>Standby 落後太多 → 拒 promote 或承接時資料缺&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>4. Reconfiguration&lt;/strong>&lt;/td>
 &lt;td>Patroni REST API 通知 routing 層&lt;/td>
 &lt;td>HAProxy / PgBouncer 切流量到新 leader&lt;/td>
 &lt;td>Routing 層 health check 慢 → 流量持續打舊 leader&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>5. Recovery&lt;/strong>&lt;/td>
 &lt;td>舊 leader 恢復（手動 / 自動）&lt;/td>
 &lt;td>跑 &lt;code>pg_rewind&lt;/code> + 重接 streaming replication 為 standby&lt;/td>
 &lt;td>WAL divergence 太大 → 必須重 base backup&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>每段都有獨立配置、不是「設一個 timeout 就好」。後面分段展開。&lt;/p>
&lt;h2 id="stage-1detection--dcs-heartbeat-跟-ttl">Stage 1：Detection — DCS heartbeat 跟 TTL&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c"># patroni.yml 核心配置&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">scope&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">myapp-pg-cluster&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">/db/&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">pg-node-1 &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># 跟 hostname 一致&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">etcd&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">hosts&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">etcd1:2379,etcd2:2379,etcd3:2379 &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># DCS quorum&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">protocol&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">https&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">bootstrap&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">dcs&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">ttl&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">30&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># leader lock TTL&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">loop_wait&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">10&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># patroni 主循環間隔&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">retry_timeout&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">10&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># DCS retry 上限&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">maximum_lag_on_failover&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">1048576&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># standby 落後 1MB 內才能 promote&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">synchronous_mode&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kc">false&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># async / sync 取捨&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>關鍵直覺：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PostgreSQL 在 OLTP 譜系的定位、本文聚焦 <em>Patroni-based HA</em> 的 lifecycle 設計 — 從正常運作到 failover 完成的 5 段、每段配置 + failure mode + recovery。</p></blockquote>
<h2 id="failover-lifecycle5-段不是一條曲線">Failover lifecycle：5 段不是一條曲線</h2>
<p>PostgreSQL 原生沒有 auto-failover；primary 掛了、application 卡死、SRE 手動 promote standby — 整個過程通常 5-30 分鐘。Patroni 把這條鏈拆成 <em>自動化的 5 段 lifecycle</em>、每段有自己的 trigger、配置、失敗模式：</p>
<table>
  <thead>
      <tr>
          <th>段</th>
          <th>觸發</th>
          <th>動作</th>
          <th>失敗模式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>1. Detection</strong></td>
          <td>Leader heartbeat 在 DCS（etcd / Consul）失聯</td>
          <td>Standby 們開始觀察、累積失聯時間到 TTL</td>
          <td>DCS 本身分裂 → false detection 啟動失敗 failover</td>
      </tr>
      <tr>
          <td><strong>2. Election</strong></td>
          <td>TTL 過、DCS 開放 leader lock</td>
          <td>Standby 競爭寫 leader key（DCS quorum-based）</td>
          <td>Network partition → 兩邊都自認 leader（split-brain）</td>
      </tr>
      <tr>
          <td><strong>3. Promotion</strong></td>
          <td>新 leader 寫 DCS key 成功</td>
          <td>跑 <code>pg_ctl promote</code>、停 streaming replication、開始接寫</td>
          <td>Standby 落後太多 → 拒 promote 或承接時資料缺</td>
      </tr>
      <tr>
          <td><strong>4. Reconfiguration</strong></td>
          <td>Patroni REST API 通知 routing 層</td>
          <td>HAProxy / PgBouncer 切流量到新 leader</td>
          <td>Routing 層 health check 慢 → 流量持續打舊 leader</td>
      </tr>
      <tr>
          <td><strong>5. Recovery</strong></td>
          <td>舊 leader 恢復（手動 / 自動）</td>
          <td>跑 <code>pg_rewind</code> + 重接 streaming replication 為 standby</td>
          <td>WAL divergence 太大 → 必須重 base backup</td>
      </tr>
  </tbody>
</table>
<p>每段都有獨立配置、不是「設一個 timeout 就好」。後面分段展開。</p>
<h2 id="stage-1detection--dcs-heartbeat-跟-ttl">Stage 1：Detection — DCS heartbeat 跟 TTL</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># patroni.yml 核心配置</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">scope</span><span class="p">:</span><span class="w"> </span><span class="l">myapp-pg-cluster</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">/db/</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">pg-node-1                               </span><span class="w"> </span><span class="c"># 跟 hostname 一致</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="nt">etcd</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">  </span><span class="nt">hosts</span><span class="p">:</span><span class="w"> </span><span class="l">etcd1:2379,etcd2:2379,etcd3:2379      </span><span class="w"> </span><span class="c"># DCS quorum</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">  </span><span class="nt">protocol</span><span class="p">:</span><span class="w"> </span><span class="l">https</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="nt">bootstrap</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">  </span><span class="nt">dcs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="nt">ttl</span><span class="p">:</span><span class="w"> </span><span class="m">30</span><span class="w">                                     </span><span class="c"># leader lock TTL</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">    </span><span class="nt">loop_wait</span><span class="p">:</span><span class="w"> </span><span class="m">10</span><span class="w">                               </span><span class="c"># patroni 主循環間隔</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">    </span><span class="nt">retry_timeout</span><span class="p">:</span><span class="w"> </span><span class="m">10</span><span class="w">                           </span><span class="c"># DCS retry 上限</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">    </span><span class="nt">maximum_lag_on_failover</span><span class="p">:</span><span class="w"> </span><span class="m">1048576</span><span class="w">            </span><span class="c"># standby 落後 1MB 內才能 promote</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">    </span><span class="nt">synchronous_mode</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="w">                     </span><span class="c"># async / sync 取捨</span></span></span></code></pre></div><p>關鍵直覺：</p>
<ul>
<li><strong>TTL (30s) = leader 失聯多久才被視為 dead</strong>。設太短（&lt; 15s）會把 transient network jitter 當 dead；設太長（&gt; 60s）unavailability 拖長</li>
<li><strong>loop_wait + retry_timeout &lt; TTL</strong>：Patroni 必須在 TTL 內成功跟 DCS 互動 N 次、<code>loop_wait=10 + retry_timeout=10</code> 給每個循環 20s buffer</li>
<li><strong>maximum_lag_on_failover</strong>：standby WAL 落後超過這個閾值就 <em>不參與 election</em>；防止「promote 一個落後 5 分鐘的 standby」資料丟失</li>
</ul>
<h2 id="stage-2election--dcs-quorum--watchdog-防-split-brain">Stage 2：Election — DCS quorum + watchdog 防 split-brain</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln">1</span><span class="cl"><span class="nt">watchdog</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">  </span><span class="nt">mode</span><span class="p">:</span><span class="w"> </span><span class="l">required                               </span><span class="w"> </span><span class="c"># required / automatic / off</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="nt">device</span><span class="p">:</span><span class="w"> </span><span class="l">/dev/watchdog</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="nt">safety_margin</span><span class="p">:</span><span class="w"> </span><span class="m">5</span></span></span></code></pre></div><p>Election 期間最大風險是 <em>split-brain</em> — network partition 下、舊 leader 還活著但跟 DCS 斷線；新 leader 從 standby 升上來、application 同時連兩個 PostgreSQL 寫。資料 divergence 後 <em>無法自動 reconcile</em>。</p>
<p>防護機制兩層：</p>
<ol>
<li><strong>DCS quorum</strong>：etcd / Consul 至少 3 node、過半 quorum 才能寫 leader key — 少數派 partition 無法 elect 新 leader</li>
<li><strong>Watchdog (Linux kernel)</strong>：required mode 強制 — Patroni 必須定期 <em>poke</em> <code>/dev/watchdog</code>、若 Patroni 自己掛或被 OS 凍結、kernel 自動 reboot 整台機器、避免舊 leader 在 DCS 失聯後繼續接寫</li>
</ol>
<p>Watchdog <code>required</code> 是 production-grade 的硬要求 — <code>automatic</code> / <code>off</code> 在 split-brain 場景下無法防護。</p>
<h2 id="stage-3promotion--pg_ctl--replication-slot-切換">Stage 3：Promotion — pg_ctl + replication slot 切換</h2>
<p>新 leader 寫 DCS key 成功後、Patroni 自動執行：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Patroni 內部、不要手動跑</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pg_ctl promote -D /var/lib/postgresql/data
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># postgresql.auto.conf 移除 primary_conninfo</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># postgresql.auto.conf 重新計算 timeline ID</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 啟動接寫</span></span></span></code></pre></div><p>Promotion 期間關鍵議題：</p>
<ul>
<li><strong>timeline divergence</strong>：新 leader 開新 timeline ID（從 leader 失聯時的 LSN 開始）；其他 standby 需要 <code>pg_rewind</code> 把自己的 WAL fork 點對齊新 timeline</li>
<li><strong>replication slot 處理</strong>：舊 leader 上的 replication slot 在 DCS 中已 stale、新 leader 重建 slot；如果 logical replication consumer 沒 idempotent、會 replay 部分訊息</li>
<li><strong>promotion latency</strong>：通常 3-10 秒（pg_ctl 本身 &lt; 5s、加 DCS 寫確認）</li>
</ul>
<h2 id="stage-4reconfiguration--client-routing-切換">Stage 4：Reconfiguration — client routing 切換</h2>
<p>PostgreSQL 自己升 leader 還不夠、application 不知道；要靠前端 routing 層轉發。三種典型 pattern：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">[client] → [HAProxy / pgBouncer] → [pg-node-1 (leader)]
</span></span><span class="line"><span class="ln">2</span><span class="cl">                                 → [pg-node-2 (standby, read)]
</span></span><span class="line"><span class="ln">3</span><span class="cl">                                 → [pg-node-3 (standby, read)]</span></span></code></pre></div><p>Patroni REST API 暴露 <code>/leader</code> / <code>/replica</code> / <code>/health</code> endpoint、HAProxy 用 <em>health check</em> 跑這些 endpoint：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl"># haproxy.cfg
</span></span><span class="line"><span class="ln">2</span><span class="cl">backend pg-write
</span></span><span class="line"><span class="ln">3</span><span class="cl">  option httpchk OPTIONS /leader
</span></span><span class="line"><span class="ln">4</span><span class="cl">  http-check expect status 200
</span></span><span class="line"><span class="ln">5</span><span class="cl">  server pg-node-1 pg-node-1:5432 check port 8008
</span></span><span class="line"><span class="ln">6</span><span class="cl">  server pg-node-2 pg-node-2:5432 check port 8008 backup
</span></span><span class="line"><span class="ln">7</span><span class="cl">  server pg-node-3 pg-node-3:5432 check port 8008 backup</span></span></code></pre></div><p>Reconfiguration 期間關鍵延遲：</p>
<ul>
<li>HAProxy health check 間隔（預設 2s）+ failure threshold（預設 3 次）= ~6s 切換感應</li>
<li>PgBouncer 不主動 health check、要靠 application 端 retry 跟 connection drop 觸發重連</li>
<li>整個 reconfiguration 端到端通常 10-20s（含 PostgreSQL promotion 時間）</li>
</ul>
<h2 id="stage-5recovery--pg_rewind-跟-base-backup-取捨">Stage 5：Recovery — pg_rewind 跟 base backup 取捨</h2>
<p>舊 leader 恢復後變 standby，但 WAL 已 divergence — 必須選一條 recovery path：</p>
<ul>
<li><strong><code>pg_rewind</code></strong>：rewind 舊 leader WAL 到分歧點、重新接 streaming replication；條件 = 分歧 WAL 量小（&lt; 幾 GB）且 timeline 可對齊</li>
<li><strong>重 base backup</strong>：用 <code>pg_basebackup</code> 從新 leader 拉完整 base + WAL；條件 = 任何時候都可、但時間長（TB 級 1-4 小時）</li>
</ul>
<p>Patroni 預設嘗試 pg_rewind、失敗才退 base backup。production 配置：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln">1</span><span class="cl"><span class="nt">postgresql</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">  </span><span class="nt">use_pg_rewind</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="nt">remove_data_directory_on_rewind_failure</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">   </span><span class="c"># rewind 失敗自動清 data dir、再 base backup</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="nt">remove_data_directory_on_diverged_timelines</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span></span></span></code></pre></div><h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1split-brain-due-to-dcs-partition">Case 1：Split-brain due to DCS partition</h3>
<p><strong>徵兆</strong>：兩個 PostgreSQL node 都在接寫、application 大量寫入 conflict / unique constraint violation。</p>
<p><strong>根因</strong>：DCS（etcd）partition — 兩個 etcd node 在 partition 兩側、都自認 quorum；其實是 split-vote、兩邊都不應該。Patroni 在兩邊各 elect 一個 leader。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>DCS 必須奇數 node（3 / 5 / 7）、過半 quorum 嚴格 enforce</li>
<li>DCS 部署跨 AZ / region 時、quorum size 要考慮 partition 機率（3 AZ 各 1 node 是 production 最低標）</li>
<li>Watchdog <code>required</code> mode 是最後一道閘門 — DCS partition 加 quorum 失靈時、watchdog 強制 reboot 失聯 node</li>
</ol>
<h3 id="case-2standby-落後太多無法-failover">Case 2：Standby 落後太多、無法 failover</h3>
<p><strong>徵兆</strong>：primary 失聯後、Patroni log 顯示 <code>Following members have lag greater than maximum_lag_on_failover</code>、所有 standby 都被拒 promote、cluster unavailable。</p>
<p><strong>根因</strong>：maximum_lag_on_failover 設 1MB、但 standby replication lag 累積到 50MB（write-heavy workload + slow disk on standby）。安全機制觸發、但代價是 <em>無 standby 可升</em>、需要人工降低門檻或等 standby catch up。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：standby 容量 / IO 對齊 primary、避免 lag 累積；prometheus alert <code>pg_replication_lag_bytes &gt; 10MB</code> 觸發前 catch</li>
<li><strong>臨時</strong>：手動 <code>patronictl edit-config</code> 把 maximum_lag_on_failover 暫時拉到 50MB、接受可能丟 50MB worth of writes、換 availability</li>
<li><strong>長期</strong>：sync replication（一個 standby 強制同步）、保證至少一個 standby zero-lag</li>
</ol>
<h3 id="case-3promotion-後-application-connection-storm">Case 3：Promotion 後 application connection storm</h3>
<p><strong>徵兆</strong>：failover 完成後 30-120 秒內、application log 大量 <code>connection refused</code> / <code>password authentication failed</code>、application 自己 retry storm。</p>
<p><strong>根因</strong>：新 leader 剛 promote、PostgreSQL <code>max_connections</code> 容量還在 warm up（shared memory / cache 未 prime）、application 同時湧入大量 connection request；應用 retry 不夠 jitter、queue 堆積。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Application 用 <em>exponential backoff with jitter</em>、不要 immediate retry</li>
<li>PgBouncer / connection pool 限制每 application instance 對 PG 的 connection 上限、不直連 PG</li>
<li>預先在 standby 跑 <code>pg_prewarm</code> 把熱表 cache 預熱、promotion 後 cache miss 不爆</li>
</ol>
<h3 id="case-4pg_rewind-失敗退到-base-backup-沒做">Case 4：pg_rewind 失敗、退到 base backup 沒做</h3>
<p><strong>徵兆</strong>：舊 leader 恢復後、Patroni log 顯示 <code>pg_rewind failed</code>、舊 leader 一直 STARTING、無法重接 cluster；SRE 手動跑 pg_basebackup 才恢復。</p>
<p><strong>根因</strong>：<code>remove_data_directory_on_rewind_failure: false</code>（預設）— rewind 失敗時 Patroni 不主動清 data dir、需要 SRE 手動處理；運維沒 runbook、卡在這步幾小時。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Production 設 <code>remove_data_directory_on_rewind_failure: true</code> + <code>remove_data_directory_on_diverged_timelines: true</code>、讓 Patroni 自動 fallback</li>
<li>data dir 跑在獨立 PV / disk、清掉風險可控（不要跑 root disk）</li>
<li>容量規劃：base backup 時間預估納入 RTO（TB 級 base backup 1-4 小時、不是 RTO 30 分鐘所能承受）</li>
</ol>
<h3 id="case-5watchdog-觸發整機-reboot誤殺">Case 5：Watchdog 觸發整機 reboot、誤殺</h3>
<p><strong>徵兆</strong>：production server 在無故障時 unexpected reboot、<code>dmesg</code> 顯示 <code>watchdog: BUG: soft lockup</code>。</p>
<p><strong>根因</strong>：Patroni 主循環因 etcd 短暫慢回應卡住 60+ 秒、kernel watchdog 觸發 reboot；但實際 PostgreSQL 沒 hang、是 Patroni-watchdog 鏈過敏。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><code>safety_margin</code> 設大一點（10-15）、給 Patroni loop_wait 抖動空間</li>
<li>etcd 跟 Patroni 部署在低延遲 network 內（同 AZ &lt; 5ms）、跨 region etcd 不建議</li>
<li>watchdog device 用 softdog（軟體模擬）vs 硬體 watchdog、debug 時 softdog 容易觀察</li>
</ol>
<h2 id="容量規劃">容量規劃</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>估算</th>
          <th>警戒</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster size</td>
          <td>3-5 node（含 leader + 2-4 standby）</td>
          <td>&lt; 3 不能 HA（單 standby 失敗整 cluster 掛）</td>
      </tr>
      <tr>
          <td>DCS size</td>
          <td>3 / 5 / 7 node（奇數 quorum）</td>
          <td>etcd 5 node 是 prod standard</td>
      </tr>
      <tr>
          <td>TTL</td>
          <td>30s（default 30、production 20-60）</td>
          <td>&lt; 15s 過敏、&gt; 60s 過鈍</td>
      </tr>
      <tr>
          <td>maximum_lag_on_failover</td>
          <td>1MB（default）</td>
          <td>大表 write-heavy 可放 10-100MB</td>
      </tr>
      <tr>
          <td>Synchronous standby</td>
          <td>1 個 sync + N 個 async 是 production 預設</td>
          <td>全 async 容易丟資料、全 sync write latency 爆</td>
      </tr>
      <tr>
          <td>RTO</td>
          <td>10-30 秒（detection 30s 內 + promotion 5-10s + reconfig 5s）</td>
          <td>&gt; 60s 要 audit 鏈路</td>
      </tr>
      <tr>
          <td>RPO</td>
          <td>sync mode 接近 0、async mode 跟 lag 同數量級</td>
          <td>async 在 disk IO 慢時 lag 可能 MB-GB level</td>
      </tr>
  </tbody>
</table>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-pgbouncer-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PgBouncer</a> 整合</h3>
<p>PgBouncer 不主動感知 Patroni failover、要靠：</p>
<ol>
<li><strong>HAProxy 在 PgBouncer 上層</strong>：HAProxy 跑 Patroni health check、PgBouncer connection 重新路由</li>
<li><strong>PgBouncer reload</strong>：failover 後 SRE / automation 跑 <code>pgbouncer -R</code>、強制重連 backend</li>
<li><strong>Connection pool drain</strong>：application 端 connection pool 設 <code>pool_lifetime_max=5min</code>、舊 connection 自然汰換</li>
</ol>
<h3 id="跟-cert-managertls-rotation">跟 cert-manager（TLS rotation）</h3>
<p>Patroni REST API 跟 PostgreSQL streaming replication 都用 TLS、cert rotation 不能停服務：</p>
<ol>
<li>cert-manager 自動換證後、Patroni 跟 PostgreSQL 都需要 reload（不是 restart）</li>
<li><code>patronictl reload &lt;cluster&gt;</code> 不會觸發 failover、只 reload config</li>
<li>PostgreSQL <code>pg_ctl reload</code> 是 SIGHUP、平滑載入新 cert</li>
</ol>
<h3 id="跟-backup--pitr">跟 backup / PITR</h3>
<p>Patroni 不管 backup — 但 standby promotion 後、WAL archive 必須跟新 leader 的 timeline 對齊：</p>
<ol>
<li>WAL archive 命令模板含 <code>%t</code>（timeline）：<code>archive_command = 'wal-g wal-push %p'</code></li>
<li>Backup tool（pgBackRest / WAL-G）支援 timeline 切換、archive 不會中斷</li>
<li>詳見 <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL archiving deep article</a></li>
</ol>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Multi-region Patroni</strong>：跨 region 部署的 DCS quorum 設計、跟單 region 的取捨完全不同</li>
<li><strong>PostgreSQL 16+ streaming replication slot 持久化</strong>：簡化 standby promotion 後 logical consumer 重連</li>
<li><strong>跟 Kubernetes operator 整合</strong>：Patroni 跑在 K8s 時、StatefulSet + pod identity + DCS 部署模式</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>上游 chapter：<a href="/blog/backend/01-database/high-concurrency-access/" data-link-title="1.1 高併發下的 SQL 讀寫邊界" data-link-desc="說明高併發服務如何共用資料庫 client、控制 transaction、管理 connection pool、避免資料庫成為瓶頸">High Concurrency Access</a> — connection / replication / HA 全鏈</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgBouncer 配置</a> / <a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/dynamic-credential/" data-link-title="HashiCorp Vault Dynamic Credential：lease 治理跟 application 整合的實作層" data-link-desc="Vault database secrets engine 怎麼配、application 怎麼 renew lease、production 五大踩雷（lease 過期 race、DB max_connections 撞牆、Vault sealed、token expire、scope 過寬）、容量規劃跟 vault-agent injector 整合">Vault Dynamic Credential</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN + replication slot 的三軸組合</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/replication-topology/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/replication-topology/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>streaming replication topology&lt;/em> — 從 single primary 到 multi-standby 部署的 3 個 trade-off 軸 + LSN + replication slot 機制。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="replication-的-3-個-trade-off-軸--mode-選擇">Replication 的 3 個 trade-off 軸 + mode 選擇&lt;/h2>
&lt;p>PG streaming replication mode 選擇看起來是「async 還是 sync」、實際是 3 個獨立 trade-off 軸的組合、async / sync / quorum-based sync 是這些軸的常見組合 &lt;em>名稱&lt;/em>：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>軸&lt;/th>
 &lt;th>端 A&lt;/th>
 &lt;th>端 B&lt;/th>
 &lt;th>PG 旋鈕&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>Durability&lt;/strong>&lt;/td>
 &lt;td>primary 寫完就 commit&lt;/td>
 &lt;td>至少一個 standby 收到才 commit&lt;/td>
 &lt;td>&lt;code>synchronous_commit&lt;/code> / &lt;code>synchronous_standby_names&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Latency&lt;/strong>&lt;/td>
 &lt;td>client 等 primary 寫完 OK&lt;/td>
 &lt;td>client 等 standby ack（額外 RTT）&lt;/td>
 &lt;td>同上&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Consistency&lt;/strong>&lt;/td>
 &lt;td>standby 隨時可能 stale&lt;/td>
 &lt;td>standby 跟 primary 保證讀到一致&lt;/td>
 &lt;td>application read routing rule（不是 replication 旋鈕）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>跟這三軸獨立的、是 &lt;em>replication 機制本身的可維護性&lt;/em>：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>LSN（Log Sequence Number）&lt;/strong>：PG 用全域 byte offset 標 WAL 進度、所有 standby 同步用 LSN 對齊、不像 MySQL 早期 binlog position + file 雙欄&lt;/li>
&lt;li>&lt;strong>Replication slot&lt;/strong>：primary 紀錄每個 standby 已接收的 LSN、防 standby 失聯期間 WAL 被清掉、是 streaming replication 的 &lt;em>持久化進度追蹤&lt;/em>&lt;/li>
&lt;/ul>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology&lt;/a> 對比、PG 的 LSN + replication slot 直接內建 &lt;em>standby 進度追蹤&lt;/em>、不像 MySQL 5.7- 要靠 binlog position + GTID 雙機制；但 slot 是 &lt;em>primary 紀錄&lt;/em>、orphan slot 是 PG-specific 議題（slot 留 WAL 直到 standby 重連、standby 永久失聯 → primary disk 爆）。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>streaming replication topology</em> — 從 single primary 到 multi-standby 部署的 3 個 trade-off 軸 + LSN + replication slot 機制。</p></blockquote>
<hr>
<h2 id="replication-的-3-個-trade-off-軸--mode-選擇">Replication 的 3 個 trade-off 軸 + mode 選擇</h2>
<p>PG streaming replication mode 選擇看起來是「async 還是 sync」、實際是 3 個獨立 trade-off 軸的組合、async / sync / quorum-based sync 是這些軸的常見組合 <em>名稱</em>：</p>
<table>
  <thead>
      <tr>
          <th>軸</th>
          <th>端 A</th>
          <th>端 B</th>
          <th>PG 旋鈕</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Durability</strong></td>
          <td>primary 寫完就 commit</td>
          <td>至少一個 standby 收到才 commit</td>
          <td><code>synchronous_commit</code> / <code>synchronous_standby_names</code></td>
      </tr>
      <tr>
          <td><strong>Latency</strong></td>
          <td>client 等 primary 寫完 OK</td>
          <td>client 等 standby ack（額外 RTT）</td>
          <td>同上</td>
      </tr>
      <tr>
          <td><strong>Consistency</strong></td>
          <td>standby 隨時可能 stale</td>
          <td>standby 跟 primary 保證讀到一致</td>
          <td>application read routing rule（不是 replication 旋鈕）</td>
      </tr>
  </tbody>
</table>
<p>跟這三軸獨立的、是 <em>replication 機制本身的可維護性</em>：</p>
<ul>
<li><strong>LSN（Log Sequence Number）</strong>：PG 用全域 byte offset 標 WAL 進度、所有 standby 同步用 LSN 對齊、不像 MySQL 早期 binlog position + file 雙欄</li>
<li><strong>Replication slot</strong>：primary 紀錄每個 standby 已接收的 LSN、防 standby 失聯期間 WAL 被清掉、是 streaming replication 的 <em>持久化進度追蹤</em></li>
</ul>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a> 對比、PG 的 LSN + replication slot 直接內建 <em>standby 進度追蹤</em>、不像 MySQL 5.7- 要靠 binlog position + GTID 雙機制；但 slot 是 <em>primary 紀錄</em>、orphan slot 是 PG-specific 議題（slot 留 WAL 直到 standby 重連、standby 永久失聯 → primary disk 爆）。</p>
<h2 id="async-streamingdefault--高-throughput-的代價">Async streaming：default + 高 throughput 的代價</h2>
<p>Async 是 PG 預設、行為：</p>
<ol>
<li>Primary 寫 WAL 進 <code>pg_wal/</code> 目錄、commit、回應 client OK</li>
<li>WAL sender process 把 WAL stream 給 standby</li>
<li>Standby WAL receiver 寫 standby 的 <code>pg_wal/</code>、startup 進程 redo 套用</li>
</ol>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>Durability：primary commit 後 standby 還沒收 → primary 永久故障 → <em>data loss</em>（已 commit 的 transaction 在 standby 不存在）</li>
<li>Latency：client 寫入延遲 = primary 自身 fsync WAL 的時間（<code>fsync=on</code> + <code>synchronous_commit=on</code> 預設、通常 &lt; 1ms 在 SSD / NVMe）</li>
<li>Consistency：standby 可能 lag、application 讀 standby 會 stale；用 <code>pg_stat_replication.write_lag / flush_lag / replay_lag</code> 看</li>
</ul>
<p><strong>配置</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># postgresql.conf on primary</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">wal_level</span> <span class="o">=</span> <span class="s">replica          # 至少 replica（logical 是 superset）</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">max_wal_senders</span> <span class="o">=</span> <span class="s">10         # 並行 WAL sender process 數（依 standby 數量）</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">wal_keep_size</span> <span class="o">=</span> <span class="s">1024MB       # WAL 保留量（slot 為主、但 backup buffer）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">synchronous_commit</span> <span class="o">=</span> <span class="s">on      # 預設、primary 自己 fsync WAL</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># synchronous_standby_names 留空 = async</span></span></span></code></pre></div><p><strong>適用</strong>：</p>
<ul>
<li>主流選擇（90% 場景）</li>
<li>Failover loss 在容忍範圍（多數 web 應用容忍 1-2 秒 data loss）</li>
<li>Read scaling 為主要 driver、絕對 durability 非首要</li>
</ul>
<h2 id="sync-streaming至少一個-standby-flush-wal-才-commit">Sync streaming：至少一個 standby flush WAL 才 commit</h2>
<p>Sync mode 在 async 基礎上加 <em>primary 等指定 standby flush WAL 才回 client</em>：</p>
<ol>
<li>Primary 寫 WAL、send to standby</li>
<li>Standby 收到 WAL、寫進 <code>pg_wal/</code>、fsync、回 ack</li>
<li><em>Primary 等 ack</em> → commit → 回 client</li>
</ol>
<p><code>synchronous_commit</code> 有 5 個 level、不是 binary：</p>
<table>
  <thead>
      <tr>
          <th>Level</th>
          <th>行為</th>
          <th>Latency 影響</th>
          <th>Crash data loss</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>off</code></td>
          <td>primary 不等自己 fsync、background flush</td>
          <td>+0</td>
          <td>primary crash 丟 0-1 秒</td>
      </tr>
      <tr>
          <td><code>local</code></td>
          <td>primary fsync own WAL（不等 standby）</td>
          <td>baseline</td>
          <td>primary crash 0、standby 丟</td>
      </tr>
      <tr>
          <td><code>remote_write</code></td>
          <td>primary fsync + standby 收到（不必 standby fsync）</td>
          <td>+1 RTT 大致</td>
          <td>OS crash on standby 丟</td>
      </tr>
      <tr>
          <td><code>on</code> (預設)</td>
          <td>primary fsync + standby fsync（standby 收進 disk）</td>
          <td>+1 RTT + fsync</td>
          <td>全 crash 都不丟</td>
      </tr>
      <tr>
          <td><code>remote_apply</code></td>
          <td>primary fsync + standby fsync + standby 已 <em>replay</em>（visible to read）</td>
          <td>+1 RTT + fsync + replay</td>
          <td>全 crash 都不丟 + replica 立刻可讀</td>
      </tr>
  </tbody>
</table>
<p><strong>配置（synchronous）</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">synchronous_commit</span> <span class="o">=</span> <span class="s">on</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">synchronous_standby_names</span> <span class="o">=</span> <span class="s">&#39;FIRST 1 (standby1, standby2)&#39;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># &#39;FIRST 1&#39; = 第一個 active standby ack 即可</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># &#39;ANY 2 (s1, s2, s3)&#39; = 任 2 個 ack 即可（quorum-based）</span></span></span></code></pre></div><p><strong>Quorum-based sync</strong>：用 <code>ANY N</code> 語法、達到 N 個 ack 就 commit、提高 latency stability（不依賴特定 standby）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">synchronous_standby_names</span> <span class="o">=</span> <span class="s">&#39;ANY 2 (standby1, standby2, standby3)&#39;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 3 個 standby 中任 2 個 ack 即 commit</span></span></span></code></pre></div><p><strong>適用</strong>：</p>
<ul>
<li>金融交易 / 訂單 / payment ledger（不允許 data loss）</li>
<li>已有 multi-AZ deploy、replica 物理上可靠</li>
<li>可接受寫入延遲 +1-3ms (跨 AZ)</li>
</ul>
<p><strong>不適用</strong>：</p>
<ul>
<li>跨 region sync（RTT 50-200ms）— 寫吞吐砍半、改用 <em>region-local sync + cross-region async</em></li>
<li>寫吞吐 &gt; 50K WPS + 容忍 sub-second loss — async 即可</li>
</ul>
<h2 id="lsn--replication-slotpg-的進度追蹤機制">LSN + Replication Slot：PG 的進度追蹤機制</h2>
<p>PG 每個 WAL 寫入都標 <em>LSN</em>（64-bit byte offset）。Standby 紀錄 <em>已收到 / 已 flush / 已 replay</em> 的 LSN、primary 透過 streaming protocol 知道每個 standby 進度。</p>
<p><strong>Replication slot</strong> 是 <em>primary 端的 standby 進度紀錄</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 建 physical replication slot（給 streaming replication 用）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_create_physical_replication_slot</span><span class="p">(</span><span class="s1">&#39;standby1_slot&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 查 slot 狀態
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">slot_name</span><span class="p">,</span><span class="w"> </span><span class="n">active</span><span class="p">,</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">,</span><span class="w"> </span><span class="n">confirmed_flush_lsn</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">       </span><span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_wal_lsn_diff</span><span class="p">(</span><span class="n">pg_current_wal_lsn</span><span class="p">(),</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">lag</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="p">;</span></span></span></code></pre></div><p><strong>Slot 的核心責任</strong>：</p>
<ul>
<li><em>防 WAL premature deletion</em>：standby 失聯（restart / network blip）、primary 仍保留 slot 對應 LSN 之後的 WAL、standby 重連可繼續 stream</li>
<li><em>無需 base backup re-build</em>：跟沒 slot 的 standby 對比、有 slot 的 standby 失聯後重連、不用重建</li>
</ul>
<p><strong>Slot 跟 <code>wal_keep_size</code></strong>：</p>
<ul>
<li><code>wal_keep_size</code>（PG 13+）/ <code>wal_keep_segments</code>（&lt; 13）：minimum WAL 保留量、不依賴 slot</li>
<li>Slot 是 <em>動態保留</em>：直到 slot 的 standby 推進 LSN 才釋放對應 WAL</li>
<li>兩者組合：<code>wal_keep_size</code> 是底線、slot 是 standby-specific 動態保留</li>
</ul>
<p><strong>Standby 配置（用 slot）</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># standby1 postgresql.conf</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">primary_conninfo</span> <span class="o">=</span> <span class="s">&#39;host=primary.example.com port=5432 user=replication password=...&#39;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">primary_slot_name</span> <span class="o">=</span> <span class="s">&#39;standby1_slot&#39;   # 用 primary 上預先建的 slot</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">hot_standby</span> <span class="o">=</span> <span class="s">on                       # 讓 standby 接受 read query</span></span></span></code></pre></div><p><code>standby.signal</code> 空檔案在 PG_DATA 內、告訴 PG 這是 standby、進入 recovery mode。</p>
<h2 id="配置-step-by-stepsync-streaming--slot">配置 step-by-step（sync streaming + slot）</h2>
<p>實務最常見組合：sync streaming + replication slot + cross-AZ replica。</p>
<h3 id="step-1primary-配置">Step 1：Primary 配置</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="na">wal_level</span> <span class="o">=</span> <span class="s">replica</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="na">max_wal_senders</span> <span class="o">=</span> <span class="s">10</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="na">max_replication_slots</span> <span class="o">=</span> <span class="s">10</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="na">synchronous_commit</span> <span class="o">=</span> <span class="s">on</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="na">synchronous_standby_names</span> <span class="o">=</span> <span class="s">&#39;FIRST 1 (standby1, standby2)&#39;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="na">wal_keep_size</span> <span class="o">=</span> <span class="s">1024MB</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># pg_hba.conf — 允許 replication 連線</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="na">host replication replication 10.0.0.0/16 scram-sha-256</span></span></span></code></pre></div><p>Restart primary 套用。</p>
<h3 id="step-2建-replication-user--slot">Step 2：建 replication user + slot</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">USER</span><span class="w"> </span><span class="n">replication</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="n">REPLICATION</span><span class="w"> </span><span class="n">PASSWORD</span><span class="w"> </span><span class="s1">&#39;...&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_create_physical_replication_slot</span><span class="p">(</span><span class="s1">&#39;standby1_slot&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_create_physical_replication_slot</span><span class="p">(</span><span class="s1">&#39;standby2_slot&#39;</span><span class="p">);</span></span></span></code></pre></div><h3 id="step-3standby-base-backup">Step 3：Standby base backup</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 在 standby 上跑</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pg_basebackup -h primary.example.com -D /var/lib/postgresql/data <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  -U replication -P -X stream <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  -S standby1_slot -R
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># -R: 自動生成 standby.signal + primary_conninfo</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># -X stream: 邊 backup 邊 stream 增量 WAL（避免 backup 期間 WAL gap）</span></span></span></code></pre></div><h3 id="step-4standby-啟動">Step 4：Standby 啟動</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># standby /var/lib/postgresql/data/postgresql.auto.conf 已有：</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># primary_conninfo = &#39;host=primary.example.com user=replication password=... application_name=standby1&#39;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># primary_slot_name = &#39;standby1_slot&#39;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl">pg_ctl -D /var/lib/postgresql/data start</span></span></code></pre></div><h3 id="step-5驗證">Step 5：驗證</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Primary: 確認 standby 連上
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">application_name</span><span class="p">,</span><span class="w"> </span><span class="k">state</span><span class="p">,</span><span class="w"> </span><span class="n">sync_state</span><span class="p">,</span><span class="w"> </span><span class="n">write_lag</span><span class="p">,</span><span class="w"> </span><span class="n">flush_lag</span><span class="p">,</span><span class="w"> </span><span class="n">replay_lag</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_replication</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 應顯示 standby1 / streaming / sync / 各 lag
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- Standby: 確認在 recovery + 收到 WAL
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_is_in_recovery</span><span class="p">(),</span><span class="w"> </span><span class="n">pg_last_wal_receive_lsn</span><span class="p">(),</span><span class="w"> </span><span class="n">pg_last_wal_replay_lsn</span><span class="p">();</span></span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-standby-lag-暴衝--single-replay-process-bottleneck">1. Standby lag 暴衝 — Single replay process bottleneck</h3>
<p>PG standby 是 <em>single startup process</em> 套用 WAL（不像 MySQL multi-thread replication）、primary 高並發寫入時 standby 跟不上、lag 從 &lt; 100ms 飆到分鐘級。常見觸發：批次 UPDATE / DELETE、大 transaction、index 建立、autovacuum 大量 dead tuple cleanup。</p>
<p>修法：</p>
<ul>
<li><em>Parallel WAL apply</em>（PG 14+）：<code>max_parallel_workers_per_gather</code> 增加 background worker、但仍受 startup process 主導</li>
<li>對 <em>read scaling</em> 場景接受 standby lag、application 用 <em>primary read 對 latency-critical query</em></li>
<li><em>Cascading replication</em> 對 high-fan-out 解決 sender CPU bottleneck、但 standby replay 仍 single-thread</li>
</ul>
<p>監控：<code>pg_stat_replication.replay_lag</code> 是 <em>最後一個 commit 到 standby replay 的時間差</em>、超過 threshold 即告警。</p>
<h3 id="2-sync-standby-失聯時-primary-commit-卡住">2. Sync standby 失聯時 primary commit 卡住</h3>
<p><code>synchronous_standby_names = 'FIRST 1 (standby1)'</code> + standby1 down → primary commit <em>等永遠</em>。Application 全部 timeout。</p>
<p>修法：</p>
<ul>
<li>用 <code>ANY N</code> quorum：<code>synchronous_standby_names = 'ANY 1 (standby1, standby2)'</code> — 任一 standby ack 即可</li>
<li>設多 standby、防單一失聯</li>
<li>監控 sync standby 健康、自動 failover 切 sync mode 到其他 standby（Patroni 自動做）</li>
<li>緊急情況：在 primary 跑 <code>ALTER SYSTEM SET synchronous_standby_names = ''; SELECT pg_reload_conf();</code> 暫時退 async（接受 data loss risk）</li>
</ul>
<h3 id="3-orphan-replication-slot--primary-disk-爆">3. Orphan replication slot — Primary disk 爆</h3>
<p>Standby 失聯（永久故障 / 重 decommission 但忘了 drop slot）、primary slot 持續保留 WAL、<code>pg_wal/</code> 累積到 disk 滿、primary 也掛。</p>
<p>修法：</p>
<ul>
<li>
<p>監控 <code>pg_replication_slots.active</code> — <code>false</code> 持續 &gt; N 小時是警訊</p>
</li>
<li>
<p>監控 slot lag：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">slot_name</span><span class="p">,</span><span class="w"> </span><span class="n">active</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">       </span><span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_wal_lsn_diff</span><span class="p">(</span><span class="n">pg_current_wal_lsn</span><span class="p">(),</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">retained_wal</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">retained_wal</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">10</span><span class="n">GB</span><span class="p">;</span></span></span></code></pre></div></li>
<li>
<p>設 <code>max_slot_wal_keep_size</code>（PG 13+）— slot 對應 WAL 超過 limit 自動 invalidate slot（standby 之後要 base backup 重來）</p>
</li>
<li>
<p>DR runbook 紀錄 <em>standby 退役流程</em> 必須包含 <code>pg_drop_replication_slot('xxx')</code></p>
</li>
</ul>
<h3 id="4-cascading-replication-雪崩">4. Cascading replication 雪崩</h3>
<p>Topology <code>primary → standby1 → standby2 → ...</code>（每層遞迴 stream）。Standby1 startup process 卡住、後續 standby 都被 block、整條 chain 雪崩。</p>
<p>修法：</p>
<ul>
<li>避免超過 2 層 cascade（primary → tier1 → tier2 是上限）</li>
<li>跨 region 用 <em>region-local tier1 + cross-region tier2</em>、不是長 chain</li>
<li>真的大規模、改用 <em>binlog server</em> style：<a href="https://github.com/postgresml/PgCat">Citus / PgCat</a> 等中介、或 logical replication 解耦</li>
</ul>
<h3 id="5-failover-後-timeline-分歧">5. Failover 後 timeline 分歧</h3>
<p>Primary 失敗、standby1 promote 為新 primary、其他 standby（standby2 / 3）原本連舊 primary、必須重新連 standby1。但 PG 用 <em>timeline</em>（每次 promotion 增 1）標 WAL 分支、原 standby 的 timeline 跟新 primary 不同。重連時看到 timeline mismatch、報錯。</p>
<p>修法：</p>
<ul>
<li><em>pg_rewind</em> 工具：對比新 primary 跟舊 standby 的 timeline 分歧點、把舊 standby 上 <em>新 primary 沒有的 WAL</em> 倒退、然後從分歧點重新跟新 primary 同步</li>
<li><em>Base backup re-build</em>：對舊 standby 重建 — 慢但保證乾淨</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni</a> 自動處理 pg_rewind / base backup 選擇</li>
</ul>
<h2 id="容量--cost-對照">容量 / cost 對照</h2>
<table>
  <thead>
      <tr>
          <th>配置</th>
          <th>寫吞吐影響</th>
          <th>Standby overhead</th>
          <th>適合 workload</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Async streaming + slot</td>
          <td>baseline</td>
          <td>低（WAL receive + startup）</td>
          <td>高吞吐、容忍 sub-second loss</td>
      </tr>
      <tr>
          <td>Sync <code>remote_write</code> + 1 standby</td>
          <td>-5% ~ -10%</td>
          <td>同上 + RTT</td>
          <td>一般 production、可接受 OS crash 丟</td>
      </tr>
      <tr>
          <td>Sync <code>on</code> + 1 standby</td>
          <td>-10% ~ -20%</td>
          <td>同上 + fsync</td>
          <td>金融、訂單、不容忍 data loss</td>
      </tr>
      <tr>
          <td>Sync <code>on</code> + ANY 2 quorum</td>
          <td>-15% ~ -30%</td>
          <td>同上、跨 AZ</td>
          <td>強 durability + multi-AZ HA</td>
      </tr>
      <tr>
          <td>Sync <code>remote_apply</code> + 1 standby</td>
          <td>-20% ~ -40%</td>
          <td>同上 + replay</td>
          <td>強一致 read on standby（少用、成本高）</td>
      </tr>
  </tbody>
</table>
<p>跨 AZ sync 通常加 1-3ms、跨 region 加 50-200ms — 寫密集 workload 跨 region sync 通常不划算、改用 <em>region-local sync + cross-region async chain</em>。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="patroni-ha">Patroni HA</h3>
<p><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni</a> 是 PG HA 自動 failover 標準、依賴 DCS（etcd / Consul）+ 本文 replication topology。Patroni 自動：</p>
<ul>
<li>偵測 primary 失聯、promote 適合 standby</li>
<li>處理 timeline 分歧（pg_rewind）</li>
<li>重配 sync standby（避免 sync standby 失聯卡 primary）</li>
</ul>
<h3 id="logical-replication--debezium">Logical Replication + Debezium</h3>
<p><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical replication + Debezium</a> 是 <em>跟 streaming replication 共用 WAL</em> 但不同 abstraction — logical decoding output event、streaming replication output physical bytes。Logical replication slot 跟 physical slot 共存、各自獨立 retention。</p>
<h3 id="pitr--wal-archiving">PITR + WAL Archiving</h3>
<p><a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL Archiving</a> 用 <em>archive_command</em> 把 WAL ship 到 S3、跟 streaming replication 並行：</p>
<ul>
<li>Streaming：給 <em>活的 standby</em>（real-time read scaling / HA）</li>
<li>Archive：給 <em>PITR + 新 standby base backup source</em></li>
</ul>
<p>兩者使用同一 WAL stream、不衝突。</p>
<h3 id="connection-路由pgbouncer--readwrite-split">Connection 路由（PgBouncer + read/write split）</h3>
<p><a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PgBouncer</a> 不做 read/write split（transaction pool 不看 SQL）。Read replica routing 通常用 <em>application-level</em> 或 <em>HAProxy 監控 standby health</em>。</p>
<h3 id="跟-mysql-replication-topology-對比">跟 MySQL Replication Topology 對比</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PG streaming replication</th>
          <th>MySQL replication</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>進度追蹤</td>
          <td>LSN（單一 byte offset）</td>
          <td>GTID 或 binlog (file, position)</td>
      </tr>
      <tr>
          <td>標準工具</td>
          <td>streaming replication（physical）+ logical</td>
          <td>binlog ROW format</td>
      </tr>
      <tr>
          <td>Sync 機制</td>
          <td><code>synchronous_commit</code> + standby names</td>
          <td>semi-sync plugin</td>
      </tr>
      <tr>
          <td>Quorum</td>
          <td><code>ANY N</code> syntax</td>
          <td><code>rpl_semi_sync_master_wait_for_slave_count</code></td>
      </tr>
      <tr>
          <td>Replay parallelism</td>
          <td>Single startup process</td>
          <td>Multi-thread (logical clock / writeset)</td>
      </tr>
      <tr>
          <td>Replica routing</td>
          <td>PgBouncer 不看 SQL、需外接</td>
          <td>ProxySQL 內建 query routing</td>
      </tr>
  </tbody>
</table>
<p>兩者 high-level 對等、低層機制有顯著差異。詳見 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">PG Patroni HA</a>（HA failover、依賴本文 replication topology）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">PG Logical Replication + Debezium</a>（不同 abstraction、共用 WAL）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PG PITR + WAL Archiving</a>（streaming + archive 並行）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PG PgBouncer</a>（connection pool、不做 read/write split）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>（sibling、不同機制）</li>
<li><a href="/blog/backend/knowledge-cards/quorum/" data-link-title="Quorum" data-link-desc="分散式系統以多數節點同意作為提交或讀取有效性的門檻">quorum 卡片</a> / <a href="/blog/backend/knowledge-cards/stale-read/" data-link-title="Stale Read" data-link-desc="讀取到落後於最新寫入版本的舊資料">stale-read 卡片</a> / <a href="/blog/backend/knowledge-cards/eventual-consistency/" data-link-title="Eventual Consistency" data-link-desc="允許短暫不一致、最終收斂到同一資料狀態的一致性語意">eventual-consistency 卡片</a></li>
<li>官方：<a href="https://www.postgresql.org/docs/current/warm-standby.html">PG Streaming Replication</a> / <a href="https://www.postgresql.org/docs/current/app-pgbasebackup.html">pg_basebackup</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/online-schema-change/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/online-schema-change/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>online schema change&lt;/em> — 先看 PG ALTER 哪些已 fast catalog-only、再看 pg_repack / pg-osc 何時必要。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;p>跟 MySQL 不同：PG 大量 schema change &lt;em>內建&lt;/em> fast catalog-only 行為、不必走 ghost table tool。MySQL 對應的 gh-ost / pt-online-schema-change 之於 PG 是 &lt;em>少數場景才需要的 escape hatch&lt;/em>、不是 standard practice。&lt;/p>
&lt;p>寫作 OSC 時必須 &lt;em>先看 PG 自身 ALTER 行為&lt;/em>、確認真的需要再上 pg_repack / pg-osc — 否則徒增複雜度。&lt;/p>
&lt;h2 id="pg-alter-table-的-fast--slow-分類">PG ALTER TABLE 的 fast / slow 分類&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- ALTER TABLE 的操作大致三類&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="類-afast-catalog-only-1-秒metadata-改">類 A：Fast catalog-only（&amp;lt; 1 秒、metadata 改）&lt;/h3>
&lt;p>PG 9.4+ / 11+ 多數 ALTER 已 catalog-only：&lt;/p>
&lt;ul>
&lt;li>&lt;code>ADD COLUMN col TYPE NULL DEFAULT NULL&lt;/code> — 直接 metadata、不 rewrite&lt;/li>
&lt;li>&lt;code>ADD COLUMN col TYPE NOT NULL DEFAULT &amp;lt;constant&amp;gt;&lt;/code>（PG 11+）— optimizer 把 default 存在 metadata、舊 row read 時動態返回 default、不 rewrite&lt;/li>
&lt;li>&lt;code>DROP COLUMN&lt;/code> — metadata 標 dropped、實際 row 不 rewrite（VACUUM 之後逐步清理）&lt;/li>
&lt;li>&lt;code>ALTER COLUMN ... SET DEFAULT &amp;lt;constant&amp;gt;&lt;/code> — metadata&lt;/li>
&lt;li>&lt;code>RENAME COLUMN&lt;/code> / &lt;code>RENAME TABLE&lt;/code> — metadata&lt;/li>
&lt;li>&lt;code>ADD CONSTRAINT ... NOT VALID&lt;/code> — 標記 constraint 不 validate、之後 &lt;code>VALIDATE CONSTRAINT&lt;/code> 才 scan&lt;/li>
&lt;li>&lt;code>ALTER COLUMN ... TYPE&lt;/code> 同 binary-compat 類型（&lt;code>VARCHAR(10) → VARCHAR(20)&lt;/code>、&lt;code>TEXT → VARCHAR&lt;/code> 等）— catalog-only&lt;/li>
&lt;/ul>
&lt;p>這類 ALTER &lt;em>直接跑、不必任何工具&lt;/em>。&lt;/p>
&lt;h3 id="類-block-heavyrewrites-tableproduction-慎用">類 B：Lock heavy（rewrites table、production 慎用）&lt;/h3>
&lt;p>需要 &lt;em>rewrite 整張 table&lt;/em>、ACCESS EXCLUSIVE lock 整個 ALTER 期間：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>online schema change</em> — 先看 PG ALTER 哪些已 fast catalog-only、再看 pg_repack / pg-osc 何時必要。</p></blockquote>
<hr>
<p>跟 MySQL 不同：PG 大量 schema change <em>內建</em> fast catalog-only 行為、不必走 ghost table tool。MySQL 對應的 gh-ost / pt-online-schema-change 之於 PG 是 <em>少數場景才需要的 escape hatch</em>、不是 standard practice。</p>
<p>寫作 OSC 時必須 <em>先看 PG 自身 ALTER 行為</em>、確認真的需要再上 pg_repack / pg-osc — 否則徒增複雜度。</p>
<h2 id="pg-alter-table-的-fast--slow-分類">PG ALTER TABLE 的 fast / slow 分類</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- ALTER TABLE 的操作大致三類</span></span></span></code></pre></div><h3 id="類-afast-catalog-only-1-秒metadata-改">類 A：Fast catalog-only（&lt; 1 秒、metadata 改）</h3>
<p>PG 9.4+ / 11+ 多數 ALTER 已 catalog-only：</p>
<ul>
<li><code>ADD COLUMN col TYPE NULL DEFAULT NULL</code> — 直接 metadata、不 rewrite</li>
<li><code>ADD COLUMN col TYPE NOT NULL DEFAULT &lt;constant&gt;</code>（PG 11+）— optimizer 把 default 存在 metadata、舊 row read 時動態返回 default、不 rewrite</li>
<li><code>DROP COLUMN</code> — metadata 標 dropped、實際 row 不 rewrite（VACUUM 之後逐步清理）</li>
<li><code>ALTER COLUMN ... SET DEFAULT &lt;constant&gt;</code> — metadata</li>
<li><code>RENAME COLUMN</code> / <code>RENAME TABLE</code> — metadata</li>
<li><code>ADD CONSTRAINT ... NOT VALID</code> — 標記 constraint 不 validate、之後 <code>VALIDATE CONSTRAINT</code> 才 scan</li>
<li><code>ALTER COLUMN ... TYPE</code> 同 binary-compat 類型（<code>VARCHAR(10) → VARCHAR(20)</code>、<code>TEXT → VARCHAR</code> 等）— catalog-only</li>
</ul>
<p>這類 ALTER <em>直接跑、不必任何工具</em>。</p>
<h3 id="類-block-heavyrewrites-tableproduction-慎用">類 B：Lock heavy（rewrites table、production 慎用）</h3>
<p>需要 <em>rewrite 整張 table</em>、ACCESS EXCLUSIVE lock 整個 ALTER 期間：</p>
<ul>
<li><code>ALTER COLUMN ... TYPE</code> binary 不相容類型（<code>INT → BIGINT</code> 永遠 rewrite、<code>TEXT → INT</code> 也是）— 雖然語意「擴大」、底層 4-byte 跟 8-byte storage 不同、全表 rewrite + ACCESS EXCLUSIVE 不可省</li>
<li><code>ALTER COLUMN ... SET NOT NULL</code> 對既有 nullable column（要 scan 整 table）</li>
<li><code>ALTER COLUMN ... DROP IDENTITY</code></li>
<li><code>ALTER TABLE ... SET TABLESPACE</code></li>
</ul>
<p>這類 ALTER 對大表 <em>production 不能直接跑</em>、要 ghost table tool。</p>
<h3 id="類-cconcurrent-index--online-operation無-table-lock">類 C：Concurrent index / online operation（無 table lock）</h3>
<ul>
<li><code>CREATE INDEX CONCURRENTLY</code> — 不 lock 寫入、background build、慢但安全</li>
<li><code>REINDEX INDEX CONCURRENTLY</code>（PG 12+） — 同上</li>
<li><code>DROP INDEX CONCURRENTLY</code> — 短 ACCESS EXCLUSIVE lock 只在最後 swap</li>
</ul>
<h2 id="何時需要-ghost-table-tool">何時需要 ghost table tool</h2>
<p>只在以下場景才需要 pg_repack / pg-osc：</p>
<ol>
<li><strong>Rewrite-required type change</strong>（類 B <code>ALTER COLUMN TYPE</code>）對大表</li>
<li><strong>VACUUM FULL 替代</strong>：pg_repack 比 VACUUM FULL 安全（不 lock 整表）</li>
<li><strong>Bloat 重組</strong>：大表 dead tuple 累積、想完整 rewrite</li>
</ol>
<p>對「add column」「drop column」「create index」等場景 <em>PG 內建 fast 已夠</em>、不必 ghost table tool。</p>
<h2 id="tool-1pg_repack--trigger-based--雙-table-swap">Tool 1：pg_repack — Trigger-based + 雙 table swap</h2>
<p>pg_repack 是 PG community 標準 online table rewrite 工具：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">pg_repack -h primary.example.com -p <span class="m">5432</span> -d production -U postgres <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --table<span class="o">=</span>orders --no-superuser-check</span></span></code></pre></div><p><strong>Mechanism</strong>：</p>
<ol>
<li>CREATE <code>repack.table_&lt;oid&gt;</code> 跟原表同 schema</li>
<li>在原表加 3 個 trigger：INSERT / UPDATE / DELETE → 寫入 log table <code>repack.log_&lt;oid&gt;</code></li>
<li>從原表 <code>INSERT INTO repack.table_&lt;oid&gt; SELECT * FROM original</code> 複製 row</li>
<li>邊複製邊 apply log table 紀錄的變更</li>
<li>切換：rename 原表 → original_old、rename repack.table_<oid> → original（atomic）</li>
<li>Drop 舊原表跟 trigger / log</li>
</ol>
<p><strong>Trade-off</strong>：</p>
<ul>
<li><em>Trigger overhead</em>：每個 primary 寫入加 trigger 執行（10-30% 寫吞吐降）</li>
<li><em>FK 處理</em>：需要 drop &amp; re-create FK referencing original table（pg_repack 自動處理但有 lock window）</li>
<li>適用 <em>PG-version 綁定</em> — pg_repack 13 不能對 PG 14 cluster 跑</li>
</ul>
<p><strong>配置</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Primary 安裝
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_repack</span><span class="p">;</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Repack orders</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pg_repack -d production --table<span class="o">=</span>orders
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># 監控 lock：另一 session 跑 SELECT * FROM pg_stat_activity</span></span></span></code></pre></div><h2 id="tool-2pg-osc--pg-online-schema-change--wal-shipping-style">Tool 2：pg-osc / pg-online-schema-change — WAL-shipping style</h2>
<p><a href="https://github.com/shayonj/pg-osc">pg-osc</a>（Shayon Mukherjee、2023）是較新的工具、模仿 gh-ost mechanism：</p>
<p><strong>Mechanism</strong>：</p>
<ol>
<li>用 logical replication slot 從 primary WAL stream 變更</li>
<li>CREATE shadow table + 套 ALTER 變更</li>
<li>Stream WAL event 同步 shadow table（不靠 trigger）</li>
<li>完成後 swap</li>
</ol>
<p><strong>Trade-off</strong>：</p>
<ul>
<li><em>Primary 寫入 overhead</em>：0（WAL 已存在）</li>
<li>比 pg_repack 較新（社群驗證度低）</li>
<li>適合 <em>trigger overhead 不可接受</em> 的高吞吐 production</li>
</ul>
<p><strong>配置</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 用 gem install</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">gem install pg_online_schema_change
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># Run</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">pg-online-schema-change perform <span class="se">\
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="se"></span>  --alter-statement<span class="o">=</span><span class="s2">&#34;ALTER TABLE orders ADD COLUMN status VARCHAR(20)&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="se"></span>  --schema<span class="o">=</span>public <span class="se">\
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="se"></span>  --dbname<span class="o">=</span>production <span class="se">\
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="se"></span>  --host<span class="o">=</span>primary.example.com</span></span></code></pre></div><h2 id="配置-step-by-steppg_repack-為主">配置 step-by-step（pg_repack 為主）</h2>
<p>實務多數 PG OSC 用 pg_repack。pg-osc 是 high-write-throughput escape hatch。</p>
<h3 id="step-1安裝--確認版本">Step 1：安裝 + 確認版本</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 安裝 pg_repack（versioned）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_repack</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_available_extensions</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;pg_repack&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 確認 installed_version 跟 PG major version 對齊</span></span></span></code></pre></div><h3 id="step-2跑-pg_repack">Step 2：跑 pg_repack</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">pg_repack -h primary -d production -U postgres <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --table<span class="o">=</span>orders <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --jobs<span class="o">=</span><span class="m">4</span> <span class="se">\ </span>                      <span class="c1"># 並行 worker</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">  --wait-timeout<span class="o">=</span><span class="m">60</span> <span class="se">\ </span>             <span class="c1"># 等 lock 超時（秒）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">  --no-kill-backend                <span class="c1"># 不主動 kill 卡 lock 的 query</span></span></span></code></pre></div><h3 id="step-3監控">Step 3：監控</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 看 pg_repack 進度
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pid</span><span class="p">,</span><span class="w"> </span><span class="n">query</span><span class="p">,</span><span class="w"> </span><span class="k">state</span><span class="p">,</span><span class="w"> </span><span class="n">wait_event_type</span><span class="p">,</span><span class="w"> </span><span class="n">wait_event</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_activity</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">query</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="s1">&#39;%repack%&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- 看 lock 狀態
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_locks</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">relation</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w">  </span><span class="k">SELECT</span><span class="w"> </span><span class="n">oid</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_class</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">relname</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;orders&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;repack.table_xxx&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><h3 id="step-4驗證">Step 4：驗證</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 跑完後對比 row count + 抽樣 query
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- 跟 pg_repack 之前 count 對比</span></span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-alter-直接跑沒看是不是-fast-變-lock-heavy">1. ALTER 直接跑沒看是不是 fast 變 lock heavy</h3>
<p><code>ALTER TABLE orders ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'pending'</code> — 預期 catalog-only（PG 11+）、但若 PG 10 跑這個就會 rewrite 整表、ACCESS EXCLUSIVE lock 幾小時。</p>
<p>修法：</p>
<ul>
<li>寫 schema migration 前 <em>確認 PG version</em></li>
<li>看 <a href="https://www.postgresql.org/docs/current/sql-altertable.html">PG ALTER doc</a>、each subcommand 標 <em>Note</em> 段是否 fast</li>
<li>Production 跑前 staging 測 + 監控 <code>pg_stat_activity</code> lock wait</li>
</ul>
<h3 id="2-vacuum-full-誤用--production-downtime">2. VACUUM FULL 誤用 — Production downtime</h3>
<p><code>VACUUM FULL</code> 等於「rewrite 整表 + ACCESS EXCLUSIVE lock」。Production 跑 = 表變 unavailable 幾分鐘到幾小時。</p>
<p>修法：</p>
<ul>
<li><em>永遠用 pg_repack</em> 取代 VACUUM FULL（除非 maintenance window）</li>
<li>對 bloat 議題、定期跑 pg_repack</li>
<li>autovacuum tuning 第一優先（<a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum-tuning</a> 詳細）</li>
</ul>
<h3 id="3-pg_repack-version-mismatch">3. pg_repack version mismatch</h3>
<p>PG cluster 升 14、但 <code>pg_repack</code> extension 還是 13 版本。試 ALTER 跑 <code>pg_repack</code> 命令、ERROR: <code>program &quot;pg_repack 14.x&quot; does not match installed extension &quot;pg_repack 13.x&quot;</code>。</p>
<p>修法：</p>
<ul>
<li>升 PG cluster 後 <em>立即 ALTER EXTENSION pg_repack UPDATE</em></li>
<li>若 pg_repack 還沒釋出對應 PG 版本（早期升級）、暫時用 pg-osc 替代或等待</li>
<li>升級 runbook 紀錄 pg_repack 是 <em>必同步升級的 extension</em></li>
</ul>
<h3 id="4-create-index-concurrently-失敗清理">4. CREATE INDEX CONCURRENTLY 失敗清理</h3>
<p><code>CREATE INDEX CONCURRENTLY</code> 跑到一半被 cancel（用戶 Ctrl-C / connection drop）、產生 <em>invalid index</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">indexrelid</span><span class="p">::</span><span class="n">regclass</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_index</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="n">indisvalid</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="c1">-- 顯示一個 idx_orders_status_invalid</span></span></span></code></pre></div><p>Invalid index 仍佔 disk、但 optimizer 不會用。</p>
<p>修法：</p>
<ul>
<li>跑 <code>DROP INDEX CONCURRENTLY idx_orders_status_invalid</code></li>
<li>之後重新 <code>CREATE INDEX CONCURRENTLY</code></li>
<li>避免在 connection 不穩的 session 跑長時間 CREATE INDEX CONCURRENTLY、改用 cron 或 deploy pipeline</li>
</ul>
<h3 id="5-generated-stored-column-不能-online-add">5. Generated stored column 不能 online ADD</h3>
<p><code>ADD COLUMN total NUMERIC GENERATED ALWAYS AS (price * qty) STORED</code> — <em>stored</em> generated column 必須 rewrite 整表計算 column value、不是 catalog-only。</p>
<p>修法：</p>
<ul>
<li>
<p>用 <code>GENERATED ALWAYS AS (...) VIRTUAL</code>（PG 18+）— 不存實際 value、catalog-only</p>
</li>
<li>
<p>或 <em>先加 nullable column + backfill + 加 NOT NULL constraint</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">total</span><span class="w"> </span><span class="nb">NUMERIC</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">UPDATE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">total</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">price</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">qty</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="k">BETWEEN</span><span class="w"> </span><span class="p">...;</span><span class="w">  </span><span class="c1">-- chunked
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">ALTER</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">total</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 之後加 trigger 或 application 層維護 total</span></span></span></code></pre></div></li>
<li>
<p>或用 pg_repack 跑 rewrite ADD GENERATED STORED</p>
</li>
</ul>
<h2 id="容量--時間估算">容量 / 時間估算</h2>
<p>對 100 GB 表、ADD COLUMN 加 index 為例：</p>
<table>
  <thead>
      <tr>
          <th>操作</th>
          <th>時間</th>
          <th>Lock 影響</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>ADD COLUMN col TYPE NULL</code> (PG 11+)</td>
          <td>&lt; 1 秒</td>
          <td>ACCESS EXCLUSIVE（毫秒級）</td>
      </tr>
      <tr>
          <td><code>ADD COLUMN col TYPE NOT NULL DEFAULT 0</code> (PG 11+)</td>
          <td>&lt; 1 秒</td>
          <td>ACCESS EXCLUSIVE（毫秒級）</td>
      </tr>
      <tr>
          <td><code>CREATE INDEX CONCURRENTLY</code></td>
          <td>2-6 小時</td>
          <td>無 table lock</td>
      </tr>
      <tr>
          <td><code>pg_repack table</code></td>
          <td>4-8 小時</td>
          <td>短 ACCESS EXCLUSIVE（swap）</td>
      </tr>
      <tr>
          <td><code>ALTER COLUMN TYPE</code> rewrite</td>
          <td>4-8 小時</td>
          <td>ACCESS EXCLUSIVE 全程</td>
      </tr>
      <tr>
          <td><code>VACUUM FULL</code></td>
          <td>同 pg_repack</td>
          <td>ACCESS EXCLUSIVE 全程（不要跑）</td>
      </tr>
  </tbody>
</table>
<h2 id="跟-mysql-gh-ost--pt-osc-對照">跟 MySQL gh-ost / pt-osc 對照</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PG pg_repack</th>
          <th>PG pg-osc</th>
          <th>MySQL gh-ost</th>
          <th>MySQL pt-osc</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>機制</td>
          <td>Trigger + log table</td>
          <td>WAL logical stream</td>
          <td>Binlog stream</td>
          <td>Trigger + log table</td>
      </tr>
      <tr>
          <td>Primary 寫 overhead</td>
          <td>中（trigger）</td>
          <td>0（WAL 已存在）</td>
          <td>0（binlog 已存在）</td>
          <td>中（trigger）</td>
      </tr>
      <tr>
          <td>Throttle 支援</td>
          <td>部分</td>
          <td>支援</td>
          <td>強</td>
          <td>部分</td>
      </tr>
      <tr>
          <td>Pause / Resume</td>
          <td>不支援</td>
          <td>不支援</td>
          <td>支援</td>
          <td>不支援</td>
      </tr>
      <tr>
          <td>工具成熟度</td>
          <td>高</td>
          <td>中（2023+）</td>
          <td>高</td>
          <td>高</td>
      </tr>
      <tr>
          <td>Use case 比例</td>
          <td>PG 主流（90% case）</td>
          <td>高吞吐 escape hatch</td>
          <td>MySQL 主流（dev）</td>
          <td>MySQL legacy + FK</td>
      </tr>
  </tbody>
</table>
<p>PG OSC tool 使用頻率比 MySQL 低 — 因為 PG 內建 fast ALTER 已 cover 90% schema change、ghost table tool 只對 <em>少數 rewrite-required</em> 場景。</p>
<p>詳見 <a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">MySQL Online Schema Change Tools</a> — sibling、不同 use case mix。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p>ALTER TABLE / pg_repack / pg-osc 都產生 WAL、會 replicate 到 standby。Standby 上的 long-running query 可能跟 ALTER 衝突、被 <code>hot_standby_feedback</code> 影響 primary autovacuum。詳見 <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>。</p>
<h3 id="跟-autovacuum-tuning">跟 Autovacuum Tuning</h3>
<p>Schema change 後常產生 dead tuple、autovacuum 需要重新 cover。詳見 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a>。</p>
<h3 id="跟-logical-replication">跟 Logical Replication</h3>
<p>logical replication 透過 publication / subscription 同步 — DDL <em>不會</em> logical replicate（PG 16 之前）、必須 <em>在 publisher / subscriber 各自跑 DDL</em>。詳見 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a>。</p>
<h3 id="跟-patroni-ha">跟 Patroni HA</h3>
<p>Patroni promote 新 primary 後、pg_repack extension state（slot / catalog）跟著走、新 primary 仍可繼續 pg_repack。詳見 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a>。</p>
<h2 id="何時用哪個">何時用哪個</h2>
<table>
  <thead>
      <tr>
          <th>情境</th>
          <th>選擇</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ADD COLUMN nullable / DROP COLUMN / RENAME 等</td>
          <td>直接 ALTER（fast catalog-only）</td>
      </tr>
      <tr>
          <td>CREATE INDEX 大表</td>
          <td><code>CREATE INDEX CONCURRENTLY</code></td>
      </tr>
      <tr>
          <td>ALTER COLUMN TYPE rewrite（大表）</td>
          <td>pg_repack</td>
      </tr>
      <tr>
          <td>Bloat 重組</td>
          <td>pg_repack</td>
      </tr>
      <tr>
          <td>高吞吐 + trigger overhead 不可接受</td>
          <td>pg-osc</td>
      </tr>
      <tr>
          <td>ADD GENERATED STORED column</td>
          <td>nullable + backfill + constraint</td>
      </tr>
      <tr>
          <td>Cluster on Cloud（RDS / Aurora）</td>
          <td>RDS / Aurora 內建 fast DDL 多數已 cover、pg_repack 視 vendor 支援</td>
      </tr>
  </tbody>
</table>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（ALTER 跟 streaming replication 互動）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">PG Autovacuum Tuning</a>（schema change 後 vacuum 議題）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">PG Logical Replication + Debezium</a>（DDL 不 replicate 議題）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">PG Patroni HA</a>（HA 跟 pg_repack 整合）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">MySQL Online Schema Change Tools</a>（sibling、tool ecosystem 不同）</li>
<li><a href="/blog/backend/knowledge-cards/expand-contract/" data-link-title="Expand / Contract" data-link-desc="說明先擴充相容面、再收斂舊路徑的遷移做法">Expand / Contract 卡片</a>（schema migration 設計原則）</li>
<li>官方：<a href="https://www.postgresql.org/docs/current/sql-altertable.html">ALTER TABLE</a> / <a href="https://github.com/reorg/pg_repack">pg_repack GitHub</a> / <a href="https://github.com/shayonj/pg-osc">pg-osc GitHub</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Connection Scaling：process-per-connection model 跟為什麼 pooler 是必裝</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/connection-scaling/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/connection-scaling/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>connection scaling 的根因&lt;/em> — 為什麼 PG 比多數 DB 更需要 pooler、跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &amp;#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgbouncer-config&lt;/a> 是 &lt;em>根因 vs 配置&lt;/em> 的關係。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="connection-per-process-model-是-pg-的結構性選擇">Connection-per-Process Model 是 PG 的結構性選擇&lt;/h2>
&lt;p>PG 接受 client connection 時的行為跟多數現代 DB 不同：每個 connection 由 postmaster &lt;code>fork()&lt;/code> 一個獨立的 OS process（backend）來服務。這個 process 在 connection lifetime 內專屬該 client、不跟其他 client 共享。&lt;/p>
&lt;p>對比常見 DB 的 connection model：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Vendor&lt;/th>
 &lt;th>Connection model&lt;/th>
 &lt;th>每 connection 資源&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>PostgreSQL&lt;/td>
 &lt;td>Process-per-connection（fork）&lt;/td>
 &lt;td>5-15MB RAM、獨立 PID&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>MySQL&lt;/td>
 &lt;td>Thread-per-connection&lt;/td>
 &lt;td>256KB-2MB RAM、共享 process&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Oracle&lt;/td>
 &lt;td>Shared server / dedicated 可選&lt;/td>
 &lt;td>配置決定&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>SQL Server&lt;/td>
 &lt;td>Thread-per-connection（pooled）&lt;/td>
 &lt;td>~512KB&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>MongoDB&lt;/td>
 &lt;td>Thread-per-connection&lt;/td>
 &lt;td>~1MB&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>PG 選 process 不選 thread 是 1990s 設計決定 — 當時 thread library 在多 UNIX 平台不穩定、process 隔離性更好（一個 backend crash 不會帶倒整個 DB）。這個 trade-off 一路保留到今天、是 PG 在 high-connection-count workload 的 &lt;em>結構性負擔&lt;/em>。&lt;/p>
&lt;h2 id="量化connection-數量對-ram-跟-cpu-的壓力">量化：connection 數量對 RAM 跟 CPU 的壓力&lt;/h2>
&lt;p>一個 PG backend process 的 RAM footprint 由三部分組成：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">backend_rss ≈ shared_buffers_attach + process_private + work_mem 高水位&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>shared_buffers&lt;/code> 是所有 backend 共享的、不重複計、但 &lt;code>process_private&lt;/code>（catalog cache / plan cache / temp buffer）跟 &lt;code>work_mem&lt;/code> 是 per-backend：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Workload 類型&lt;/th>
 &lt;th>process_private&lt;/th>
 &lt;th>work_mem 高水位&lt;/th>
 &lt;th>單 backend RAM&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Idle / 簡單 OLTP&lt;/td>
 &lt;td>3-5MB&lt;/td>
 &lt;td>4MB&lt;/td>
 &lt;td>7-9MB&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>中等 query（join / sort）&lt;/td>
 &lt;td>5-8MB&lt;/td>
 &lt;td>16-64MB&lt;/td>
 &lt;td>21-72MB&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Heavy analytical（CTE / window）&lt;/td>
 &lt;td>8-15MB&lt;/td>
 &lt;td>256MB+&lt;/td>
 &lt;td>264MB+&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>500 個 connection、平均 30MB 各 ≈ 15GB RAM 給 backend processes（還沒算 shared_buffers）。這是 PG 在 cloud instance 上很快撞到 RAM ceiling 的根因。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>connection scaling 的根因</em> — 為什麼 PG 比多數 DB 更需要 pooler、跟 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgbouncer-config</a> 是 <em>根因 vs 配置</em> 的關係。</p></blockquote>
<hr>
<h2 id="connection-per-process-model-是-pg-的結構性選擇">Connection-per-Process Model 是 PG 的結構性選擇</h2>
<p>PG 接受 client connection 時的行為跟多數現代 DB 不同：每個 connection 由 postmaster <code>fork()</code> 一個獨立的 OS process（backend）來服務。這個 process 在 connection lifetime 內專屬該 client、不跟其他 client 共享。</p>
<p>對比常見 DB 的 connection model：</p>
<table>
  <thead>
      <tr>
          <th>Vendor</th>
          <th>Connection model</th>
          <th>每 connection 資源</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PostgreSQL</td>
          <td>Process-per-connection（fork）</td>
          <td>5-15MB RAM、獨立 PID</td>
      </tr>
      <tr>
          <td>MySQL</td>
          <td>Thread-per-connection</td>
          <td>256KB-2MB RAM、共享 process</td>
      </tr>
      <tr>
          <td>Oracle</td>
          <td>Shared server / dedicated 可選</td>
          <td>配置決定</td>
      </tr>
      <tr>
          <td>SQL Server</td>
          <td>Thread-per-connection（pooled）</td>
          <td>~512KB</td>
      </tr>
      <tr>
          <td>MongoDB</td>
          <td>Thread-per-connection</td>
          <td>~1MB</td>
      </tr>
  </tbody>
</table>
<p>PG 選 process 不選 thread 是 1990s 設計決定 — 當時 thread library 在多 UNIX 平台不穩定、process 隔離性更好（一個 backend crash 不會帶倒整個 DB）。這個 trade-off 一路保留到今天、是 PG 在 high-connection-count workload 的 <em>結構性負擔</em>。</p>
<h2 id="量化connection-數量對-ram-跟-cpu-的壓力">量化：connection 數量對 RAM 跟 CPU 的壓力</h2>
<p>一個 PG backend process 的 RAM footprint 由三部分組成：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">backend_rss ≈ shared_buffers_attach + process_private + work_mem 高水位</span></span></code></pre></div><p><code>shared_buffers</code> 是所有 backend 共享的、不重複計、但 <code>process_private</code>（catalog cache / plan cache / temp buffer）跟 <code>work_mem</code> 是 per-backend：</p>
<table>
  <thead>
      <tr>
          <th>Workload 類型</th>
          <th>process_private</th>
          <th>work_mem 高水位</th>
          <th>單 backend RAM</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Idle / 簡單 OLTP</td>
          <td>3-5MB</td>
          <td>4MB</td>
          <td>7-9MB</td>
      </tr>
      <tr>
          <td>中等 query（join / sort）</td>
          <td>5-8MB</td>
          <td>16-64MB</td>
          <td>21-72MB</td>
      </tr>
      <tr>
          <td>Heavy analytical（CTE / window）</td>
          <td>8-15MB</td>
          <td>256MB+</td>
          <td>264MB+</td>
      </tr>
  </tbody>
</table>
<p>500 個 connection、平均 30MB 各 ≈ 15GB RAM 給 backend processes（還沒算 shared_buffers）。這是 PG 在 cloud instance 上很快撞到 RAM ceiling 的根因。</p>
<p>CPU 層面、<code>fork()</code> 系統呼叫在 Linux 通常 1-3ms、context switch ~3-5μs。100 connection burst 在 1 秒內進來、accumulated fork cost 100-300ms、加 query 本身的 CPU 跟 scheduler latency、平均 query 延遲會跳 2-5x。</p>
<h2 id="三個-guc-互動max_connections--shared_buffers--work_mem">三個 GUC 互動：max_connections / shared_buffers / work_mem</h2>
<p>PG 的 memory 規劃由這三個 GUC 互動決定、不能獨立調：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">total_RAM ≈ shared_buffers + (max_connections × work_mem 高水位) + OS overhead</span></span></code></pre></div><p>實務 sizing 規則（16GB instance、OLTP workload）：</p>
<table>
  <thead>
      <tr>
          <th>GUC</th>
          <th>建議值</th>
          <th>理由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>shared_buffers</code></td>
          <td>25% RAM（4GB）</td>
          <td>太大 OS file cache 收益遞減、&lt; 25% wastes RAM</td>
      </tr>
      <tr>
          <td><code>work_mem</code></td>
          <td>8-32MB</td>
          <td>每 query operation 用一份、不是每 connection 一份</td>
      </tr>
      <tr>
          <td><code>max_connections</code></td>
          <td>100-200</td>
          <td>超過 200 需 pooler、不是調更大</td>
      </tr>
      <tr>
          <td><code>effective_cache_size</code></td>
          <td>50-75% RAM</td>
          <td>planner 估 cost 用、不是實際配置</td>
      </tr>
      <tr>
          <td><code>maintenance_work_mem</code></td>
          <td>64-512MB</td>
          <td>VACUUM / CREATE INDEX 用</td>
      </tr>
  </tbody>
</table>
<p><code>max_connections = 1000</code> 是常見 anti-pattern — 真實 active query 可能只 50-100、剩下都 idle、但每個還是吃 RAM 跟 process slot、context switch overhead 還在。</p>
<h2 id="pooler-為什麼是-production-prerequisite">Pooler 為什麼是 <em>production prerequisite</em></h2>
<blockquote>
<p>本段是「為什麼必裝」、實際 PgBouncer 配置看 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgbouncer-config</a>。</p></blockquote>
<p>Pooler 的核心責任是 <em>把 N 個 application connection multiplex 成 M 個 PG backend（M ≪ N）</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Application (3000 connection)
</span></span><span class="line"><span class="ln">2</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">3</span><span class="cl">Pooler（PgBouncer / PgCat）
</span></span><span class="line"><span class="ln">4</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">5</span><span class="cl">PostgreSQL (50 backend process)</span></span></code></pre></div><p>Application 看到的是 <em>無限 connection 池</em>、PG 看到的是 <em>穩定 50 個 backend</em>。三個層次的效益：</p>
<ol>
<li><strong>RAM 節省</strong>：3000 connection × 30MB = 90GB → 50 backend × 30MB = 1.5GB</li>
<li><strong>Fork() cost 攤平</strong>：backend 重用、不是每個 client 都 fork</li>
<li><strong>Connection storm 緩衝</strong>：application 重啟 / scaling event 不會直接打到 PG</li>
</ol>
<p>Pooler 有三種 pool mode、各有 application 層相容性 trade-off：</p>
<table>
  <thead>
      <tr>
          <th>Pool mode</th>
          <th>Session 隔離</th>
          <th>適用 application</th>
          <th>PG feature 限制</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Session</td>
          <td>每 client 獨佔 1 backend</td>
          <td>用 prepared statement、SET、temp table</td>
          <td>等同沒 pool、僅救 fork cost</td>
      </tr>
      <tr>
          <td>Transaction</td>
          <td>每 transaction 換 backend</td>
          <td>多數 stateless API（最常用）</td>
          <td>不能用 session-level state</td>
      </tr>
      <tr>
          <td>Statement</td>
          <td>每 statement 換 backend</td>
          <td>Read-only / analytical</td>
          <td>不能用 transaction</td>
      </tr>
  </tbody>
</table>
<p>Production 多數選 transaction pool — 救 RAM 又保留 transaction semantics、代價是 application 不能用 session-level <code>SET</code>、<code>LISTEN/NOTIFY</code>、prepared statement（部分 pooler 已支援）。</p>
<h2 id="application-side-pool-vs-middleware-pool-vs-rds-proxy">Application-side Pool vs Middleware Pool vs RDS Proxy</h2>
<p>三層 pool 都能解 connection 問題、但解的問題不同：</p>
<table>
  <thead>
      <tr>
          <th>層級</th>
          <th>代表</th>
          <th>解的問題</th>
          <th>限制</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Application-side（driver）</td>
          <td>HikariCP（Java）/ pgx pool（Go）/ asyncpg / Sequelize</td>
          <td>Connection 重用 + lifecycle 管理</td>
          <td>仍每 app instance 開 N 個到 PG、總量沒收斂</td>
      </tr>
      <tr>
          <td>Middleware pooler</td>
          <td>PgBouncer / PgCat</td>
          <td>Multiplex 所有 application instance 到少數 backend</td>
          <td>多一跳 latency 0.1-1ms、需自管 HA</td>
      </tr>
      <tr>
          <td>Cloud-managed proxy</td>
          <td>RDS Proxy / Cloud SQL Proxy</td>
          <td>Multiplex + IAM auth + Secrets Manager integration</td>
          <td>Latency 1-3ms、cost premium、PG feature 受限</td>
      </tr>
  </tbody>
</table>
<p><strong>典型 production 拓撲</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Application (HikariCP pool 10/instance × 50 instance = 500)
</span></span><span class="line"><span class="ln">2</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">3</span><span class="cl">PgBouncer transaction pool（50 backend）
</span></span><span class="line"><span class="ln">4</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">5</span><span class="cl">PostgreSQL primary</span></span></code></pre></div><p>Application pool 救 fork cost、PgBouncer 救 backend 總量、兩層各做各的事不衝突。</p>
<p><strong>雙層 pool 配置容易出錯</strong>：application pool size 5 + PgBouncer default_pool_size 50 + 100 個 app instance、application 願意開 500 connection、PgBouncer 只給 50 個 backend — 多 450 個 application connection wait、看起來像「DB 慢」但實際是 pool 不足。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="case-1connection-storm重啟--autoscale-同時打進來">Case 1：Connection storm（重啟 / autoscale 同時打進來）</h3>
<p><strong>情境</strong>：Kubernetes rolling restart、200 個 pod 同時重連、每 pod 開 20 個 connection、瞬間 4000 個 connection 嘗試打到 PG。</p>
<p>PG <code>max_connections = 500</code> 直接拒絕 3500 個、application 看到 <code>FATAL: sorry, too many clients already</code>、retry storm 雪上加霜。</p>
<p>修法：</p>
<ul>
<li>PgBouncer 在前面、application 連 PgBouncer 不直連 PG</li>
<li><code>reserve_pool_size = 5</code> 給管理流量留 buffer</li>
<li>Application 端加 jittered exponential backoff、避免 retry 同步</li>
</ul>
<h3 id="case-2fork-cost-在-burst-流量">Case 2：fork() cost 在 burst 流量</h3>
<p><strong>情境</strong>：Cron job 每分鐘整點觸發、500 個 worker 同時開 short-lived connection 跑 30ms query、結束關閉。</p>
<p>每分鐘 500 次 <code>fork()</code> + 500 次 <code>exit()</code>、fork cost 500-1500ms、CPU spike、其他 OLTP query 延遲飆。</p>
<p>修法：</p>
<ul>
<li>Worker 改 connect 到 PgBouncer transaction pool、backend 重用、fork 只在 PgBouncer 首次拓展時</li>
<li>或 worker 改成 long-lived process + 內部 task queue、避免每分鐘重 fork</li>
</ul>
<h3 id="case-3shared_buffers-跟-max_connections-互相壓縮">Case 3：shared_buffers 跟 max_connections 互相壓縮</h3>
<p><strong>情境</strong>：16GB instance、<code>shared_buffers = 8GB</code>（50%）、<code>max_connections = 800</code>、<code>work_mem = 16MB</code>。</p>
<p>預估 RAM：8GB + 800 × ~30MB = 32GB ≫ 16GB instance、OOM kill 來訪。</p>
<p>修法（重新分配）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">shared_buffers</span> <span class="o">=</span> <span class="s">4GB           # 25%</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">max_connections</span> <span class="o">=</span> <span class="s">200          # 透過 PgBouncer multiplex</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">work_mem</span> <span class="o">=</span> <span class="s">16MB</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">effective_cache_size</span> <span class="o">=</span> <span class="s">12GB</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">maintenance_work_mem</span> <span class="o">=</span> <span class="s">512MB</span></span></span></code></pre></div><p>關鍵：<code>max_connections</code> 不是調更大救 connection 不足、是調 <em>PgBouncer pool size</em> 拓展 application 容量。</p>
<h3 id="case-4double-pool-配置失敗">Case 4：Double-pool 配置失敗</h3>
<p><strong>情境</strong>：Application HikariCP pool size = 50、50 個 instance、PgBouncer <code>default_pool_size = 20</code>、PG <code>max_connections = 100</code>。</p>
<p>Application 願意開 2500 個 connection、PgBouncer 只給 20 個 backend、application thread 大量 block 在 PgBouncer 等 backend 釋出。</p>
<p>修法：</p>
<ul>
<li>計算 <em>application 願意的並發</em> vs <em>PgBouncer 允許的 backend</em> vs <em>PG max_connections</em> 三層匹配</li>
<li>通常 <code>application_total_connection ≪ pgbouncer_max_client_conn</code> + <code>pgbouncer_default_pool_size + reserve ≪ pg_max_connections</code></li>
<li>Monitor PgBouncer <code>SHOW POOLS</code> 的 <code>cl_waiting</code>、長期 &gt; 0 表示 pool 不足</li>
</ul>
<h3 id="case-5max_connections-設太大反而慢">Case 5：max_connections 設太大反而慢</h3>
<p><strong>情境</strong>：team 看到 <code>connection refused</code>、把 <code>max_connections</code> 從 200 調到 2000、想說「給更多 connection 應該更好」。</p>
<p>調完 throughput 反而降 30% — context switch overhead、planner cache 競爭、lock manager 競爭都跟 connection 數線性放大。</p>
<p>修法：</p>
<ul>
<li><code>max_connections</code> 上限通常 200-500、超過要靠 pooler multiplex</li>
<li>用 <code>pg_stat_activity</code> 看真實 active connection（state != &lsquo;idle&rsquo;）、通常 &lt; 100</li>
<li>真實上限 = active 高水位 × 安全係數 1.5、不是「未來可能會用到的數量」</li>
</ul>
<h2 id="跟-mysql-connection-model-對比">跟 MySQL connection model 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PostgreSQL</th>
          <th>MySQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Connection 模型</td>
          <td>Process-per-connection（fork）</td>
          <td>Thread-per-connection</td>
      </tr>
      <tr>
          <td>單 connection RAM</td>
          <td>5-15MB（idle）/ 30-200MB（heavy）</td>
          <td>256KB-2MB</td>
      </tr>
      <tr>
          <td>Fork / spawn cost</td>
          <td>1-3ms</td>
          <td>&lt; 100μs</td>
      </tr>
      <tr>
          <td>Pooler 必要性</td>
          <td><strong>強烈必要</strong>（300+ connection 必裝）</td>
          <td>中等（ProxySQL 對特定 case 有用）</td>
      </tr>
      <tr>
          <td>主流 pooler</td>
          <td>PgBouncer / PgCat</td>
          <td>ProxySQL / MySQL Router</td>
      </tr>
  </tbody>
</table>
<p>MySQL thread-per-connection model 讓它在 high-connection-count workload 上 <em>看起來</em> 更省 — 但 PG 透過 PgBouncer 達到的 application 看到的容量跟 MySQL 直連是一樣的、只是多一層 indirection。</p>
<p>實務影響：</p>
<ul>
<li>MySQL 直連 1000 connection 還 OK、PG 直連 1000 connection 通常 OOM</li>
<li>PG + PgBouncer 1000 application connection、後端 50 backend、表現跟 MySQL 1000 直連相當</li>
<li>沒有 <em>PG 更耗 RAM</em> 的本質結論、是 <em>PG 預設不 multiplex、需要外掛 multiplex 層</em></li>
</ul>
<h2 id="pg-17-的-connection-進展">PG 17+ 的 connection 進展</h2>
<p>PG 17（2024）對 connection 仍維持 process-per-connection、但有幾個減壓改進：</p>
<ul>
<li><strong>Per-process memory 降低</strong>：catalog cache 改 generational allocator、idle backend RAM 降 ~20%</li>
<li><strong>Subscriber-side parallel apply</strong>：logical replication 減少 connection 開銷</li>
<li><strong><code>io_combine_limit</code></strong>：buffered read 合併、降 syscall overhead</li>
</ul>
<p>但 <em>process-per-connection model 本身</em> 沒換 — 短期內 PG 仍需 pooler。長期方向（PG 18+ 討論）可能引入 thread-based backend、但目前是 experimental patch。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgbouncer-config</a>：PgBouncer 操作配置 + 5 case</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">replication-topology</a>：Read replica + connection 分流</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization</a>：<code>work_mem</code> 影響 plan</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">mvcc-lock-model</a>：connection idle in transaction 卡 vacuum</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum-tuning</a>：autovacuum 也吃 connection slot</li>
</ul>
<h2 id="下一步">下一步</h2>
<ul>
<li>連到 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgbouncer-config</a> 學配置細節</li>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview</a> 回到全圖</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Index Selection：B-tree / GIN / GiST / BRIN / Hash 對應 workload 的決策樹</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/index-selection/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/index-selection/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>index 選型&lt;/em> — 何時用哪種 index、跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization&lt;/a> 的「為什麼這個 plan 慢」互補。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="6-種-index-method-對應-workload">6 種 Index Method 對應 Workload&lt;/h2>
&lt;p>PG 有 6 種 index access method、各有自己擅長的 query pattern：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Index method&lt;/th>
 &lt;th>適用 query pattern&lt;/th>
 &lt;th>典型 column type&lt;/th>
 &lt;th>儲存成本&lt;/th>
 &lt;th>&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>B-tree&lt;/td>
 &lt;td>&lt;code>=&lt;/code> / &lt;code>&amp;lt;&lt;/code> / &lt;code>&amp;gt;&lt;/code> / &lt;code>BETWEEN&lt;/code> / &lt;code>IS NULL&lt;/code> / &lt;code>LIKE 'prefix%'&lt;/code>&lt;/td>
 &lt;td>任何 scalar、最常用&lt;/td>
 &lt;td>中&lt;/td>
 &lt;td>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Hash&lt;/td>
 &lt;td>純 &lt;code>=&lt;/code> 比對&lt;/td>
 &lt;td>scalar、不常用&lt;/td>
 &lt;td>低&lt;/td>
 &lt;td>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>GIN&lt;/td>
 &lt;td>&lt;code>@&amp;gt;&lt;/code> / &lt;code>?&lt;/code> / `?&lt;/td>
 &lt;td>` / FTS / array 包含&lt;/td>
 &lt;td>JSONB / tsvector / array&lt;/td>
 &lt;td>高（write 慢）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>GiST&lt;/td>
 &lt;td>範圍 / 空間 / 自訂 operator&lt;/td>
 &lt;td>geometry / tsvector / range&lt;/td>
 &lt;td>中&lt;/td>
 &lt;td>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>SP-GiST&lt;/td>
 &lt;td>Non-balanced 樹結構&lt;/td>
 &lt;td>IP / phone prefix / quad-tree&lt;/td>
 &lt;td>中&lt;/td>
 &lt;td>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>BRIN&lt;/td>
 &lt;td>大表的 range scan、physical order 跟 logical order 相關&lt;/td>
 &lt;td>timestamp / id（append-only）&lt;/td>
 &lt;td>極低&lt;/td>
 &lt;td>&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>選錯 index 的代價：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>index 選型</em> — 何時用哪種 index、跟 <a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization</a> 的「為什麼這個 plan 慢」互補。</p></blockquote>
<hr>
<h2 id="6-種-index-method-對應-workload">6 種 Index Method 對應 Workload</h2>
<p>PG 有 6 種 index access method、各有自己擅長的 query pattern：</p>
<table>
  <thead>
      <tr>
          <th>Index method</th>
          <th>適用 query pattern</th>
          <th>典型 column type</th>
          <th>儲存成本</th>
          <th></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>B-tree</td>
          <td><code>=</code> / <code>&lt;</code> / <code>&gt;</code> / <code>BETWEEN</code> / <code>IS NULL</code> / <code>LIKE 'prefix%'</code></td>
          <td>任何 scalar、最常用</td>
          <td>中</td>
          <td></td>
      </tr>
      <tr>
          <td>Hash</td>
          <td>純 <code>=</code> 比對</td>
          <td>scalar、不常用</td>
          <td>低</td>
          <td></td>
      </tr>
      <tr>
          <td>GIN</td>
          <td><code>@&gt;</code> / <code>?</code> / `?</td>
          <td>` / FTS / array 包含</td>
          <td>JSONB / tsvector / array</td>
          <td>高（write 慢）</td>
      </tr>
      <tr>
          <td>GiST</td>
          <td>範圍 / 空間 / 自訂 operator</td>
          <td>geometry / tsvector / range</td>
          <td>中</td>
          <td></td>
      </tr>
      <tr>
          <td>SP-GiST</td>
          <td>Non-balanced 樹結構</td>
          <td>IP / phone prefix / quad-tree</td>
          <td>中</td>
          <td></td>
      </tr>
      <tr>
          <td>BRIN</td>
          <td>大表的 range scan、physical order 跟 logical order 相關</td>
          <td>timestamp / id（append-only）</td>
          <td>極低</td>
          <td></td>
      </tr>
  </tbody>
</table>
<p>選錯 index 的代價：</p>
<ul>
<li><strong>Write workload</strong>：每 write 都更新所有相關 index、5 個 unused index = 5x write 放大</li>
<li><strong>Storage</strong>：JSONB 加 GIN 可能比表本身還大</li>
<li><strong>Plan misjudge</strong>：planner 看到 index 不一定用、<code>EXPLAIN</code> 才確認</li>
</ul>
<h2 id="b-tree預設選擇95-workload-適用">B-tree：預設選擇、95% workload 適用</h2>
<p>B-tree 是 PG 預設 index、CREATE INDEX 不指定 method 就是 B-tree：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_user_id</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_created_at</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">created_at</span><span class="p">);</span></span></span></code></pre></div><p>B-tree 擅長的 query：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 等值
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">42</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- 範圍
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="k">BETWEEN</span><span class="w"> </span><span class="s1">&#39;2025-01-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="s1">&#39;2025-01-31&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- IS NULL
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">shipped_at</span><span class="w"> </span><span class="k">IS</span><span class="w"> </span><span class="k">NULL</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- Prefix LIKE
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">sku</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="s1">&#39;ABC%&#39;</span><span class="p">;</span></span></span></code></pre></div><p>B-tree 不擅長：</p>
<ul>
<li><code>LIKE '%suffix'</code>（前綴 wildcard）→ 改 trigram + GIN</li>
<li><code>column @&gt; array</code>（包含）→ 改 GIN</li>
<li>JSON 內部 path query → 改 GIN on JSONB</li>
</ul>
<p><strong>Multi-column B-tree</strong> 的順序很重要：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 假設常 query: WHERE user_id = ? AND status = ?
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_user_status</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">status</span><span class="p">);</span><span class="w">  </span><span class="c1">-- 對
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_status_user</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">status</span><span class="p">,</span><span class="w"> </span><span class="n">user_id</span><span class="p">);</span><span class="w">  </span><span class="c1">-- 錯（status 選擇性低）</span></span></span></code></pre></div><p>順序原則：</p>
<ol>
<li><strong>等值 column 在前</strong>（高選擇性）</li>
<li><strong>範圍 column 在後</strong>（B-tree leftmost 規則）</li>
<li><strong>selectivity 高的在前</strong>（filter 更多 row）</li>
</ol>
<h2 id="ginjsonb--fts--array-的標配">GIN：JSONB / FTS / Array 的標配</h2>
<p>GIN（Generalized Inverted Index）對「一個 value 內含多個 sub-element」的 column 高效：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- JSONB
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_products_metadata</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- Array
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_articles_tags</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">tags</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- Full-text search
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_articles_content</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">content</span><span class="p">));</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- Trigram（fuzzy match）
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_trgm</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_products_name_trgm</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="n">gin_trgm_ops</span><span class="p">);</span></span></span></code></pre></div><p>GIN 代價：</p>
<ul>
<li><strong>Write 慢 2-10x</strong>：每個 sub-element 都要更新 inverted index</li>
<li><strong>Storage 大</strong>：可能比表還大</li>
<li><strong>Vacuum 沉重</strong>：bloat 累積快</li>
</ul>
<p><strong>Operator class</strong> 選擇影響大：</p>
<table>
  <thead>
      <tr>
          <th>Op class</th>
          <th>適用</th>
          <th>索引大小</th>
          <th>支援 operator</th>
          <th></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>jsonb_ops</code>（預設）</td>
          <td>通用</td>
          <td>大</td>
          <td><code>@&gt;</code> / <code>?</code> / `?</td>
          <td><code>/</code>?&amp;`</td>
      </tr>
      <tr>
          <td><code>jsonb_path_ops</code></td>
          <td>只 <code>@&gt;</code> containment</td>
          <td>1/3-1/2</td>
          <td>只 <code>@&gt;</code></td>
          <td></td>
      </tr>
  </tbody>
</table>
<p>只用 <code>@&gt;</code> query 時、<code>jsonb_path_ops</code> 救大量 storage。</p>
<h2 id="gist範圍--空間--自訂">GiST：範圍 / 空間 / 自訂</h2>
<p>GiST（Generalized Search Tree）擅長範圍跟空間：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 範圍 type（PostgreSQL 內建 int4range / tsrange 等）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_bookings_period</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">bookings</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GiST</span><span class="w"> </span><span class="p">(</span><span class="n">period</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 空間（PostGIS）
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_locations_geom</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">locations</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GiST</span><span class="w"> </span><span class="p">(</span><span class="n">geom</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- Exclusion constraint（範圍不重疊）
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">bookings</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">CONSTRAINT</span><span class="w"> </span><span class="n">no_overlap</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="n">EXCLUDE</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GiST</span><span class="w"> </span><span class="p">(</span><span class="n">room_id</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="o">=</span><span class="p">,</span><span class="w"> </span><span class="n">period</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="p">);</span></span></span></code></pre></div><p>GiST vs GIN 對 FTS 的選擇：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>GIN</th>
          <th>GiST</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Lookup 速度</td>
          <td>快 3x</td>
          <td>慢</td>
      </tr>
      <tr>
          <td>Update 速度</td>
          <td>慢 3x</td>
          <td>快</td>
      </tr>
      <tr>
          <td>索引大小</td>
          <td>大</td>
          <td>小</td>
      </tr>
      <tr>
          <td>適合場景</td>
          <td>Read-heavy FTS</td>
          <td>Write-heavy / 即時更新</td>
      </tr>
  </tbody>
</table>
<p>多數 FTS workload 選 GIN — read 占多、index size 換 query latency 划算。</p>
<h2 id="brin大表--physical-order-correlated">BRIN：大表 + Physical Order Correlated</h2>
<p>BRIN（Block Range Index）對 <em>physical 儲存順序跟 logical 順序強相關</em> 的 column 高效：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- timestamp column（append-only insert、physical 順序 = 時間順序）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_events_created_at</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">BRIN</span><span class="w"> </span><span class="p">(</span><span class="n">created_at</span><span class="p">);</span></span></span></code></pre></div><p>BRIN 機制：每個 block range（預設 128 page）記 min/max、query 時跳過 range 外的 block。</p>
<p>適用場景：</p>
<ul>
<li><strong>append-only 表</strong>：log、metrics、events</li>
<li><strong>大表</strong>（10GB+）：B-tree 太貴、BRIN 1/1000 大小</li>
<li><strong>column physical order 跟 query 一致</strong>：時間欄、自增 id</li>
</ul>
<p><strong>BRIN 失效情境</strong>：</p>
<ul>
<li>UPDATE 破壞 physical order（row 被 vacuum 移到別 block）→ BRIN 失效</li>
<li>隨機 insert（uuid / hash id）→ BRIN range 完全沒選擇性</li>
</ul>
<p><strong>何時不該用 BRIN</strong>：表 &lt; 1GB（沒省 storage 收益）、column 沒 physical order correlation（CLUSTER 後可能改善）。</p>
<h2 id="partial-index條件式-index-救-storage">Partial Index：條件式 index 救 storage</h2>
<p>對 <em>只 query 部分 row</em> 的 column、partial index 救大量 storage：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 只 index unshipped order
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_unshipped</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">created_at</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">shipped_at</span><span class="w"> </span><span class="k">IS</span><span class="w"> </span><span class="k">NULL</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="c1">-- 只 index active user
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_users_active</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">email</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- 只 index 高金額 transaction
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_high_value</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">total</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">1000</span><span class="p">;</span></span></span></code></pre></div><p>Partial index 的 query 要 <em>完全匹配 WHERE 條件</em> 才用得到：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 用得到 partial index
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">shipped_at</span><span class="w"> </span><span class="k">IS</span><span class="w"> </span><span class="k">NULL</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2025-01-01&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 用不到（planner 不 prove WHERE 包含 partial 條件）
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2025-01-01&#39;</span><span class="p">;</span></span></span></code></pre></div><p>實務 size 救法：unshipped order 只 1% 總量、partial index 1/100 大小。</p>
<h2 id="expression-index對函式結果-index">Expression Index：對函式結果 index</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 對 lowercased email index（case-insensitive search）
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_users_email_lower</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="k">lower</span><span class="p">(</span><span class="n">email</span><span class="p">));</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="k">lower</span><span class="p">(</span><span class="n">email</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">lower</span><span class="p">(</span><span class="s1">&#39;USER@example.com&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="c1">-- 對 JSONB 內部欄位
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_products_category</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="p">((</span><span class="n">metadata</span><span class="o">-&gt;&gt;</span><span class="s1">&#39;category&#39;</span><span class="p">));</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="o">-&gt;&gt;</span><span class="s1">&#39;category&#39;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;shoes&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- 對日期截斷
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_day</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">date_trunc</span><span class="p">(</span><span class="s1">&#39;day&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">created_at</span><span class="p">));</span></span></span></code></pre></div><p>Expression 必須 IMMUTABLE — <code>now()</code> / <code>random()</code> 不能用、<code>timezone('UTC', ts)</code> 可以。</p>
<h2 id="covering-indexinclude避免回表">Covering Index（INCLUDE）：避免回表</h2>
<p>PG 11+ 支援 INCLUDE column：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 只 index user_id、但 query 常要 email
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_users_user_id_covering</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span><span class="w"> </span><span class="n">INCLUDE</span><span class="w"> </span><span class="p">(</span><span class="n">email</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Index-only scan：不用回表
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">email</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">42</span><span class="p">;</span></span></span></code></pre></div><p>INCLUDE column 不參與 sorting / equality、只放 leaf node、救 IO。</p>
<h2 id="index-選擇決策樹">Index 選擇決策樹</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Query pattern 是什麼？
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">├─ 等值 / 範圍 / prefix LIKE / IS NULL
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">│  └─ B-tree（90% 場景）
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">│     ├─ 只 query 部分 row？→ Partial B-tree
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">│     ├─ 對函式結果？→ Expression B-tree
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│     └─ 需要回表更多 column？→ Covering（INCLUDE）
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">├─ JSONB 內部 query / array 包含 / FTS
</span></span><span class="line"><span class="ln">10</span><span class="cl">│  └─ GIN
</span></span><span class="line"><span class="ln">11</span><span class="cl">│     ├─ 只用 @&gt;？→ jsonb_path_ops 救 storage
</span></span><span class="line"><span class="ln">12</span><span class="cl">│     └─ FTS write-heavy？→ 改 GiST
</span></span><span class="line"><span class="ln">13</span><span class="cl">│
</span></span><span class="line"><span class="ln">14</span><span class="cl">├─ 範圍 type（int4range / tsrange）/ 空間
</span></span><span class="line"><span class="ln">15</span><span class="cl">│  └─ GiST
</span></span><span class="line"><span class="ln">16</span><span class="cl">│
</span></span><span class="line"><span class="ln">17</span><span class="cl">├─ 大表 + append-only + physical order correlated
</span></span><span class="line"><span class="ln">18</span><span class="cl">│  └─ BRIN
</span></span><span class="line"><span class="ln">19</span><span class="cl">│
</span></span><span class="line"><span class="ln">20</span><span class="cl">├─ 純 equality + 簡單 column
</span></span><span class="line"><span class="ln">21</span><span class="cl">│  └─ Hash（很少用、B-tree 通常更好）
</span></span><span class="line"><span class="ln">22</span><span class="cl">│
</span></span><span class="line"><span class="ln">23</span><span class="cl">└─ Non-balanced 樹（IP prefix / quad-tree）
</span></span><span class="line"><span class="ln">24</span><span class="cl">   └─ SP-GiST（罕見）</span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="case-1過度-indexwrite-放大">Case 1：過度 index（write 放大）</h3>
<p><strong>情境</strong>：team「為了 query 快」對 20 個 column 各建 index、寫入量大時 INSERT 慢 10x。</p>
<p>每個 INSERT 要更新 20 個 index、WAL volume 也跟著放大、replication lag 拉長。</p>
<p>修法：</p>
<ul>
<li>用 <code>pg_stat_user_indexes</code> 找 <em>idx_scan = 0</em> 的 index、可能根本沒用</li>
<li>用 <code>pg_stat_statements</code> 找實際被執行的 query、反推真正需要的 index</li>
<li>同 column 多 index（user_id 單欄 + (user_id, status) 多欄）通常可拆掉單欄</li>
</ul>
<h3 id="case-2partial-index-條件跟-query-不匹配">Case 2：Partial index 條件跟 query 不匹配</h3>
<p><strong>情境</strong>：建 <code>WHERE status = 'active'</code> partial index、application query 寫 <code>WHERE status IN ('active')</code>、planner 不 prove 等價、不用 index。</p>
<p>修法：</p>
<ul>
<li>Partial 條件用最 generic form（避免 IN / OR 跟 = 的差異）</li>
<li>寫完用 <code>EXPLAIN</code> 驗證 query 真的用到 partial index</li>
<li>Application 統一 query 寫法、不要混 <code>=</code> 跟 <code>IN</code> 跟 <code>ANY</code></li>
</ul>
<h3 id="case-3b-tree-對-jsonb-內部欄位無效">Case 3：B-tree 對 JSONB 內部欄位無效</h3>
<p><strong>情境</strong>：對 <code>metadata</code> JSONB column 建 B-tree、query <code>metadata-&gt;&gt;'category' = 'shoes'</code> 不用 index。</p>
<p>B-tree 對 <em>整個 JSONB</em> 排序、但 path query 不是整個 JSONB 的比對。</p>
<p>修法：</p>
<ul>
<li>對固定 path 建 expression index：<code>CREATE INDEX ... ON products ((metadata-&gt;&gt;'category'))</code></li>
<li>對動態 path 建 GIN index：<code>CREATE INDEX ... USING GIN (metadata)</code></li>
<li>兩者並存可、<code>EXPLAIN</code> 看 planner 選哪個</li>
</ul>
<h3 id="case-4brin-對非-correlated-資料無效">Case 4：BRIN 對非 correlated 資料無效</h3>
<p><strong>情境</strong>：對 <code>user_id</code> 建 BRIN index（user_id 是隨機 UUID）、query 完全跑 seq scan。</p>
<p>UUID 沒 physical order correlation、每個 block range 的 min/max 涵蓋整個 ID space、BRIN 完全沒 prune 效果。</p>
<p>修法：</p>
<ul>
<li>BRIN 只用 <code>timestamp</code> / 自增 <code>id</code> / 其他自然 correlate 的 column</li>
<li>用 <code>pg_stats</code> 看 <code>correlation</code> value、&lt; 0.1 就不適合 BRIN</li>
<li>真要對 random column 加 index、回 B-tree</li>
</ul>
<h3 id="case-5multi-column-index-順序錯">Case 5：Multi-column index 順序錯</h3>
<p><strong>情境</strong>：常見 query <code>WHERE status = 'pending' AND user_id = 42</code>、建 index <code>(status, user_id)</code>、效能差。</p>
<p><code>status</code> 只 5 個 distinct value、選擇性 1/5；<code>user_id</code> 1M distinct、選擇性 1/1M。Index leftmost 是 status、scan range 太大。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 拆兩個或調順序
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_user_status</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">status</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 或加 partial 限定低選擇性 column
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_orders_pending</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;pending&#39;</span><span class="p">;</span></span></span></code></pre></div><h2 id="跟-mysql-index-差異">跟 MySQL Index 差異</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PostgreSQL</th>
          <th>MySQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Index method</td>
          <td>6 種（B-tree / Hash / GIN / GiST / SP-GiST / BRIN）</td>
          <td>主要 B-tree、空間另算 R-tree</td>
      </tr>
      <tr>
          <td>預設</td>
          <td>B-tree</td>
          <td>B-tree（InnoDB clustered）</td>
      </tr>
      <tr>
          <td>Clustered index</td>
          <td>沒有原生（CLUSTER 一次性）</td>
          <td>InnoDB primary key 永遠 clustered</td>
      </tr>
      <tr>
          <td>Covering</td>
          <td>INCLUDE（PG 11+）</td>
          <td>自然支援（secondary index 帶 PK）</td>
      </tr>
      <tr>
          <td>JSON index</td>
          <td>GIN on JSONB（強）</td>
          <td>functional index on JSON（弱）</td>
      </tr>
      <tr>
          <td>Partial index</td>
          <td>原生支援</td>
          <td>8.0+ 支援（受限）</td>
      </tr>
      <tr>
          <td>Expression index</td>
          <td>原生支援</td>
          <td>5.7+ functional index</td>
      </tr>
      <tr>
          <td>BRIN-like</td>
          <td>原生</td>
          <td>沒有</td>
      </tr>
      <tr>
          <td>Spatial</td>
          <td>GiST / PostGIS</td>
          <td>R-tree（基本）</td>
      </tr>
  </tbody>
</table>
<p>PG index 系統比 MySQL 表達力高、但代價是 <em>選對 index method 是 application 責任</em>、MySQL 預設 B-tree 多數場景夠用。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization</a>：EXPLAIN 看 index 用沒用</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">jsonb-deep-dive</a>：JSONB + GIN 細節</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/full-text-search/" data-link-title="PostgreSQL Full-Text Search：tsvector / tsquery / GIN index 跟 pg_trgm fuzzy 三層搜尋" data-link-desc="PG 內建 full-text search 用 *tsvector / tsquery / GIN index* 三件組、適合中小規模搜尋（&lt; 100M 文件）；pg_trgm 提供 fuzzy match。本文走 FTS 機制（tsvector 是 lexeme &#43; position 的 vector）、3 種 query（match / ranking / weighted）、multi-language support、跟 pg_trgm fuzzy match 互補、5 production 踩雷（dictionary 選錯 / GIN 跟 GiST 取捨 / ranking 評分權重 / multi-language column 處理 / 何時不該用 PG FTS 改 Elasticsearch）">full-text-search</a>：FTS + GIN</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum-tuning</a>：index bloat</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">online-schema-change</a>：CREATE INDEX CONCURRENTLY</li>
</ul>
<h2 id="下一步">下一步</h2>
<ul>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization</a> 驗證 index 有沒有被 plan 用到</li>
<li>回 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview</a> 看全圖</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/citus-distributed/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/citus-distributed/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>Citus distributed extension&lt;/em> — 把 PG 變成 sharded cluster 的方式。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;p>當 PG single-primary 寫吞吐撞上單機極限（50K-100K WPS）、選項三條：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Application 層 sharding&lt;/strong>：應用層自管 shard routing&lt;/li>
&lt;li>&lt;strong>Citus&lt;/strong>：PG extension、自動 routing + cross-shard query&lt;/li>
&lt;li>&lt;strong>Distributed SQL&lt;/strong>（CockroachDB / Aurora DSQL / Spanner）：不同 engine&lt;/li>
&lt;/ol>
&lt;p>選 Citus 的核心 driver：&lt;em>保留 PG SQL syntax + extension 生態&lt;/em>。但「應用層幾乎不必改」是樂觀說法 — 實際上 application 必須圍繞 distribution column 重設計（query 加 filter / transaction 限定同 shard / reference table 量控制）、跟 Vitess 比 cross-shard query 自動化弱。代價是 &lt;em>coordinator / worker 部署複雜度 + cross-shard query 限制 + application schema 改造工作量&lt;/em>。&lt;/p>
&lt;p>閱讀本文前可先對齊 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">Database Sharding&lt;/a> 的 shard key、routing、resharding 與 cross-shard query 語意；容量失衡時再接 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition&lt;/a>。&lt;/p>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">MySQL Vitess sharding&lt;/a> 的核心差異：Citus 是 &lt;em>PG extension&lt;/em>（PG 自己跑）、Vitess 是 &lt;em>獨立 proxy + tablet 系統&lt;/em>（包 MySQL）。Citus 用 PG 原生機制（FDW / extension hook）、Vitess 是 &lt;em>外部包裝&lt;/em>。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>Citus distributed extension</em> — 把 PG 變成 sharded cluster 的方式。</p></blockquote>
<hr>
<p>當 PG single-primary 寫吞吐撞上單機極限（50K-100K WPS）、選項三條：</p>
<ol>
<li><strong>Application 層 sharding</strong>：應用層自管 shard routing</li>
<li><strong>Citus</strong>：PG extension、自動 routing + cross-shard query</li>
<li><strong>Distributed SQL</strong>（CockroachDB / Aurora DSQL / Spanner）：不同 engine</li>
</ol>
<p>選 Citus 的核心 driver：<em>保留 PG SQL syntax + extension 生態</em>。但「應用層幾乎不必改」是樂觀說法 — 實際上 application 必須圍繞 distribution column 重設計（query 加 filter / transaction 限定同 shard / reference table 量控制）、跟 Vitess 比 cross-shard query 自動化弱。代價是 <em>coordinator / worker 部署複雜度 + cross-shard query 限制 + application schema 改造工作量</em>。</p>
<p>閱讀本文前可先對齊 <a href="/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">Database Sharding</a> 的 shard key、routing、resharding 與 cross-shard query 語意；容量失衡時再接 <a href="/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition</a>。</p>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">MySQL Vitess sharding</a> 的核心差異：Citus 是 <em>PG extension</em>（PG 自己跑）、Vitess 是 <em>獨立 proxy + tablet 系統</em>（包 MySQL）。Citus 用 PG 原生機制（FDW / extension hook）、Vitess 是 <em>外部包裝</em>。</p>
<h2 id="citus-架構coordinator--worker">Citus 架構：Coordinator + Worker</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">                ┌─────────────────┐
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">   Application  │   Coordinator   │  ← 對外 PG wire protocol、planner、routing
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">                │   (Citus + PG)  │
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">                └────┬─────┬──────┘
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">                     │     │
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">              ┌──────┘     └──────┐
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">              ▼                   ▼
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">        ┌──────────┐         ┌──────────┐
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">        │ Worker 1 │         │ Worker 2 │  ← 各跑 PG + Citus extension
</span></span><span class="line"><span class="ln">10</span><span class="cl">        │  (PG)    │         │  (PG)    │
</span></span><span class="line"><span class="ln">11</span><span class="cl">        │ shard 1,3│         │ shard 2,4│
</span></span><span class="line"><span class="ln">12</span><span class="cl">        └──────────┘         └──────────┘</span></span></code></pre></div><p><strong>Coordinator</strong>：</p>
<ul>
<li>對 application 看起來像 PG（同 port / 同 wire protocol）</li>
<li>接 SQL → Citus planner 把 query 分解 + route 給 worker</li>
<li>不存 data（distributed table 的 shard 在 worker 上）</li>
<li>存 <em>metadata</em>（哪個 shard 在哪個 worker）</li>
</ul>
<p><strong>Worker</strong>：</p>
<ul>
<li>標準 PG instance + Citus extension</li>
<li>各存若干 shard</li>
<li>接 coordinator 來的 query、跑 local execute、回結果</li>
</ul>
<p><strong>Shard</strong>：</p>
<ul>
<li>Distributed table 拆成 N 個 shard（預設 32）</li>
<li>每 shard 是 worker 上的 <em>physical PG table</em>（含 <code>_&lt;shardid&gt;</code> 後綴）</li>
<li>行為跟一般 PG table 一樣、可以直接連 worker 用 PG 工具 access</li>
</ul>
<h2 id="3-種-table-type">3 種 Table Type</h2>
<h3 id="distributed-table--跨-shard-切分">Distributed table — 跨 shard 切分</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 建一般 PG table
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="n">BIGSERIAL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">user_id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">amount</span><span class="w"> </span><span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">created_at</span><span class="w"> </span><span class="k">TIMESTAMP</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">id</span><span class="p">)</span><span class="w">  </span><span class="c1">-- PK 必須含 distribution column
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 用 Citus 把它變 distributed
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;orders&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;user_id&#39;</span><span class="p">);</span></span></span></code></pre></div><p><code>user_id</code> 是 <em>distribution column</em> — Citus 用它的 hash 決定 row 屬哪個 shard。<code>PK 必須含 distribution column</code>（跟 MySQL partitioning 同要求）。</p>
<p>跟 Vitess Vindex 對比：</p>
<ul>
<li>Citus：hash distribution column → shard（單一 hash function、不可選 algorithm）</li>
<li>Vitess：Vindex 可選多種（hash / lookup_hash / xxhash / null）</li>
</ul>
<h3 id="reference-table--全-shard-共有">Reference table — 全 shard 共有</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">name</span><span class="w"> </span><span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">price</span><span class="w"> </span><span class="nb">DECIMAL</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_reference_table</span><span class="p">(</span><span class="s1">&#39;products&#39;</span><span class="p">);</span></span></span></code></pre></div><p><code>products</code> 在 <em>每個 worker 都有完整 copy</em>、寫入 coordinator 廣播給所有 worker。</p>
<p>用途：</p>
<ul>
<li>小 lookup table（country code / product category 等）</li>
<li>跨 distributed table JOIN 時、reference table 在每 worker 上、不必 cross-shard</li>
<li>寫入頻率低（廣播 cost 跟 worker 數 linear）</li>
</ul>
<h3 id="local-table--coordinator-上的-pg-table">Local table — Coordinator 上的 PG table</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">audit_log</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">event</span><span class="w"> </span><span class="n">JSONB</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 不調用 Citus function、預設留在 coordinator</span></span></span></code></pre></div><p>行為跟一般 PG table 一樣。用於 <em>不需 distribute</em> 的 table（如 admin metadata）。</p>
<h2 id="colocation跨-distributed-table-同-shard-對齊">Colocation：跨 distributed table 同 shard 對齊</h2>
<p>當兩個 distributed table 都用 <em>同 distribution column</em>（例如 <code>user_id</code>）+ 同 shard count、Citus 自動 colocate：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;orders&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;user_id&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;user_addresses&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;user_id&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">colocate_with</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="s1">&#39;orders&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Colocate 後：</p>
<ul>
<li><code>user_id = 100</code> 的 orders 跟 user_addresses 在 <em>同一 worker shard</em></li>
<li>JOIN 不跨 worker、效率高</li>
<li>可用 PG 原生 FK constraint（cross-table 但同 shard）</li>
</ul>
<p>Colocate 是 Citus 設計的核心 <em>跨 table 一致性</em> 機制。沒 colocate 的 cross-table query 變 cross-worker、效率大降。</p>
<h2 id="配置-step-by-steplocal-cluster">配置 step-by-step（local cluster）</h2>
<p>Production 用 Citus Cloud（Microsoft 託管）或 Azure Cosmos DB for PostgreSQL（同 engine）。Self-hosted：</p>
<h3 id="step-1coordinator--worker-都裝-pg--citus">Step 1：Coordinator + worker 都裝 PG + Citus</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 在每個 node（coordinator + 2 worker）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">apt install postgresql-14
</span></span><span class="line"><span class="ln">3</span><span class="cl">apt install postgresql-14-citus-12.0
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="nv">shared_preload_libraries</span> <span class="o">=</span> <span class="s1">&#39;citus&#39;</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">
</span></span><span class="line"><span class="ln">8</span><span class="cl">systemctl restart postgresql</span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 在每個 node 跑
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">citus</span><span class="p">;</span></span></span></code></pre></div><h3 id="step-2coordinator-註冊-worker">Step 2：Coordinator 註冊 worker</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 在 coordinator 跑
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">citus_add_node</span><span class="p">(</span><span class="s1">&#39;worker1.example.com&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">5432</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">citus_add_node</span><span class="p">(</span><span class="s1">&#39;worker2.example.com&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">5432</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 確認
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">citus_get_active_worker_nodes</span><span class="p">();</span></span></span></code></pre></div><h3 id="step-3建-distributed-table">Step 3：建 distributed table</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="n">BIGSERIAL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">user_id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">amount</span><span class="w"> </span><span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="n">created_at</span><span class="w"> </span><span class="k">TIMESTAMP</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">    </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">id</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;orders&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;user_id&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Citus 自動把 <code>orders</code> 拆成 32 個 shard（<code>orders_102008</code> 等）、分配到 worker。</p>
<h3 id="step-4application-連-coordinator">Step 4：Application 連 coordinator</h3>
<p>Application connection string 連 coordinator IP / port（不必知道 worker 存在）。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 從 application 跑 query、Citus 透明 route
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">amount</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">12345</span><span class="p">,</span><span class="w"> </span><span class="mi">50</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- → Citus 看 user_id=12345 hash 屬 shard 17、route 給對應 worker
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">12345</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- → Single-shard query、極快
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="c1">-- → Cross-shard aggregation、Citus 並行跑、合併結果</span></span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-distribution-column-選錯--cross-shard-query-變主流">1. Distribution column 選錯 — Cross-shard query 變主流</h3>
<p>選 <code>created_at</code> 或 <code>id</code>（auto increment）作 distribution column、看起來均勻、實際 <em>application query 多以 user_id 為主</em>、變成 <em>每個 query 都 cross-shard</em>、performance 雪崩。</p>
<p>修法：</p>
<ul>
<li><em>Distribution column 選 application 最常 filter / join 的 column</em>（通常是 <code>tenant_id</code> / <code>user_id</code>）</li>
<li>Audit application top query、確認 distribution column 對齊 query pattern</li>
<li>改 distribution column 要 <em>rewrite 所有 shard</em>、像 resharding、大工程</li>
</ul>
<h3 id="2-cross-shard-transaction-限制">2. Cross-shard transaction 限制</h3>
<p>跨多 shard 的 transaction（如：UPDATE 兩個 user_id 不同的 row）Citus 用 <em>2PC</em>（two-phase commit）但有限制：</p>
<ul>
<li>Multi-statement transaction 跨 shard 需明確開 <code>SET citus.multi_shard_modify_mode = 'sequential'</code></li>
<li>部分 isolation level 不保證 serializable across shards</li>
<li>DDL 跨 shard 是 sequential</li>
</ul>
<p>修法：</p>
<ul>
<li>Schema design 避免 cross-shard transaction（同 colocation group 內 transaction 沒問題）</li>
<li>必要 cross-shard 場景明確設 multi-shard mode</li>
<li>對 <em>strict cross-shard consistency</em>、考慮 distributed SQL（CockroachDB / Aurora DSQL）</li>
</ul>
<h3 id="3-reference-table-過大--寫入廣播-cost-爆">3. Reference table 過大 — 寫入廣播 cost 爆</h3>
<p>Reference table 在每 worker 都有 copy、寫入 <em>廣播給所有 worker</em>。Reference table 100K row + 高頻寫入 → 寫一次寫 N worker、cost N x。</p>
<p>修法：</p>
<ul>
<li>Reference table 限 <em>小 + 寫入頻率低</em> 的 lookup data</li>
<li>超大表不該是 reference table、考慮 distributed</li>
<li>監控 reference table 寫入 rate、超 threshold 重新評估</li>
</ul>
<h3 id="4-colocate-沒對齊--隱性-cross-shard-join">4. Colocate 沒對齊 — 隱性 cross-shard JOIN</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 看似可以、實際 cross-shard 慢
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">user_addresses</span><span class="w"> </span><span class="n">ua</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ua</span><span class="p">.</span><span class="n">user_id</span><span class="p">;</span></span></span></code></pre></div><p>若 <code>user_addresses</code> 沒 <code>colocate_with =&gt; 'orders'</code>、兩表 shard 分配獨立、JOIN 跨 worker。</p>
<p>修法：</p>
<ul>
<li>建相關 table 時 <code>colocate_with</code> 對齊</li>
<li>用 <code>SELECT * FROM citus_tables</code> 看 colocation_id、確認對齊</li>
<li>跨非 colocate table 的 JOIN 用 <em>materialized view</em> 或 application 層拆 query 避開</li>
</ul>
<h3 id="5-worker-failover--coordinator-必須知道">5. Worker failover — Coordinator 必須知道</h3>
<p>Worker 故障、Citus 預設 <em>coordinator 看到 query 失敗、不自動 failover</em>。</p>
<p>修法（Citus 11+）：</p>
<ul>
<li>用 <em>shard replication</em>（<code>citus.shard_replication_factor = 2</code>）— 每 shard 在 2 個 worker 有 copy</li>
<li>配 PG streaming replication 在 worker 層、外加 Patroni 管 failover</li>
<li>Coordinator 失敗 → 整個 cluster 失能、coordinator 也要 HA（Patroni）</li>
</ul>
<p>跟 Vitess 對比 Citus 的 HA story 較弱、production 必須認真規劃。</p>
<h2 id="何時用-citus">何時用 Citus</h2>
<table>
  <thead>
      <tr>
          <th>條件</th>
          <th>建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Multi-tenant SaaS、tenant_id 為自然 distribution</td>
          <td>是</td>
      </tr>
      <tr>
          <td>寫吞吐 &gt; 50K WPS、單 PG 撐不住</td>
          <td>是</td>
      </tr>
      <tr>
          <td>需要保留 PG SQL + extension（pgvector / TimescaleDB）</td>
          <td>是</td>
      </tr>
      <tr>
          <td>應用 query pattern 80% 都用同一 distribution column</td>
          <td>是</td>
      </tr>
      <tr>
          <td>應用大量 ad-hoc cross-tenant aggregation</td>
          <td>否（cross-shard 慢）</td>
      </tr>
      <tr>
          <td>強 cross-shard consistency 需求</td>
          <td>否（用 CockroachDB）</td>
      </tr>
      <tr>
          <td>想 zero-ops managed</td>
          <td>Azure Cosmos DB for PostgreSQL（同 engine）</td>
      </tr>
  </tbody>
</table>
<h2 id="容量規劃">容量規劃</h2>
<ul>
<li>Coordinator: 中等 CPU + RAM、metadata 不大、不存 data</li>
<li>Worker: per-worker spec 同 single PG production</li>
<li>Shard count: 預設 32、實務常設 worker count × 4-8</li>
<li>Replication factor: production 至少 2</li>
</ul>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p>Coordinator + worker 各跑 PG streaming replication、Citus 不取代 PG replication。Worker failover 用 Patroni / streaming replication。詳見 <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>。</p>
<h3 id="跟-pg-extensions">跟 PG Extensions</h3>
<p>Citus 跟其他 PG extension 多數兼容（pgvector / TimescaleDB / pg_stat_statements）— 它維持 <em>extension</em> 形態，保留 PostgreSQL 生態接點。詳見 <em>PG Extension Ecosystem</em> 篇（待寫）。</p>
<h3 id="跟-mysql-vitess">跟 MySQL Vitess</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Citus</th>
          <th>Vitess</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>部署模型</td>
          <td>PG extension</td>
          <td>獨立 proxy + tablet</td>
      </tr>
      <tr>
          <td>主要場景</td>
          <td>Multi-tenant SaaS</td>
          <td>超大規模分片</td>
      </tr>
      <tr>
          <td>Cross-shard JOIN</td>
          <td>colocate 對齊 + reference table</td>
          <td>VTGate 自動 split + aggregate</td>
      </tr>
      <tr>
          <td>FK</td>
          <td>同 colocation 內可用</td>
          <td>Vitess 18+ 支援、cross-shard 限制</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>依賴 Patroni + replication factor</td>
          <td>VTOrc + replication</td>
      </tr>
      <tr>
          <td>學習曲線</td>
          <td>中（PG ops 經驗夠）</td>
          <td>高（4 component）</td>
      </tr>
  </tbody>
</table>
<p>Citus 對 <em>PG-native</em> 場景更平順、Vitess 對 <em>MySQL-native</em> 場景更平順、不直接競爭。詳見 <a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">MySQL Vitess Sharding</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（per-worker replication）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">PG MVCC + Lock Model</a>（cross-shard transaction lock 行為）</li>
<li><a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a>（Citus vs CockroachDB vs Spanner）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">MySQL Vitess Sharding</a>（sibling、不同實作）</li>
<li><a href="/blog/backend/01-database/vendors/cosmosdb/" data-link-title="Azure Cosmos DB" data-link-desc="全球分散式 multi-model DB、5 個 consistency levels、Microsoft 自家 dogfood 證據">Cosmos DB vendor</a>（Azure Cosmos DB for PostgreSQL = managed Citus）</li>
<li>官方：<a href="https://docs.citusdata.com/">Citus Documentation</a> / <a href="https://github.com/citusdata/citus">Citus on GitHub</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL SQL Features：PG 早就有的、MySQL 8.0 才補的、PG 仍領先的</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/sql-features-baseline/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/sql-features-baseline/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>SQL features baseline&lt;/em> — PG 早期就有的、MySQL 8.0 才補的、PG 仍領先的、給從 MySQL 評估 PG 的讀者 reference。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="pg-sql-工程深度的歷史錨點">PG SQL 工程深度的歷史錨點&lt;/h2>
&lt;p>PG 在 SQL feature 上長期領先 MySQL：&lt;/p>
&lt;ul>
&lt;li>2009 (PG 8.4)：CTE / window function / recursive query&lt;/li>
&lt;li>2013 (PG 9.3)：lateral derived table / materialized view&lt;/li>
&lt;li>2014 (PG 9.4)：JSONB / partial index 早就有 / GIN index&lt;/li>
&lt;li>2015 (PG 9.5)：UPSERT (&lt;code>ON CONFLICT&lt;/code>)&lt;/li>
&lt;li>2017 (PG 10)：declarative partitioning / logical replication / multi-column statistics&lt;/li>
&lt;/ul>
&lt;p>MySQL 8.0（2018）才補 CTE / window / lateral / JSON_TABLE / hash join — &lt;em>PG 早 9 年起步&lt;/em>。&lt;/p>
&lt;p>對 &lt;em>從 MySQL 評估 PG&lt;/em> 的讀者來說、PG 的 SQL 工程深度不只是「該有的都有」、更多是「PG 結構性領先的特性 + MySQL 8.0 補了哪些 + PG 仍領先哪些」。&lt;/p>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features&lt;/a> 對比視角：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>SQL features baseline</em> — PG 早期就有的、MySQL 8.0 才補的、PG 仍領先的、給從 MySQL 評估 PG 的讀者 reference。</p></blockquote>
<hr>
<h2 id="pg-sql-工程深度的歷史錨點">PG SQL 工程深度的歷史錨點</h2>
<p>PG 在 SQL feature 上長期領先 MySQL：</p>
<ul>
<li>2009 (PG 8.4)：CTE / window function / recursive query</li>
<li>2013 (PG 9.3)：lateral derived table / materialized view</li>
<li>2014 (PG 9.4)：JSONB / partial index 早就有 / GIN index</li>
<li>2015 (PG 9.5)：UPSERT (<code>ON CONFLICT</code>)</li>
<li>2017 (PG 10)：declarative partitioning / logical replication / multi-column statistics</li>
</ul>
<p>MySQL 8.0（2018）才補 CTE / window / lateral / JSON_TABLE / hash join — <em>PG 早 9 年起步</em>。</p>
<p>對 <em>從 MySQL 評估 PG</em> 的讀者來說、PG 的 SQL 工程深度不只是「該有的都有」、更多是「PG 結構性領先的特性 + MySQL 8.0 補了哪些 + PG 仍領先哪些」。</p>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features</a> 對比視角：</p>
<ul>
<li>MySQL 8.0 視角：「我終於補齊 + 跟 PG 對比」</li>
<li>PG 視角：「我長期領先 + MySQL 8.0 才追上某些、其他我仍領先」</li>
</ul>
<h2 id="pg-結構性領先特性mysql-沒對應--弱對應">PG 結構性領先特性（MySQL 沒對應 / 弱對應）</h2>
<h3 id="1-materialized-view">1. Materialized View</h3>
<p>PG 9.3+ 內建 materialized view：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">orders_summary</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">order_count</span><span class="p">,</span><span class="w"> </span><span class="k">SUM</span><span class="p">(</span><span class="n">amount</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">total</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">user_id</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 手動 refresh
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="n">REFRESH</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">orders_summary</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- 或 concurrent refresh（PG 9.4+、不 lock read）
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="n">REFRESH</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">CONCURRENTLY</span><span class="w"> </span><span class="n">orders_summary</span><span class="p">;</span></span></span></code></pre></div><p>用途：</p>
<ul>
<li>預計算複雜 aggregation、查詢時極快</li>
<li>Concurrent refresh 不 lock read</li>
<li>可建 index on materialized view</li>
</ul>
<p><strong>MySQL 對應</strong>：沒原生 materialized view。常見替代：</p>
<ul>
<li>Trigger + summary table（手動維護）</li>
<li>Application 層 caching layer</li>
<li>用 view + cache layer（不是 materialization）</li>
</ul>
<p>MySQL 8.0+ 仍無原生 materialized view。</p>
<h3 id="2-partial-index">2. Partial Index</h3>
<p>PG 預設支援 partial index — 對 <em>滿足條件的 row</em> 才建 index：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 只對 active user 建 index
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_users_active_email</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">users</span><span class="p">(</span><span class="n">email</span><span class="p">)</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Index size 比 full index 小很多、query 性能跟 full index 一樣
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">email</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;x@y.com&#39;</span><span class="p">;</span></span></span></code></pre></div><p>用途：</p>
<ul>
<li><em>Soft-delete</em> 場景：對 <code>deleted_at IS NULL</code> 建 partial index</li>
<li><em>Hot subset</em> 場景：對 <code>status = 'pending'</code> 等熱資料建 partial</li>
<li>Index 大小 / 寫入成本大降</li>
</ul>
<p><strong>MySQL 對應</strong>：MySQL 沒原生 partial index。MySQL 8.0+ 有 <em>functional index</em> 但跟 partial 不同。MySQL 替代：</p>
<ul>
<li>Generated column + index（接近、但維護複雜）</li>
<li>或接受 full index cost</li>
</ul>
<h3 id="3-foreign-data-wrapper-fdw">3. Foreign Data Wrapper (FDW)</h3>
<p>PG FDW 讓 query 跨外部資料源：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">postgres_fdw</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">SERVER</span><span class="w"> </span><span class="n">remote_db</span><span class="w"> </span><span class="k">FOREIGN</span><span class="w"> </span><span class="k">DATA</span><span class="w"> </span><span class="n">WRAPPER</span><span class="w"> </span><span class="n">postgres_fdw</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="k">OPTIONS</span><span class="w"> </span><span class="p">(</span><span class="k">host</span><span class="w"> </span><span class="s1">&#39;remote.example.com&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">dbname</span><span class="w"> </span><span class="s1">&#39;analytics&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">USER</span><span class="w"> </span><span class="n">MAPPING</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="n">localuser</span><span class="w"> </span><span class="n">SERVER</span><span class="w"> </span><span class="n">remote_db</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">OPTIONS</span><span class="w"> </span><span class="p">(</span><span class="k">user</span><span class="w"> </span><span class="s1">&#39;remoteuser&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">password</span><span class="w"> </span><span class="s1">&#39;...&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">FOREIGN</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">remote_orders</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="nb">INT</span><span class="p">,</span><span class="w"> </span><span class="p">...)</span><span class="w"> </span><span class="n">SERVER</span><span class="w"> </span><span class="n">remote_db</span><span class="w"> </span><span class="k">OPTIONS</span><span class="w"> </span><span class="p">(</span><span class="k">table_name</span><span class="w"> </span><span class="s1">&#39;orders&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- 在 local PG query remote table
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">remote_orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="p">;</span></span></span></code></pre></div><p>支援 FDW：<code>postgres_fdw</code> / <code>mysql_fdw</code> / <code>oracle_fdw</code> / <code>mongo_fdw</code> / <code>file_fdw</code> / <code>redis_fdw</code> 等。</p>
<p><strong>MySQL 對應</strong>：MySQL 8.0+ 有 FEDERATED engine（受限、不推薦）。實務上 MySQL 跨 DB query 用 application 層處理。</p>
<h3 id="4-jsonb--gin-indexpg-結構性優勢">4. JSONB + GIN Index（PG 結構性優勢）</h3>
<p>PG JSONB 是 <em>binary 儲存</em> + 可 <em>直接 GIN index</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">metadata</span><span class="w"> </span><span class="n">JSONB</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="c1">-- GIN index over JSONB
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_products_metadata</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- 快 query
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@&gt;</span><span class="w"> </span><span class="s1">&#39;{&#34;category&#34;: &#34;shoes&#34;}&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@?</span><span class="w"> </span><span class="s1">&#39;$.variants[*].price &gt; 100&#39;</span><span class="p">;</span></span></span></code></pre></div><p><strong>MySQL 對應</strong>：MySQL 8.0 JSON_TABLE 是 SQL standard、但 <em>index 必須 generated column workaround</em>（不能 GIN index over JSON）。</p>
<p>詳見 <a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features</a> JSON_TABLE vs PG JSONB 對比段。</p>
<h3 id="5-range-types--exclusion-constraints">5. Range Types + Exclusion Constraints</h3>
<p>PG range types + exclusion constraints 防止 <em>時間範圍重疊</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">reservations</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">room_id</span><span class="w"> </span><span class="nb">INT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">during</span><span class="w"> </span><span class="n">TSRANGE</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">EXCLUDE</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIST</span><span class="w"> </span><span class="p">(</span><span class="n">room_id</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="o">=</span><span class="p">,</span><span class="w"> </span><span class="n">during</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- INSERT 重疊 booking 自動 reject
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">reservations</span><span class="w"> </span><span class="p">(</span><span class="n">room_id</span><span class="p">,</span><span class="w"> </span><span class="n">during</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;[2026-05-19 10:00, 2026-05-19 12:00)&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">reservations</span><span class="w"> </span><span class="p">(</span><span class="n">room_id</span><span class="p">,</span><span class="w"> </span><span class="n">during</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;[2026-05-19 11:00, 2026-05-19 13:00)&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- ERROR: conflicting key value violates exclusion constraint</span></span></span></code></pre></div><p><strong>MySQL 對應</strong>：完全沒對應、必須 application 層 enforce。</p>
<h3 id="6-check-constraint--domain-type">6. CHECK Constraint + Domain Type</h3>
<p>PG <code>CHECK</code> constraint 真執行（MySQL 8.0 才補）+ user-defined <code>DOMAIN</code>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">DOMAIN</span><span class="w"> </span><span class="n">positive_int</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="nb">INT</span><span class="w"> </span><span class="k">CHECK</span><span class="w"> </span><span class="p">(</span><span class="n">VALUE</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">0</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">quantity</span><span class="w"> </span><span class="n">positive_int</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="n">amount</span><span class="w"> </span><span class="nb">DECIMAL</span><span class="w"> </span><span class="k">CHECK</span><span class="w"> </span><span class="p">(</span><span class="n">amount</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="mi">0</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p><strong>MySQL 對應</strong>：8.0+ 有 CHECK constraint enforcement（5.7 可寫但不執行）。沒 user-defined DOMAIN。</p>
<h3 id="7-extension-ecosystem">7. Extension Ecosystem</h3>
<p>PG extension 是 <em>結構優勢</em>：</p>
<ul>
<li><code>pg_partman</code>：自動 partition lifecycle</li>
<li><code>pg_repack</code>：online table rewrite</li>
<li><code>pg_stat_statements</code>：query stats</li>
<li><code>pgvector</code>：vector similarity search</li>
<li><code>pg_cron</code>：scheduled job</li>
<li><code>PostGIS</code>：GIS</li>
<li><code>TimescaleDB</code>：time-series</li>
<li><code>Citus</code>：sharding</li>
</ul>
<p><strong>MySQL 對應</strong>：MySQL plugin 機制有、生態遠遠不如。詳見 <em>PG Extension Ecosystem</em> 篇（待寫）。</p>
<h2 id="mysql-80-補齊的-pg-既有特性">MySQL 8.0 補齊的 PG 既有特性</h2>
<table>
  <thead>
      <tr>
          <th>特性</th>
          <th>PG 推出</th>
          <th>MySQL 推出</th>
          <th>差異後說明</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CTE</td>
          <td>8.4 (2009)</td>
          <td>8.0 (2018)</td>
          <td>MySQL 補語法、行為 PG 12+ 跟 MySQL 接近</td>
      </tr>
      <tr>
          <td>Window function</td>
          <td>8.4 (2009)</td>
          <td>8.0 (2018)</td>
          <td>兩家都標準、frame spec 細節有差</td>
      </tr>
      <tr>
          <td>Lateral derived table</td>
          <td>9.3 (2013)</td>
          <td>8.0.14 (2019)</td>
          <td>MySQL 後加、planner 不如 PG 成熟</td>
      </tr>
      <tr>
          <td>Hash join</td>
          <td>早就有</td>
          <td>8.0.18 (2019)</td>
          <td>MySQL 受限（equality on indexed column）</td>
      </tr>
      <tr>
          <td>JSON_TABLE</td>
          <td>17 (2024)</td>
          <td>8.0 (2018)</td>
          <td>MySQL 較早、PG 17+ 補進、PG 自己有 JSONB 路線</td>
      </tr>
      <tr>
          <td>CHECK constraint</td>
          <td>早就有</td>
          <td>8.0 (2018)</td>
          <td>MySQL 5.7 可寫但不執行</td>
      </tr>
      <tr>
          <td>Role-based auth</td>
          <td>早就有</td>
          <td>8.0 (2018)</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Atomic DDL</td>
          <td>早就有</td>
          <td>8.0 (2018)</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Common keyword</td>
          <td>完整</td>
          <td>8.0 補</td>
          <td>MySQL 5.7 缺很多 (window/rank/lateral 等)</td>
      </tr>
  </tbody>
</table>
<p>MySQL 8.0 是 <em>補齊 9 年 SQL standard 落後</em>、不是 <em>新領先 PG</em>。</p>
<h2 id="pg-仍領先的特性">PG 仍領先的特性</h2>
<p>對應「MySQL 8.0 補了 → PG 仍沒輸」的視角。以下 14 條中、<em>production 影響最大</em> 的是 Materialized view / Partial index / JSONB GIN / Full-text search 跟 Range / Exclusion constraints（schema-level expressiveness）；<em>次要但常用</em> 的是 Multi-column statistics 跟 Procedural language；<em>非典型但 niche 重要</em> 的是 User-defined DOMAIN / Generic table inheritance（讀者不必然知道、但 ORM 跟 schema migration 工具會用）：</p>
<table>
  <thead>
      <tr>
          <th>PG 領先特性</th>
          <th>MySQL 對應狀態</th>
          <th>補充</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Materialized view</td>
          <td>無原生</td>
          <td>application-side 重算成本高</td>
      </tr>
      <tr>
          <td>Partial index</td>
          <td>無（functional index 不等同）</td>
          <td>對 boolean / status column 救 storage</td>
      </tr>
      <tr>
          <td>FDW</td>
          <td>弱（FEDERATED engine 不推薦）</td>
          <td>跨 DB query escape hatch</td>
      </tr>
      <tr>
          <td>JSONB GIN index</td>
          <td>無（generated column workaround）</td>
          <td>JSON workload 結構性差</td>
      </tr>
      <tr>
          <td>Range types</td>
          <td>無</td>
          <td>booking / availability schema 救命</td>
      </tr>
      <tr>
          <td>Exclusion constraints</td>
          <td>無</td>
          <td>range overlap 防護</td>
      </tr>
      <tr>
          <td>User-defined DOMAIN</td>
          <td>無</td>
          <td>column-level type constraint</td>
      </tr>
      <tr>
          <td>Extension ecosystem</td>
          <td>弱</td>
          <td>pgvector / TimescaleDB / PostGIS</td>
      </tr>
      <tr>
          <td>Full-text search 成熟</td>
          <td>InnoDB FTS 較弱</td>
          <td>tsvector + GIN + pg_trgm 三層</td>
      </tr>
      <tr>
          <td>Multi-column statistics</td>
          <td>8.0 histograms 部分對應、PG 更廣</td>
          <td>planner 更準</td>
      </tr>
      <tr>
          <td>Procedural language</td>
          <td>PL/pgSQL + 多語言（PL/Python / PL/Perl 等）</td>
          <td>Stored procedure（不擴語言）</td>
      </tr>
      <tr>
          <td>Recursive CTE 深度</td>
          <td>Unlimited</td>
          <td>1000（cte_max_recursion_depth）</td>
      </tr>
      <tr>
          <td>LSN-based replication</td>
          <td>簡潔</td>
          <td>binlog file+position（GTID 緩解）</td>
      </tr>
      <tr>
          <td>Generic table inheritance</td>
          <td>早就有</td>
          <td>無（multi-tenant schema 結構用）</td>
      </tr>
  </tbody>
</table>
<h2 id="對從-mysql-評估-pg的讀者">對「從 MySQL 評估 PG」的讀者</h2>
<p>讀者通常從 MySQL 8.0 過來、問題是 <em>「PG 比 MySQL 強在哪、弱在哪」</em>：</p>
<h3 id="pg-比-mysql-強">PG 比 MySQL 強</h3>
<ul>
<li><em>SQL 工程深度</em>：上面列的 7 個結構優勢</li>
<li><em>Extension ecosystem</em>：pgvector / TimescaleDB / Citus / pg_partman 等</li>
<li><em>Optimizer</em>：planner 對複雜 query 更成熟</li>
<li><em>Concurrency model</em>：MVCC + 少 lock（<a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">MVCC + Lock Model</a>）</li>
</ul>
<h3 id="pg-比-mysql-弱">PG 比 MySQL 弱</h3>
<ul>
<li><em>Replication 機制簡潔度</em>：MySQL GTID 比 PG WAL + replication slot 配置簡單（<a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>）</li>
<li><em>Sharding ecosystem</em>：Vitess / PlanetScale 比 Citus 規模驗證高</li>
<li><em>Operational tooling 廣度</em>：pt-toolkit / gh-ost / Orchestrator 等</li>
<li><em>VACUUM 維護</em>：PG MVCC 必須 VACUUM、autovacuum 配錯議題多（<a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a>）</li>
</ul>
<h3 id="選-pg-的核心-driver">選 PG 的核心 driver</h3>
<p>對 SQL 工程深度、extension、複雜 query / OLAP-style workload 的場景、PG 仍是首選。對純簡單 OLTP + 大規模 sharding、MySQL + Vitess 仍 competitive。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">MVCC + Lock Model</a>：PG MVCC 是 SQL feature 的並行控制基礎</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">Query Optimization</a>：PG planner 對 window / CTE / hash join 成熟</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/citus-distributed/" data-link-title="PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster" data-link-desc="Citus 是 PG extension、把單機 PG 變成 *coordinator &#43; worker* sharded cluster、保留 PG SQL &#43; 加 distributed table &#43; reference table &#43; columnar storage。本文走 Citus 架構（coordinator / worker / distribution column）、3 種 table type（distributed / reference / local）、配置 step-by-step、5 production 踩雷（distribution column 選錯 / cross-shard transaction / reference table 過大 / colocate 不對齊 / worker failover）、跟 MySQL Vitess sharding sibling 對比">Citus Distributed</a>：extension 之一、體現 extension 生態</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a>：MVCC 代價、跟 SQL feature 並行控制相關</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">PG MVCC + Lock Model</a>（concurrency 基礎）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">PG Query Optimization</a>（planner 成熟度）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/citus-distributed/" data-link-title="PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster" data-link-desc="Citus 是 PG extension、把單機 PG 變成 *coordinator &#43; worker* sharded cluster、保留 PG SQL &#43; 加 distributed table &#43; reference table &#43; columnar storage。本文走 Citus 架構（coordinator / worker / distribution column）、3 種 table type（distributed / reference / local）、配置 step-by-step、5 production 踩雷（distribution column 選錯 / cross-shard transaction / reference table 過大 / colocate 不對齊 / worker failover）、跟 MySQL Vitess sharding sibling 對比">PG Citus Distributed</a>（extension example）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">PG Autovacuum Tuning</a>（MVCC 維護）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features</a>（sibling、反向視角）</li>
<li>官方：<a href="https://www.postgresql.org/about/featurematrix/">PostgreSQL Features</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL BDR / Multi-Master：active-active 寫入的 3 種路徑跟 conflict 治理</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/bdr-multi-master/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/bdr-multi-master/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>multi-master / active-active replication&lt;/em> — 不是 PG 預設、需要 extension。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="pg-預設沒-multi-master得用-extension">PG 預設沒 multi-master、得用 extension&lt;/h2>
&lt;p>PG core 是 &lt;em>single-primary streaming replication&lt;/em>：&lt;/p>
&lt;ul>
&lt;li>寫入只能進 primary&lt;/li>
&lt;li>Standby 接受 read（hot_standby）但拒絕 write&lt;/li>
&lt;li>Failover 後新 primary 接管、不能多入口&lt;/li>
&lt;/ul>
&lt;p>對需要 &lt;em>active-active&lt;/em>（多 region 各自接受 local write）的場景、PG 提供 3 條 extension 路徑：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>方案&lt;/th>
 &lt;th>來源&lt;/th>
 &lt;th>機制&lt;/th>
 &lt;th>License&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>BDR&lt;/strong>&lt;/td>
 &lt;td>EDB（Enterprise）&lt;/td>
 &lt;td>Logical replication-based、雙向&lt;/td>
 &lt;td>商業（EDB 訂閱）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>pgEdge&lt;/strong>&lt;/td>
 &lt;td>pgEdge Inc.&lt;/td>
 &lt;td>基於 BDR、開源、加 Spock extension&lt;/td>
 &lt;td>開源（Spock）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Bucardo&lt;/strong>&lt;/td>
 &lt;td>community&lt;/td>
 &lt;td>Trigger-based、async、Perl 寫&lt;/td>
 &lt;td>開源（BSD）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>每條路徑有不同 trade-off。對 99% PG production case、&lt;em>不需要 multi-master&lt;/em> — single-primary streaming replication + read replica scaling 已夠。Multi-master 是 &lt;em>特殊需求&lt;/em>（跨 region active-active write / 不可中斷 maintenance）才上。&lt;/p>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &amp;#43; certification* 整個機制不同。本文走 GR 機制（GCE &amp;#43; certification &amp;#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">MySQL Group Replication&lt;/a> 對比：MySQL GR 是 &lt;em>官方內建&lt;/em>（5.7+）、PG 沒對應內建選項。MySQL 用戶 GR / InnoDB Cluster 直接套、PG 用戶要選 extension + license trade-off。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>multi-master / active-active replication</em> — 不是 PG 預設、需要 extension。</p></blockquote>
<hr>
<h2 id="pg-預設沒-multi-master得用-extension">PG 預設沒 multi-master、得用 extension</h2>
<p>PG core 是 <em>single-primary streaming replication</em>：</p>
<ul>
<li>寫入只能進 primary</li>
<li>Standby 接受 read（hot_standby）但拒絕 write</li>
<li>Failover 後新 primary 接管、不能多入口</li>
</ul>
<p>對需要 <em>active-active</em>（多 region 各自接受 local write）的場景、PG 提供 3 條 extension 路徑：</p>
<table>
  <thead>
      <tr>
          <th>方案</th>
          <th>來源</th>
          <th>機制</th>
          <th>License</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>BDR</strong></td>
          <td>EDB（Enterprise）</td>
          <td>Logical replication-based、雙向</td>
          <td>商業（EDB 訂閱）</td>
      </tr>
      <tr>
          <td><strong>pgEdge</strong></td>
          <td>pgEdge Inc.</td>
          <td>基於 BDR、開源、加 Spock extension</td>
          <td>開源（Spock）</td>
      </tr>
      <tr>
          <td><strong>Bucardo</strong></td>
          <td>community</td>
          <td>Trigger-based、async、Perl 寫</td>
          <td>開源（BSD）</td>
      </tr>
  </tbody>
</table>
<p>每條路徑有不同 trade-off。對 99% PG production case、<em>不需要 multi-master</em> — single-primary streaming replication + read replica scaling 已夠。Multi-master 是 <em>特殊需求</em>（跨 region active-active write / 不可中斷 maintenance）才上。</p>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &#43; certification* 整個機制不同。本文走 GR 機制（GCE &#43; certification &#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">MySQL Group Replication</a> 對比：MySQL GR 是 <em>官方內建</em>（5.7+）、PG 沒對應內建選項。MySQL 用戶 GR / InnoDB Cluster 直接套、PG 用戶要選 extension + license trade-off。</p>
<h2 id="multi-master-三方案對比">Multi-master 三方案對比</h2>
<h3 id="方案-1bdr-edb-postgres-distributed">方案 1：BDR (EDB Postgres Distributed)</h3>
<p>EDB 商業 distributed 方案、跑在 EDB Postgres Advanced Server 或 PG community 上。</p>
<p><strong>特性</strong>：</p>
<ul>
<li>雙向 logical replication、N-way active-active</li>
<li>Built-in conflict detection + resolution（LWW / column-level / user-defined）</li>
<li>Eager（sync）跟 async 兩種 mode</li>
<li>Tightly integrated with EDB tooling</li>
</ul>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>商業 license、EDB 訂閱</li>
<li>對 cross-region multi-master 成熟（北美 enterprise 廣用）</li>
<li>對 <em>新 PG version</em> 通常滯後幾個月</li>
</ul>
<h3 id="方案-2pgedge基於-spock-extension">方案 2：pgEdge（基於 Spock extension）</h3>
<p>pgEdge 開源 multi-master、基於 <em>Spock</em> extension（從 BDR 衍生）：</p>
<p><strong>特性</strong>：</p>
<ul>
<li>開源、可自管</li>
<li>跟 BDR 架構接近、無 license fee</li>
<li>Conflict resolution 用 LWW + column-level</li>
<li>對 <em>edge / 地理分散</em> 場景設計</li>
</ul>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>較新（2023+）、社群驗證度低於 BDR</li>
<li>Conflict resolution policy 比 BDR 簡單</li>
<li>部分 EDB 商業 feature 沒對應</li>
</ul>
<h3 id="方案-3bucardo">方案 3：Bucardo</h3>
<p>PG community async multi-master、Perl 寫、trigger-based：</p>
<p><strong>特性</strong>：</p>
<ul>
<li>完全開源</li>
<li>Trigger-based（不依賴 logical replication）</li>
<li>支援 multi-source replication（fan-in / fan-out）</li>
</ul>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>Async only — <em>higher latency conflict</em></li>
<li>Trigger overhead（影響 primary 寫吞吐）</li>
<li>維護 Perl + tools chain 不普及</li>
<li>對 <em>Sync 一致性</em> 需求不適用</li>
</ul>
<h2 id="multi-master-conflict-model">Multi-Master Conflict Model</h2>
<p>任何 multi-master 方案都要解決 <em>同一 row 兩地同時改</em> 的 conflict：</p>
<h3 id="conflict-來源">Conflict 來源</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Region A (primary 1)          Region B (primary 2)
</span></span><span class="line"><span class="ln">2</span><span class="cl">UPDATE orders                 UPDATE orders
</span></span><span class="line"><span class="ln">3</span><span class="cl">SET status=&#39;shipped&#39;          SET status=&#39;cancelled&#39;
</span></span><span class="line"><span class="ln">4</span><span class="cl">WHERE id=100                  WHERE id=100
</span></span><span class="line"><span class="ln">5</span><span class="cl">     ↓                              ↓
</span></span><span class="line"><span class="ln">6</span><span class="cl">   合併？哪個贏？</span></span></code></pre></div><p>跨 region 兩地各自 commit、replication lag 期間發現 conflict、必須 <em>自動 resolve</em>（不能丟給 application）。</p>
<h3 id="conflict-resolution-strategies">Conflict Resolution Strategies</h3>
<p><strong>1. Last-Write-Wins (LWW)</strong> — 最常見：</p>
<ul>
<li>比較 transaction commit timestamp、晚的贏</li>
<li>簡單但 <em>data loss</em>（前一個 commit 的變更被覆蓋）</li>
<li>需要 <em>clock 同步</em>（NTP）—  clock skew 造成不可預測</li>
</ul>
<p><strong>2. Column-level conflict resolution</strong>：</p>
<ul>
<li>不同 column 各自 LWW（status column 跟 amount column 獨立解）</li>
<li>比 row-level LWW 細、但需 application semantics 配合</li>
</ul>
<p><strong>3. User-defined trigger</strong>：</p>
<ul>
<li>寫 PG function 解 conflict</li>
<li>對 <em>特殊 business logic</em>（如：金額相加、不是覆蓋）有用</li>
<li>維護成本高</li>
</ul>
<p><strong>4. Manual reconciliation</strong>：</p>
<ul>
<li>Conflict 寫進 log table、application / DBA 手動處理</li>
<li>對 <em>無法自動 resolve</em> 場景（如金融）</li>
<li>高 ops cost</li>
</ul>
<p>對 99% case 用 LWW、接受 small data loss、application 設計 <em>idempotent / commutative</em> 操作避免衝突。</p>
<h3 id="conflict-機率取決於-application-pattern">Conflict 機率取決於 application pattern</h3>
<ul>
<li><em>Tenant-isolated</em> application（user_id 各自寫自己的 row）：基本無 conflict</li>
<li><em>Shared counter / inventory</em> application：高 conflict、multi-master 不適合</li>
<li><em>Append-only event log</em>：conflict 低、適合 multi-master</li>
</ul>
<h2 id="配置-step-by-steppgedge-為主">配置 step-by-step（pgEdge 為主）</h2>
<p>pgEdge 開源、最常見的 self-hosted 選擇。</p>
<h3 id="step-1在每個-region-node-裝-pgedge">Step 1：在每個 region node 裝 pgEdge</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Install pgEdge CLI</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">curl -fsSL https://pgedge-upstream.s3.amazonaws.com/REPO/install.py <span class="p">|</span> python3
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># Setup PG + Spock + pgEdge</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">./pgedge install pg16
</span></span><span class="line"><span class="ln">6</span><span class="cl">./pgedge install spock</span></span></code></pre></div><h3 id="step-2配置每個-node">Step 2：配置每個 node</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 在 node1（us-east） 跑
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">node_create</span><span class="p">(</span><span class="n">node_name</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;node1&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">dsn</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;host=node1.example.com port=5432 dbname=production&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 在 node2（eu-west）跑
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">node_create</span><span class="p">(</span><span class="n">node_name</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;node2&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">dsn</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;host=node2.example.com port=5432 dbname=production&#39;</span><span class="p">);</span></span></span></code></pre></div><h3 id="step-3建-replication-set--subscribe">Step 3：建 replication set + subscribe</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 在 node1 建 default replication set + 加 tables
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">repset_add_all_tables</span><span class="p">(</span><span class="s1">&#39;default&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- 在 node1 subscribe node2
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">sub_create</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">subscription_name</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;sub_n1_n2&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="n">provider_dsn</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;host=node2.example.com port=5432 dbname=production&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 在 node2 subscribe node1（雙向）
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">sub_create</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="n">subscription_name</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;sub_n2_n1&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">    </span><span class="n">provider_dsn</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;host=node1.example.com port=5432 dbname=production&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><h3 id="step-4設-conflict-resolution">Step 4：設 conflict resolution</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 設 LWW（預設）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">conflict_resolution_setting_set</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">conflict_type</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;update_origin_change&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">resolution_setting</span><span class="w"> </span><span class="p">:</span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;apply_remote&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><h3 id="step-5驗證">Step 5：驗證</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 看 subscription 狀態
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">spock</span><span class="p">.</span><span class="n">subscription</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 看 replication lag
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_replication</span><span class="p">;</span></span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-lww-data-loss--application-沒設計-commutative">1. LWW data loss — Application 沒設計 commutative</h3>
<p>LWW 預設、兩 region 同時 UPDATE 同 row → 晚的 commit 贏、早的丟失。Application 看不到「我寫的不見了」、debug 困難。</p>
<p>修法：</p>
<ul>
<li>Application schema 設計 <em>tenant-isolated</em>（user_id 各自寫自己 row）</li>
<li>對 <em>shared counter / inventory</em> 用 <em>commutative operation</em>（INCREMENT not SET）</li>
<li>重要寫入加 <em>audit log</em> — conflict 仍寫到 audit、application 看 audit 知道發生過</li>
<li>真的需要 strict consistency 別用 multi-master、用 single-primary + reader 或 distributed SQL</li>
</ul>
<h3 id="2-sequence-collision--two-region-各自-next-同號">2. Sequence collision — Two region 各自 next 同號</h3>
<p><code>SERIAL</code> / <code>IDENTITY</code> 用 sequence、兩 region 各自 nextval 可能拿到同 number、INSERT 衝突（PK duplicate）。</p>
<p>修法：</p>
<ul>
<li>用 <em>staggered sequence range</em>：node1 用 1-1M、node2 用 1M+1 到 2M（用 <code>setval</code>）</li>
<li>或用 <em>UUID</em>（v4 / v7）作 PK、跨 node 無 collision</li>
<li>或 <em>sequence per-node namespace</em>：<code>CREATE SEQUENCE orders_id_node1 START 1 INCREMENT 2</code>（odd vs even）</li>
</ul>
<h3 id="3-ddl-replication-不自動">3. DDL replication 不自動</h3>
<p>PG logical replication（pgEdge / BDR 基礎）<em>不自動 replicate DDL</em>。每 node <code>CREATE TABLE</code> / <code>ALTER TABLE</code> 必須 <em>分別跑</em>。</p>
<p>修法：</p>
<ul>
<li>用 <em>deployment automation</em>（Ansible / Terraform）對所有 node 同時跑 DDL</li>
<li>pgEdge 提供 <code>spock.replicate_ddl(...)</code> 把 DDL 轉成可 replicate event</li>
<li>BDR Enterprise 有 <em>DDL replication</em>（商業 feature）</li>
<li>DDL 變更前確認 <em>所有 node 都健康</em>、減少 partial state</li>
</ul>
<h3 id="4-conflict-log-治理--log-table-爆滿">4. Conflict log 治理 — Log table 爆滿</h3>
<p>每個 conflict 寫進 <code>spock.conflict_log</code> / <code>bdr.conflict_history</code> 等 table、log 累積 disk 爆。</p>
<p>修法：</p>
<ul>
<li>設 <em>log retention</em>：cron 定期 archive + delete 老 conflict log</li>
<li>監控 conflict rate — 高 conflict rate 是 application 設計問題（不是 ops 問題）</li>
<li>對 <em>strict business</em> conflict 寫進 application-level audit table、不只 system log</li>
</ul>
<h3 id="5-failover-後-timeline-分歧">5. Failover 後 timeline 分歧</h3>
<p>Multi-master 設計上 <em>每 region 是 primary</em>、Region A 掛了 Region B 接管 — 但 Region A 復活後 <em>仍認為自己是 primary</em>。如果 Region A 復活前已有寫入沒 replicate 出去、resolution 跟 LWW 衝突。</p>
<p>修法：</p>
<ul>
<li><em>Fence Region A 復活</em>：物理 fence（network firewall）+ 手動 unfence 流程</li>
<li>用 <em>etcd / Consul</em> 跟 BDR / Spock 整合 leader election（避免 split-brain）</li>
<li>對 cross-region multi-master、必須有 <em>runbook</em> 處理 region 復活流程、不靠自動</li>
</ul>
<h2 id="何時用-multi-master-vs-不用">何時用 multi-master vs 不用</h2>
<table>
  <thead>
      <tr>
          <th>情境</th>
          <th>建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>真正 cross-region active-active write 需求</td>
          <td>BDR / pgEdge</td>
      </tr>
      <tr>
          <td>不可中斷 maintenance（zero downtime upgrade）</td>
          <td>BDR / pgEdge</td>
      </tr>
      <tr>
          <td>高 conflict rate（shared counter / inventory）</td>
          <td>不要 multi-master、用 distributed SQL</td>
      </tr>
      <tr>
          <td>Read scaling 為主、可接受 stale read</td>
          <td>streaming replication + read replica（更簡單）</td>
      </tr>
      <tr>
          <td>Strict consistency 需求</td>
          <td>single-primary + sync replication 或 Aurora DSQL / Spanner</td>
      </tr>
      <tr>
          <td>預算敏感 + 不想養 BDR / pgEdge ops</td>
          <td>不要 multi-master、用 managed distributed SQL</td>
      </tr>
  </tbody>
</table>
<h2 id="跟-mysql-group-replication-對比">跟 MySQL Group Replication 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PG Multi-Master</th>
          <th>MySQL Group Replication</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>內建？</td>
          <td>否、需 extension</td>
          <td>是、5.7+ 內建</td>
      </tr>
      <tr>
          <td>商業 vs 開源</td>
          <td>BDR 商業 / pgEdge 開源</td>
          <td>Oracle 商業 / community 都行</td>
      </tr>
      <tr>
          <td>Sync mode</td>
          <td>可（BDR eager）</td>
          <td>是（certification-based）</td>
      </tr>
      <tr>
          <td>Conflict resolution</td>
          <td>LWW / column / user-defined</td>
          <td>Certification-based（distributed transaction）</td>
      </tr>
      <tr>
          <td>Production maturity</td>
          <td>BDR 高、pgEdge 中</td>
          <td>高（Oracle 推）</td>
      </tr>
      <tr>
          <td>Use case 比例</td>
          <td>少（PG 多用 single-primary）</td>
          <td>較多（MySQL 推 InnoDB Cluster）</td>
      </tr>
  </tbody>
</table>
<p>MySQL GR 內建 + Oracle 推、PG 沒對應內建。對 multi-master 需求重的 org、MySQL 走 GR 路徑更直接。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication Topology</h3>
<p>Multi-master 是 <em>streaming replication 之上的 logical replication 加雙向</em>、不取代 streaming。Streaming 仍給 standby / failover、multi-master 給 active-active write。詳見 <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>。</p>
<h3 id="跟-logical-replication">跟 Logical Replication</h3>
<p>pgEdge / BDR 都基於 logical replication slot、跟 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a> 共用 PG logical decoding infrastructure、但 <em>配置 + tooling</em> 不同。</p>
<h3 id="跟-mvcc">跟 MVCC</h3>
<p>Multi-master 的 conflict 在 <em>commit 後</em> 偵測（async）、不在 transaction 內。跟單機 MVCC（同 cluster 內 transaction snapshot）不同層。詳見 <a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">MVCC + Lock Model</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（streaming + multi-master 共存）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">PG Logical Replication + Debezium</a>（logical decoding 基礎）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">PG MVCC + Lock Model</a>（multi-master conflict vs 單機 MVCC）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">PG Patroni HA</a>（single-primary HA 替代方案）</li>
<li><a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a>（multi-master vs distributed SQL）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &#43; certification* 整個機制不同。本文走 GR 機制（GCE &#43; certification &#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">MySQL Group Replication</a>（sibling、不同實作）</li>
<li>官方：<a href="https://www.enterprisedb.com/products/edb-postgres-distributed-bdr">EDB BDR</a> / <a href="https://www.pgedge.com/">pgEdge</a> / <a href="https://github.com/pgEdge/spock">Spock GitHub</a> / <a href="https://bucardo.org/">Bucardo</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/query-optimization/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/query-optimization/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>query optimization&lt;/em> — EXPLAIN ANALYZE / auto_explain / pg_hint_plan 三層工具跟 4 個實際 case。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="4-個常見-production-case">4 個常見 production case&lt;/h2>
&lt;p>PG query 慢的 root cause 多數是 &lt;em>planner 選錯 plan&lt;/em>。從以下 4 個 case 進入 query optimization：&lt;/p>
&lt;h3 id="case-15-秒--50ms--seq-scan-vs-index">Case 1：5 秒 → 50ms — Seq scan vs index&lt;/h3>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- 慢 (5 秒)
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">amount&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">c&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">JOIN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">customers&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">c&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">customer_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">c&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">c&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">region&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;TW&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">AND&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">created_at&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;2026-05-01&amp;#39;&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>EXPLAIN (ANALYZE, BUFFERS)&lt;/code>：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">Hash Join (cost=20000..50000 rows=100 width=...) (actual time=4900..5000 rows=10000)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> -&amp;gt; Seq Scan on customers c (cost=0..20000 rows=1000000 width=...)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> Filter: (region = &amp;#39;TW&amp;#39;)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> Rows Removed by Filter: 900000
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl"> -&amp;gt; Hash (cost=...)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl"> -&amp;gt; Index Scan on orders_created_idx&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>問題：&lt;code>customers.region&lt;/code> 沒 index、planner 選 seq scan、實際 region=TW 只 10% row。修法：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">INDEX&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">CONCURRENTLY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">idx_customers_region&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">customers&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">region&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">ANALYZE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">customers&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">-- 更新 statistics、讓 planner 看到新 index&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>加完 5 秒降 50ms。&lt;/p>
&lt;h3 id="case-230-秒--200ms--hash-join-沒觸發用-nested-loop">Case 2：30 秒 → 200ms — Hash join 沒觸發、用 nested loop&lt;/h3>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">u&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">count&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">u&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">LEFT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">JOIN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">user_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">u&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">GROUP&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">u&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>EXPLAIN ANALYZE 顯示 &lt;em>Nested Loop&lt;/em> 跑 1M 次 inner loop、執行 30 秒。Planner 估錯 row count、選 nested loop。Hash join 應該 &amp;lt; 200ms。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>query optimization</em> — EXPLAIN ANALYZE / auto_explain / pg_hint_plan 三層工具跟 4 個實際 case。</p></blockquote>
<hr>
<h2 id="4-個常見-production-case">4 個常見 production case</h2>
<p>PG query 慢的 root cause 多數是 <em>planner 選錯 plan</em>。從以下 4 個 case 進入 query optimization：</p>
<h3 id="case-15-秒--50ms--seq-scan-vs-index">Case 1：5 秒 → 50ms — Seq scan vs index</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 慢 (5 秒)
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">amount</span><span class="p">,</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">name</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">customers</span><span class="w"> </span><span class="k">c</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">customer_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">id</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">region</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;TW&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="p">;</span></span></span></code></pre></div><p><code>EXPLAIN (ANALYZE, BUFFERS)</code>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Hash Join  (cost=20000..50000 rows=100 width=...) (actual time=4900..5000 rows=10000)
</span></span><span class="line"><span class="ln">2</span><span class="cl">  -&gt;  Seq Scan on customers c  (cost=0..20000 rows=1000000 width=...)
</span></span><span class="line"><span class="ln">3</span><span class="cl">      Filter: (region = &#39;TW&#39;)
</span></span><span class="line"><span class="ln">4</span><span class="cl">      Rows Removed by Filter: 900000
</span></span><span class="line"><span class="ln">5</span><span class="cl">  -&gt;  Hash  (cost=...)
</span></span><span class="line"><span class="ln">6</span><span class="cl">      -&gt;  Index Scan on orders_created_idx</span></span></code></pre></div><p>問題：<code>customers.region</code> 沒 index、planner 選 seq scan、實際 region=TW 只 10% row。修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">CONCURRENTLY</span><span class="w"> </span><span class="n">idx_customers_region</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">customers</span><span class="p">(</span><span class="n">region</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">ANALYZE</span><span class="w"> </span><span class="n">customers</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 更新 statistics、讓 planner 看到新 index</span></span></span></code></pre></div><p>加完 5 秒降 50ms。</p>
<h3 id="case-230-秒--200ms--hash-join-沒觸發用-nested-loop">Case 2：30 秒 → 200ms — Hash join 沒觸發、用 nested loop</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">u</span><span class="p">.</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="n">o</span><span class="p">.</span><span class="n">id</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="n">u</span><span class="w"> </span><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">u</span><span class="p">.</span><span class="n">id</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">u</span><span class="p">.</span><span class="n">name</span><span class="p">;</span></span></span></code></pre></div><p>EXPLAIN ANALYZE 顯示 <em>Nested Loop</em> 跑 1M 次 inner loop、執行 30 秒。Planner 估錯 row count、選 nested loop。Hash join 應該 &lt; 200ms。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ANALYZE</span><span class="w"> </span><span class="n">users</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">ANALYZE</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- 提高 default_statistics_target 對 critical column
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">ALTER</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="k">STATISTICS</span><span class="w"> </span><span class="mi">1000</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">ANALYZE</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span></span></span></code></pre></div><p>統計精度提升、planner 估 row count 準、自動切 hash join。</p>
<h3 id="case-38-秒--100ms--multi-column-統計缺">Case 3：8 秒 → 100ms — Multi-column 統計缺</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;pending&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">region</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;TW&#39;</span><span class="p">;</span></span></span></code></pre></div><p><code>status = 'pending'</code> 5% row、<code>region = 'TW'</code> 10% row。Planner 假設兩 column 獨立、估 0.5% (5K row)。實際 status=&lsquo;pending&rsquo; 跟 region=&lsquo;TW&rsquo; 強相關（TW 訂單多 pending）、實際 4% (40K row)。Planner 估錯 8x、選錯 plan。</p>
<p>修法（PG 10+）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">STATISTICS</span><span class="w"> </span><span class="n">stats_orders_status_region</span><span class="w"> </span><span class="p">(</span><span class="n">dependencies</span><span class="p">,</span><span class="w"> </span><span class="n">ndistinct</span><span class="p">,</span><span class="w"> </span><span class="n">mcv</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="n">status</span><span class="p">,</span><span class="w"> </span><span class="n">region</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">ANALYZE</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 之後 planner 知道 status+region 相關度、估準</span></span></span></code></pre></div><h3 id="case-420-秒--5-秒--parallel-query-沒觸發">Case 4：20 秒 → 5 秒 — Parallel query 沒觸發</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">region</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">),</span><span class="w"> </span><span class="k">sum</span><span class="p">(</span><span class="n">amount</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">region</span><span class="p">;</span></span></span></code></pre></div><p><code>orders</code> 100M row、預期 PG parallel scan + parallel aggregate、實際 single worker 跑 20 秒。</p>
<p>EXPLAIN：<code>Workers Planned: 0</code>。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">max_parallel_workers_per_gather</span> <span class="o">=</span> <span class="s">4</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">max_parallel_workers</span> <span class="o">=</span> <span class="s">8</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">max_worker_processes</span> <span class="o">=</span> <span class="s">16</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">parallel_setup_cost</span> <span class="o">=</span> <span class="s">100        # 預設 1000、降低讓 planner 更敢 parallel</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="na">parallel_tuple_cost</span> <span class="o">=</span> <span class="s">0.01       # 預設 0.1</span></span></span></code></pre></div><p>並行後 5 秒。</p>
<h2 id="explain-三層工具">EXPLAIN 三層工具</h2>
<h3 id="tool-1explain--plan-preview">Tool 1：EXPLAIN — Plan preview</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">EXPLAIN</span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="p">...;</span></span></span></code></pre></div><p>輸出每個 node 的 <em>估計</em> cost / row count / width。<strong>用於 quick plan check</strong>。</p>
<p>關鍵欄位：</p>
<ul>
<li><code>Plan node 類型</code>：<code>Seq Scan</code> &lt; <code>Index Scan</code> &lt; <code>Index Only Scan</code>、警訊看 <em>unexpected</em> node type</li>
<li><code>cost=START..END</code>：planner 估的 cost、START 是 startup cost、END 是 total</li>
<li><code>rows</code>：估計 output row 數</li>
<li><code>width</code>：每 row average byte（影響 sort / hash memory）</li>
</ul>
<h3 id="tool-2explain-analyze--實際執行--對比-estimate">Tool 2：EXPLAIN ANALYZE — 實際執行 + 對比 estimate</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">EXPLAIN</span><span class="w"> </span><span class="p">(</span><span class="k">ANALYZE</span><span class="p">,</span><span class="w"> </span><span class="n">BUFFERS</span><span class="p">,</span><span class="w"> </span><span class="k">VERBOSE</span><span class="p">)</span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="p">...;</span></span></span></code></pre></div><p>差別：實際 <em>跑 query</em>、輸出實際 row count / time、跟 estimate 對比：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Hash Join  (cost=20000..50000 rows=100) (actual time=400..500 rows=10000 loops=1)</span></span></code></pre></div><p><code>rows=100 (estimate)</code> vs <code>rows=10000 (actual)</code> — 估錯 100x、planner 可能選錯 plan。<code>BUFFERS</code> 顯示 disk read vs buffer cache hit。</p>
<p><strong>注意</strong>：EXPLAIN ANALYZE <em>實際跑 query</em>、修改性 query（UPDATE / DELETE）會真的改 data。讀 query 安全。修改性 query 包 transaction：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">EXPLAIN</span><span class="w"> </span><span class="k">ANALYZE</span><span class="w"> </span><span class="k">UPDATE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;x&#39;</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">...;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">ROLLBACK</span><span class="p">;</span></span></span></code></pre></div><h3 id="tool-3auto_explain--production-query-自動-capture">Tool 3：auto_explain — Production query 自動 capture</h3>
<p><code>auto_explain</code> extension 自動 log slow query 的 plan：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">shared_preload_libraries</span> <span class="o">=</span> <span class="s">&#39;auto_explain&#39;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">auto_explain.log_min_duration</span> <span class="o">=</span> <span class="s">&#39;1s&#39;    # 超過 1 秒 log plan</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">auto_explain.log_analyze</span> <span class="o">=</span> <span class="s">on            # 含 ANALYZE 統計</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">auto_explain.log_buffers</span> <span class="o">=</span> <span class="s">on</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="na">auto_explain.log_format</span> <span class="o">=</span> <span class="s">&#39;json&#39;         # JSON 格式給工具消費</span></span></span></code></pre></div><p>Production slow query 自動進 log、不必手動 EXPLAIN。組合 pg_stat_statements + auto_explain 是 PG 標準 query observability。</p>
<h2 id="pg_hint_plan-vs-planner-guc">pg_hint_plan vs Planner GUC</h2>
<p>PG 兩種方式 nudge planner：</p>
<h3 id="planner-gucglobal">Planner GUC（global）</h3>
<p><code>postgresql.conf</code> 內：</p>
<ul>
<li><code>enable_seqscan = off</code> — 禁用 seq scan（force index）</li>
<li><code>enable_nestloop = off</code> — 禁用 nested loop（force hash/merge join）</li>
<li><code>random_page_cost = 1.1</code> — SSD 設低（預設 4 是 HDD assumption）</li>
<li><code>effective_cache_size = '16GB'</code> — buffer pool + OS cache 估、影響 planner</li>
</ul>
<p>GUC 是 <em>global</em> — 影響所有 query。對 <em>單一 query 用 hint</em>：</p>
<h3 id="pg_hint_plan-extensionper-query-hint">pg_hint_plan extension（per-query hint）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 強制特定 plan
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="cm">/*+ IndexScan(orders idx_orders_status) NestLoop(orders customers) */</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">customers</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="p">...;</span></span></span></code></pre></div><p>Hint 形態：</p>
<ul>
<li><code>IndexScan(t1 idx_name)</code> — 強制 index scan</li>
<li><code>SeqScan(t1)</code> — 強制 seq scan</li>
<li><code>HashJoin(t1 t2)</code> / <code>NestLoop(t1 t2)</code> / <code>MergeJoin(t1 t2)</code></li>
<li><code>Leading(t1 t2 t3)</code> — 強制 join order</li>
<li><code>Rows(t1 t2 #100)</code> — 強制 row 估計</li>
</ul>
<p><strong>推薦</strong>：</p>
<ul>
<li>全 cluster 行為：用 GUC（如 <code>random_page_cost</code>）</li>
<li>單 query 行為：用 pg_hint_plan（不污染其他 query）</li>
<li>不要過度 hint — planner 多數時候 <em>是對的</em>、hint 是 last resort</li>
</ul>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-statistics-過時--planner-估錯-row-count">1. Statistics 過時 — Planner 估錯 row count</h3>
<p><code>ANALYZE</code> 是 autovacuum 一部分、預設 <em>autovacuum_analyze_scale_factor=0.1</em>（10% row 變動才 analyze）。對 <em>快速 grow 的表</em>（log / event）、ANALYZE 跟不上、planner 用過時 statistics。</p>
<p>修法：</p>
<ul>
<li>
<p>對 critical table 設 <em>較 aggressive autovacuum_analyze_scale_factor</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="p">(</span><span class="n">autovacuum_analyze_scale_factor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">.</span><span class="mi">02</span><span class="p">);</span></span></span></code></pre></div></li>
<li>
<p>對 <em>大批量寫入後</em>、手動 <code>ANALYZE events;</code></p>
</li>
<li>
<p>監控 <code>pg_stat_user_tables.last_analyze</code> — 跟 row count 比、判定是否需手動 trigger</p>
</li>
</ul>
<h3 id="2-multi-column-statistics--planner-假設-column-獨立">2. Multi-column statistics — Planner 假設 column 獨立</h3>
<p>如 Case 3、單 column statistics 對 <em>相關 column</em> 估錯。</p>
<p>修法：</p>
<ul>
<li>對 <em>常一起 query 的 column 組合</em>、建 <code>CREATE STATISTICS</code>（PG 10+）</li>
<li>3 種 type：<code>dependencies</code>（functional dependency）、<code>ndistinct</code>（multi-column distinct count）、<code>mcv</code>（most common value combinations）</li>
<li>設完 <em>必須跑 ANALYZE</em> 才生效</li>
</ul>
<h3 id="3-cost-base-setting-不對齊硬體--planner-偏-seq-scan">3. Cost-base setting 不對齊硬體 — Planner 偏 seq scan</h3>
<p>預設 <code>random_page_cost = 4</code>、<code>seq_page_cost = 1</code> 是 <em>HDD assumption</em>（random IO 比 sequential 慢 4x）。SSD / NVMe random / seq IO 差別小、planner 不該 4x penalty random。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- SSD
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">random_page_cost</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">.</span><span class="mi">1</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- NVMe
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">random_page_cost</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_reload_conf</span><span class="p">();</span></span></span></code></pre></div><p><code>random_page_cost</code> 改了 planner 對 index scan 的 cost 估計更準、自動選 index 更積極。</p>
<h3 id="4-effective_cache_size-不對齊實際-ram">4. <code>effective_cache_size</code> 不對齊實際 RAM</h3>
<p><code>effective_cache_size</code> 預設 4 GB、planner 假設 buffer pool + OS cache 共 4 GB。實際 server 64 GB RAM、<code>shared_buffers = 16GB</code>、OS page cache ~30 GB、實際可用 cache 46 GB。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">effective_cache_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;46GB&#39;</span><span class="p">;</span><span class="w">  </span><span class="c1">-- shared_buffers + OS cache 估</span></span></span></code></pre></div><p>提升後 planner 估 query 多數 page 在 cache、降低 <em>估計 random IO cost</em>、選 index 更積極。</p>
<h3 id="5-parallel-query-不觸發">5. Parallel query 不觸發</h3>
<p>預設 <code>max_parallel_workers_per_gather = 2</code>、有些 workload 不夠。或 <em>table size 太小</em>、<code>min_parallel_table_scan_size = 8MB</code> 預設、小表不 parallel。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">max_parallel_workers_per_gather</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">4</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">parallel_setup_cost</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">parallel_tuple_cost</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">.</span><span class="mi">01</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">min_parallel_table_scan_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;0&#39;</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 任何 size 都 parallel</span></span></span></code></pre></div><p>監控 <code>EXPLAIN</code> 的 <code>Workers Planned</code> 數量、看是否真 parallel。</p>
<h2 id="觀測-metric">觀測 metric</h2>
<p>Production 持續 monitor：</p>
<ul>
<li><code>pg_stat_statements</code>：每個 query digest 累計 calls / time / rows / IO</li>
<li><code>auto_explain</code> log：slow query 的實際 plan + ANALYZE 統計</li>
<li><code>pg_stat_user_tables.last_analyze</code> / <code>last_autoanalyze</code>：statistics 新鮮度</li>
<li><code>pg_stat_user_indexes.idx_scan</code>：每個 index 使用次數 — 0 表示沒用、可考慮 drop</li>
</ul>
<p>把這些丟進 Datadog / Prometheus（用 <code>postgres_exporter</code> / <code>pg_exporter</code>）做 trend analysis。</p>
<h2 id="跟-mysql-query-optimization-對照">跟 MySQL Query Optimization 對照</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PG</th>
          <th>MySQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Query plan preview</td>
          <td><code>EXPLAIN</code></td>
          <td><code>EXPLAIN</code></td>
      </tr>
      <tr>
          <td>實際執行統計</td>
          <td><code>EXPLAIN ANALYZE</code></td>
          <td><code>EXPLAIN ANALYZE</code> (8.0+)</td>
      </tr>
      <tr>
          <td>Auto-capture</td>
          <td><code>auto_explain</code> extension</td>
          <td><code>slow_query_log</code> + <code>pt-query-digest</code></td>
      </tr>
      <tr>
          <td>Optimizer trace</td>
          <td>log_planner_stats / log_executor_stats</td>
          <td><code>optimizer_trace</code> (JSON)</td>
      </tr>
      <tr>
          <td>Per-query hint</td>
          <td><code>pg_hint_plan</code> extension</td>
          <td>optimizer hint comment (<code>/*+ */</code>)</td>
      </tr>
      <tr>
          <td>Multi-column statistics</td>
          <td><code>CREATE STATISTICS</code></td>
          <td>無原生（依賴 index 統計）</td>
      </tr>
      <tr>
          <td>Parallel query</td>
          <td>Full (scan / agg / join, PG 9.6+)</td>
          <td>受限 (8.0 hash join)</td>
      </tr>
      <tr>
          <td>Cost-base setting</td>
          <td>random_page_cost / effective_cache_size</td>
          <td>隱性、optimizer 預設</td>
      </tr>
  </tbody>
</table>
<p>PG planner 整體成熟、複雜 OLAP-style query 處理較好。MySQL 8.0 補了不少（histograms / hash join）但複雜 query 仍弱於 PG。詳見 <a href="/blog/backend/01-database/vendors/mysql/query-optimization/" data-link-title="MySQL Query Optimization：從 EXPLAIN 看到實際執行、5 條 query 從 5 秒變 50ms 的 anatomy" data-link-desc="MySQL query 慢的根因不在「SQL 寫法」、在「optimizer 選錯 plan」。本文從 5 個常見 production case 開場（5 秒 → 50ms / 30 秒 → 200ms / 8 秒 → 30ms 等）、走 EXPLAIN / EXPLAIN ANALYZE / optimizer trace 三層分析工具、index hint vs optimizer hint 取捨、cardinality estimation 失效時的修法、5 production 踩雷（statistics 過時 / forced index 用錯 / hash join 沒觸發 / range scan 退化 ALL / derived table materialization）">MySQL Query Optimization</a>。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-autovacuum-tuning">跟 Autovacuum Tuning</h3>
<p>ANALYZE 是 autovacuum 一部分、autovacuum 跟不上 → statistics 過時 → planner 估錯。詳見 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a>。</p>
<h3 id="跟-replication-topology">跟 Replication Topology</h3>
<p>Standby 上跑 query 用同 statistics（streaming replication copy 整個 system catalog）、planner 行為一致。但 <em>standby 有 hot_standby_feedback</em> 影響 primary autovacuum / ANALYZE 行為。詳見 <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>。</p>
<h3 id="跟-partitioning">跟 Partitioning</h3>
<p>Partition pruning 跟 query plan 緊密 — <code>EXPLAIN</code> 看是否 prune 對的 partition。詳見 <a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">Declarative Partitioning</a>。</p>
<h2 id="何時用-pg_hint_plan-vs-guc">何時用 pg_hint_plan vs GUC</h2>
<table>
  <thead>
      <tr>
          <th>情境</th>
          <th>選擇</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>全 cluster 行為（如 SSD random_page_cost）</td>
          <td>GUC</td>
      </tr>
      <tr>
          <td>單一 critical query 強制特定 plan</td>
          <td>pg_hint_plan</td>
      </tr>
      <tr>
          <td>暫時 disable 某類 plan 給 debug</td>
          <td><code>SET enable_xxx=off</code> per-session</td>
      </tr>
      <tr>
          <td>Production stable use</td>
          <td>GUC + multi-column statistics 為主、hint 為 last resort</td>
      </tr>
  </tbody>
</table>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">PG Autovacuum Tuning</a>（ANALYZE 跟 statistics 新鮮度）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（standby planner 行為）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">PG Declarative Partitioning</a>（partition pruning）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/query-optimization/" data-link-title="MySQL Query Optimization：從 EXPLAIN 看到實際執行、5 條 query 從 5 秒變 50ms 的 anatomy" data-link-desc="MySQL query 慢的根因不在「SQL 寫法」、在「optimizer 選錯 plan」。本文從 5 個常見 production case 開場（5 秒 → 50ms / 30 秒 → 200ms / 8 秒 → 30ms 等）、走 EXPLAIN / EXPLAIN ANALYZE / optimizer trace 三層分析工具、index hint vs optimizer hint 取捨、cardinality estimation 失效時的修法、5 production 踩雷（statistics 過時 / forced index 用錯 / hash join 沒觸發 / range scan 退化 ALL / derived table materialization）">MySQL Query Optimization</a>（sibling、不同 optimizer 成熟度）</li>
<li>官方：<a href="https://www.postgresql.org/docs/current/sql-explain.html">EXPLAIN</a> / <a href="https://github.com/ossc-db/pg_hint_plan">pg_hint_plan</a> / <a href="https://www.postgresql.org/docs/current/auto-explain.html">auto_explain</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL MVCC + Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>MVCC + lock model&lt;/em> — PG 並行控制機制跟跟 MySQL lock-based 不同。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="pg-mvcc每次更新都-新增-tuple不改舊版">PG MVCC：每次更新都 &lt;em>新增 tuple&lt;/em>、不改舊版&lt;/h2>
&lt;p>PG 的並行控制核心是 &lt;em>Multi-Version Concurrency Control&lt;/em> — UPDATE 不修改原 row、是 &lt;em>新增&lt;/em> 一個 tuple version、舊 version 留在 table 直到 VACUUM 清理：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">原 row: (id=1, status=&amp;#39;pending&amp;#39;, xmin=100, xmax=NULL)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> ↓ UPDATE status=&amp;#39;shipped&amp;#39;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">新 tuple: (id=1, status=&amp;#39;shipped&amp;#39;, xmin=200, xmax=NULL)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">舊 tuple 標 xmax=200（不刪、給其他 transaction 看舊 version）&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>xmin&lt;/code> / &lt;code>xmax&lt;/code> 是 &lt;em>creator transaction id&lt;/em> / &lt;em>destroyer transaction id&lt;/em>。每個 SELECT 用 &lt;em>snapshot&lt;/em>（含當下 active transaction list）判斷哪些 tuple 對自己可見：&lt;/p>
&lt;ul>
&lt;li>自己 transaction id &amp;gt; tuple.xmin 且 (tuple.xmax = NULL 或自己 transaction id &amp;lt; tuple.xmax) → 可見&lt;/li>
&lt;li>否則 → 看不到（過去 / 未來版本）&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>結果&lt;/strong>：&lt;/p>
&lt;ul>
&lt;li>&lt;em>Readers 不 lock writers&lt;/em>：SELECT 看 snapshot、不 block UPDATE&lt;/li>
&lt;li>&lt;em>Writers 不 lock readers&lt;/em>：UPDATE 寫新 tuple、不影響正在跑的 SELECT snapshot&lt;/li>
&lt;li>&lt;em>Writers 只 lock 同一 row 的 writers&lt;/em>：兩個 UPDATE 同 row 才 conflict&lt;/li>
&lt;/ul>
&lt;p>跟 MySQL InnoDB &lt;em>lock-based&lt;/em>（&lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/lock-contention/" data-link-title="MySQL Lock Contention：在 staging 重現的 deadlock、production 跑 6 個月才出現" data-link-desc="MySQL InnoDB 的 lock 是 row-level、但 *為什麼某些 row 莫名其妙也被 lock* 是 gap lock / next-key lock 設計造成的隱性行為。本文從一個 production case 開場（staging 重現 deadlock / production 6 個月後突然爆）、走 5 種 InnoDB lock 類型（record / gap / next-key / insert intention / auto-inc）、isolation level 對 lock 行為的決定性影響、deadlock detection / SHOW ENGINE INNODB STATUS 解讀、5 production 踩雷（gap lock 阻塞 INSERT / auto-inc lock contention / FK lock cascading / large transaction lock holding / READ COMMITTED 跟 binlog ROW 互動）">Lock Contention&lt;/a>）對比：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>MVCC + lock model</em> — PG 並行控制機制跟跟 MySQL lock-based 不同。</p></blockquote>
<hr>
<h2 id="pg-mvcc每次更新都-新增-tuple不改舊版">PG MVCC：每次更新都 <em>新增 tuple</em>、不改舊版</h2>
<p>PG 的並行控制核心是 <em>Multi-Version Concurrency Control</em> — UPDATE 不修改原 row、是 <em>新增</em> 一個 tuple version、舊 version 留在 table 直到 VACUUM 清理：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">原 row:    (id=1, status=&#39;pending&#39;, xmin=100, xmax=NULL)
</span></span><span class="line"><span class="ln">2</span><span class="cl">                 ↓ UPDATE status=&#39;shipped&#39;
</span></span><span class="line"><span class="ln">3</span><span class="cl">新 tuple:  (id=1, status=&#39;shipped&#39;, xmin=200, xmax=NULL)
</span></span><span class="line"><span class="ln">4</span><span class="cl">舊 tuple 標 xmax=200（不刪、給其他 transaction 看舊 version）</span></span></code></pre></div><p><code>xmin</code> / <code>xmax</code> 是 <em>creator transaction id</em> / <em>destroyer transaction id</em>。每個 SELECT 用 <em>snapshot</em>（含當下 active transaction list）判斷哪些 tuple 對自己可見：</p>
<ul>
<li>自己 transaction id &gt; tuple.xmin 且 (tuple.xmax = NULL 或自己 transaction id &lt; tuple.xmax) → 可見</li>
<li>否則 → 看不到（過去 / 未來版本）</li>
</ul>
<p><strong>結果</strong>：</p>
<ul>
<li><em>Readers 不 lock writers</em>：SELECT 看 snapshot、不 block UPDATE</li>
<li><em>Writers 不 lock readers</em>：UPDATE 寫新 tuple、不影響正在跑的 SELECT snapshot</li>
<li><em>Writers 只 lock 同一 row 的 writers</em>：兩個 UPDATE 同 row 才 conflict</li>
</ul>
<p>跟 MySQL InnoDB <em>lock-based</em>（<a href="/blog/backend/01-database/vendors/mysql/lock-contention/" data-link-title="MySQL Lock Contention：在 staging 重現的 deadlock、production 跑 6 個月才出現" data-link-desc="MySQL InnoDB 的 lock 是 row-level、但 *為什麼某些 row 莫名其妙也被 lock* 是 gap lock / next-key lock 設計造成的隱性行為。本文從一個 production case 開場（staging 重現 deadlock / production 6 個月後突然爆）、走 5 種 InnoDB lock 類型（record / gap / next-key / insert intention / auto-inc）、isolation level 對 lock 行為的決定性影響、deadlock detection / SHOW ENGINE INNODB STATUS 解讀、5 production 踩雷（gap lock 阻塞 INSERT / auto-inc lock contention / FK lock cascading / large transaction lock holding / READ COMMITTED 跟 binlog ROW 互動）">Lock Contention</a>）對比：</p>
<ul>
<li>MySQL：SELECT FOR UPDATE 用 gap lock 防 phantom、deadlock 機率高</li>
<li>PG：MVCC + snapshot 自然防 phantom（read 看 snapshot）、deadlock 少</li>
</ul>
<p>但 PG 代價是 <em>VACUUM 治理</em> — dead tuple 不清理會佔 disk + 影響 query 效率。詳見 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a>。</p>
<h2 id="pg-4-種-lock">PG 4 種 lock</h2>
<p>PG 仍有 lock、但場景跟 MySQL 不同：</p>
<h3 id="1-row-level-lock--主要由-update--delete--select-for-update-取">1. Row-level lock — 主要由 UPDATE / DELETE / SELECT FOR UPDATE 取</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="k">UPDATE</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- 對 id=100 row 加 ROW EXCLUSIVE lock
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1">-- 其他 transaction 試 UPDATE / DELETE id=100 必須等</span></span></span></code></pre></div><p>Row-level lock <em>不 block reader</em>（SELECT 看 snapshot、不檢查 lock）。</p>
<h3 id="2-table-level-lock--ddl-跟少數-select-for-場景">2. Table-level lock — DDL 跟少數 SELECT FOR 場景</h3>
<p>PG 有 8 種 table lock mode、嚴重程度遞增：</p>
<table>
  <thead>
      <tr>
          <th>Mode</th>
          <th>行為</th>
          <th>衝突</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ACCESS SHARE</td>
          <td>SELECT 跑</td>
          <td>跟 ACCESS EXCLUSIVE 衝突</td>
      </tr>
      <tr>
          <td>ROW SHARE</td>
          <td>SELECT FOR UPDATE / FOR SHARE</td>
          <td>跟 EXCLUSIVE 衝突</td>
      </tr>
      <tr>
          <td>ROW EXCLUSIVE</td>
          <td>UPDATE / DELETE / INSERT</td>
          <td>跟 SHARE 衝突</td>
      </tr>
      <tr>
          <td>SHARE UPDATE EXCLUSIVE</td>
          <td>VACUUM / ANALYZE / CREATE INDEX CONCURRENTLY</td>
          <td>跟同 mode + 高 mode 衝突</td>
      </tr>
      <tr>
          <td>SHARE</td>
          <td>CREATE INDEX（non-concurrent）</td>
          <td>跟 ROW EXCLUSIVE 衝突</td>
      </tr>
      <tr>
          <td>SHARE ROW EXCLUSIVE</td>
          <td>CREATE TRIGGER / 某些 ALTER</td>
          <td>跟 ROW EXCLUSIVE 衝突</td>
      </tr>
      <tr>
          <td>EXCLUSIVE</td>
          <td>REFRESH MATERIALIZED VIEW</td>
          <td>跟所有 + 自身衝突</td>
      </tr>
      <tr>
          <td>ACCESS EXCLUSIVE</td>
          <td>DROP / ALTER TABLE / VACUUM FULL</td>
          <td>跟所有衝突</td>
      </tr>
  </tbody>
</table>
<p>DDL（ALTER / DROP）拿 ACCESS EXCLUSIVE、跟所有衝突。Production 跑 ALTER 必須短時間或走 <a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">Online Schema Change</a>。</p>
<h3 id="3-advisory-lock--application-自己控">3. Advisory lock — Application 自己控</h3>
<p>PG 提供 <em>advisory lock</em> 給 application 用、不關 row / table 結構：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Session 1
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_advisory_lock</span><span class="p">(</span><span class="mi">12345</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- 跑 critical section
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_advisory_unlock</span><span class="p">(</span><span class="mi">12345</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- Session 2
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_try_advisory_lock</span><span class="p">(</span><span class="mi">12345</span><span class="p">);</span><span class="w">  </span><span class="c1">-- 試取、不阻塞、返回 false</span></span></span></code></pre></div><p>用途：</p>
<ul>
<li>Application-level 互斥（如：cron job 同時只跑一個）</li>
<li>跨 connection 同步（PG-managed mutex）</li>
<li>Distributed transaction coordinator（lightweight）</li>
</ul>
<p>跟 row lock 不同：advisory lock 不關 row、application 自定義 lock ID 語義。</p>
<h3 id="4-predicate-lock--serializable-isolation-才用">4. Predicate lock — SERIALIZABLE isolation 才用</h3>
<p>PG SERIALIZABLE 用 <em>Serializable Snapshot Isolation (SSI)</em>、追蹤 <em>predicate</em>（query 條件）而不是 <em>row</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SET</span><span class="w"> </span><span class="k">TRANSACTION</span><span class="w"> </span><span class="k">ISOLATION</span><span class="w"> </span><span class="k">LEVEL</span><span class="w"> </span><span class="k">SERIALIZABLE</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- Predicate lock 紀錄這個 query 看了哪些 predicate
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;pending&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 其他 transaction INSERT pending order
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1">-- 提交時：PG 偵測 anomaly、rollback 之一
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">COMMIT</span><span class="p">;</span></span></span></code></pre></div><p>跟 MySQL gap lock 不同：</p>
<ul>
<li>MySQL gap lock：<em>pre-lock</em>、防 phantom 在 query 期間</li>
<li>PG predicate lock：<em>post-detect</em>、commit 時偵測 anomaly、退回 transaction</li>
</ul>
<p>PG SSI 對 <em>寫入吞吐影響低</em>（不 pre-lock）、但 <em>transaction rollback 機率高</em>（要 application retry）。</p>
<h2 id="pg-預設-isolationread-committed">PG 預設 isolation：READ COMMITTED</h2>
<p>PG 預設 READ COMMITTED、跟 MySQL InnoDB 預設 REPEATABLE READ 不同：</p>
<table>
  <thead>
      <tr>
          <th>Isolation</th>
          <th>PG 行為</th>
          <th>MySQL InnoDB 對應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>READ UNCOMMITTED</td>
          <td>PG 視為 READ COMMITTED（不真的支援 dirty read）</td>
          <td>MySQL 真支援</td>
      </tr>
      <tr>
          <td>READ COMMITTED</td>
          <td>每 statement 看當下 committed snapshot（PG 預設）</td>
          <td>一致</td>
      </tr>
      <tr>
          <td>REPEATABLE READ</td>
          <td>Transaction 內 fixed snapshot（純 MVCC）</td>
          <td>MVCC snapshot + gap lock 防 phantom（兩者都 MVCC、差在 phantom 防護機制：PG 靠 snapshot version visibility、InnoDB 加 gap lock pre-lock 範圍）</td>
      </tr>
      <tr>
          <td>SERIALIZABLE</td>
          <td>SSI、commit 時偵測 anomaly</td>
          <td>強 lock + gap</td>
      </tr>
  </tbody>
</table>
<p><strong>對 application code 含意</strong>：</p>
<ul>
<li>PG REPEATABLE READ 對 <em>寫入吞吐</em> 影響低（不 pre-lock、只 retry）</li>
<li>沒 gap lock → INSERT 不被 lock-induced 阻塞</li>
<li>Deadlock 機率比 MySQL 低數量級</li>
</ul>
<p>實務 PG production：用預設 READ COMMITTED 即可、SERIALIZABLE 留給 <em>strict consistency 需求</em>（金融 / 訂單）但接受 retry。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-idle-transaction-卡-vacuum--bloat-暴增">1. Idle transaction 卡 vacuum — Bloat 暴增</h3>
<p>PG MVCC 仰賴 <em>VACUUM 清理 dead tuple</em>。VACUUM 只清理 <em>沒 active transaction 看得到的 dead tuple</em>。如果有 <em>idle in transaction</em> session 持續開著（application connection pool 連線忘關 transaction）、VACUUM 看不到 <em>該 transaction snapshot 之後的 dead tuple</em>、累積 bloat。</p>
<p>修法：</p>
<ul>
<li>監控 <code>pg_stat_activity</code> 看 <code>state = 'idle in transaction'</code> 持續時間</li>
<li>設 <code>idle_in_transaction_session_timeout = '5min'</code> — 超時 PG 自動 kill 該 session</li>
<li>Application connection pool 配置 <em>不留 transaction 開著</em>（如：pgBouncer transaction pool 自動 commit / rollback）</li>
</ul>
<h3 id="2-select-for-update-跨-transaction--application-retry-麻煩">2. SELECT FOR UPDATE 跨 transaction — Application retry 麻煩</h3>
<p>跟 MySQL 不同：PG SELECT FOR UPDATE 不會 <em>block 其他 SELECT</em>（讀仍可繼續）、但 <em>block 其他 UPDATE / FOR UPDATE</em>。若 application 在 transaction 內 SELECT FOR UPDATE、其他 transaction 等。</p>
<p>如果 application 設計 <em>跨 transaction 持 lock</em>（如：取 lock + return UI + 等用戶操作 + commit）、容易撞 idle in transaction 跟其他 transaction wait。</p>
<p>修法：</p>
<ul>
<li><em>Transaction 短</em>：取 FOR UPDATE → 立刻處理 → commit、不跨 user interaction</li>
<li>跨 user interaction 用 <em>advisory lock</em> 或 application-level state machine、不依賴 row lock</li>
</ul>
<h3 id="3-advisory-lock-沒釋放--session-結束才自動釋放">3. Advisory lock 沒釋放 — Session 結束才自動釋放</h3>
<p><code>pg_advisory_lock()</code> 拿了、沒 <code>pg_advisory_unlock()</code>、lock 直到 <em>session 結束</em> 才自動釋放。Connection pool 重複使用同 connection、可能繼承前面留的 lock。</p>
<p>修法：</p>
<ul>
<li>用 <code>pg_advisory_lock</code> 必 <code>try/finally pg_advisory_unlock</code></li>
<li>或用 <em>session-level</em> 用 transaction-scoped：<code>pg_advisory_xact_lock()</code> — commit / rollback 自動釋放</li>
<li>監控 <code>pg_locks</code> 看 advisory lock count、長期累積是警訊</li>
</ul>
<h3 id="4-bloat-不只是-vacuum-沒跑是-active-transaction-阻擋-vacuum">4. Bloat 不只是 vacuum 沒跑、是 <em>active transaction 阻擋 vacuum</em></h3>
<p>第 #1 點延伸：vacuum 已跑、但 bloat 仍持續成長、原因不是 vacuum 不夠、是 <em>active transaction 阻擋 vacuum 看 dead tuple</em>。</p>
<p>修法：</p>
<ul>
<li>不只看 <code>last_vacuum</code>、看 <em>VACUUM 跑了但沒收回多少</em></li>
<li><code>SELECT * FROM pg_stat_progress_vacuum</code> 看 VACUUM 進度</li>
<li><code>SELECT * FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY backend_xmin</code> — 看誰阻擋 vacuum</li>
<li>詳見 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a></li>
</ul>
<h3 id="5-serializable-下-transaction-rollback--application-必須-retry">5. SERIALIZABLE 下 transaction rollback — Application 必須 retry</h3>
<p><code>SET TRANSACTION ISOLATION LEVEL SERIALIZABLE</code> 後、PG SSI 偵測到 anomaly 會 <em>rollback transaction</em>、application 看到 <code>serialization failure</code>、必須 retry。</p>
<p>對 <em>不知道要 retry</em> 的 application、SERIALIZABLE 變 production bug。</p>
<p>修法：</p>
<ul>
<li>Application code 加 <em>retry middleware</em>：catch <code>SQLSTATE 40001 (serialization_failure)</code> → exponential backoff retry</li>
<li>不必所有 transaction 走 SERIALIZABLE — 只對 <em>strict consistency 需求</em> 場景 set</li>
<li>高並發 SERIALIZABLE workload 容易 rollback storm、考慮拆 transaction 縮短時間</li>
</ul>
<h2 id="觀測-metric">觀測 metric</h2>
<p>Production 監控：</p>
<ul>
<li><code>pg_stat_activity</code>：active session / idle in transaction / wait_event</li>
<li><code>pg_locks</code>：當前 lock 列表、用 join 看誰 block 誰</li>
<li><code>pg_stat_database.deadlocks</code>：deadlock 計數（PG 較低、但仍要監控）</li>
<li><code>pg_stat_user_tables.n_dead_tup</code> / <code>n_live_tup</code>：dead tuple 比例 — bloat 指標</li>
<li><code>pg_stat_progress_vacuum</code>：VACUUM 進度</li>
</ul>
<h2 id="跟-mysql-lock-model-對比">跟 MySQL Lock Model 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PG MVCC</th>
          <th>MySQL InnoDB Lock</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>主要機制</td>
          <td>MVCC + snapshot</td>
          <td>Lock-based + MVCC mixed</td>
      </tr>
      <tr>
          <td>Readers vs Writers</td>
          <td>不互 block</td>
          <td>預設 RR 下 gap lock 影響</td>
      </tr>
      <tr>
          <td>Deadlock 機率</td>
          <td>低（無 gap lock）</td>
          <td>中-高（gap lock 主要來源）</td>
      </tr>
      <tr>
          <td>Phantom 防護</td>
          <td>Snapshot 自然防 + SSI predicate lock</td>
          <td>Gap lock 預先 lock</td>
      </tr>
      <tr>
          <td>預設 isolation</td>
          <td>READ COMMITTED</td>
          <td>REPEATABLE READ</td>
      </tr>
      <tr>
          <td>成本</td>
          <td>Dead tuple + VACUUM 治理</td>
          <td>Lock contention 治理</td>
      </tr>
      <tr>
          <td>Application code</td>
          <td>SERIALIZABLE 需 retry</td>
          <td>寫得不錯多數時 OK</td>
      </tr>
  </tbody>
</table>
<p>兩者解決同一問題（並行控制）、用不同策略。PG 用 <em>空間換時間</em>（保留多版本 tuple、讀寫不互鎖、但需 VACUUM 清理）、MySQL 用 <em>時間換空間</em>（lock 等待、但不必清舊版本）。</p>
<p><strong>選擇判讀</strong>：</p>
<ul>
<li>High 並發 OLTP、寫 / 讀都重：PG MVCC 通常更好（讀不 block 寫）</li>
<li>簡單 OLTP + 不想管 VACUUM：MySQL InnoDB 對 ops 簡單</li>
<li>需要 SERIALIZABLE 強一致：PG SSI 對寫吞吐影響低</li>
<li>已有 MySQL 生態 / 工具鏈：MySQL Lock 知識可繼續用</li>
</ul>
<p>詳見 <a href="/blog/backend/01-database/vendors/mysql/lock-contention/" data-link-title="MySQL Lock Contention：在 staging 重現的 deadlock、production 跑 6 個月才出現" data-link-desc="MySQL InnoDB 的 lock 是 row-level、但 *為什麼某些 row 莫名其妙也被 lock* 是 gap lock / next-key lock 設計造成的隱性行為。本文從一個 production case 開場（staging 重現 deadlock / production 6 個月後突然爆）、走 5 種 InnoDB lock 類型（record / gap / next-key / insert intention / auto-inc）、isolation level 對 lock 行為的決定性影響、deadlock detection / SHOW ENGINE INNODB STATUS 解讀、5 production 踩雷（gap lock 阻塞 INSERT / auto-inc lock contention / FK lock cascading / large transaction lock holding / READ COMMITTED 跟 binlog ROW 互動）">MySQL Lock Contention</a> — 完整 MySQL lock 機制。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-autovacuum-tuning">跟 Autovacuum Tuning</h3>
<p>MVCC 仰賴 VACUUM、autovacuum 是 PG 並行控制的 <em>維護成本</em>。VACUUM 跑慢 / 沒跑 → bloat → query 慢。詳見 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a>。</p>
<h3 id="跟-replication-topology">跟 Replication Topology</h3>
<p><code>hot_standby_feedback = on</code> 讓 standby 上 long-running query 不被 vacuum 取消、但 <em>standby 把 oldest xmin 推回 primary</em>、primary autovacuum 變保守、增加 bloat。詳見 <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>。</p>
<h3 id="跟-connection-pool">跟 Connection Pool</h3>
<p>pgBouncer transaction pooling 模式下、advisory lock / SELECT FOR UPDATE 跨 transaction 行為 <em>broken</em>（不同 transaction 可能進不同 backend connection）。詳見 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgBouncer Config</a>。</p>
<h3 id="跟-query-optimization">跟 Query Optimization</h3>
<p>長 transaction 跑慢 query 期間、其他 transaction 看到 snapshot bloat、planner 估錯 dead tuple ratio。詳見 <a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">Query Optimization</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">PG Autovacuum Tuning</a>（VACUUM 是 MVCC 必要成本）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（hot_standby_feedback 影響）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PG pgBouncer</a>（transaction pooling 跟 lock 互動）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">PG Online Schema Change</a>（DDL lock 議題）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">PG Query Optimization</a>（snapshot bloat 影響 planner）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/lock-contention/" data-link-title="MySQL Lock Contention：在 staging 重現的 deadlock、production 跑 6 個月才出現" data-link-desc="MySQL InnoDB 的 lock 是 row-level、但 *為什麼某些 row 莫名其妙也被 lock* 是 gap lock / next-key lock 設計造成的隱性行為。本文從一個 production case 開場（staging 重現 deadlock / production 6 個月後突然爆）、走 5 種 InnoDB lock 類型（record / gap / next-key / insert intention / auto-inc）、isolation level 對 lock 行為的決定性影響、deadlock detection / SHOW ENGINE INNODB STATUS 解讀、5 production 踩雷（gap lock 阻塞 INSERT / auto-inc lock contention / FK lock cascading / large transaction lock holding / READ COMMITTED 跟 binlog ROW 互動）">MySQL Lock Contention</a>（sibling、不同模型）</li>
<li><a href="/blog/backend/knowledge-cards/isolation-level/" data-link-title="Isolation Level" data-link-desc="說明資料庫交易隔離級別如何影響並發讀寫結果">Isolation Level 卡片</a></li>
<li>官方：<a href="https://www.postgresql.org/docs/current/mvcc.html">PG MVCC</a> / <a href="https://www.postgresql.org/docs/current/transaction-iso.html">PG Concurrency Control</a> / <a href="https://www.postgresql.org/docs/current/explicit-locking.html">Explicit Locking</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL JSONB Deep Dive：Binary Storage + GIN Index 為什麼是結構性優勢</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>JSONB deep dive&lt;/em> — binary storage + GIN index 的結構性優勢。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="json-vs-jsonb選-jsonb">JSON vs JSONB：選 JSONB&lt;/h2>
&lt;p>PG 9.2 加 &lt;code>JSON&lt;/code> type、9.4 加 &lt;code>JSONB&lt;/code>。99% 場景用 JSONB：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>JSON&lt;/th>
 &lt;th>JSONB&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>儲存&lt;/td>
 &lt;td>純文字（原樣保存）&lt;/td>
 &lt;td>Binary decomposed format&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Parse cost&lt;/td>
 &lt;td>每次 query parse&lt;/td>
 &lt;td>Insert 時 parse 一次&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Index 支援&lt;/td>
 &lt;td>Limited（functional index）&lt;/td>
 &lt;td>GIN / functional / partial 都行&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operator 支援&lt;/td>
 &lt;td>有限（→ / →&amp;gt;）&lt;/td>
 &lt;td>完整（@&amp;gt; / ? / @? / ? 等）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Duplicate key&lt;/td>
 &lt;td>保留（原樣）&lt;/td>
 &lt;td>只保留最後一個（normalize）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Key order&lt;/td>
 &lt;td>保留&lt;/td>
 &lt;td>不保留&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Whitespace&lt;/td>
 &lt;td>保留&lt;/td>
 &lt;td>不保留&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>JSONB 唯一缺點是 &lt;em>binary 儲存（不保留 key order / whitespace / duplicate）&lt;/em>。99% application 不在意這些。&lt;/p>
&lt;p>從 &lt;em>application semantics&lt;/em> 視角、JSONB 是 PG JSON 的 &lt;em>the right type&lt;/em>、JSON 是 &lt;em>legacy / niche&lt;/em>。&lt;/p>
&lt;h2 id="jsonb-gin-index核心結構性優勢">JSONB GIN Index：核心結構性優勢&lt;/h2>
&lt;p>PG GIN（Generalized Inverted Index）可以對 JSONB 內所有 key/value pair 建 inverted index：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">products&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">SERIAL&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">PRIMARY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">metadata&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">JSONB&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- GIN index
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">INDEX&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">idx_products_metadata&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">products&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">USING&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">GIN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">metadata&lt;/span>&lt;span class="p">);&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>加完後、JSONB query 用 GIN index 加速：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- @&amp;gt; (contains) 用 GIN
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">products&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">metadata&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">@&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;{&amp;#34;category&amp;#34;: &amp;#34;shoes&amp;#34;}&amp;#39;&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- ? (has key) 用 GIN
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">products&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">metadata&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">?&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;discount&amp;#39;&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- ?| (has any of these keys) 用 GIN
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">products&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">metadata&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">?|&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">array&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;discount&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;promotion&amp;#39;&lt;/span>&lt;span class="p">];&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>跟 MongoDB index 對比、PG 不必 &lt;em>預先 define&lt;/em> JSON path index、&lt;code>USING GIN (metadata)&lt;/code> 對 &lt;em>整個 JSONB document 任意 path&lt;/em> 都有效。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>JSONB deep dive</em> — binary storage + GIN index 的結構性優勢。</p></blockquote>
<hr>
<h2 id="json-vs-jsonb選-jsonb">JSON vs JSONB：選 JSONB</h2>
<p>PG 9.2 加 <code>JSON</code> type、9.4 加 <code>JSONB</code>。99% 場景用 JSONB：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>JSON</th>
          <th>JSONB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>儲存</td>
          <td>純文字（原樣保存）</td>
          <td>Binary decomposed format</td>
      </tr>
      <tr>
          <td>Parse cost</td>
          <td>每次 query parse</td>
          <td>Insert 時 parse 一次</td>
      </tr>
      <tr>
          <td>Index 支援</td>
          <td>Limited（functional index）</td>
          <td>GIN / functional / partial 都行</td>
      </tr>
      <tr>
          <td>Operator 支援</td>
          <td>有限（→ / →&gt;）</td>
          <td>完整（@&gt; / ? / @? / ? 等）</td>
      </tr>
      <tr>
          <td>Duplicate key</td>
          <td>保留（原樣）</td>
          <td>只保留最後一個（normalize）</td>
      </tr>
      <tr>
          <td>Key order</td>
          <td>保留</td>
          <td>不保留</td>
      </tr>
      <tr>
          <td>Whitespace</td>
          <td>保留</td>
          <td>不保留</td>
      </tr>
  </tbody>
</table>
<p>JSONB 唯一缺點是 <em>binary 儲存（不保留 key order / whitespace / duplicate）</em>。99% application 不在意這些。</p>
<p>從 <em>application semantics</em> 視角、JSONB 是 PG JSON 的 <em>the right type</em>、JSON 是 <em>legacy / niche</em>。</p>
<h2 id="jsonb-gin-index核心結構性優勢">JSONB GIN Index：核心結構性優勢</h2>
<p>PG GIN（Generalized Inverted Index）可以對 JSONB 內所有 key/value pair 建 inverted index：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">metadata</span><span class="w"> </span><span class="n">JSONB</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- GIN index
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_products_metadata</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="p">);</span></span></span></code></pre></div><p>加完後、JSONB query 用 GIN index 加速：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- @&gt; (contains) 用 GIN
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@&gt;</span><span class="w"> </span><span class="s1">&#39;{&#34;category&#34;: &#34;shoes&#34;}&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- ? (has key) 用 GIN
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">?</span><span class="w"> </span><span class="s1">&#39;discount&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- ?| (has any of these keys) 用 GIN
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">?|</span><span class="w"> </span><span class="nb">array</span><span class="p">[</span><span class="s1">&#39;discount&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;promotion&#39;</span><span class="p">];</span></span></span></code></pre></div><p>跟 MongoDB index 對比、PG 不必 <em>預先 define</em> JSON path index、<code>USING GIN (metadata)</code> 對 <em>整個 JSONB document 任意 path</em> 都有效。</p>
<h3 id="jsonb_ops-vs-jsonb_path_ops"><code>jsonb_ops</code> vs <code>jsonb_path_ops</code></h3>
<p>PG GIN 對 JSONB 有兩種 <em>operator class</em>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th><code>jsonb_ops</code>（預設）</th>
          <th><code>jsonb_path_ops</code></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>索引內容</td>
          <td>Key + value 都索引</td>
          <td>只索引 path → value pair</td>
      </tr>
      <tr>
          <td>Index size</td>
          <td>大</td>
          <td>小（約一半）</td>
      </tr>
      <tr>
          <td>支援 operator</td>
          <td><code>@&gt; / ? / ?| / ?&amp;</code></td>
          <td>只 <code>@&gt;</code> (containment)</td>
      </tr>
      <tr>
          <td>適用</td>
          <td>多種 query pattern</td>
          <td>只用 <code>@&gt;</code> 的場景</td>
      </tr>
  </tbody>
</table>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- jsonb_ops（預設）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_meta_default</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- jsonb_path_ops（小、快、但只支援 @&gt;）
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_meta_path</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="w"> </span><span class="n">jsonb_path_ops</span><span class="p">);</span></span></span></code></pre></div><p><strong>選擇</strong>：</p>
<ul>
<li>只跑 <code>@&gt;</code> containment query → <code>jsonb_path_ops</code>（index 小、快）</li>
<li>跑 <code>?</code> / <code>?|</code> / <code>?&amp;</code> key existence query → <code>jsonb_ops</code>（預設）</li>
</ul>
<h2 id="operator--path-query">Operator + Path Query</h2>
<p>JSONB 提供豐富 operator + jsonpath：</p>
<h3 id="operator">Operator</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- Extract value（returns jsonb）
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="s1">&#39;name&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- Extract text（returns text）
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">-&gt;&gt;</span><span class="w"> </span><span class="s1">&#39;name&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- Path extract
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">#&gt;</span><span class="w"> </span><span class="s1">&#39;{variants, 0, price}&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">#&gt;&gt;</span><span class="w"> </span><span class="s1">&#39;{variants, 0, price}&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 返回 text
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- Containment（用 GIN index）
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@&gt;</span><span class="w"> </span><span class="s1">&#39;{&#34;category&#34;: &#34;shoes&#34;, &#34;active&#34;: true}&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="c1">-- Reverse containment
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="s1">&#39;{&#34;sub&#34;: &#34;value&#34;}&#39;</span><span class="w"> </span><span class="o">&lt;@</span><span class="w"> </span><span class="n">metadata</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="c1">-- Key existence
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">?</span><span class="w"> </span><span class="s1">&#39;discount&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">?|</span><span class="w"> </span><span class="nb">array</span><span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;b&#39;</span><span class="p">];</span><span class="w">  </span><span class="c1">-- 任一 key
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">?&amp;</span><span class="w"> </span><span class="nb">array</span><span class="p">[</span><span class="s1">&#39;a&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;b&#39;</span><span class="p">];</span><span class="w">  </span><span class="c1">-- 全部 key</span></span></span></code></pre></div><h3 id="jsonpathpg-12">jsonpath（PG 12+）</h3>
<p>SQL/JSON jsonpath 是 SQL standard、PG 12+ 支援：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- jsonb_path_query：展開 path 結果
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">jsonb_path_query</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;$.variants[*].price&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="c1">-- jsonb_path_exists：返 boolean
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">jsonb_path_exists</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;$.variants[*] ? (@.price &gt; 100)&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- jsonb_path_query_array：返 array of result
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">jsonb_path_query_array</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;$.tags[*]&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="p">;</span></span></span></code></pre></div><p>jsonpath 比 PG-specific operator 標準化、跨 vendor portable。</p>
<h2 id="partial-jsonb-index">Partial JSONB Index</h2>
<p>對 <em>只 query subset row</em> 的場景、建 partial index：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 只對 active product 建 metadata index
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_active_products_metadata</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- Query active products + JSONB filter
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@&gt;</span><span class="w"> </span><span class="s1">&#39;{&#34;category&#34;: &#34;shoes&#34;}&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="c1">-- → planner 用 partial GIN index</span></span></span></code></pre></div><p>Partial index 比 full GIN 小很多、write cost 低、index hit rate 高。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-大-jsonb--toast--性能崩潰">1. 大 JSONB + TOAST — 性能崩潰</h3>
<p>JSONB &gt; 2 KB 自動進 TOAST（PG 內外部 storage）、每次 query read 該 row 都要 <em>de-TOAST</em>（拉外部 storage 再合併）。大 JSONB（&gt; 50 KB）每次 query 慢 10-100x。</p>
<p>修法：</p>
<ul>
<li>把 <em>大 attribute 拆獨立 column</em>（如 <code>description TEXT</code> 不放 metadata）</li>
<li>用 <em>JSON path index</em> 對 hot path 加速、不必每次讀整個 JSONB</li>
<li>用 <code>pg_column_size(metadata)</code> 監控 JSONB size 分布、找 outlier</li>
<li>對 truly 大 document（&gt; 1 MB）考慮 separate table 或 object storage</li>
</ul>
<h3 id="2-nested-update--整個-jsonb-重寫">2. Nested update — 整個 JSONB 重寫</h3>
<p>PG 沒 <em>atomic partial update</em>。修改 nested key 必須讀整個 JSONB → 修改 → 寫回：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">UPDATE</span><span class="w"> </span><span class="n">products</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SET</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">jsonb_set</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;{discount}&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;0.2&#39;</span><span class="p">::</span><span class="n">jsonb</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 等同於：讀 metadata、改 discount、寫回整個 metadata</span></span></span></code></pre></div><p>對 <em>大 JSONB + 高頻 update</em> 場景、寫吞吐受限。跟 MongoDB <code>$set</code> operator 對應 <em>partial document update</em> 不同。</p>
<p>修法：</p>
<ul>
<li>對 <em>high-update nested key</em> 拆獨立 column</li>
<li>Application 層 batch update（攢一批一次 update）</li>
<li>接受 PG JSONB <em>是 immutable-replace</em> 心智模型、不是 <em>mutable in-place</em></li>
</ul>
<h3 id="3-index-選錯-op-class---query-走-full-scan">3. Index 選錯 op class — <code>?</code> query 走 full scan</h3>
<p>對 <code>jsonb_path_ops</code> index、<code>?</code> key existence query 走 <em>full scan</em>（不用 index）。Application 看 query 慢、查 EXPLAIN 才發現 index 沒用。</p>
<p>修法：</p>
<ul>
<li>設計階段確認 <em>application query pattern</em>：只用 <code>@&gt;</code> 還是會用 <code>?</code></li>
<li>多 query pattern → <code>jsonb_ops</code>（預設）</li>
<li>純 containment → <code>jsonb_path_ops</code>（省 index size）</li>
<li>不確定先用預設、production 觀察後再優化</li>
</ul>
<h3 id="4-jsonb_path_query-跟-jsonb_path_exists-行為差">4. <code>jsonb_path_query</code> 跟 <code>jsonb_path_exists</code> 行為差</h3>
<ul>
<li><code>jsonb_path_query(metadata, '$.variants[*].price')</code> — 展開、每個 match return 一 row</li>
<li><code>jsonb_path_exists(metadata, '$.variants[*]')</code> — return boolean（true if any match）</li>
</ul>
<p>Application 想要「過濾 row」用前者寫成：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 錯：返多 row 給每個 product、結果 row count 暴增
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">jsonb_path_query</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;$.variants[*].price&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="p">;</span></span></span></code></pre></div><p>應該：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 對：只過濾 product
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">jsonb_path_exists</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;$.variants[*] ? (@.price &gt; 100)&#39;</span><span class="p">);</span></span></span></code></pre></div><p>修法：</p>
<ul>
<li>區分 <em>exists 過濾 row</em> vs <em>query 展開 row</em></li>
<li>過濾用 <code>jsonb_path_exists</code> 或 <code>@&gt;</code> operator</li>
<li>展開用 <code>jsonb_path_query</code> + 配合 <code>LATERAL</code> 或 subquery</li>
</ul>
<h3 id="5-partial-index-條件不對齊-query">5. Partial index 條件不對齊 query</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_active_metadata</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">metadata</span><span class="p">)</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- Application query 但 status 沒 explicit
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@&gt;</span><span class="w"> </span><span class="s1">&#39;{&#34;category&#34;: &#34;shoes&#34;}&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- → 不用 partial index（planner 不知道 status=&#39;active&#39; 條件）</span></span></span></code></pre></div><p>修法：</p>
<ul>
<li>
<p>Application query <em>必須包含 partial index 的 WHERE 條件</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;active&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@&gt;</span><span class="w"> </span><span class="s1">&#39;...&#39;</span><span class="p">;</span></span></span></code></pre></div></li>
<li>
<p>確認 planner 用 partial index：<code>EXPLAIN</code> 看 <code>Index Scan using idx_active_metadata</code></p>
</li>
<li>
<p>不對齊 query pattern 的 partial index = waste</p>
</li>
</ul>
<h2 id="何時用-jsonb-vs-拆-column">何時用 JSONB vs 拆 column</h2>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>選擇</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>不規則 schema（user-generated metadata / customization）</td>
          <td>JSONB</td>
      </tr>
      <tr>
          <td>半結構化 + 5-10 個常 query key</td>
          <td>JSONB + GIN partial index</td>
      </tr>
      <tr>
          <td>規則 schema、column 數量穩定</td>
          <td>拆 column（更快 / index 易）</td>
      </tr>
      <tr>
          <td>Nested 結構 + 經常需要展開 query</td>
          <td>JSONB + jsonb_path_query</td>
      </tr>
      <tr>
          <td>大 document（&gt; 1 KB）+ 高頻 update</td>
          <td>拆 column 或 separate table</td>
      </tr>
      <tr>
          <td>完全 schemaless workload</td>
          <td>考慮 MongoDB 而非 PG</td>
      </tr>
  </tbody>
</table>
<p>JSONB 是 <em>PG 適合 semi-structured data</em> 的工具、不是 <em>MongoDB 替代品</em>。對 <em>主要結構化 + 少量 JSON</em> 場景 JSONB 完美；對 <em>主要 JSON / 複雜 nested aggregation</em> 場景 MongoDB 仍是專業選擇。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-query-optimization">跟 Query Optimization</h3>
<p>JSONB query 的 planner 行為：</p>
<ul>
<li><code>@&gt;</code> containment 對 jsonb_ops / jsonb_path_ops 都用 GIN</li>
<li><code>?</code> 只對 jsonb_ops 用 GIN</li>
<li>jsonb_path_exists 用 <em>functional index</em>（不是 GIN）</li>
<li>看 EXPLAIN 確認用對 index、詳見 <a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">Query Optimization</a></li>
</ul>
<h3 id="跟-sql-features-baseline">跟 SQL Features Baseline</h3>
<p>JSONB 是 PG 結構性領先特性之一、詳見 <a href="/blog/backend/01-database/vendors/postgresql/sql-features-baseline/" data-link-title="PostgreSQL SQL Features：PG 早就有的、MySQL 8.0 才補的、PG 仍領先的" data-link-desc="PG 在 SQL features 上長期領先 MySQL — CTE / window function / lateral / partial index / FTS / JSONB / GIN index / materialized view 在 PG 早 5-15 年。MySQL 8.0（2018）補多數但 *index / storage / extension* 層仍是 PG 結構優勢。本文整理 PG 早期就有的特性、MySQL 8.0 補的差異、PG 仍領先的、跟 MySQL modern-sql-features sibling 反向視角">SQL Features Baseline</a>。</p>
<h3 id="跟-mvcc--lock-model">跟 MVCC + Lock Model</h3>
<p>JSONB UPDATE 整個 column 重寫、每次 update 創新 tuple、跟 row update 相同 MVCC behavior。詳見 <a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">MVCC + Lock Model</a>。</p>
<h3 id="跟-mysql-json_table">跟 MySQL JSON_TABLE</h3>
<p>MySQL 8.0 JSON_TABLE 跟 PG jsonpath 類似（都 SQL standard）、但 <em>index 機制</em> 完全不同：</p>
<ul>
<li>PG：JSONB + GIN index over 整個 column</li>
<li>MySQL：JSON column + generated column + index over generated</li>
</ul>
<p>PG JSONB GIN 是 <em>結構性領先</em>、MySQL 短期內難對應。詳見 <a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features</a>。</p>
<h2 id="觀測-metric">觀測 metric</h2>
<ul>
<li><code>pg_column_size(metadata)</code> — 每 row JSONB size 分布</li>
<li><code>pg_relation_size('idx_name')</code> — JSONB GIN index 大小</li>
<li><code>pg_stat_user_indexes.idx_scan</code> — JSONB index 使用次數</li>
<li>TOAST table size：<code>SELECT pg_relation_size(reltoastrelid) FROM pg_class WHERE relname='products'</code></li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/sql-features-baseline/" data-link-title="PostgreSQL SQL Features：PG 早就有的、MySQL 8.0 才補的、PG 仍領先的" data-link-desc="PG 在 SQL features 上長期領先 MySQL — CTE / window function / lateral / partial index / FTS / JSONB / GIN index / materialized view 在 PG 早 5-15 年。MySQL 8.0（2018）補多數但 *index / storage / extension* 層仍是 PG 結構優勢。本文整理 PG 早期就有的特性、MySQL 8.0 補的差異、PG 仍領先的、跟 MySQL modern-sql-features sibling 反向視角">PG SQL Features Baseline</a>（JSONB 是 PG 結構領先之一）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">PG Query Optimization</a>（JSONB index 用對）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">PG MVCC + Lock Model</a>（JSONB update 跟 MVCC）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features</a>（JSON_TABLE vs JSONB 對比）</li>
<li><a href="/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB vendor</a>（純 document workload 替代）</li>
<li>官方：<a href="https://www.postgresql.org/docs/current/functions-json.html">PG JSON Functions</a> / <a href="https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING">JSONB Indexing</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/extension-ecosystem/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/extension-ecosystem/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>extension ecosystem&lt;/em> — PG 結構性產品線擴張的機制。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="extension-不只是-plugin是產品線擴張">Extension 不只是 plugin、是產品線擴張&lt;/h2>
&lt;p>PG extension 機制讓 &lt;em>第三方加新 type / function / operator / index access method / planner hook&lt;/em>、深度整合到 PG core。對比其他 DB 的 plugin model（MySQL plugin / MongoDB plugin）、PG extension 是 &lt;em>更深的 SPI&lt;/em>。&lt;/p>
&lt;p>結果：&lt;/p>
&lt;ul>
&lt;li>pgvector → PG 變 vector similarity search DB（取代 Pinecone / Weaviate）&lt;/li>
&lt;li>TimescaleDB → PG 變 time-series DB（取代 InfluxDB）&lt;/li>
&lt;li>Citus → PG 變 sharded cluster&lt;/li>
&lt;li>PostGIS → PG 變 GIS DB&lt;/li>
&lt;li>pg_cron → PG 變 scheduled job runner&lt;/li>
&lt;li>pgvectorscale → 大規模 vector index&lt;/li>
&lt;/ul>
&lt;p>對 &lt;em>vendor lock-in 敏感&lt;/em> / &lt;em>想統一 stack&lt;/em> 的 org、PG extension 提供 &lt;em>用 PG 取代多個 specialized DB&lt;/em> 的可能。&lt;/p>
&lt;p>但 &lt;em>統一 stack 的代價&lt;/em>：PG 主庫 ops 風險集中（一個 PG 掛 = vector / time-series / GIS / cron 全掛）、extension 跟 PG version 對齊矩陣多一道升級顧慮、規模上限通常比專業 DB 低（pgvector 100M+ vs Pinecone 10B+ / TimescaleDB 100K rows/s vs InfluxDB 500K+）。決策框架：&lt;em>中小規模 + 已用 PG + 不想多管系統&lt;/em> → extension；&lt;em>大規模 + 純該 workload + 有專業 team&lt;/em> → specialized DB。&lt;/p>
&lt;h2 id="extension-lifecycle">Extension Lifecycle&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- 看可用 extension
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_available_extensions&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 安裝（在 OS 層、要有對應 package）
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="c1">-- apt install postgresql-14-pg-stat-statements
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- Enable in DB
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_stat_statements&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 確認
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_extension&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 升級 extension
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">ALTER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_stat_statements&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">UPDATE&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 移除
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">DROP&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_stat_statements&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>每個 extension 有：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>extension ecosystem</em> — PG 結構性產品線擴張的機制。</p></blockquote>
<hr>
<h2 id="extension-不只是-plugin是產品線擴張">Extension 不只是 plugin、是產品線擴張</h2>
<p>PG extension 機制讓 <em>第三方加新 type / function / operator / index access method / planner hook</em>、深度整合到 PG core。對比其他 DB 的 plugin model（MySQL plugin / MongoDB plugin）、PG extension 是 <em>更深的 SPI</em>。</p>
<p>結果：</p>
<ul>
<li>pgvector → PG 變 vector similarity search DB（取代 Pinecone / Weaviate）</li>
<li>TimescaleDB → PG 變 time-series DB（取代 InfluxDB）</li>
<li>Citus → PG 變 sharded cluster</li>
<li>PostGIS → PG 變 GIS DB</li>
<li>pg_cron → PG 變 scheduled job runner</li>
<li>pgvectorscale → 大規模 vector index</li>
</ul>
<p>對 <em>vendor lock-in 敏感</em> / <em>想統一 stack</em> 的 org、PG extension 提供 <em>用 PG 取代多個 specialized DB</em> 的可能。</p>
<p>但 <em>統一 stack 的代價</em>：PG 主庫 ops 風險集中（一個 PG 掛 = vector / time-series / GIS / cron 全掛）、extension 跟 PG version 對齊矩陣多一道升級顧慮、規模上限通常比專業 DB 低（pgvector 100M+ vs Pinecone 10B+ / TimescaleDB 100K rows/s vs InfluxDB 500K+）。決策框架：<em>中小規模 + 已用 PG + 不想多管系統</em> → extension；<em>大規模 + 純該 workload + 有專業 team</em> → specialized DB。</p>
<h2 id="extension-lifecycle">Extension Lifecycle</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 看可用 extension
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_available_extensions</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- 安裝（在 OS 層、要有對應 package）
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1">-- apt install postgresql-14-pg-stat-statements
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- Enable in DB
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_stat_statements</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 確認
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_extension</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- 升級 extension
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_stat_statements</span><span class="w"> </span><span class="k">UPDATE</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="c1">-- 移除
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1"></span><span class="k">DROP</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_stat_statements</span><span class="p">;</span></span></span></code></pre></div><p>每個 extension 有：</p>
<ul>
<li><em>Version</em> — 跟 PG version 綁定（如 pg_stat_statements 14 / 15 / 16）</li>
<li><em>Schema</em> — 安裝到 <code>public</code> 或專屬 schema</li>
<li><em>Dependencies</em> — 部分 extension 依賴其他（如 PostGIS 依賴 pg_trgm）</li>
<li><em>Trusted vs untrusted</em> — trusted 可以 non-superuser 安裝（PG 13+）</li>
</ul>
<h2 id="6-個-production-critical-extension">6 個 Production-Critical Extension</h2>
<h3 id="1-pg_stat_statements--query-stats必裝">1. pg_stat_statements — Query stats（必裝）</h3>
<p>任何 production PG cluster 都該裝：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">shared_preload_libraries</span> <span class="o">=</span> <span class="s">&#39;pg_stat_statements&#39;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">pg_stat_statements.max</span> <span class="o">=</span> <span class="s">5000</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">pg_stat_statements.track</span> <span class="o">=</span> <span class="s">all</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_stat_statements</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- Top 10 query by total time
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">query</span><span class="p">,</span><span class="w"> </span><span class="n">calls</span><span class="p">,</span><span class="w"> </span><span class="n">total_exec_time</span><span class="p">,</span><span class="w"> </span><span class="n">mean_exec_time</span><span class="p">,</span><span class="w"> </span><span class="k">rows</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_statements</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">total_exec_time</span><span class="w"> </span><span class="k">DESC</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p>對應 MySQL <code>events_statements_summary_by_digest</code>。詳見 <a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">Query Optimization</a>。</p>
<h3 id="2-pg_partman--自動-partition-lifecycle">2. pg_partman — 自動 partition lifecycle</h3>
<p>PG declarative partitioning 需要 <em>手動建 / drop partition</em>。pg_partman 自動化：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_partman</span><span class="w"> </span><span class="k">SCHEMA</span><span class="w"> </span><span class="n">partman</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="c1">-- 設 events 表自動 monthly partition
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">partman</span><span class="p">.</span><span class="n">create_parent</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">p_parent_table</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="s1">&#39;public.events&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">p_control</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="s1">&#39;created_at&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="n">p_type</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="s1">&#39;range&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">    </span><span class="n">p_interval</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="s1">&#39;1 month&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="n">p_premake</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="mi">6</span><span class="w">  </span><span class="c1">-- 預先建 6 個未來 partition
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="c1">-- 跑 maintenance（建未來 partition + drop 老 partition）
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">partman</span><span class="p">.</span><span class="n">run_maintenance</span><span class="p">(</span><span class="n">p_analyze</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="k">false</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="c1">-- 預設用 pg_cron 排程</span></span></span></code></pre></div><p>對 <em>time-series partition</em> workload 必裝。詳見 <a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">Declarative Partitioning</a>。</p>
<h3 id="3-pg_repack--online-table-rewrite">3. pg_repack — Online table rewrite</h3>
<p>詳見 <a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">Online Schema Change</a>。</p>
<h3 id="4-pgvector--vector-similarity-search">4. pgvector — Vector similarity search</h3>
<p>LLM embedding / semantic search 場景必裝：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">vector</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">content</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">embedding</span><span class="w"> </span><span class="n">VECTOR</span><span class="p">(</span><span class="mi">1536</span><span class="p">)</span><span class="w">  </span><span class="c1">-- OpenAI text-embedding-3-small 1536-dim
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- HNSW index（pgvector 0.5+）
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">HNSW</span><span class="w"> </span><span class="p">(</span><span class="n">embedding</span><span class="w"> </span><span class="n">vector_cosine_ops</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="c1">-- 找最相似的 5 個
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;=&gt;</span><span class="w"> </span><span class="s1">&#39;[0.1, 0.2, ...]&#39;</span><span class="p">::</span><span class="n">vector</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">5</span><span class="p">;</span></span></span></code></pre></div><p>對 <em>中小規模 RAG / semantic search</em> workload、pgvector 在 PG 內跑、不必跨 Pinecone / Weaviate / Qdrant 等獨立服務。</p>
<p>對 <em>超大規模</em> vector workload（&gt; 1 億 vector）考慮 pgvectorscale（pgvector 的 streaming variant）或專業 vector DB。</p>
<h3 id="5-timescaledb--time-series-擴展">5. TimescaleDB — Time-series 擴展</h3>
<p>把 PG 變 time-series DB：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">timescaledb</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">metrics</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">time</span><span class="w"> </span><span class="n">TIMESTAMPTZ</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">device_id</span><span class="w"> </span><span class="nb">INT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">value</span><span class="w"> </span><span class="n">DOUBLE</span><span class="w"> </span><span class="k">PRECISION</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- 轉成 hypertable（auto-partition by time）
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_hypertable</span><span class="p">(</span><span class="s1">&#39;metrics&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;time&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="c1">-- Continuous aggregate（materialized view 自動 refresh）
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">metrics_5min</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">timescaledb</span><span class="p">.</span><span class="n">continuous</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">time_bucket</span><span class="p">(</span><span class="s1">&#39;5 minutes&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">time</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">bucket</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">       </span><span class="n">device_id</span><span class="p">,</span><span class="w"> </span><span class="k">avg</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">metrics</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">bucket</span><span class="p">,</span><span class="w"> </span><span class="n">device_id</span><span class="p">;</span></span></span></code></pre></div><p>對 IoT / monitoring / financial tick data 場景、TimescaleDB 比純 PG 寫吞吐高 10x+。</p>
<h3 id="6-postgis--gis-extension">6. PostGIS — GIS extension</h3>
<p>地理 / 空間 query 業界標準：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">postgis</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">stores</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">name</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="k">location</span><span class="w"> </span><span class="n">GEOGRAPHY</span><span class="p">(</span><span class="n">POINT</span><span class="p">,</span><span class="w"> </span><span class="mi">4326</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">stores</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIST</span><span class="w"> </span><span class="p">(</span><span class="k">location</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- 找 1 km 內的 store
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">stores</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">ST_DWithin</span><span class="p">(</span><span class="k">location</span><span class="p">,</span><span class="w"> </span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">05</span><span class="p">)::</span><span class="n">geography</span><span class="p">,</span><span class="w"> </span><span class="mi">1000</span><span class="p">);</span></span></span></code></pre></div><p>PostGIS 是 GIS workload 業界標準、其他 DB GIS 能力都對標 PostGIS。</p>
<h2 id="其他常用-extension">其他常用 extension</h2>
<p>除 6 個 production-critical 之外、以下是 <em>特定場景常用</em> 的 extension — 分四類：排程跟 utility（<code>pg_cron</code> / <code>pg_trgm</code> / <code>uuid-ossp</code>）、type 擴展（<code>hstore</code> / <code>citext</code> / <code>pgcrypto</code>）、跨 DB 整合（<code>postgres_fdw</code> / <code>mysql_fdw</code>）、observability / debug 工具（<code>pg_buffercache</code> / <code>pg_visibility</code> / <code>auto_explain</code>）：</p>
<table>
  <thead>
      <tr>
          <th>Extension</th>
          <th>用途</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>pg_cron</code></td>
          <td>排程 SQL job（不必外部 cron）</td>
      </tr>
      <tr>
          <td><code>pg_trgm</code></td>
          <td>Fuzzy string match / similarity</td>
      </tr>
      <tr>
          <td><code>uuid-ossp</code></td>
          <td>UUID 產生</td>
      </tr>
      <tr>
          <td><code>hstore</code></td>
          <td>Key-value pair type</td>
      </tr>
      <tr>
          <td><code>citext</code></td>
          <td>Case-insensitive text type</td>
      </tr>
      <tr>
          <td><code>pgcrypto</code></td>
          <td>加密 / hash function</td>
      </tr>
      <tr>
          <td><code>postgres_fdw</code></td>
          <td>PG → PG foreign table</td>
      </tr>
      <tr>
          <td><code>mysql_fdw</code></td>
          <td>PG → MySQL foreign table</td>
      </tr>
      <tr>
          <td><code>pg_buffercache</code></td>
          <td>Buffer pool 內容檢視</td>
      </tr>
      <tr>
          <td><code>pg_visibility</code></td>
          <td>Visibility map 檢視（debug bloat）</td>
      </tr>
      <tr>
          <td><code>auto_explain</code></td>
          <td>Slow query 自動 log plan</td>
      </tr>
      <tr>
          <td><code>wal2json</code></td>
          <td>Logical decoding output 為 JSON</td>
      </tr>
      <tr>
          <td><code>Citus</code></td>
          <td>Distributed PG</td>
      </tr>
      <tr>
          <td><code>pgvector</code></td>
          <td>Vector similarity</td>
      </tr>
      <tr>
          <td><code>pglogical</code></td>
          <td>Logical replication（功能比 native 強）</td>
      </tr>
      <tr>
          <td><code>pg_squeeze</code></td>
          <td>pg_repack 替代</td>
      </tr>
  </tbody>
</table>
<p>實務組合：observability 三件套（<code>pg_stat_statements</code> + <code>auto_explain</code> + <code>pg_buffercache</code>）幾乎是 production 標配；FDW 是「跨 DB query」的 escape hatch、但 cross-DB query 效能差、適合 reporting 不適合 OLTP。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-extension-version-跟-pg-version-對齊">1. Extension version 跟 PG version 對齊</h3>
<p>PG cluster 升 14 → 15 後、extension（pg_stat_statements / pg_partman / pgvector 等）必須有對應 15 版本。早期升級 / niche extension 可能還沒釋出。</p>
<p>修法：</p>
<ul>
<li>升 PG cluster 前 <em>先確認所有 extension 都有對應 PG version 釋出版本</em></li>
<li>升完 PG cluster <em>立即跑 <code>ALTER EXTENSION xxx UPDATE</code></em></li>
<li>Upgrade runbook 紀錄每個 extension 的版本兼容狀態</li>
</ul>
<h3 id="2-managed-pg-限制-extension-列表">2. Managed PG 限制 extension 列表</h3>
<p>AWS RDS / Aurora PG / Cloud SQL / Azure DB for PostgreSQL 各自有 <em>支援 extension 白名單</em>：</p>
<ul>
<li>不在白名單的 extension 不能 install</li>
<li>部分 extension 限定特定 PG version</li>
<li>Untrusted extension 通常不允許</li>
</ul>
<p>常見 <em>managed 不支援</em> 的 extension：</p>
<ul>
<li><code>pg_repack</code>（Aurora 有限支援、RDS 部分 version 支援）</li>
<li><code>pglogical</code>（部分 cloud 不支援）</li>
<li><code>pg_cron</code>（cloud 通常用 managed scheduler 取代）</li>
<li>Custom extension（自寫 .so）</li>
</ul>
<p>修法：</p>
<ul>
<li>評估 managed PG 之前、先查 <em>vendor 支援 extension 列表</em></li>
<li>Self-hosted vs managed 的 <em>跨雲 portability</em> 議題：extension 是 lock-in source</li>
<li>如果 application 強依賴某 extension（如 PostGIS），確認 cloud 支援</li>
</ul>
<h3 id="3-extension-upgrade-order">3. Extension upgrade order</h3>
<p><code>pg_upgrade</code> 升 PG major version 後、extension 也要升。順序：</p>
<ol>
<li><em>pg_upgrade</em> PG binary + cluster</li>
<li>對每個 DB 跑 <code>ALTER EXTENSION xxx UPDATE</code></li>
<li>部分 extension（如 PostGIS）需要 <em>特殊升級程序</em>（<code>SELECT postgis_extensions_upgrade()</code>）</li>
</ol>
<p>修法：</p>
<ul>
<li>升 PG 後 <em>先測 staging cluster</em> 確認 extension upgrade 流程</li>
<li>PostGIS / TimescaleDB / Citus 有自己 upgrade 程序、必須遵循 vendor doc</li>
<li>升完跑 <code>\dx</code> 看每個 extension 版本</li>
</ul>
<h3 id="4-shared_preload_libraries-衝突">4. <code>shared_preload_libraries</code> 衝突</h3>
<p>部分 extension（pg_stat_statements / auto_explain / TimescaleDB / Citus / pg_cron）必須在 <code>shared_preload_libraries</code> 加進去、需要 <em>重啟 PG</em>。</p>
<p>衝突情境：</p>
<ul>
<li>pg_partman + TimescaleDB 都用 background worker、worker 上限不夠</li>
<li><code>max_worker_processes</code> 預設 8、不夠時某些 extension 起不起來</li>
</ul>
<p>修法：</p>
<ul>
<li>列出所有 shared_preload extension、確認 order（部分有 dependency）</li>
<li>提高 <code>max_worker_processes = 16</code> / <code>max_parallel_workers = 8</code> 等</li>
<li>重啟 PG 才生效、計入 maintenance window</li>
</ul>
<h3 id="5-extension-跟-logical-replication-互動">5. Extension 跟 logical replication 互動</h3>
<p>Logical replication（pglogical / native）不自動 replicate extension state（function / type definition）。Subscriber 沒裝對應 extension、replicate event 失敗。</p>
<p>修法：</p>
<ul>
<li>Subscriber 必須 <em>先安裝</em> publisher 用的 extension</li>
<li>Extension 版本 <em>publisher / subscriber 對齊</em></li>
<li>對 extension-heavy schema、考慮用 <em>streaming replication</em>（physical）而非 logical</li>
</ul>
<h2 id="cloud-vendor-對-extension-的支援">Cloud Vendor 對 Extension 的支援</h2>
<table>
  <thead>
      <tr>
          <th>Vendor</th>
          <th>常見 extension 支援</th>
          <th>限制</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AWS RDS PostgreSQL</td>
          <td>pg_stat_statements / pg_partman / pgvector / pg_repack</td>
          <td>部分 version 限制 / 不能 install custom</td>
      </tr>
      <tr>
          <td>AWS Aurora PostgreSQL</td>
          <td>同 RDS、加 Aurora-specific</td>
          <td>pg_repack 限版本</td>
      </tr>
      <tr>
          <td>GCP Cloud SQL</td>
          <td>標準 extension 廣支援</td>
          <td>pg_cron / pgvector OK</td>
      </tr>
      <tr>
          <td>Azure DB for PostgreSQL</td>
          <td>廣泛支援 + Azure 整合</td>
          <td>Citus（managed 即 Cosmos DB for PG）</td>
      </tr>
      <tr>
          <td>Self-hosted</td>
          <td>全部</td>
          <td>自己維護</td>
      </tr>
  </tbody>
</table>
<p>對 <em>extension-heavy</em> application、self-hosted PG 仍是必要選擇。Managed PG 適合 <em>標準 extension</em> workload。</p>
<h2 id="何時用-pg-extension-取代專業-db">何時用 PG extension 取代專業 DB</h2>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>用 extension 還是專業 DB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>&lt; 100M vector + RAG / semantic search</td>
          <td>pgvector（單一 stack 省 ops）</td>
      </tr>
      <tr>
          <td>大規模 vector search &gt; 10M with high QPS</td>
          <td>專業 vector DB（Pinecone / Qdrant）</td>
      </tr>
      <tr>
          <td>Time-series &lt; 100 TB</td>
          <td>TimescaleDB</td>
      </tr>
      <tr>
          <td>Time-series &gt; 100 TB + high cardinality</td>
          <td>專業 TS DB（InfluxDB / VictoriaMetrics）</td>
      </tr>
      <tr>
          <td>GIS</td>
          <td>PostGIS（業界標準）</td>
      </tr>
      <tr>
          <td>Sharded &lt; 10 TB + multi-tenant</td>
          <td>Citus</td>
      </tr>
      <tr>
          <td>Sharded &gt; 100 TB</td>
          <td>distributed SQL（CockroachDB / TiDB）</td>
      </tr>
      <tr>
          <td>Scheduled job</td>
          <td>pg_cron（簡單）/ Airflow（複雜）</td>
      </tr>
  </tbody>
</table>
<p>對中小規模、PG + extension 是 <em>簡化 stack</em> 的有效路徑。規模超過時、專業 DB 仍是首選。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/citus-distributed/" data-link-title="PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster" data-link-desc="Citus 是 PG extension、把單機 PG 變成 *coordinator &#43; worker* sharded cluster、保留 PG SQL &#43; 加 distributed table &#43; reference table &#43; columnar storage。本文走 Citus 架構（coordinator / worker / distribution column）、3 種 table type（distributed / reference / local）、配置 step-by-step、5 production 踩雷（distribution column 選錯 / cross-shard transaction / reference table 過大 / colocate 不對齊 / worker failover）、跟 MySQL Vitess sharding sibling 對比">Citus Distributed</a>：extension 一例、可看 extension model</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">Query Optimization</a>：pg_stat_statements + auto_explain 必用</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">Online Schema Change</a>：pg_repack 是 extension</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">Declarative Partitioning</a>：pg_partman 是 extension</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/sql-features-baseline/" data-link-title="PostgreSQL SQL Features：PG 早就有的、MySQL 8.0 才補的、PG 仍領先的" data-link-desc="PG 在 SQL features 上長期領先 MySQL — CTE / window function / lateral / partial index / FTS / JSONB / GIN index / materialized view 在 PG 早 5-15 年。MySQL 8.0（2018）補多數但 *index / storage / extension* 層仍是 PG 結構優勢。本文整理 PG 早期就有的特性、MySQL 8.0 補的差異、PG 仍領先的、跟 MySQL modern-sql-features sibling 反向視角">SQL Features Baseline</a>：extension 是 PG 結構性領先之一</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/sql-features-baseline/" data-link-title="PostgreSQL SQL Features：PG 早就有的、MySQL 8.0 才補的、PG 仍領先的" data-link-desc="PG 在 SQL features 上長期領先 MySQL — CTE / window function / lateral / partial index / FTS / JSONB / GIN index / materialized view 在 PG 早 5-15 年。MySQL 8.0（2018）補多數但 *index / storage / extension* 層仍是 PG 結構優勢。本文整理 PG 早期就有的特性、MySQL 8.0 補的差異、PG 仍領先的、跟 MySQL modern-sql-features sibling 反向視角">PG SQL Features Baseline</a>（extension 是結構優勢）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/citus-distributed/" data-link-title="PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster" data-link-desc="Citus 是 PG extension、把單機 PG 變成 *coordinator &#43; worker* sharded cluster、保留 PG SQL &#43; 加 distributed table &#43; reference table &#43; columnar storage。本文走 Citus 架構（coordinator / worker / distribution column）、3 種 table type（distributed / reference / local）、配置 step-by-step、5 production 踩雷（distribution column 選錯 / cross-shard transaction / reference table 過大 / colocate 不對齊 / worker failover）、跟 MySQL Vitess sharding sibling 對比">PG Citus Distributed</a>（extension example）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">PG Online Schema Change</a>（pg_repack）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">PG Declarative Partitioning</a>（pg_partman）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">PG Query Optimization</a>（pg_stat_statements + auto_explain）</li>
<li>官方：<a href="https://www.postgresql.org/docs/current/extend-extensions.html">PG Extensions</a> / <a href="https://github.com/pgvector/pgvector">pgvector</a> / <a href="https://docs.timescale.com/">TimescaleDB</a> / <a href="https://postgis.net/">PostGIS</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Full-Text Search：tsvector / tsquery / GIN index 跟 pg_trgm fuzzy 三層搜尋</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/full-text-search/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/full-text-search/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>full-text search&lt;/em> — 內建 tsvector / tsquery + pg_trgm fuzzy match。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="pg-fts-機制tsvector--tsquery--gin-index">PG FTS 機制：tsvector + tsquery + GIN index&lt;/h2>
&lt;p>PG 內建 full-text search 三件組：&lt;/p>
&lt;ul>
&lt;li>&lt;code>tsvector&lt;/code>：document 轉成 &lt;em>lexeme&lt;/em>（字根 + position）vector、normalized 後存&lt;/li>
&lt;li>&lt;code>tsquery&lt;/code>：搜尋字串 parse 成 query 形式&lt;/li>
&lt;li>GIN index：對 tsvector 加 inverted index&lt;/li>
&lt;/ul>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- Document
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">to_tsvector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;The quick brown fox jumps over the lazy dog&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 結果：&amp;#39;brown&amp;#39;:3 &amp;#39;dog&amp;#39;:9 &amp;#39;fox&amp;#39;:4 &amp;#39;jump&amp;#39;:5 &amp;#39;lazi&amp;#39;:8 &amp;#39;quick&amp;#39;:2
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="c1">-- The/over 是 stop word 被過濾、jumps/lazy 轉字根、保留 position
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- Query
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">to_tsquery&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;fox &amp;amp; dog&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 結果：&amp;#39;fox&amp;#39; &amp;amp; &amp;#39;dog&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- Match
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">to_tsvector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;The quick brown fox&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">@@&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">to_tsquery&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;fox &amp;amp; quick&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- → true&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Index&lt;/strong>：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">articles&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">SERIAL&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">PRIMARY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">body&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- GIN index over tsvector (動態 cast)
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">INDEX&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">idx_articles_fts&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">articles&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">USING&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">GIN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">to_tsvector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">||&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39; &amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">||&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">body&lt;/span>&lt;span class="p">));&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- Query 用 index
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">articles&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">to_tsvector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">||&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39; &amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">||&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">body&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">@@&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">to_tsquery&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;english&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;postgres &amp;amp; index&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &amp;#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&amp;#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &amp;#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">JSONB GIN index&lt;/a> 同 GIN access method、不同 indexed expression。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>full-text search</em> — 內建 tsvector / tsquery + pg_trgm fuzzy match。</p></blockquote>
<hr>
<h2 id="pg-fts-機制tsvector--tsquery--gin-index">PG FTS 機制：tsvector + tsquery + GIN index</h2>
<p>PG 內建 full-text search 三件組：</p>
<ul>
<li><code>tsvector</code>：document 轉成 <em>lexeme</em>（字根 + position）vector、normalized 後存</li>
<li><code>tsquery</code>：搜尋字串 parse 成 query 形式</li>
<li>GIN index：對 tsvector 加 inverted index</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- Document
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;The quick brown fox jumps over the lazy dog&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="c1">-- 結果：&#39;brown&#39;:3 &#39;dog&#39;:9 &#39;fox&#39;:4 &#39;jump&#39;:5 &#39;lazi&#39;:8 &#39;quick&#39;:2
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1">-- The/over 是 stop word 被過濾、jumps/lazy 轉字根、保留 position
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="c1">-- Query
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;fox &amp; dog&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- 結果：&#39;fox&#39; &amp; &#39;dog&#39;
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- Match
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;The quick brown fox&#39;</span><span class="p">)</span><span class="w"> </span><span class="o">@@</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;fox &amp; quick&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="c1">-- → true</span></span></span></code></pre></div><p><strong>Index</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">title</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">body</span><span class="w"> </span><span class="nb">TEXT</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- GIN index over tsvector (動態 cast)
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_articles_fts</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">articles</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="s1">&#39; &#39;</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="n">body</span><span class="p">));</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- Query 用 index
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">articles</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="s1">&#39; &#39;</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="n">body</span><span class="p">)</span><span class="w"> </span><span class="o">@@</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;postgres &amp; index&#39;</span><span class="p">);</span></span></span></code></pre></div><p>跟 <a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">JSONB GIN index</a> 同 GIN access method、不同 indexed expression。</p>
<h2 id="generated-column-加速">Generated column 加速</h2>
<p>每次 query 都跑 <code>to_tsvector(...)</code> 浪費 CPU。用 <em>generated column</em> 預存：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">fts</span><span class="w"> </span><span class="n">tsvector</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">GENERATED</span><span class="w"> </span><span class="n">ALWAYS</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">coalesce</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="s1">&#39; &#39;</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="n">coalesce</span><span class="p">(</span><span class="n">body</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&#39;</span><span class="p">)))</span><span class="w"> </span><span class="n">STORED</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_articles_fts</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">fts</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- Query 簡化
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">fts</span><span class="w"> </span><span class="o">@@</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;postgres&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Stored generated column 是 PG 12+、自動跟 row update 同步。</p>
<h2 id="ranking--加權">Ranking + 加權</h2>
<p>PG FTS 提供 <code>ts_rank</code> / <code>ts_rank_cd</code> 給結果排序：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 簡單 ranking
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">ts_rank</span><span class="p">(</span><span class="n">fts</span><span class="p">,</span><span class="w"> </span><span class="n">query</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">rank</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">articles</span><span class="p">,</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;postgres &amp; index&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">query</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">fts</span><span class="w"> </span><span class="o">@@</span><span class="w"> </span><span class="n">query</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">rank</span><span class="w"> </span><span class="k">DESC</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p>加權（A &gt; B &gt; C &gt; D）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- Title 比 body 重要
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">UPDATE</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">fts</span><span class="w"> </span><span class="o">=</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">setweight</span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">coalesce</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&#39;</span><span class="p">)),</span><span class="w"> </span><span class="s1">&#39;A&#39;</span><span class="p">)</span><span class="w"> </span><span class="o">||</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">setweight</span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">coalesce</span><span class="p">(</span><span class="n">body</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&#39;</span><span class="p">)),</span><span class="w"> </span><span class="s1">&#39;B&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="c1">-- Query 用加權 ranking
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">       </span><span class="n">ts_rank</span><span class="p">(</span><span class="n">fts</span><span class="p">,</span><span class="w"> </span><span class="n">query</span><span class="p">,</span><span class="w"> </span><span class="mi">32</span><span class="w"> </span><span class="cm">/* normalize by document length */</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">rank</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">articles</span><span class="p">,</span><span class="w"> </span><span class="n">to_tsquery</span><span class="p">(</span><span class="s1">&#39;english&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;postgres&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">query</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">fts</span><span class="w"> </span><span class="o">@@</span><span class="w"> </span><span class="n">query</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">rank</span><span class="w"> </span><span class="k">DESC</span><span class="p">;</span></span></span></code></pre></div><p><code>ts_rank</code> 第三 parameter 是 normalization flag：</p>
<ul>
<li>0：no normalization</li>
<li>1：divide by document length</li>
<li>32：divide by uniqueness（避免短 doc 一律 rank 高）</li>
</ul>
<h2 id="multi-language-support">Multi-language Support</h2>
<p>PG 內建多種語言 dictionary：<code>english</code> / <code>french</code> / <code>german</code> / <code>spanish</code> / <code>simple</code>（不做 stemming）等。</p>
<p>對 <em>中文 / 日文 / 韓文</em>、PG 預設無支援、需要 extension：</p>
<ul>
<li><code>zhparser</code>（中文、用 SCWS 分詞）</li>
<li><code>pgroonga</code>（多語言、支援中日韓）</li>
<li><code>RUM index</code>（PG 自己 + 可選 dictionary）</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 中文用 zhparser
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">zhparser</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">SEARCH</span><span class="w"> </span><span class="n">CONFIGURATION</span><span class="w"> </span><span class="n">chinese</span><span class="w"> </span><span class="p">(</span><span class="n">PARSER</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">zhparser</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">SEARCH</span><span class="w"> </span><span class="n">CONFIGURATION</span><span class="w"> </span><span class="n">chinese</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">ADD</span><span class="w"> </span><span class="n">MAPPING</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="n">v</span><span class="p">,</span><span class="n">a</span><span class="p">,</span><span class="n">i</span><span class="p">,</span><span class="n">e</span><span class="p">,</span><span class="n">l</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="k">simple</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- 使用
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">&#39;chinese&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;我愛 PostgreSQL 資料庫&#39;</span><span class="p">);</span></span></span></code></pre></div><p>對 <em>主要英文 search</em> 場景 PG built-in 夠用、對 <em>主要 CJK search</em> 需要 extension。</p>
<h2 id="pg_trgm--fuzzy-string-match">pg_trgm — Fuzzy String Match</h2>
<p>PG FTS 對 <em>精確字根 match</em> 強、對 <em>拼錯 / similar string</em> 弱。<code>pg_trgm</code> extension 提供 trigram-based fuzzy match：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">pg_trgm</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="c1">-- 對 column 建 GIN trigram index
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_users_name_trgm</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIN</span><span class="w"> </span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="n">gin_trgm_ops</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="c1">-- Fuzzy match（similarity threshold 預設 0.3）
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">%</span><span class="w"> </span><span class="s1">&#39;jhon&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- → 找到 &#39;John&#39;、&#39;Johan&#39;、&#39;Johnny&#39; 等 similar string
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 顯式 similarity score
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">similarity</span><span class="p">(</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;jhon&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">similarity</span><span class="p">(</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;jhon&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">DESC</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">5</span><span class="p">;</span></span></span></code></pre></div><p>用途：</p>
<ul>
<li>Autocomplete / typeahead suggestion</li>
<li>拼錯容錯（user 輸入 typo）</li>
<li>ILIKE 加速（<code>name ILIKE '%jhon%'</code> 走 GIN trigram index）</li>
</ul>
<p>跟 FTS 互補：</p>
<ul>
<li>FTS：full document search、tokenize / stemming / ranking</li>
<li>pg_trgm：short string similarity、typo tolerance</li>
</ul>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-dictionary-選錯--中文搜不到">1. Dictionary 選錯 — 中文搜不到</h3>
<p>對中文 column 用 <code>to_tsvector('english', text)</code>、不分詞、整段當一個 token、搜不到任何結果。</p>
<p>修法：</p>
<ul>
<li>中文用 <code>zhparser</code> / <code>pgroonga</code></li>
<li>多語言 column 拆 <em>per-language column</em> 或用 <code>simple</code> dictionary（不 stemming、字元級 match）</li>
<li>確認 dictionary 選對：<code>SELECT to_tsvector('chinese', '...')</code> 看分詞結果</li>
</ul>
<h3 id="2-gin-vs-gist-取捨選錯">2. GIN vs GiST 取捨選錯</h3>
<p>PG FTS 有兩種 index access method：</p>
<ul>
<li><em>GIN</em>：read fast、write slow、size 大、適合 <em>read-heavy</em></li>
<li><em>GiST</em>：read 慢、write fast、size 小、適合 <em>write-heavy 或 small doc</em></li>
</ul>
<p>預設選 GIN、適合 90% search workload。對 <em>寫入頻繁 + 文件小</em> 場景 GiST。</p>
<p>修法：</p>
<ul>
<li>預設 GIN</li>
<li>寫吞吐 &gt; 10K WPS 場景考慮 GiST 或 <em>bulk index</em>（先 disable index、bulk insert、重建 index）</li>
<li>GIN 有 <code>fastupdate</code> option、buffering 加速寫入（trade-off：read 慢）</li>
</ul>
<h3 id="3-ranking-評分權重不對齊-business">3. Ranking 評分權重不對齊 business</h3>
<p><code>ts_rank</code> 預設不考慮 <em>field weight</em>、<code>ts_rank_cd</code> 考慮 cover density、兩者結果不同。Application 不知道 <em>自己 query 對應哪個 rank function</em>、結果隨機。</p>
<p>修法：</p>
<ul>
<li>顯式選 ranking function：<code>ts_rank</code> 一般用、<code>ts_rank_cd</code> 對 <em>proximity 重要</em> 場景</li>
<li>設 <em>field weight</em>（A &gt; B &gt; C &gt; D）反映 business priority（title &gt; body &gt; tags）</li>
<li>對 <em>搜尋結果</em> 用 A/B test 評估 ranking 質量、不靠直覺</li>
</ul>
<h3 id="4-multi-language-column-處理">4. Multi-language column 處理</h3>
<p>Application 同表存多語言 row（user-generated content、不同 language）、用單一 <code>to_tsvector('english', ...)</code> 對中文 row 搜不到、對 french row 也 stem 錯。</p>
<p>修法：</p>
<ul>
<li>
<p>加 <code>language</code> column 標每 row 語言</p>
</li>
<li>
<p>用 dynamic dictionary：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">articles</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">fts</span><span class="w"> </span><span class="n">tsvector</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">GENERATED</span><span class="w"> </span><span class="n">ALWAYS</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">to_tsvector</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">        </span><span class="k">CASE</span><span class="w"> </span><span class="k">WHEN</span><span class="w"> </span><span class="k">language</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;zh&#39;</span><span class="w"> </span><span class="k">THEN</span><span class="w"> </span><span class="s1">&#39;chinese&#39;</span><span class="p">::</span><span class="n">regconfig</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">             </span><span class="k">WHEN</span><span class="w"> </span><span class="k">language</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;fr&#39;</span><span class="w"> </span><span class="k">THEN</span><span class="w"> </span><span class="s1">&#39;french&#39;</span><span class="p">::</span><span class="n">regconfig</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">             </span><span class="k">ELSE</span><span class="w"> </span><span class="s1">&#39;english&#39;</span><span class="p">::</span><span class="n">regconfig</span><span class="w"> </span><span class="k">END</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">        </span><span class="n">coalesce</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="s1">&#39; &#39;</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="n">coalesce</span><span class="p">(</span><span class="n">body</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w">    </span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">STORED</span><span class="p">;</span></span></span></code></pre></div></li>
<li>
<p>Query 時用對應語言 <code>to_tsquery</code></p>
</li>
</ul>
<h3 id="5-何時不該用-pg-fts--應該換-elasticsearch--opensearch">5. 何時不該用 PG FTS — 應該換 Elasticsearch / OpenSearch</h3>
<p>PG FTS 適合 <em>中小規模搜尋</em>、不適合：</p>
<ul>
<li><em>&gt; 100M document</em> high-QPS search</li>
<li>需要 <em>complex aggregation</em>（faceted search）</li>
<li>需要 <em>advanced ranking</em>（BM25 / learning to rank）</li>
<li>需要 <em>分散式 search</em>（PG FTS 是 single-node）</li>
<li>需要 <em>near-real-time indexing</em>（PG GIN update 較慢）</li>
</ul>
<p>對這些場景、用 Elasticsearch / OpenSearch / Meilisearch / Typesense 等專業 search engine。</p>
<p>PG FTS <em>優勢</em> 是 <em>跟 OLTP data 同 transaction</em> — 不需要 ETL 同步 search index、application 寫 PG 立即 searchable。對 application data + search 是 <em>同源</em> 的場景 PG FTS 比較適合。</p>
<h2 id="何時用-pg-fts">何時用 PG FTS</h2>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>選擇</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Application internal search（admin / dashboard）</td>
          <td>PG FTS</td>
      </tr>
      <tr>
          <td>&lt; 10M document、低 QPS（&lt; 100/s）</td>
          <td>PG FTS</td>
      </tr>
      <tr>
          <td>Search 跟 OLTP data 同 transaction needed</td>
          <td>PG FTS</td>
      </tr>
      <tr>
          <td>Fuzzy / typo tolerance</td>
          <td>PG FTS + pg_trgm</td>
      </tr>
      <tr>
          <td>&gt; 100M document + high QPS</td>
          <td>Elasticsearch / OpenSearch</td>
      </tr>
      <tr>
          <td>Faceted aggregation</td>
          <td>Elasticsearch / OpenSearch</td>
      </tr>
      <tr>
          <td>Vector similarity（semantic search）</td>
          <td>pgvector（同 PG）</td>
      </tr>
  </tbody>
</table>
<p>PG FTS + pgvector 組合對 <em>中小規模 hybrid keyword + semantic search</em> 是強選擇。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">JSONB Deep Dive</a>：JSONB 跟 FTS 都用 GIN</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">Extension Ecosystem</a>：pg_trgm / pgroonga / zhparser 都是 extension</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">Query Optimization</a>：FTS query 的 EXPLAIN</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>：FTS GIN index 在 standby 自動 replicate</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">PG Extension Ecosystem</a>（pg_trgm / pgroonga）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">PG JSONB Deep Dive</a>（共用 GIN）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">PG Query Optimization</a>（FTS query plan）</li>
<li>官方：<a href="https://www.postgresql.org/docs/current/textsearch.html">PG Full-Text Search</a> / <a href="https://www.postgresql.org/docs/current/pgtrgm.html">pg_trgm</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Replication Slot Management：Physical / Logical / Failover Slot 治理</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/replication-slot-management/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/replication-slot-management/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>replication slot management&lt;/em> — physical / logical / failover slot 三類治理。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="replication-slot-兩大類">Replication Slot 兩大類&lt;/h2>
&lt;p>PG 兩種 replication slot：&lt;/p>
&lt;h3 id="physical-replication-slot">Physical Replication Slot&lt;/h3>
&lt;p>對應 &lt;em>streaming replication&lt;/em>（physical WAL byte-level）：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_create_physical_replication_slot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;standby1_slot&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>用於：&lt;/p>
&lt;ul>
&lt;li>Streaming replication standby（&lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &amp;#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &amp;#43; LSN-based 進度追蹤 &amp;#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &amp;#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &amp;#43; logical replication 整合">Replication Topology&lt;/a>）&lt;/li>
&lt;li>pg_basebackup 用 slot 防 WAL 清理&lt;/li>
&lt;li>高 lag standby 防 WAL premature deletion&lt;/li>
&lt;/ul>
&lt;h3 id="logical-replication-slot">Logical Replication Slot&lt;/h3>
&lt;p>對應 &lt;em>logical replication / logical decoding&lt;/em>：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_create_logical_replication_slot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;my_slot&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;pgoutput&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 或用 wal2json plugin
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_create_logical_replication_slot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;debezium_slot&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;wal2json&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>用於：&lt;/p>
&lt;ul>
&lt;li>PG-to-PG logical replication（publication / subscription）&lt;/li>
&lt;li>CDC（Debezium / Maxwell / pg_logical_emitter）&lt;/li>
&lt;li>Multi-master replication（BDR / pgEdge / Spock）&lt;/li>
&lt;/ul>
&lt;p>logical slot 跟 physical slot 共存、各自獨立 retention。&lt;/p>
&lt;h2 id="slot-lifecycle">Slot Lifecycle&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">建立 → active（有 consumer）→ inactive（consumer 失聯）→ drop
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> WAL 持續累積（直到推進 LSN 或 drop）&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>狀態查詢&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>replication slot management</em> — physical / logical / failover slot 三類治理。</p></blockquote>
<hr>
<h2 id="replication-slot-兩大類">Replication Slot 兩大類</h2>
<p>PG 兩種 replication slot：</p>
<h3 id="physical-replication-slot">Physical Replication Slot</h3>
<p>對應 <em>streaming replication</em>（physical WAL byte-level）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_create_physical_replication_slot</span><span class="p">(</span><span class="s1">&#39;standby1_slot&#39;</span><span class="p">);</span></span></span></code></pre></div><p>用於：</p>
<ul>
<li>Streaming replication standby（<a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>）</li>
<li>pg_basebackup 用 slot 防 WAL 清理</li>
<li>高 lag standby 防 WAL premature deletion</li>
</ul>
<h3 id="logical-replication-slot">Logical Replication Slot</h3>
<p>對應 <em>logical replication / logical decoding</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_create_logical_replication_slot</span><span class="p">(</span><span class="s1">&#39;my_slot&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;pgoutput&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="c1">-- 或用 wal2json plugin
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_create_logical_replication_slot</span><span class="p">(</span><span class="s1">&#39;debezium_slot&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;wal2json&#39;</span><span class="p">);</span></span></span></code></pre></div><p>用於：</p>
<ul>
<li>PG-to-PG logical replication（publication / subscription）</li>
<li>CDC（Debezium / Maxwell / pg_logical_emitter）</li>
<li>Multi-master replication（BDR / pgEdge / Spock）</li>
</ul>
<p>logical slot 跟 physical slot 共存、各自獨立 retention。</p>
<h2 id="slot-lifecycle">Slot Lifecycle</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">建立 → active（有 consumer）→ inactive（consumer 失聯）→ drop
</span></span><span class="line"><span class="ln">2</span><span class="cl">                                    ↓
</span></span><span class="line"><span class="ln">3</span><span class="cl">                              WAL 持續累積（直到推進 LSN 或 drop）</span></span></code></pre></div><p><strong>狀態查詢</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">slot_name</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">       </span><span class="n">slot_type</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">       </span><span class="n">active</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">       </span><span class="n">restart_lsn</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">       </span><span class="n">confirmed_flush_lsn</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">       </span><span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_wal_lsn_diff</span><span class="p">(</span><span class="n">pg_current_wal_lsn</span><span class="p">(),</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">retained_wal</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="p">;</span></span></span></code></pre></div><p>關鍵欄位：</p>
<ul>
<li><code>slot_type</code>：<code>physical</code> / <code>logical</code></li>
<li><code>active</code>：true / false（consumer 是否連著）</li>
<li><code>restart_lsn</code>：slot 起點 LSN、primary 必須保留這以後的 WAL</li>
<li><code>confirmed_flush_lsn</code>：logical slot 已 confirm flush 的 LSN</li>
<li><code>retained_wal</code>：當前因 slot 累積的 WAL</li>
</ul>
<h2 id="failover-slot-synchronization-pg-17">Failover Slot Synchronization (PG 17+)</h2>
<p>PG 17 之前的 <em>痛點</em>：logical replication slot 是 <em>primary 上的 state</em>、failover 後 <em>新 primary 沒這個 slot</em>、CDC consumer 失聯、需要重建（大工程）。</p>
<p>PG 17 加 <em>failover slot synchronization</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- PG 17+：標 slot 為 failover-tracked
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1">-- signature: pg_create_logical_replication_slot(slot_name, plugin, temporary, two_phase, failover)
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_create_logical_replication_slot</span><span class="p">(</span><span class="s1">&#39;my_slot&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;pgoutput&#39;</span><span class="p">,</span><span class="w"> </span><span class="k">false</span><span class="p">,</span><span class="w"> </span><span class="k">false</span><span class="p">,</span><span class="w"> </span><span class="k">true</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">--                                                                          ↑
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1">--                                                                     failover=true（第 5 個參數）
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1">-- 注意：第 4 個參數是 two_phase（這裡 false）、第 5 個才是 failover
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- Standby 上 enable sync_replication_slots
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">sync_replication_slots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">on</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_reload_conf</span><span class="p">();</span></span></span></code></pre></div><p><code>sync_replication_slots = on</code> 後、physical replication 同步 slot state 到 standby。Failover promote standby 後、logical slot 仍可用、CDC consumer 重連即可。</p>
<p>PG 17 之前用 <a href="https://www.pgedge.com/">pgEdge</a> / <em>pglogical</em> 等 extension 提供類似功能、現在 PG core 內建。</p>
<h2 id="orphan-slot-治理">Orphan Slot 治理</h2>
<p><code>active = false</code> 的 slot 持續累積 WAL、disk 爆是 PG production 經典事故。</p>
<h3 id="監控-orphan-slot">監控 orphan slot</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 找 inactive 太久的 slot
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">slot_name</span><span class="p">,</span><span class="w"> </span><span class="n">active</span><span class="p">,</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">       </span><span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_wal_lsn_diff</span><span class="p">(</span><span class="n">pg_current_wal_lsn</span><span class="p">(),</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">retained_wal</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="n">active</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="k">AND</span><span class="w"> </span><span class="n">pg_wal_lsn_diff</span><span class="p">(</span><span class="n">pg_current_wal_lsn</span><span class="p">(),</span><span class="w"> </span><span class="n">restart_lsn</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">1024</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">1024</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">1024</span><span class="p">;</span><span class="w">  </span><span class="c1">-- &gt; 1 GB</span></span></span></code></pre></div><h3 id="自動-invalidate-slotpg-13">自動 invalidate slot（PG 13+）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- postgresql.conf
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">SYSTEM</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">max_slot_wal_keep_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;50GB&#39;</span><span class="p">;</span><span class="w">  </span><span class="c1">-- slot 累積 &gt; 50GB 自動 invalidate</span></span></span></code></pre></div><p>當 slot 累積 WAL 超過 <code>max_slot_wal_keep_size</code>、PG 自動 invalidate slot（<code>active=false</code> 且不再保留 WAL）。Consumer 重連會 fail、必須重建（base backup + new slot）。</p>
<p>這是 <em>trade-off</em>：</p>
<ul>
<li>設 limit → 保護 disk、但 consumer 失聯 → 大重建工作</li>
<li>不設 limit → consumer 失聯 OK、但 disk 爆</li>
</ul>
<p>實務多數設 <code>max_slot_wal_keep_size</code> 給 <em>disk capacity 50%</em>、避免徹底 disk full。</p>
<h3 id="手動-drop-orphan-slot">手動 drop orphan slot</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 確認 slot 真的不需要
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">slot_name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;old_standby_slot&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Drop
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_drop_replication_slot</span><span class="p">(</span><span class="s1">&#39;old_standby_slot&#39;</span><span class="p">);</span></span></span></code></pre></div><p>DR runbook 必須包含 <em>standby 退役流程</em>：先 standby fence、再 primary drop slot。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-orphan-slot-disk-爆">1. Orphan slot disk 爆</h3>
<p>最經典 PG 事故：standby decomission 沒 drop slot、primary 持續保留 WAL、<code>pg_wal/</code> 累積到 disk full、primary 也掛。</p>
<p>修法：</p>
<ul>
<li>監控 <code>pg_replication_slots</code> + <code>pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn))</code> retained_wal</li>
<li>設 <code>max_slot_wal_keep_size</code>（PG 13+）— hard limit</li>
<li>Standby 退役 runbook 強制 <em>先 fence、再 drop slot</em></li>
<li>Cron job 自動 alert orphan slot</li>
</ul>
<h3 id="2-logical-slot-lag--cdc-consumer-跟不上">2. Logical slot lag — CDC consumer 跟不上</h3>
<p>Logical decoding 比 physical replication 慢（per-transaction logical event 重組）。CDC consumer（Debezium）跟不上 → slot lag 累積。</p>
<p>修法：</p>
<ul>
<li>監控 <code>pg_replication_slots.confirmed_flush_lsn</code> 跟 primary <code>pg_current_wal_lsn()</code> 對比</li>
<li>CDC consumer 性能調整（throughput / batch size）</li>
<li>Throttle source writes（如果不能升 consumer）</li>
<li>對 hot table 拆 publication / subscription、避免單 slot 處理所有變更</li>
</ul>
<p>詳見 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a>。</p>
<h3 id="3-failover-後-logical-slot-丟pg-16-之前">3. Failover 後 logical slot 丟（PG 16 之前）</h3>
<p>PG 16 之前、failover promote standby、新 primary 沒有原 logical slot。CDC consumer 試連、ERROR: <code>replication slot &quot;xxx&quot; does not exist</code>。</p>
<p>修法（PG 17+）：</p>
<ul>
<li>用 <em>failover slot synchronization</em>（如上）</li>
<li><code>pg_create_logical_replication_slot(...,  failover := true)</code></li>
<li>Standby <code>sync_replication_slots = on</code></li>
</ul>
<p>修法（PG 16-）：</p>
<ul>
<li>用 <a href="https://www.2ndquadrant.com/en/resources/pglogical/">pglogical</a> 或 <a href="https://www.pgedge.com/">pgEdge</a> extension</li>
<li>Failover runbook 包含 <em>新 primary 重建 logical slot</em>（CDC consumer 重 snapshot）</li>
<li>Pre-create slot on standby + manual sync（早期 workaround）</li>
</ul>
<h3 id="4-wal_keep_size-跟-slot-衝突">4. <code>wal_keep_size</code> 跟 slot 衝突</h3>
<p><code>wal_keep_size</code>（PG 13+）/ <code>wal_keep_segments</code>（&lt; 13）跟 slot 都會保留 WAL：</p>
<ul>
<li><code>wal_keep_size</code>：固定 minimum WAL 保留量</li>
<li>Slot：動態保留直到 consumer 推進</li>
</ul>
<p>兩者一起 set 時：實際保留 WAL = <code>max(wal_keep_size, slot 需要的量)</code>。</p>
<p>修法：</p>
<ul>
<li><code>wal_keep_size</code> 設小（如 1-2 GB）作 <em>minimum backup</em></li>
<li>主要靠 slot 動態保留 — 給 active consumer</li>
<li>監控 <code>pg_wal/</code> 大小 + 拆解 retention source（<code>wal_keep_size</code> vs slot 各佔多少）</li>
</ul>
<h3 id="5-slot-數量上限">5. Slot 數量上限</h3>
<p><code>max_replication_slots</code> 預設 10、不夠時新 slot 建不出來、報錯。</p>
<p>修法：</p>
<ul>
<li>Production 大 cluster 設 <code>max_replication_slots = 50</code> 或更多</li>
<li>對 <em>standby + logical replication + CDC consumer</em> 同時跑、計算需要的 slot 數</li>
<li>監控 <code>SELECT count(*) FROM pg_replication_slots</code> 接近 limit 時告警</li>
</ul>
<h2 id="slot-naming-convention">Slot Naming Convention</h2>
<p>Production 大 cluster 多 slot、命名 convention 重要：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">&lt;consumer-type&gt;_&lt;consumer-name&gt;_&lt;purpose&gt;
</span></span><span class="line"><span class="ln">2</span><span class="cl">例：
</span></span><span class="line"><span class="ln">3</span><span class="cl">- physical_standby1_replication
</span></span><span class="line"><span class="ln">4</span><span class="cl">- physical_standby2_replication
</span></span><span class="line"><span class="ln">5</span><span class="cl">- logical_debezium_orders_cdc
</span></span><span class="line"><span class="ln">6</span><span class="cl">- logical_pgedge_node2_subscription
</span></span><span class="line"><span class="ln">7</span><span class="cl">- physical_pgbasebackup_temp（base backup 用、completed 後 drop）</span></span></code></pre></div><p>清楚命名讓 <em>看 slot 名</em> 就知道用途、誰負責、能不能 drop。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>：physical slot 給 streaming replication 用</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a>：logical slot 給 CDC</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/bdr-multi-master/" data-link-title="PostgreSQL BDR / Multi-Master：active-active 寫入的 3 種路徑跟 conflict 治理" data-link-desc="PG 預設是 single-primary、active-active 多寫入入口需要 *BDR (EDB)* / *pgEdge* / *Bucardo* 等 extension。本文走 3 種 multi-master 方案對比、conflict detection &#43; resolution model、async vs sync 取捨、配置 step-by-step（pgEdge 為主）、5 production 踩雷（last-write-wins data loss / sequence collision / DDL replication / conflict log 治理 / failover 後 timeline 分歧）、跟 MySQL Group Replication sibling 對比">BDR / Multi-Master</a>：multi-master 大量用 logical slot</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL Archiving</a>：WAL archive 跟 slot 是兩種 WAL retention 機制、可並行</li>
</ul>
<h2 id="監控-metric">監控 metric</h2>
<p>Production 持續監控：</p>
<ul>
<li><code>pg_replication_slots.active</code> — 失聯 slot</li>
<li><code>pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)</code> — slot 累積 WAL</li>
<li><code>pg_replication_slots.confirmed_flush_lsn</code> vs <code>pg_current_wal_lsn()</code> — logical slot lag</li>
<li><code>pg_ls_waldir()</code> 看 <code>pg_wal/</code> 目錄大小</li>
<li><code>count(*) FROM pg_replication_slots</code> 對 <code>max_replication_slots</code> 比例</li>
</ul>
<p>把這些丟進 Datadog / Prometheus + alert。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（physical slot 用途）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">PG Logical Replication + Debezium</a>（logical slot 用途）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/bdr-multi-master/" data-link-title="PostgreSQL BDR / Multi-Master：active-active 寫入的 3 種路徑跟 conflict 治理" data-link-desc="PG 預設是 single-primary、active-active 多寫入入口需要 *BDR (EDB)* / *pgEdge* / *Bucardo* 等 extension。本文走 3 種 multi-master 方案對比、conflict detection &#43; resolution model、async vs sync 取捨、配置 step-by-step（pgEdge 為主）、5 production 踩雷（last-write-wins data loss / sequence collision / DDL replication / conflict log 治理 / failover 後 timeline 分歧）、跟 MySQL Group Replication sibling 對比">PG BDR / Multi-Master</a>（multi-master 大量 slot）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PG PITR + WAL Archiving</a>（WAL retention 兩種機制）</li>
<li>官方：<a href="https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-SLOTS">PG Replication Slots</a> / <a href="https://www.postgresql.org/docs/current/logicaldecoding.html">Logical Replication Slot</a></li>
</ul>
]]></content:encoded></item><item><title>TimescaleDB Deep Dive：Hypertable / Continuous Aggregate / Compression 把 PG 變 Time-Series DB</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/timescaledb-deep-dive/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/timescaledb-deep-dive/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>TimescaleDB extension&lt;/em> — 用 PG 解 time-series workload 的路徑、跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem&lt;/a> 是 &lt;em>單一 extension 細節 vs ecosystem 全景&lt;/em> 的關係。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="timescaledb-是-pg-的-time-series-specialization">TimescaleDB 是 PG 的 &lt;em>Time-Series Specialization&lt;/em>&lt;/h2>
&lt;p>TimescaleDB 不是獨立 DB、是 PG extension：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">timescaledb&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>加完後、PG 多三個 time-series 專屬機制：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Hypertable&lt;/strong>：對 time column 自動 partition、應用層看是一張表&lt;/li>
&lt;li>&lt;strong>Continuous aggregate&lt;/strong>：incremental refresh 的 materialized view&lt;/li>
&lt;li>&lt;strong>Compression&lt;/strong>：對舊 chunk 壓縮（columnar-like format）&lt;/li>
&lt;/ol>
&lt;p>跟專業 time-series DB（InfluxDB / Prometheus / VictoriaMetrics）對比、TimescaleDB 的賣點不是「最快」而是「PG ecosystem 一致」：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>TimescaleDB&lt;/th>
 &lt;th>InfluxDB&lt;/th>
 &lt;th>Prometheus&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Query 語言&lt;/td>
 &lt;td>標準 SQL&lt;/td>
 &lt;td>InfluxQL / Flux&lt;/td>
 &lt;td>PromQL&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>寫入效能&lt;/td>
 &lt;td>中（10-100K rows/s）&lt;/td>
 &lt;td>高（500K+ rows/s）&lt;/td>
 &lt;td>中（pull-based scrape）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>壓縮&lt;/td>
 &lt;td>90%+（columnar compression）&lt;/td>
 &lt;td>高&lt;/td>
 &lt;td>高&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Join&lt;/td>
 &lt;td>完整 SQL join&lt;/td>
 &lt;td>弱&lt;/td>
 &lt;td>不支援&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>跟既有 PG schema&lt;/td>
 &lt;td>同一個 DB、可 join&lt;/td>
 &lt;td>獨立&lt;/td>
 &lt;td>獨立&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>生態&lt;/td>
 &lt;td>完整 PG ecosystem&lt;/td>
 &lt;td>自家 ecosystem&lt;/td>
 &lt;td>自家 ecosystem&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Open source&lt;/td>
 &lt;td>Apache 2.0（部分功能 TSL license）&lt;/td>
 &lt;td>MIT&lt;/td>
 &lt;td>Apache 2.0&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>何時選 TimescaleDB&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>TimescaleDB extension</em> — 用 PG 解 time-series workload 的路徑、跟 <a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a> 是 <em>單一 extension 細節 vs ecosystem 全景</em> 的關係。</p></blockquote>
<hr>
<h2 id="timescaledb-是-pg-的-time-series-specialization">TimescaleDB 是 PG 的 <em>Time-Series Specialization</em></h2>
<p>TimescaleDB 不是獨立 DB、是 PG extension：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">timescaledb</span><span class="p">;</span></span></span></code></pre></div><p>加完後、PG 多三個 time-series 專屬機制：</p>
<ol>
<li><strong>Hypertable</strong>：對 time column 自動 partition、應用層看是一張表</li>
<li><strong>Continuous aggregate</strong>：incremental refresh 的 materialized view</li>
<li><strong>Compression</strong>：對舊 chunk 壓縮（columnar-like format）</li>
</ol>
<p>跟專業 time-series DB（InfluxDB / Prometheus / VictoriaMetrics）對比、TimescaleDB 的賣點不是「最快」而是「PG ecosystem 一致」：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>TimescaleDB</th>
          <th>InfluxDB</th>
          <th>Prometheus</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Query 語言</td>
          <td>標準 SQL</td>
          <td>InfluxQL / Flux</td>
          <td>PromQL</td>
      </tr>
      <tr>
          <td>寫入效能</td>
          <td>中（10-100K rows/s）</td>
          <td>高（500K+ rows/s）</td>
          <td>中（pull-based scrape）</td>
      </tr>
      <tr>
          <td>壓縮</td>
          <td>90%+（columnar compression）</td>
          <td>高</td>
          <td>高</td>
      </tr>
      <tr>
          <td>Join</td>
          <td>完整 SQL join</td>
          <td>弱</td>
          <td>不支援</td>
      </tr>
      <tr>
          <td>跟既有 PG schema</td>
          <td>同一個 DB、可 join</td>
          <td>獨立</td>
          <td>獨立</td>
      </tr>
      <tr>
          <td>生態</td>
          <td>完整 PG ecosystem</td>
          <td>自家 ecosystem</td>
          <td>自家 ecosystem</td>
      </tr>
      <tr>
          <td>Open source</td>
          <td>Apache 2.0（部分功能 TSL license）</td>
          <td>MIT</td>
          <td>Apache 2.0</td>
      </tr>
  </tbody>
</table>
<p><strong>何時選 TimescaleDB</strong>：</p>
<ul>
<li>Application 已用 PG、不想多管一套 time-series DB</li>
<li>需要 join time-series 跟 application 表（user / device metadata）</li>
<li>不需 InfluxDB 級寫入速度（&lt; 100K rows/s）</li>
<li>Team SQL 熟、PromQL / Flux 學習成本不想付</li>
</ul>
<p><strong>何時選 InfluxDB / Prometheus（不選 TimescaleDB）</strong>：</p>
<ul>
<li>High-cardinality metric（10M+ unique series）— TSDB-purpose-built engine 在 cardinality 跟 retention 上比 hypertable 高效</li>
<li>Pull-based scrape model（Prometheus）跟 alerting / Grafana 生態深整合</li>
<li>PromQL operator（<code>rate()</code> / <code>histogram_quantile()</code>）對 metric query 比 SQL 直覺</li>
<li>TSL license 不能接受（TimescaleDB 部分功能在 Timescale License、不是純 Apache 2.0）</li>
<li>Operational team 已熟 InfluxDB / Prometheus、不想多學 PG 維運</li>
</ul>
<h2 id="hypertable自動-time-based-partitioning">Hypertable：自動 Time-based Partitioning</h2>
<p>普通 PG 表變 hypertable：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">sensor_data</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">time</span><span class="w">        </span><span class="n">TIMESTAMPTZ</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">sensor_id</span><span class="w">   </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">temperature</span><span class="w"> </span><span class="n">DOUBLE</span><span class="w"> </span><span class="k">PRECISION</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="n">humidity</span><span class="w">    </span><span class="n">DOUBLE</span><span class="w"> </span><span class="k">PRECISION</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="c1">-- 變 hypertable、按 time 自動 partition
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_hypertable</span><span class="p">(</span><span class="s1">&#39;sensor_data&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;time&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Hypertable 機制：</p>
<ul>
<li>後台自動拆 <em>chunk</em>（child partition）by time interval（預設 7 天）</li>
<li>Application 看到的是 <code>sensor_data</code> 一張表、實際資料分散在 <code>_timescaledb_internal._hyper_*_chunk</code> 表</li>
<li>Query 自動 chunk pruning（只掃命中時間範圍的 chunk）</li>
</ul>
<p><strong>Chunk interval 選擇</strong>很關鍵：</p>
<table>
  <thead>
      <tr>
          <th>Chunk interval</th>
          <th>適用</th>
          <th>問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 小時</td>
          <td>高頻 metrics（每秒 100+ row）</td>
          <td>Chunk 太多、catalog 膨脹</td>
      </tr>
      <tr>
          <td>1 天</td>
          <td>中高頻（每秒 10-100 row）</td>
          <td>OK</td>
      </tr>
      <tr>
          <td>7 天（預設）</td>
          <td>中頻（每分鐘 row）</td>
          <td>OK</td>
      </tr>
      <tr>
          <td>30 天</td>
          <td>低頻（每小時 row）</td>
          <td>OK</td>
      </tr>
  </tbody>
</table>
<p>通用原則：<em>每個 chunk 25% RAM</em>、超過退化 disk IO。Production 監控 <code>chunk_size</code> 跟 <code>shared_buffers</code> ratio 自動調。</p>
<p><strong>Multi-dimensional hypertable</strong>（time + space partition）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 按 time + device_id 雙維 partition
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_hypertable</span><span class="p">(</span><span class="s1">&#39;sensor_data&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;time&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">partitioning_column</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="s1">&#39;sensor_id&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">number_partitions</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="mi">16</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p>適用 sensor 數 1000+ 的 IoT workload、單 chunk 太大時用 space partition 拆。</p>
<h2 id="continuous-aggregatecaggincremental-materialized-view">Continuous Aggregate（CAGG）：Incremental Materialized View</h2>
<p>普通 PG materialized view 是 <em>全量重算</em>、TimescaleDB CAGG 是 <em>incremental refresh</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1 小時粒度聚合
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">sensor_hourly</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">timescaledb</span><span class="p">.</span><span class="n">continuous</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">time_bucket</span><span class="p">(</span><span class="s1">&#39;1 hour&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">time</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">hour</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">sensor_id</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="k">avg</span><span class="p">(</span><span class="n">temperature</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">avg_temp</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">    </span><span class="k">max</span><span class="p">(</span><span class="n">temperature</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">max_temp</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="k">min</span><span class="p">(</span><span class="n">temperature</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">min_temp</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">sample_count</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">sensor_data</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">hour</span><span class="p">,</span><span class="w"> </span><span class="n">sensor_id</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="c1">-- 加 refresh policy（每 30 分鐘 refresh 過去 1 天）
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">add_continuous_aggregate_policy</span><span class="p">(</span><span class="s1">&#39;sensor_hourly&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">    </span><span class="n">start_offset</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;1 day&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">    </span><span class="n">end_offset</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;30 minutes&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w">    </span><span class="n">schedule_interval</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;30 minutes&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p>CAGG 機制：</p>
<ul>
<li>記錄哪些 time bucket 已 materialize、哪些 stale</li>
<li>Refresh 時只重算 stale bucket、不全量</li>
<li>Query CAGG 自動 fallback 到原 hypertable 補最新資料（real-time aggregation）</li>
</ul>
<p><strong>CAGG vs 普通 MV 對比</strong>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>TimescaleDB CAGG</th>
          <th>普通 PG MV</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Refresh 模式</td>
          <td>Incremental</td>
          <td>全量重算</td>
      </tr>
      <tr>
          <td>Refresh 時間</td>
          <td>秒級</td>
          <td>表大時數十分鐘</td>
      </tr>
      <tr>
          <td>Real-time fallback</td>
          <td>自動補最新</td>
          <td>不支援、需手動 union</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>多一份 aggregated</td>
          <td>多一份 aggregated</td>
      </tr>
      <tr>
          <td>Policy</td>
          <td>內建排程</td>
          <td>需 pg_cron / 外部排程</td>
      </tr>
  </tbody>
</table>
<p><strong>CAGG hierarchy</strong>（多層聚合）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 從 1 hour CAGG 再聚合到 1 day
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">sensor_daily</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">timescaledb</span><span class="p">.</span><span class="n">continuous</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="n">time_bucket</span><span class="p">(</span><span class="s1">&#39;1 day&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">hour</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">day</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">    </span><span class="n">sensor_id</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">    </span><span class="k">avg</span><span class="p">(</span><span class="n">avg_temp</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">daily_avg</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">sensor_hourly</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">day</span><span class="p">,</span><span class="w"> </span><span class="n">sensor_id</span><span class="p">;</span></span></span></code></pre></div><p>Application query 不同時間範圍時自動命中對應粒度、不必每次掃原始資料。</p>
<h2 id="compression把舊-chunk-壓-90">Compression：把舊 Chunk 壓 90%+</h2>
<p>舊 chunk 可以開啟 compression：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 開啟 compression（必須先設定 segment by）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">sensor_data</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">timescaledb</span><span class="p">.</span><span class="n">compress</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">timescaledb</span><span class="p">.</span><span class="n">compress_segmentby</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;sensor_id&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="n">timescaledb</span><span class="p">.</span><span class="n">compress_orderby</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;time DESC&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="c1">-- 自動壓縮 policy：7 天前 chunk 壓
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">add_compression_policy</span><span class="p">(</span><span class="s1">&#39;sensor_data&#39;</span><span class="p">,</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;7 days&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Compression 機制：</p>
<ul>
<li>把 chunk 內 row 按 <code>segmentby</code> 分組</li>
<li>每組內按 <code>orderby</code> 排序後、把每 column 變成 <em>columnar array</em></li>
<li>對 array 用 type-specific 壓縮（Gorilla for float / delta-of-delta for timestamp / dictionary for string）</li>
</ul>
<p>實際壓縮率：</p>
<table>
  <thead>
      <tr>
          <th>Workload</th>
          <th>壓縮率</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>IoT sensor（重複值多）</td>
          <td>95-98%</td>
      </tr>
      <tr>
          <td>Application metrics</td>
          <td>90-95%</td>
      </tr>
      <tr>
          <td>Trade tick（隨機浮點）</td>
          <td>70-85%</td>
      </tr>
      <tr>
          <td>Log line（高 cardinality string）</td>
          <td>50-70%</td>
      </tr>
  </tbody>
</table>
<p><strong>Compression 限制</strong>（重要）：</p>
<ul>
<li>壓縮後 chunk <strong>不能 UPDATE / DELETE 單 row</strong>（要先 decompress）</li>
<li>壓縮後 chunk <strong>不能加 column</strong>（要 decompress 所有 chunk）</li>
<li>壓縮後 chunk 只能 <em>append new row</em>、不能改舊 row</li>
<li>DDL 變更（加 column / 改 index）需 decompress</li>
</ul>
<p>實務：compression 是 <em>write-once cold data</em> 的工具、active OLTP chunk 不開。</p>
<h2 id="retention-policy自動刪舊資料">Retention Policy：自動刪舊資料</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 1 年前 chunk 自動刪
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">add_retention_policy</span><span class="p">(</span><span class="s1">&#39;sensor_data&#39;</span><span class="p">,</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;1 year&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Retention drop 整個 chunk（不是 DELETE row）、O(1) 操作、不產生 bloat。</p>
<p>CAGG 有獨立 retention：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 原始資料只留 30 天、aggregated 留 5 年
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">add_retention_policy</span><span class="p">(</span><span class="s1">&#39;sensor_data&#39;</span><span class="p">,</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;30 days&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">add_retention_policy</span><span class="p">(</span><span class="s1">&#39;sensor_hourly&#39;</span><span class="p">,</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;5 years&#39;</span><span class="p">);</span></span></span></code></pre></div><p>這是 TimescaleDB 跟普通 PG partitioning 最大的價值差 — 普通 PG 要自己寫 cron drop partition、TimescaleDB policy 內建。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="case-1chunk-size-不對catalog-膨脹">Case 1：Chunk size 不對、catalog 膨脹</h3>
<p><strong>情境</strong>：sensor 每秒寫 10 row、chunk_interval 設 1 小時、一年產 8760 chunk、<code>pg_class</code> 撐到 200 萬 row、planner 變慢。</p>
<p>修法：</p>
<ul>
<li>Chunk 數量上限 ~10000、超過 catalog overhead 出現</li>
<li>重設 chunk_interval：<code>SELECT set_chunk_time_interval('sensor_data', INTERVAL '1 day');</code></li>
<li>已存在 chunk 不會自動 merge、要靠 retention drop 自然消化</li>
</ul>
<h3 id="case-2cagg-refresh-落後-real-time">Case 2：CAGG refresh 落後 real-time</h3>
<p><strong>情境</strong>：CAGG refresh policy 每 1 小時跑、application 期待「即時 dashboard」、看到的數字落後 1 小時。</p>
<p>修法：</p>
<ul>
<li>縮短 <code>schedule_interval</code>（5 分鐘）</li>
<li>用 <code>real-time aggregation</code>（預設 ON、CAGG 自動 union 原始資料）</li>
<li>確認 <code>materialized_only = false</code>（real-time aggregation 開啟）</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="n">MATERIALIZED</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="n">sensor_hourly</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="p">(</span><span class="n">timescaledb</span><span class="p">.</span><span class="n">materialized_only</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">false</span><span class="p">);</span></span></span></code></pre></div><h3 id="case-3compression-後想-update">Case 3：Compression 後想 UPDATE</h3>
<p><strong>情境</strong>：發現某個歷史 row 數值錯、想 UPDATE、報錯 <em>cannot update/delete from compressed chunk</em>。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 找到該 chunk 並 decompress
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">decompress_chunk</span><span class="p">(</span><span class="k">c</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">show_chunks</span><span class="p">(</span><span class="s1">&#39;sensor_data&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">older_than</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="nb">INTERVAL</span><span class="w"> </span><span class="s1">&#39;7 days&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">c</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="k">c</span><span class="p">::</span><span class="nb">text</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="s1">&#39;%_5_chunk&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- UPDATE 完再 compress 回去
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">UPDATE</span><span class="w"> </span><span class="n">sensor_data</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">temperature</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">22</span><span class="p">.</span><span class="mi">5</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">...;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">compress_chunk</span><span class="p">(...);</span></span></span></code></pre></div><p>或設計階段就避免 — compression 用在 <em>immutable data</em>、有可能改的留未壓。</p>
<h3 id="case-4hypertable-不能加-fk-到-non-hypertable">Case 4：Hypertable 不能加 FK 到 non-hypertable</h3>
<p><strong>情境</strong>：想對 <code>sensor_data</code> 加 FK 到 <code>sensors</code> 表、報錯 <em>foreign key constraints with hypertables are not supported</em>。</p>
<p>修法：</p>
<ul>
<li>Application 層維護 referential integrity</li>
<li>或反過來：<code>sensors</code> 可以 FK 到 hypertable（特定方向支援）</li>
<li>TimescaleDB 2.11+ 部分支援 FK from hypertable、但限制多</li>
</ul>
<h3 id="case-5timescaledb-跟-pg-主版本對齊">Case 5：TimescaleDB 跟 PG 主版本對齊</h3>
<p><strong>情境</strong>：PG 升級 14 → 16、TimescaleDB extension 沒對應升級、PG 啟動 fail。</p>
<p>TimescaleDB 跟 PG 版本對齊矩陣：</p>
<table>
  <thead>
      <tr>
          <th>TimescaleDB</th>
          <th>支援 PG version</th>
          <th>備註</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2.11+</td>
          <td>13, 14, 15</td>
          <td></td>
      </tr>
      <tr>
          <td>2.13+</td>
          <td>13, 14, 15, 16</td>
          <td>加 PG 16 支援</td>
      </tr>
      <tr>
          <td>2.15.x</td>
          <td>13, 14, 15, 16</td>
          <td>最後支援 PG 13 的 minor</td>
      </tr>
      <tr>
          <td>2.16+</td>
          <td>14, 15, 16</td>
          <td>PG 13 drop</td>
      </tr>
      <tr>
          <td>2.17+</td>
          <td>14, 15, 16, 17</td>
          <td>PG 17 加入（需 17.2+ binary 對齊）</td>
      </tr>
      <tr>
          <td>2.18+</td>
          <td>14, 15, 16, 17</td>
          <td>PG 17 完整支援</td>
      </tr>
      <tr>
          <td>2.23+</td>
          <td>14, 15, 16, 17, 18</td>
          <td>PG 18 加入</td>
      </tr>
  </tbody>
</table>
<p>修法：</p>
<ul>
<li>升 PG 前先升 TimescaleDB 到支援目標 PG 版本的 extension</li>
<li>Production 升級順序：TimescaleDB minor upgrade → PG major upgrade → TimescaleDB final upgrade</li>
<li>Cloud managed（Timescale Cloud）自動處理</li>
</ul>
<h2 id="跟-pg-原生-partitioning-對比">跟 PG 原生 Partitioning 對比</h2>
<p>PG 10+ 有 declarative partitioning、不一定要 TimescaleDB：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>TimescaleDB hypertable</th>
          <th>PG declarative partitioning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>自動建 chunk</td>
          <td>是</td>
          <td>否（需手動或 pg_partman）</td>
      </tr>
      <tr>
          <td>Chunk pruning</td>
          <td>自動</td>
          <td>自動（需 partition key）</td>
      </tr>
      <tr>
          <td>Retention 內建</td>
          <td>是</td>
          <td>否（pg_partman 或自寫 cron）</td>
      </tr>
      <tr>
          <td>Compression</td>
          <td>內建 columnar</td>
          <td>否</td>
      </tr>
      <tr>
          <td>Continuous aggregate</td>
          <td>內建</td>
          <td>否（自寫 incremental refresh）</td>
      </tr>
      <tr>
          <td>跨 chunk index</td>
          <td>統一 management</td>
          <td>Per-partition index</td>
      </tr>
      <tr>
          <td>Cardinality limit</td>
          <td>10000+ chunk OK</td>
          <td>1000+ partition 就慢</td>
      </tr>
  </tbody>
</table>
<p>何時用原生 partitioning（不用 TimescaleDB）：</p>
<ul>
<li>不需要 compression / CAGG</li>
<li>Partition 數 &lt; 1000</li>
<li>已用 pg_partman 不想換</li>
<li>公司禁用 TSL license（TimescaleDB 部分功能受限）</li>
</ul>
<p>何時用 TimescaleDB：</p>
<ul>
<li>高頻 time-series（compression 必要）</li>
<li>需要 CAGG（手寫 incremental MV 成本高）</li>
<li>Partition 數 &gt; 1000</li>
<li>IoT / metrics / observability workload</li>
</ul>
<p>詳細 partitioning 機制看 <a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">declarative-partitioning</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a>：PG extension 全景</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">declarative-partitioning</a>：原生 partitioning</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">jsonb-deep-dive</a>：IoT payload 用 JSONB 儲存</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum-tuning</a>：hypertable autovacuum 行為</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/major-version-upgrade/" data-link-title="PostgreSQL major version upgrade (14 → 17)：為什麼這篇不套 5 type migration" data-link-desc="PostgreSQL major version upgrade 是 *5 type 漏類* 的實證 — source/target 同 vendor、5 維度都 Low 但 *upgrade-specific audit* 是核心；本文結構接近 deep article methodology 的 6-section &#43; 額外 upgrade audit 段；涵蓋 pg_upgrade / logical replication / blue-green 三方法、extension 相容性、5 production 踩雷">major-version-upgrade</a>：TimescaleDB + PG 升級順序</li>
</ul>
<h2 id="下一步">下一步</h2>
<ul>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a> 了解其他 PG 擴展選項</li>
<li>回 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview</a> 看全圖</li>
</ul>
]]></content:encoded></item><item><title>pgvector Deep Dive：HNSW / IVFFlat 取捨跟跟專業 Vector DB 對比</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pgvector-deep-dive/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pgvector-deep-dive/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>pgvector extension&lt;/em> — 用 PG 解 vector search workload 的路徑、是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem&lt;/a> 內最受關注的 extension。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="pgvector-是-pg-變-vector-db-的最短路徑">pgvector 是 PG 變 Vector DB 的最短路徑&lt;/h2>
&lt;p>pgvector 加兩件事：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">vector&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 加 vector column（dimension 必須事先決定）
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">documents&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">SERIAL&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">PRIMARY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">content&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">TEXT&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">embedding&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">vector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1536&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">-- OpenAI ada-002 維度
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 三種 distance operator
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">documents&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ORDER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">embedding&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;lt;-&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;[0.1, 0.2, ...]&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">LIMIT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">-- L2
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">documents&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ORDER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">embedding&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;lt;#&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;[0.1, 0.2, ...]&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">LIMIT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">-- inner product
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">documents&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ORDER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">embedding&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;lt;=&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;[0.1, 0.2, ...]&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">LIMIT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">-- cosine&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Operator 對應：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>pgvector extension</em> — 用 PG 解 vector search workload 的路徑、是 <a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a> 內最受關注的 extension。</p></blockquote>
<hr>
<h2 id="pgvector-是-pg-變-vector-db-的最短路徑">pgvector 是 PG 變 Vector DB 的最短路徑</h2>
<p>pgvector 加兩件事：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">vector</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="c1">-- 加 vector column（dimension 必須事先決定）
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">content</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="n">embedding</span><span class="w"> </span><span class="n">vector</span><span class="p">(</span><span class="mi">1536</span><span class="p">)</span><span class="w">  </span><span class="c1">-- OpenAI ada-002 維度
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 三種 distance operator
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;-&gt;</span><span class="w"> </span><span class="s1">&#39;[0.1, 0.2, ...]&#39;</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span><span class="w">  </span><span class="c1">-- L2
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;#&gt;</span><span class="w"> </span><span class="s1">&#39;[0.1, 0.2, ...]&#39;</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span><span class="w">  </span><span class="c1">-- inner product
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;=&gt;</span><span class="w"> </span><span class="s1">&#39;[0.1, 0.2, ...]&#39;</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span><span class="w">  </span><span class="c1">-- cosine</span></span></span></code></pre></div><p>Operator 對應：</p>
<table>
  <thead>
      <tr>
          <th>Operator</th>
          <th>意義</th>
          <th>適用</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>&lt;-&gt;</code></td>
          <td>L2 distance</td>
          <td>通用、空間距離</td>
      </tr>
      <tr>
          <td><code>&lt;#&gt;</code></td>
          <td>Negative inner product</td>
          <td>normalized vector、cosine 等價</td>
      </tr>
      <tr>
          <td><code>&lt;=&gt;</code></td>
          <td>Cosine distance</td>
          <td>embedding 比較最常用</td>
      </tr>
  </tbody>
</table>
<p>對 OpenAI / Cohere / sentence-transformers embedding、通常用 <code>&lt;=&gt;</code>（cosine）— embedding model 訓練時是 cosine objective。</p>
<h2 id="ann-index-是-vector-search-的核心">ANN Index 是 Vector Search 的核心</h2>
<p>不加 index 的 <code>ORDER BY embedding &lt;=&gt; ?</code> 是 <em>full scan</em>：</p>
<ul>
<li>100K row、1536 dim、每 query ~2-5s（不可用）</li>
<li>1M row 直接超時</li>
</ul>
<p>pgvector 提供兩種 <em>Approximate Nearest Neighbor</em>（ANN）index：</p>
<table>
  <thead>
      <tr>
          <th>Index</th>
          <th>Build 時間</th>
          <th>Query 時間</th>
          <th>Recall@10</th>
          <th>Memory cost</th>
          <th>Update 行為</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>IVFFlat</td>
          <td>快（分鐘級）</td>
          <td>中（10-100ms）</td>
          <td>90-95%</td>
          <td>中（lists 數量）</td>
          <td>Insert OK、需重建保持 recall</td>
      </tr>
      <tr>
          <td>HNSW</td>
          <td>慢（小時級）</td>
          <td>快（1-10ms）</td>
          <td>95-99%</td>
          <td>高（2-4x 資料）</td>
          <td>Insert OK、graph 漸進維護</td>
      </tr>
  </tbody>
</table>
<p><strong>選 IVFFlat 的場景</strong>：</p>
<ul>
<li>Embedding 量 &lt; 1M</li>
<li>Build 時間敏感（CI / batch 環境）</li>
<li>Memory 緊</li>
<li>接受重建 cost（每月 / 每季）</li>
</ul>
<p><strong>選 HNSW 的場景</strong>：</p>
<ul>
<li>Embedding 量 1M-100M</li>
<li>Query latency &lt; 50ms 要求</li>
<li>Memory 充足</li>
<li>Insert 量穩定（不會爆炸性增長）</li>
</ul>
<h2 id="ivfflat分-cluster-找鄰居">IVFFlat：分 Cluster 找鄰居</h2>
<p>IVFFlat 機制：</p>
<ol>
<li><strong>Build</strong>：跑 k-means 把所有 vector 分 <code>lists</code> 個 cluster</li>
<li><strong>Query</strong>：先找最近的 <code>probes</code> 個 cluster、再在這些 cluster 內找 nearest neighbor</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Build（lists 數量重要）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">ivfflat</span><span class="w"> </span><span class="p">(</span><span class="n">embedding</span><span class="w"> </span><span class="n">vector_cosine_ops</span><span class="p">)</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">lists</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Query 時調 probes 換 recall vs latency
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SET</span><span class="w"> </span><span class="n">ivfflat</span><span class="p">.</span><span class="n">probes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;=&gt;</span><span class="w"> </span><span class="o">?</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p><strong>Lists 跟 probes sizing 規則</strong>（pgvector 官方建議）：</p>
<table>
  <thead>
      <tr>
          <th>Row count</th>
          <th>lists 建議</th>
          <th>probes 建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>&lt; 1M</td>
          <td><code>rows / 1000</code></td>
          <td><code>sqrt(lists)</code></td>
      </tr>
      <tr>
          <td>&gt; 1M</td>
          <td><code>sqrt(rows)</code></td>
          <td><code>sqrt(lists)</code></td>
      </tr>
  </tbody>
</table>
<p>實務：100K row → lists=100 / probes=10、1M row → lists=1000 / probes=32。</p>
<p><strong>IVFFlat 的 recall drift</strong>：cluster 是 build 時固定的、新 insert 的 vector 進入「最近 cluster」、但隨資料分布改變、cluster center 可能不再代表性、recall 隨時間下降。</p>
<p>修法：定期 <code>REINDEX INDEX CONCURRENTLY ...</code>（每月 / 每 100K 新 row）。</p>
<h2 id="hnswmulti-level-graph-找鄰居">HNSW：Multi-level Graph 找鄰居</h2>
<p>HNSW（Hierarchical Navigable Small World）機制：</p>
<ol>
<li>多層 graph、上層稀疏、下層密集</li>
<li>Query 從上層 entry point 開始、逐層找近鄰、最後在底層精細搜尋</li>
<li>Insert 漸進維護 graph、不必重建</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Build（兩個關鍵參數）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">hnsw</span><span class="w"> </span><span class="p">(</span><span class="n">embedding</span><span class="w"> </span><span class="n">vector_cosine_ops</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">16</span><span class="p">,</span><span class="w"> </span><span class="n">ef_construction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">64</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- Query 時調 ef_search
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">SET</span><span class="w"> </span><span class="n">hnsw</span><span class="p">.</span><span class="n">ef_search</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;=&gt;</span><span class="w"> </span><span class="o">?</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p><strong>參數含義</strong>：</p>
<table>
  <thead>
      <tr>
          <th>參數</th>
          <th>含義</th>
          <th>預設</th>
          <th>Trade-off</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>m</code></td>
          <td>每 node 最多鄰居數</td>
          <td>16</td>
          <td>大 → recall 高、memory 多</td>
      </tr>
      <tr>
          <td><code>ef_construction</code></td>
          <td>Build 時 graph 質量參數</td>
          <td>64</td>
          <td>大 → build 慢、graph 質量好</td>
      </tr>
      <tr>
          <td><code>ef_search</code></td>
          <td>Query 時搜尋範圍</td>
          <td>40</td>
          <td>大 → recall 高、latency 高</td>
      </tr>
  </tbody>
</table>
<p><strong>Build cost 真實量級</strong>（1M vector × 1536 dim）：</p>
<table>
  <thead>
      <tr>
          <th>配置</th>
          <th>Build 時間</th>
          <th>Memory</th>
          <th>Recall@10</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>m=8, ef_construction=32</td>
          <td>30 min</td>
          <td>4GB</td>
          <td>92%</td>
      </tr>
      <tr>
          <td>m=16, ef_construction=64</td>
          <td>2 hour</td>
          <td>8GB</td>
          <td>96%</td>
      </tr>
      <tr>
          <td>m=32, ef_construction=200</td>
          <td>8 hour</td>
          <td>16GB</td>
          <td>98%</td>
      </tr>
  </tbody>
</table>
<p>Production 多數選中間 <code>m=16, ef_construction=64</code>、recall / cost 平衡。</p>
<h2 id="hybrid-searchvector--filter-一起">Hybrid Search：Vector + Filter 一起</h2>
<p>Vector search 加 SQL filter 是 pgvector 比專業 vector DB 強的場景：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Vector + metadata filter
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">documents</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">category</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;tech&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2025-01-01&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">embedding</span><span class="w"> </span><span class="o">&lt;=&gt;</span><span class="w"> </span><span class="s1">&#39;[0.1, 0.2, ...]&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p>但這裡有個 <em>pgvector 的踩雷</em>：filter 跟 ANN index 互動有兩種模式：</p>
<ol>
<li><strong>Pre-filter</strong>（planner 選）：先 filter 出符合條件的 row、再對 subset 跑 vector ordering → 不用 ANN index、可能慢</li>
<li><strong>Post-filter</strong>：用 ANN index 找 top-N、再 filter、可能 N 不夠補</li>
</ol>
<p>pgvector 0.8+（2024-10 release）加入 <em>iterative index scan</em>：HNSW / IVFFlat 一邊掃 graph 一邊 filter、效能比 pre-filter 好 5-10x。0.7+（2024-07）加 halfvec / binary quantization / parallel HNSW build。</p>
<p>實務：filter selectivity 高（&lt; 10%）時、考慮對 filter column 加 index 走 pre-filter；selectivity 低（&gt; 50%）走 iterative scan。</p>
<h2 id="quantization-跟-dimension-reduction">Quantization 跟 Dimension Reduction</h2>
<p>1536 dim float32 vector 一筆 6KB、1M row 6GB、加 HNSW index 後 ~20GB。Memory 緊時的省法：</p>
<h3 id="half-precisionpgvector-07">Half-precision（pgvector 0.7+）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">embedding</span><span class="w"> </span><span class="n">halfvec</span><span class="p">(</span><span class="mi">1536</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p><code>halfvec</code> 是 float16、storage 減半、recall 損失通常 &lt; 1%。</p>
<h3 id="binary-quantization">Binary quantization</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 把每維壓成 1 bit
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">hnsw</span><span class="w"> </span><span class="p">(</span><span class="n">embedding</span><span class="w"> </span><span class="n">bit_hamming_ops</span><span class="p">);</span></span></span></code></pre></div><p>Recall 下降明顯（85-90%）、但 storage 1/32、適合「先粗篩再 rerank」hybrid pipeline。</p>
<h3 id="dimension-reduction">Dimension reduction</h3>
<p>訓練 PCA / Matryoshka model 把 1536 dim 降到 256-512 dim、recall 通常損失 &lt; 3%、storage 1/3-1/6。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="case-1dimension-超-2000-限制">Case 1：Dimension 超 2000 限制</h3>
<p><strong>情境</strong>：要用 OpenAI text-embedding-3-large（3072 dim）、<code>CREATE TABLE ... embedding vector(3072)</code> 報錯。</p>
<p>pgvector <code>vector</code> type 上限 2000 dim（IVFFlat / HNSW index 限制）。</p>
<p>修法：</p>
<ul>
<li>改用 <code>halfvec</code>（pgvector 0.7+ 支援 4000 dim）</li>
<li>用 Matryoshka 截斷到 2000 dim 以下</li>
<li>換 embedding model（OpenAI text-embedding-3-small 1536 dim / 可截斷到 256-1024）</li>
</ul>
<h3 id="case-2hnsw-build-太慢">Case 2：HNSW build 太慢</h3>
<p><strong>情境</strong>：1M row build HNSW、跑 8 小時、blocking production。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 用 CONCURRENTLY 不 block
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">CONCURRENTLY</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">documents</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">hnsw</span><span class="w"> </span><span class="p">(...);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 開 maintenance_work_mem
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SET</span><span class="w"> </span><span class="n">maintenance_work_mem</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;8GB&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- 開 parallel
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="k">SET</span><span class="w"> </span><span class="n">max_parallel_maintenance_workers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">7</span><span class="p">;</span></span></span></code></pre></div><p>仍慢的話、考慮：</p>
<ul>
<li>切分 batch insert + index（適合 read-heavy）</li>
<li>用 IVFFlat 短期上線、之後再切 HNSW</li>
<li>改用 cloud managed pgvector（提供更大 instance）</li>
</ul>
<h3 id="case-3ivfflat-不重建-recall-漂移">Case 3：IVFFlat 不重建 recall 漂移</h3>
<p><strong>情境</strong>：IVFFlat build 時資料 100K、現在 500K、新資料 recall 從 92% 降到 75%、user 抱怨「找不到相關文件」。</p>
<p>修法：</p>
<ul>
<li>Monitor recall：定期跑 ground-truth eval（brute-force 對比）</li>
<li>設定 reindex policy：每 100K 新 row 或每月 reindex</li>
<li>換 HNSW：insert 漸進維護、不需 reindex（trade-off：build 更慢）</li>
</ul>
<h3 id="case-4hybrid-search-filter-selectivity-沒設計">Case 4：Hybrid search filter selectivity 沒設計</h3>
<p><strong>情境</strong>：query <code>WHERE user_id = ? ORDER BY embedding &lt;=&gt; ?</code>、user_id 高選擇性（1/1M）、planner 選 vector index scan、掃到 top-K 全不符 user_id、補抓無止盡。</p>
<p>修法：</p>
<ul>
<li><code>EXPLAIN</code> 看 planner 選 pre-filter 還是 vector-first</li>
<li>對 <code>user_id</code> 加 B-tree index、強 planner pre-filter（hint 不容易、用 statistics）</li>
<li>pgvector 0.8+ 用 iterative scan、自動處理</li>
<li>設計 schema：高選擇性 filter（user_id）建議走 pre-filter；低選擇性（category）走 iterative</li>
</ul>
<h3 id="case-5memory-budget-沒抓">Case 5：Memory budget 沒抓</h3>
<p><strong>情境</strong>：1M vector × 1536 dim × HNSW（m=16）= ~12GB index、shared_buffers 8GB、index 不在 cache、每 query disk IO、latency 100ms+。</p>
<p>修法：</p>
<ul>
<li>算 vector + index memory：<code>row × dim × 4 bytes × (1 + index_overhead)</code></li>
<li><code>shared_buffers</code> 至少能放 hot index portion</li>
<li>不行就降 dim（halfvec）/ 升 instance / 拆 sharded</li>
</ul>
<h2 id="跟專業-vector-db-對比">跟專業 Vector DB 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>pgvector</th>
          <th>Pinecone</th>
          <th>Weaviate</th>
          <th>Milvus</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Query 介面</td>
          <td>SQL</td>
          <td>REST/gRPC API</td>
          <td>GraphQL / REST</td>
          <td>gRPC</td>
      </tr>
      <tr>
          <td>Recall</td>
          <td>95-99%（HNSW）</td>
          <td>95-99%</td>
          <td>95-99%</td>
          <td>95-99%</td>
      </tr>
      <tr>
          <td>Throughput</td>
          <td>中（PG 限制）</td>
          <td>高</td>
          <td>高</td>
          <td>高</td>
      </tr>
      <tr>
          <td>Hybrid search</td>
          <td>強（完整 SQL）</td>
          <td>中（metadata filter）</td>
          <td>中</td>
          <td>中</td>
      </tr>
      <tr>
          <td>跟既有 PG 整合</td>
          <td>完美（同 DB join）</td>
          <td>需 sync</td>
          <td>需 sync</td>
          <td>需 sync</td>
      </tr>
      <tr>
          <td>Multi-tenant</td>
          <td>row-level（PG 一致）</td>
          <td>內建</td>
          <td>內建</td>
          <td>partition</td>
      </tr>
      <tr>
          <td>Open source</td>
          <td>是</td>
          <td>否</td>
          <td>是</td>
          <td>是</td>
      </tr>
      <tr>
          <td>Operational cost</td>
          <td>跟 PG 一樣（管 PG 即可）</td>
          <td>Managed-only</td>
          <td>需自管或 cloud</td>
          <td>需自管或 cloud</td>
      </tr>
      <tr>
          <td>Scale 上限</td>
          <td>10M-100M vector</td>
          <td>10B+</td>
          <td>1B+</td>
          <td>10B+</td>
      </tr>
  </tbody>
</table>
<p><strong>選 pgvector 的場景</strong>：</p>
<ul>
<li>Application 已用 PG、不想多管系統</li>
<li>Vector 量 &lt; 100M</li>
<li>需要 join vector + relational</li>
<li>Team SQL 熟、不想學 API SDK</li>
<li>Cost 敏感（managed Pinecone 1M vector 月 ~$70+）</li>
</ul>
<p><strong>選專業 vector DB 的場景</strong>：</p>
<ul>
<li>Vector 量 &gt; 5-20M（依 dim / QPS / recall 要求、pgvector 在這個級別 + 高 QPS 已開始痛、不必撐到 100M 才換）</li>
<li>純 vector workload（沒 relational integration）</li>
<li>需要 multi-tenant SaaS</li>
<li>Throughput 要求極高（&gt; 10K QPS）</li>
<li>不想自管 HNSW build / memory budget / recall drift（managed Pinecone 把這層 ops 轉嫁、cost 換 ops 時間）</li>
<li>需要 dim &gt; 2000（pgvector vector type 限制、halfvec 可到 4000、再大需 dimension reduction）</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a>：其他 PG extension</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">jsonb-deep-dive</a>：embedding 通常配 metadata JSONB</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/index-selection/" data-link-title="PostgreSQL Index Selection：B-tree / GIN / GiST / BRIN / Hash 對應 workload 的決策樹" data-link-desc="PG 有 6 種 index method（B-tree / Hash / GIN / GiST / SP-GiST / BRIN）跟 partial / expression / covering 三種變體、不是「都用 B-tree 就好」。每種 index 有自己的 query pattern、儲存代價、write amplification 跟 maintenance 成本。本文走 6 種 index 的適用 workload 對照、決策樹、partial / expression / covering / multi-column 變體、5 production 踩雷（過度 index / partial 條件不對 / B-tree 對 JSON 無效 / BRIN 對非 correlated 資料無效 / multi-column 順序錯）、跟 query-optimization 的 EXPLAIN 互補">index-selection</a>：B-tree / GIN / HNSW 整體比較</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization</a>：vector query 的 EXPLAIN</li>
</ul>
<h2 id="下一步">下一步</h2>
<ul>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a> 探索其他 PG 擴展可能</li>
<li>回 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview</a> 看全圖</li>
</ul>
]]></content:encoded></item><item><title>PostGIS Deep Dive：Geometry / Geography 型別、GiST 空間索引跟 ST_* 函式生態</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/postgis-deep-dive/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/postgis-deep-dive/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>PostGIS extension&lt;/em> — PG 變 GIS DB 的標配、跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem&lt;/a> 是 &lt;em>單一 extension 細節 vs ecosystem 全景&lt;/em> 的關係。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="postgis-是-pg-的-gis-specialization">PostGIS 是 PG 的 &lt;em>GIS Specialization&lt;/em>&lt;/h2>
&lt;p>PostGIS 是 PG 最成熟的 extension 之一（2001 年起、25 年歷史）、產業地位等同 OracleSpatial / SQL Server geography：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">postgis&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>加完後 PG 多兩件事：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>空間型別&lt;/strong>：&lt;code>geometry&lt;/code>（平面）/ &lt;code>geography&lt;/code>（地球曲面）/ &lt;code>raster&lt;/code>（柵格）&lt;/li>
&lt;li>&lt;strong>1000+ 函式&lt;/strong>：&lt;code>ST_Distance&lt;/code> / &lt;code>ST_Within&lt;/code> / &lt;code>ST_Buffer&lt;/code> / &lt;code>ST_Intersects&lt;/code> 等&lt;/li>
&lt;/ol>
&lt;p>用 PostGIS 解的典型 workload：&lt;/p>
&lt;ul>
&lt;li>「離我最近的 N 家店」（k-NN）&lt;/li>
&lt;li>「半徑 1km 內的所有 POI」（radius query）&lt;/li>
&lt;li>「兩個 polygon 是否重疊」（intersection）&lt;/li>
&lt;li>「polyline 總長度」（measurement）&lt;/li>
&lt;li>「行政區包含哪些 point」（containment）&lt;/li>
&lt;/ul>
&lt;h2 id="geometry-vs-geography選錯付學費">Geometry vs Geography：選錯付學費&lt;/h2>
&lt;p>PostGIS 提供兩種空間型別、用途完全不同：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>&lt;code>geometry&lt;/code>&lt;/th>
 &lt;th>&lt;code>geography&lt;/code>&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>座標系統&lt;/td>
 &lt;td>平面（笛卡兒）&lt;/td>
 &lt;td>地球曲面（spheroid）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>距離單位&lt;/td>
 &lt;td>座標系統決定（meter / degree）&lt;/td>
 &lt;td>永遠 meter&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>跨經度 180°&lt;/td>
 &lt;td>不處理&lt;/td>
 &lt;td>自動處理&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>適用範圍&lt;/td>
 &lt;td>小區域（單一城市 / 國家）&lt;/td>
 &lt;td>全球&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>函式覆蓋&lt;/td>
 &lt;td>1000+ 函式&lt;/td>
 &lt;td>約 300 函式&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>效能&lt;/td>
 &lt;td>快（平面計算）&lt;/td>
 &lt;td>慢 2-5x（球面計算）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Index 行為&lt;/td>
 &lt;td>GiST 直接&lt;/td>
 &lt;td>GiST 直接&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>選 &lt;code>geography&lt;/code> 的場景&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>PostGIS extension</em> — PG 變 GIS DB 的標配、跟 <a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a> 是 <em>單一 extension 細節 vs ecosystem 全景</em> 的關係。</p></blockquote>
<hr>
<h2 id="postgis-是-pg-的-gis-specialization">PostGIS 是 PG 的 <em>GIS Specialization</em></h2>
<p>PostGIS 是 PG 最成熟的 extension 之一（2001 年起、25 年歷史）、產業地位等同 OracleSpatial / SQL Server geography：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">postgis</span><span class="p">;</span></span></span></code></pre></div><p>加完後 PG 多兩件事：</p>
<ol>
<li><strong>空間型別</strong>：<code>geometry</code>（平面）/ <code>geography</code>（地球曲面）/ <code>raster</code>（柵格）</li>
<li><strong>1000+ 函式</strong>：<code>ST_Distance</code> / <code>ST_Within</code> / <code>ST_Buffer</code> / <code>ST_Intersects</code> 等</li>
</ol>
<p>用 PostGIS 解的典型 workload：</p>
<ul>
<li>「離我最近的 N 家店」（k-NN）</li>
<li>「半徑 1km 內的所有 POI」（radius query）</li>
<li>「兩個 polygon 是否重疊」（intersection）</li>
<li>「polyline 總長度」（measurement）</li>
<li>「行政區包含哪些 point」（containment）</li>
</ul>
<h2 id="geometry-vs-geography選錯付學費">Geometry vs Geography：選錯付學費</h2>
<p>PostGIS 提供兩種空間型別、用途完全不同：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th><code>geometry</code></th>
          <th><code>geography</code></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>座標系統</td>
          <td>平面（笛卡兒）</td>
          <td>地球曲面（spheroid）</td>
      </tr>
      <tr>
          <td>距離單位</td>
          <td>座標系統決定（meter / degree）</td>
          <td>永遠 meter</td>
      </tr>
      <tr>
          <td>跨經度 180°</td>
          <td>不處理</td>
          <td>自動處理</td>
      </tr>
      <tr>
          <td>適用範圍</td>
          <td>小區域（單一城市 / 國家）</td>
          <td>全球</td>
      </tr>
      <tr>
          <td>函式覆蓋</td>
          <td>1000+ 函式</td>
          <td>約 300 函式</td>
      </tr>
      <tr>
          <td>效能</td>
          <td>快（平面計算）</td>
          <td>慢 2-5x（球面計算）</td>
      </tr>
      <tr>
          <td>Index 行為</td>
          <td>GiST 直接</td>
          <td>GiST 直接</td>
      </tr>
  </tbody>
</table>
<p><strong>選 <code>geography</code> 的場景</strong>：</p>
<ul>
<li>全球範圍 application（跨國 / 跨大陸）</li>
<li>距離精準度要求高（球面比平面誤差小）</li>
<li>不需要複雜空間運算（geography 函式較少）</li>
</ul>
<p><strong>選 <code>geometry</code> 的場景</strong>：</p>
<ul>
<li>單一城市 / 國家內 application</li>
<li>需要完整 ST_* 函式（90% 函式只支援 geometry）</li>
<li>效能敏感</li>
</ul>
<p>實務多數 production 選 <code>geometry</code> + 適合的 SRID（用 local projection）— 既快又精準。</p>
<h2 id="srid-跟-projection為什麼-4326-vs-3857-是-gis-第一課">SRID 跟 Projection：為什麼 4326 vs 3857 是 GIS 第一課</h2>
<p>SRID（Spatial Reference System Identifier）定義「座標數字怎麼解讀」：</p>
<table>
  <thead>
      <tr>
          <th>SRID</th>
          <th>名稱</th>
          <th>適用</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>4326</td>
          <td>WGS 84（GPS）</td>
          <td>經緯度、最常見、Google Maps API</td>
      </tr>
      <tr>
          <td>3857</td>
          <td>Web Mercator</td>
          <td>Web tile map（OpenStreetMap）</td>
      </tr>
      <tr>
          <td>3826</td>
          <td>TWD97 / TM2 zone 121</td>
          <td>台灣 local projection、米為單位</td>
      </tr>
      <tr>
          <td>2272</td>
          <td>NAD83 / Pennsylvania</td>
          <td>美國 state plane（各州不同）</td>
      </tr>
  </tbody>
</table>
<p><strong>為什麼選 local projection（3826）而不是經緯度（4326）</strong>：</p>
<ul>
<li>經緯度單位是 <em>度</em>、不是距離 — <code>ST_Distance</code> 直接算出來是「度」、不是「米」</li>
<li>距離計算需 <code>ST_DistanceSphere</code> 或 <code>geography</code> cast、計算 cost 高</li>
<li>Local projection 是「平面投影」、<code>ST_Distance</code> 直接是米、<code>ST_Area</code> 直接是平方米</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 4326 經緯度直接算 → 結果不是米
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">ST_Distance</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5654</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">0330</span><span class="p">),</span><span class="w"> </span><span class="mi">4326</span><span class="p">),</span><span class="w">  </span><span class="c1">-- 台北 101
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"></span><span class="w">    </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5170</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">0478</span><span class="p">),</span><span class="w"> </span><span class="mi">4326</span><span class="p">)</span><span class="w">   </span><span class="c1">-- 台北車站
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="p">);</span><span class="w">  </span><span class="c1">-- ~0.05（這是「度」）
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- 轉 3826（台灣本地投影）才是米
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">ST_Distance</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="n">ST_Transform</span><span class="p">(</span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5654</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">0330</span><span class="p">),</span><span class="w"> </span><span class="mi">4326</span><span class="p">),</span><span class="w"> </span><span class="mi">3826</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="n">ST_Transform</span><span class="p">(</span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5170</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">0478</span><span class="p">),</span><span class="w"> </span><span class="mi">4326</span><span class="p">),</span><span class="w"> </span><span class="mi">3826</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">  </span><span class="c1">-- ~5300（米）
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- 或用 geography cast
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">ST_Distance</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">    </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5654</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">0330</span><span class="p">),</span><span class="w"> </span><span class="mi">4326</span><span class="p">)::</span><span class="n">geography</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">    </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">121</span><span class="p">.</span><span class="mi">5170</span><span class="p">,</span><span class="w"> </span><span class="mi">25</span><span class="p">.</span><span class="mi">0478</span><span class="p">),</span><span class="w"> </span><span class="mi">4326</span><span class="p">)::</span><span class="n">geography</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">  </span><span class="c1">-- ~5300（米）</span></span></span></code></pre></div><p><strong>典型 schema 設計</strong>（台灣 application）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">pois</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">name</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="c1">-- 儲存 4326（跟 Google Maps API 對齊）
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="w">    </span><span class="n">location_4326</span><span class="w"> </span><span class="n">geometry</span><span class="p">(</span><span class="n">Point</span><span class="p">,</span><span class="w"> </span><span class="mi">4326</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="c1">-- 預計算 3826（給距離 / 面積 query 用）
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="w">    </span><span class="n">location_3826</span><span class="w"> </span><span class="n">geometry</span><span class="p">(</span><span class="n">Point</span><span class="p">,</span><span class="w"> </span><span class="mi">3826</span><span class="p">)</span><span class="w"> </span><span class="k">GENERATED</span><span class="w"> </span><span class="n">ALWAYS</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">        </span><span class="p">(</span><span class="n">ST_Transform</span><span class="p">(</span><span class="n">location_4326</span><span class="p">,</span><span class="w"> </span><span class="mi">3826</span><span class="p">))</span><span class="w"> </span><span class="n">STORED</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_pois_location_3826</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">pois</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIST</span><span class="w"> </span><span class="p">(</span><span class="n">location_3826</span><span class="p">);</span></span></span></code></pre></div><h2 id="gist-空間索引r-tree-的-pg-實作">GiST 空間索引：R-tree 的 PG 實作</h2>
<p>PostGIS 用 PG 內建 GiST 做空間索引（內部是 R-tree 變體）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_pois_geom</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">pois</span><span class="w"> </span><span class="k">USING</span><span class="w"> </span><span class="n">GIST</span><span class="w"> </span><span class="p">(</span><span class="n">location_3826</span><span class="p">);</span></span></span></code></pre></div><p>GiST 對空間 query 加速的場景：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 範圍 query（box overlap）
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">location_3826</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w"> </span><span class="n">ST_MakeEnvelope</span><span class="p">(</span><span class="mi">290000</span><span class="p">,</span><span class="w"> </span><span class="mi">2760000</span><span class="p">,</span><span class="w"> </span><span class="mi">305000</span><span class="p">,</span><span class="w"> </span><span class="mi">2775000</span><span class="p">,</span><span class="w"> </span><span class="mi">3826</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="c1">-- 半徑 query（用 ST_DWithin 才走 index）
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">ST_DWithin</span><span class="p">(</span><span class="n">location_3826</span><span class="p">,</span><span class="w"> </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">300000</span><span class="p">,</span><span class="w"> </span><span class="mi">2770000</span><span class="p">),</span><span class="w"> </span><span class="mi">3826</span><span class="p">),</span><span class="w"> </span><span class="mi">1000</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- k-NN（PostGIS 2.0+ &lt;-&gt; operator）
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">location_3826</span><span class="w"> </span><span class="o">&lt;-&gt;</span><span class="w"> </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">300000</span><span class="p">,</span><span class="w"> </span><span class="mi">2770000</span><span class="p">),</span><span class="w"> </span><span class="mi">3826</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">dist</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">location_3826</span><span class="w"> </span><span class="o">&lt;-&gt;</span><span class="w"> </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="mi">300000</span><span class="p">,</span><span class="w"> </span><span class="mi">2770000</span><span class="p">),</span><span class="w"> </span><span class="mi">3826</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p><strong>index 用沒用到的關鍵</strong>：</p>
<table>
  <thead>
      <tr>
          <th>Query 寫法</th>
          <th>走 index？</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>ST_DWithin(a, b, dist)</code></td>
          <td>是</td>
      </tr>
      <tr>
          <td><code>ST_Distance(a, b) &lt; dist</code></td>
          <td>否（必 full scan）</td>
      </tr>
      <tr>
          <td><code>a &amp;&amp; bbox</code></td>
          <td>是</td>
      </tr>
      <tr>
          <td><code>ST_Intersects(a, bbox)</code></td>
          <td>是</td>
      </tr>
      <tr>
          <td><code>a &lt;-&gt; b ORDER BY ... LIMIT n</code></td>
          <td>是（k-NN）</td>
      </tr>
      <tr>
          <td><code>ST_Equals(a, b)</code></td>
          <td>否</td>
      </tr>
  </tbody>
</table>
<p>Production 寫法守則：能用 <code>ST_DWithin</code> 就不用 <code>ST_Distance(...) &lt; ?</code>、語意一樣但 index 行為差很多。</p>
<h2 id="st_-函式生態產業級全套">ST_* 函式生態：產業級全套</h2>
<p>PostGIS 1000+ 函式分類（典型用到的）：</p>
<table>
  <thead>
      <tr>
          <th>類別</th>
          <th>代表函式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>建構</td>
          <td><code>ST_MakePoint</code> / <code>ST_MakeLine</code> / <code>ST_MakePolygon</code></td>
      </tr>
      <tr>
          <td>關係判定</td>
          <td><code>ST_Intersects</code> / <code>ST_Within</code> / <code>ST_Contains</code> / <code>ST_Touches</code></td>
      </tr>
      <tr>
          <td>距離 / 大小</td>
          <td><code>ST_Distance</code> / <code>ST_DWithin</code> / <code>ST_Length</code> / <code>ST_Area</code></td>
      </tr>
      <tr>
          <td>變換</td>
          <td><code>ST_Buffer</code> / <code>ST_Union</code> / <code>ST_Difference</code> / <code>ST_Intersection</code></td>
      </tr>
      <tr>
          <td>投影</td>
          <td><code>ST_Transform</code> / <code>ST_SetSRID</code></td>
      </tr>
      <tr>
          <td>格式轉換</td>
          <td><code>ST_AsGeoJSON</code> / <code>ST_AsKML</code> / <code>ST_AsText</code> / <code>ST_GeomFromGeoJSON</code></td>
      </tr>
      <tr>
          <td>路徑 / 拓樸</td>
          <td><code>ST_ShortestLine</code> / <code>ST_LineMerge</code></td>
      </tr>
      <tr>
          <td>聚合</td>
          <td><code>ST_Collect</code> / <code>ST_ConvexHull</code> / <code>ST_Centroid</code></td>
      </tr>
      <tr>
          <td>簡化</td>
          <td><code>ST_Simplify</code> / <code>ST_SimplifyPreserveTopology</code></td>
      </tr>
  </tbody>
</table>
<p><strong>Web tile 場景</strong>典型 query：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 給定 z/x/y tile、找這個 tile 內的所有 POI
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">ST_AsMVTGeom</span><span class="p">(</span><span class="n">location_3857</span><span class="p">,</span><span class="w"> </span><span class="n">ST_TileEnvelope</span><span class="p">(</span><span class="n">z</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">geom</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">location_3857</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w"> </span><span class="n">ST_TileEnvelope</span><span class="p">(</span><span class="n">z</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">);</span></span></span></code></pre></div><p><code>ST_AsMVTGeom</code> + <code>ST_AsMVT</code> 直接產 Mapbox Vector Tile binary、給前端 Leaflet / Mapbox GL JS 用。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="case-1geometry-用錯-srid">Case 1：Geometry 用錯 SRID</h3>
<p><strong>情境</strong>：app 寫入時用 4326、query 時用 3826 ST_Transform、忘記給某個 column 設 SRID、index 失效。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 確認 SRID
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">ST_SRID</span><span class="p">(</span><span class="k">location</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- 強 type 約束（column type 寫死 SRID）
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">pois</span><span class="w"> </span><span class="k">ALTER</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="k">location</span><span class="w"> </span><span class="k">TYPE</span><span class="w"> </span><span class="n">geometry</span><span class="p">(</span><span class="n">Point</span><span class="p">,</span><span class="w"> </span><span class="mi">4326</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="k">USING</span><span class="w"> </span><span class="n">ST_SetSRID</span><span class="p">(</span><span class="k">location</span><span class="p">,</span><span class="w"> </span><span class="mi">4326</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- Check constraint 防錯
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">pois</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">CONSTRAINT</span><span class="w"> </span><span class="n">chk_location_srid</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">CHECK</span><span class="w"> </span><span class="p">(</span><span class="n">ST_SRID</span><span class="p">(</span><span class="k">location</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">4326</span><span class="p">);</span></span></span></code></pre></div><h3 id="case-2geography-不能用所有-st_-函式">Case 2：Geography 不能用所有 ST_* 函式</h3>
<p><strong>情境</strong>：用 <code>geography</code> 想跑 <code>ST_Buffer</code>、報錯或結果不對。</p>
<p><code>ST_Buffer</code> 對 geography 走 spheroid 近似、邊界 case 結果跟 geometry 不一致；很多函式（<code>ST_Voronoi</code> / <code>ST_Delaunay</code> 等）只支援 geometry。</p>
<p>修法：</p>
<ul>
<li>簡單距離 query 用 geography</li>
<li>複雜空間運算用 geometry + 適合 projection</li>
<li>不確定哪些函式支援 geography、看 PostGIS docs <em>Geography Support Functions</em> 清單</li>
</ul>
<h3 id="case-3gist-index-不對-st_distance-生效">Case 3：GiST index 不對 ST_Distance 生效</h3>
<p><strong>情境</strong>：query <code>ST_Distance(location, ?) &lt; 1000</code>、<code>EXPLAIN</code> 顯示 full scan、加 index 也沒用。</p>
<p><code>ST_Distance</code> 算完才 filter、planner 沒辦法用 GiST。</p>
<p>修法：</p>
<ul>
<li>改 <code>ST_DWithin(location, ?, 1000)</code> — 語意一樣、會走 GiST</li>
<li>確認 index 是對 <em>被 query 的 column</em> 建的（不是 transform 後的 expression）</li>
</ul>
<h3 id="case-4cluster-on-geom-後-brin-失效">Case 4：CLUSTER on geom 後 BRIN 失效</h3>
<p><strong>情境</strong>：對 <code>pois</code> 跑 <code>CLUSTER pois USING idx_pois_geom</code> 想加速空間查、但同時對 <code>created_at</code> 用 BRIN index、BRIN 完全失效。</p>
<p>CLUSTER 重組 physical order 跟 GiST 對齊、<code>created_at</code> physical order correlation 從 1.0 變 0.0、BRIN range 沒選擇性。</p>
<p>修法：</p>
<ul>
<li>不要 CLUSTER 大表（一次性、影響其他 column）</li>
<li>換 partition by time + GiST per-partition（取兩者）</li>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/index-selection/" data-link-title="PostgreSQL Index Selection：B-tree / GIN / GiST / BRIN / Hash 對應 workload 的決策樹" data-link-desc="PG 有 6 種 index method（B-tree / Hash / GIN / GiST / SP-GiST / BRIN）跟 partial / expression / covering 三種變體、不是「都用 B-tree 就好」。每種 index 有自己的 query pattern、儲存代價、write amplification 跟 maintenance 成本。本文走 6 種 index 的適用 workload 對照、決策樹、partial / expression / covering / multi-column 變體、5 production 踩雷（過度 index / partial 條件不對 / B-tree 對 JSON 無效 / BRIN 對非 correlated 資料無效 / multi-column 順序錯）、跟 query-optimization 的 EXPLAIN 互補">index-selection</a> 的 BRIN 段</li>
</ul>
<h3 id="case-5ewkb-vs-wkb-跨工具相容">Case 5：EWKB vs WKB 跨工具相容</h3>
<p><strong>情境</strong>：用 PostGIS export 給其他 GIS 工具（QGIS / Shapely / ogr2ogr）、resort 抱怨格式不對。</p>
<p>PostGIS 內部用 EWKB（Extended Well-Known Binary）— 多帶 SRID。多數 GIS 工具讀 WKB（標準）。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Export 標準 WKB
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">ST_AsBinary</span><span class="p">(</span><span class="n">geom</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 或 GeoJSON（跨工具最相容）
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">ST_AsGeoJSON</span><span class="p">(</span><span class="n">geom</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pois</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- 或 Shapefile via ogr2ogr
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1">-- ogr2ogr -f &#34;ESRI Shapefile&#34; output.shp PG:&#34;...&#34; -sql &#34;SELECT * FROM pois&#34;</span></span></span></code></pre></div><h2 id="跟專業-gis-db-對比">跟專業 GIS DB 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PostGIS</th>
          <th>Oracle Spatial</th>
          <th>SQL Server geography</th>
          <th>MongoDB GeoJSON</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>函式覆蓋</td>
          <td>1000+</td>
          <td>800+</td>
          <td>200+</td>
          <td>~20</td>
      </tr>
      <tr>
          <td>Raster 支援</td>
          <td>是</td>
          <td>是</td>
          <td>否</td>
          <td>否</td>
      </tr>
      <tr>
          <td>Topology</td>
          <td>是（PostGIS Topology）</td>
          <td>是</td>
          <td>否</td>
          <td>否</td>
      </tr>
      <tr>
          <td>3D 支援</td>
          <td>是（PostGIS SFCGAL）</td>
          <td>是</td>
          <td>部分</td>
          <td>否</td>
      </tr>
      <tr>
          <td>License</td>
          <td>GPL</td>
          <td>商業</td>
          <td>商業</td>
          <td>開源</td>
      </tr>
      <tr>
          <td>Tile generation</td>
          <td>內建（ST_AsMVT）</td>
          <td>否</td>
          <td>否</td>
          <td>否</td>
      </tr>
      <tr>
          <td>跟 PG 整合</td>
          <td>完美</td>
          <td>跟 Oracle 一體</td>
          <td>跟 SQL Server 一體</td>
          <td>獨立</td>
      </tr>
      <tr>
          <td>工業界使用</td>
          <td>OpenStreetMap / 各國國土測繪</td>
          <td>大型企業</td>
          <td>Microsoft 生態</td>
          <td>簡單 location app</td>
      </tr>
  </tbody>
</table>
<p><strong>選 PostGIS 的場景</strong>（90% GIS workload）：</p>
<ul>
<li>Application 已用 PG</li>
<li>需要完整 GIS 函式生態（路網 / 等高線 / 流域分析）</li>
<li>開源 / cost 敏感</li>
<li>跟 OGR / GDAL / QGIS 互通</li>
</ul>
<p><strong>選專業 GIS DB 的場景</strong>：</p>
<ul>
<li>已綁定 Oracle / SQL Server license</li>
<li>極專業 GIS（3D 城市模型 / LIDAR / GPU 加速）</li>
<li>純 location app 不需 relational（MongoDB GeoJSON 足夠）</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a>：其他 PG extension</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/index-selection/" data-link-title="PostgreSQL Index Selection：B-tree / GIN / GiST / BRIN / Hash 對應 workload 的決策樹" data-link-desc="PG 有 6 種 index method（B-tree / Hash / GIN / GiST / SP-GiST / BRIN）跟 partial / expression / covering 三種變體、不是「都用 B-tree 就好」。每種 index 有自己的 query pattern、儲存代價、write amplification 跟 maintenance 成本。本文走 6 種 index 的適用 workload 對照、決策樹、partial / expression / covering / multi-column 變體、5 production 踩雷（過度 index / partial 條件不對 / B-tree 對 JSON 無效 / BRIN 對非 correlated 資料無效 / multi-column 順序錯）、跟 query-optimization 的 EXPLAIN 互補">index-selection</a>：GiST 跟其他 index 對比</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">query-optimization</a>：空間 query 的 EXPLAIN</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">jsonb-deep-dive</a>：POI metadata 用 JSONB 儲存</li>
</ul>
<h2 id="下一步">下一步</h2>
<ul>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a> 探索其他 PG 擴展可能</li>
<li>回 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview</a> 看全圖</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PostgreSQL MVCC 的 vacuum 必要性、本文聚焦 &lt;em>autovacuum 在 production write-heavy workload 為什麼追不上&lt;/em> 的根因 + 各維度 tuning。&lt;/p>&lt;/blockquote>
&lt;h2 id="你的-autovacuum-永遠追不上-bloat--為什麼">你的 autovacuum 永遠追不上 bloat — 為什麼&lt;/h2>
&lt;p>write-heavy table 的常見故事：上線時表 10GB、3 個月後 30GB、6 個月 80GB；DBA 看 &lt;code>pg_stat_user_tables&lt;/code> 發現 &lt;code>n_dead_tup&lt;/code> 比 &lt;code>n_live_tup&lt;/code> 還多、&lt;code>pg_stat_progress_vacuum&lt;/code> 顯示 autovacuum 一直在跑、但 dead tuple 從沒清乾淨。表本身才 5M row、實際磁碟卻佔 80GB。&lt;/p>
&lt;p>這不是 PostgreSQL bug、是 autovacuum &lt;em>cost-based throttling 預設保守&lt;/em> 的設計意圖 — autovacuum 不該影響 OLTP query 性能、所以每跑一段就 sleep。預設 &lt;code>autovacuum_vacuum_cost_limit=200&lt;/code> + &lt;code>autovacuum_vacuum_cost_delay=2ms&lt;/code> 在 write-heavy 表（每秒幾千 UPDATE）下、清理速度 &lt;em>永遠慢於&lt;/em> dead tuple 產生速度。預設配置適合 read-heavy / write-light workload；OLTP write-heavy 必須調。&lt;/p>
&lt;h2 id="mvcc-跟-dead-tuplevacuum-在解什麼">MVCC 跟 dead tuple：vacuum 在解什麼&lt;/h2>
&lt;p>PostgreSQL MVCC：每次 UPDATE 都是 &lt;em>insert new row + mark old row as deleted&lt;/em>；DELETE 是 &lt;em>mark as deleted、不立刻釋放空間&lt;/em>。dead tuple 在 disk 上佔位、但不能被 query 讀到。autovacuum 的責任：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>回收 dead tuple 空間&lt;/strong> 供新 row reuse（不縮 table 大小、是 free space map）&lt;/li>
&lt;li>&lt;strong>更新 visibility map&lt;/strong> 讓 index-only scan 跳過 heap fetch&lt;/li>
&lt;li>&lt;strong>凍結老 row 的 xid&lt;/strong>（freeze）避免 xid wraparound 災難&lt;/li>
&lt;li>&lt;strong>重整 index B-tree&lt;/strong> 標記 dead pointer（不刪 index page）&lt;/li>
&lt;/ol>
&lt;p>Vacuum 不縮表 — 真要縮要跑 &lt;code>VACUUM FULL&lt;/code>（全表 exclusive lock、production 不能跑）或 &lt;code>pg_repack&lt;/code>（online repack tool）。預期 vacuum 只能 &lt;em>讓表停止長大&lt;/em>、不能 &lt;em>讓表變小&lt;/em>。&lt;/p>
&lt;h2 id="tuningcost-based-throttle-跟-trigger-threshold">Tuning：cost-based throttle 跟 trigger threshold&lt;/h2>
&lt;h3 id="cost-based-throttle全-instance">Cost-based throttle（全 instance）&lt;/h3>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-ini" data-lang="ini">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1"># postgresql.conf&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="na">autovacuum_vacuum_cost_limit&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">2000 # 預設 200、production 拉 5-10 倍&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="na">autovacuum_vacuum_cost_delay&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">2ms # 預設 2ms、不太需要動&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="na">autovacuum_max_workers&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">6 # 預設 3、CPU 多時拉到 6-10&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="na">maintenance_work_mem&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s">1GB # 預設 64MB、單一 vacuum 用的記憶體&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>直覺：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PostgreSQL MVCC 的 vacuum 必要性、本文聚焦 <em>autovacuum 在 production write-heavy workload 為什麼追不上</em> 的根因 + 各維度 tuning。</p></blockquote>
<h2 id="你的-autovacuum-永遠追不上-bloat--為什麼">你的 autovacuum 永遠追不上 bloat — 為什麼</h2>
<p>write-heavy table 的常見故事：上線時表 10GB、3 個月後 30GB、6 個月 80GB；DBA 看 <code>pg_stat_user_tables</code> 發現 <code>n_dead_tup</code> 比 <code>n_live_tup</code> 還多、<code>pg_stat_progress_vacuum</code> 顯示 autovacuum 一直在跑、但 dead tuple 從沒清乾淨。表本身才 5M row、實際磁碟卻佔 80GB。</p>
<p>這不是 PostgreSQL bug、是 autovacuum <em>cost-based throttling 預設保守</em> 的設計意圖 — autovacuum 不該影響 OLTP query 性能、所以每跑一段就 sleep。預設 <code>autovacuum_vacuum_cost_limit=200</code> + <code>autovacuum_vacuum_cost_delay=2ms</code> 在 write-heavy 表（每秒幾千 UPDATE）下、清理速度 <em>永遠慢於</em> dead tuple 產生速度。預設配置適合 read-heavy / write-light workload；OLTP write-heavy 必須調。</p>
<h2 id="mvcc-跟-dead-tuplevacuum-在解什麼">MVCC 跟 dead tuple：vacuum 在解什麼</h2>
<p>PostgreSQL MVCC：每次 UPDATE 都是 <em>insert new row + mark old row as deleted</em>；DELETE 是 <em>mark as deleted、不立刻釋放空間</em>。dead tuple 在 disk 上佔位、但不能被 query 讀到。autovacuum 的責任：</p>
<ol>
<li><strong>回收 dead tuple 空間</strong> 供新 row reuse（不縮 table 大小、是 free space map）</li>
<li><strong>更新 visibility map</strong> 讓 index-only scan 跳過 heap fetch</li>
<li><strong>凍結老 row 的 xid</strong>（freeze）避免 xid wraparound 災難</li>
<li><strong>重整 index B-tree</strong> 標記 dead pointer（不刪 index page）</li>
</ol>
<p>Vacuum 不縮表 — 真要縮要跑 <code>VACUUM FULL</code>（全表 exclusive lock、production 不能跑）或 <code>pg_repack</code>（online repack tool）。預期 vacuum 只能 <em>讓表停止長大</em>、不能 <em>讓表變小</em>。</p>
<h2 id="tuningcost-based-throttle-跟-trigger-threshold">Tuning：cost-based throttle 跟 trigger threshold</h2>
<h3 id="cost-based-throttle全-instance">Cost-based throttle（全 instance）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">autovacuum_vacuum_cost_limit</span> <span class="o">=</span> <span class="s">2000          # 預設 200、production 拉 5-10 倍</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">autovacuum_vacuum_cost_delay</span> <span class="o">=</span> <span class="s">2ms            # 預設 2ms、不太需要動</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">autovacuum_max_workers</span> <span class="o">=</span> <span class="s">6                    # 預設 3、CPU 多時拉到 6-10</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">maintenance_work_mem</span> <span class="o">=</span> <span class="s">1GB                    # 預設 64MB、單一 vacuum 用的記憶體</span></span></span></code></pre></div><p>直覺：</p>
<ul>
<li><code>cost_limit</code> 是每個 cycle 能消費多少「cost」、cost 由 page read / dirty / hit 加總；拉高 = 每次 cycle 處理更多 page</li>
<li>拉 <code>cost_limit</code> 比 <code>cost_delay</code> 直接 — delay 太低（&lt; 1ms）OS scheduler 抖動就無效</li>
<li><code>max_workers</code> 限同時跑的 vacuum；partition 多時容易爆滿、要拉</li>
<li><code>maintenance_work_mem</code> 影響 index vacuum 速度、SSD 環境 1-2GB 是 sweet spot</li>
</ul>
<h3 id="per-table-override精準到-hot-table">Per-table override（精準到 hot table）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 對 hot write-heavy 表加強
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="n">autovacuum_vacuum_scale_factor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">.</span><span class="mi">05</span><span class="p">,</span><span class="w">      </span><span class="c1">-- 預設 0.2、5% dead 就觸發
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"></span><span class="w">  </span><span class="n">autovacuum_vacuum_threshold</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1000</span><span class="p">,</span><span class="w">          </span><span class="c1">-- 預設 50、絕對值底線
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="w">  </span><span class="n">autovacuum_vacuum_cost_limit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">5000</span><span class="p">,</span><span class="w">         </span><span class="c1">-- 該表獨立 cost_limit
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="w">  </span><span class="n">autovacuum_analyze_scale_factor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">.</span><span class="mi">05</span><span class="p">,</span><span class="w">      </span><span class="c1">-- analyze 也跟著
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="w">  </span><span class="n">autovacuum_freeze_max_age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100000000</span><span class="w">        </span><span class="c1">-- anti-wraparound 提前
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 對 append-only 表（log table）降頻
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">audit_log</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">  </span><span class="n">autovacuum_vacuum_scale_factor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span><span class="w">        </span><span class="c1">-- 50% dead 才觸發（極少 UPDATE / DELETE）
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span><span class="w">  </span><span class="n">autovacuum_freeze_max_age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1000000000</span><span class="w">       </span><span class="c1">-- freeze 延後
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span><span class="p">);</span></span></span></code></pre></div><p>關鍵：<em>hot table 比 default 緊、cold table 比 default 鬆</em>、不要把所有表用同套配置。Production cluster 通常 5-20 個 hot table 需要 per-table tuning。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1write-heavy-hot-tableautovacuum-永遠跑不完">Case 1：write-heavy hot table，autovacuum 永遠跑不完</h3>
<p><strong>徵兆</strong>：<code>pg_stat_user_tables.n_dead_tup</code> 持續高於 <code>n_live_tup</code>、<code>pg_stat_progress_vacuum</code> 顯示某表 vacuum 跑了 6+ 小時還在 <code>scanning heap</code>、表 size 持續長大。</p>
<p><strong>根因</strong>：default <code>cost_limit=200</code> 對該表 write rate（~5000 UPDATE/s）下、vacuum 處理速度 &lt; dead tuple 產生速度；單次 autovacuum 跑完整表要 12 小時、但表 5% bloat 觸發又啟動下一輪。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>對該表 <code>ALTER TABLE ... SET (autovacuum_vacuum_cost_limit = 10000)</code> — 該表 vacuum 不受全 instance 限制</li>
<li><code>maintenance_work_mem</code> 拉到 2GB（單 vacuum）</li>
<li>短期：手動 <code>VACUUM (VERBOSE, ANALYZE) events;</code> 在 maintenance window 跑、catch up</li>
<li>長期：考慮 partitioning — partition 後 vacuum 只動最近 partition、不掃整表</li>
</ol>
<h3 id="case-2長-transaction-卡住-vacuum-的-xmin-horizon">Case 2：長 transaction 卡住 vacuum 的 xmin horizon</h3>
<p><strong>徵兆</strong>：autovacuum 看似有跑、但 <code>n_dead_tup</code> 不降；<code>pg_stat_activity</code> 看到一個跑了 8 小時的 SELECT（report query 或 idle in transaction）。</p>
<p><strong>根因</strong>：vacuum 只能回收「不會被任何 active transaction 看到」的 dead tuple；長 transaction 的 xmin 鎖死 vacuum 能回收的範圍、即使 autovacuum 不停跑、能回收的 row 數為 0。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：application 端用 <code>statement_timeout</code> + <code>idle_in_transaction_session_timeout</code>（30 分鐘）強制終止 long transaction</li>
<li><strong>偵測</strong>：<code>SELECT pid, now() - xact_start FROM pg_stat_activity WHERE state = 'idle in transaction'</code> 定期掃</li>
<li><strong>臨時</strong>：kill 長 transaction（<code>pg_cancel_backend(pid)</code> / <code>pg_terminate_backend(pid)</code>）、autovacuum 下次跑就能回收</li>
<li><strong>架構</strong>：報表 query 跑在 standby、不要在 primary 開 long transaction</li>
</ol>
<h3 id="case-3anti-wraparound-vacuum-在-peak-觸發">Case 3：Anti-wraparound vacuum 在 peak 觸發</h3>
<p><strong>徵兆</strong>：production 流量高峰時 PostgreSQL CPU 100%、<code>pg_stat_progress_vacuum</code> 顯示 anti-wraparound vacuum 正在跑、application latency 暴漲；log 出現 <code>database &quot;myapp&quot; must be vacuumed within X transactions</code>。</p>
<p><strong>根因</strong>：autovacuum_freeze_max_age（預設 200M）到了、PostgreSQL <em>強制</em> 跑 anti-wraparound vacuum（即使在 peak）；這個 vacuum <em>不受 cost_limit 限制</em>、跑到完才停、表大時要幾小時、跟 OLTP query 搶 IO。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：<code>autovacuum_freeze_max_age</code> 拉到 1B（10 億）、給 freeze 更多時間在 off-peak 自然發生</li>
<li><strong>per-table freeze</strong>：hot table 設 <code>autovacuum_freeze_max_age = 100M</code>（提前在 off-peak freeze）、cold table 設 800M（避免不必要 freeze）</li>
<li><strong>緊急</strong>：手動跑 <code>VACUUM (FREEZE, VERBOSE) table_name;</code> 在 maintenance window 預先 freeze</li>
<li><strong>監測</strong>：<code>SELECT relname, age(relfrozenxid) FROM pg_class WHERE relkind = 'r' ORDER BY age(relfrozenxid) DESC LIMIT 20;</code> 看哪些表逼近 wraparound</li>
</ol>
<h3 id="case-4partition-table-把-autovacuum_max_workers-跑滿">Case 4：Partition table 把 autovacuum_max_workers 跑滿</h3>
<p><strong>徵兆</strong>：partition 後（時間 partition、12 個月分區）、autovacuum 跑很慢、<code>pg_stat_activity</code> 看到 3 個 autovacuum worker 都在跑 partition 表、其他 hot table queue 等很久。</p>
<p><strong>根因</strong>：<code>autovacuum_max_workers=3</code> 預設、每個 partition 算獨立 table；100 個 partition 中 50 個都需要 vacuum、worker 滿、其他 table 排隊。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>拉 <code>autovacuum_max_workers</code> 到 6-10（依 CPU core 數）</li>
<li>cold partition 設 <code>autovacuum_enabled = false</code>（已不寫的舊 partition）、減少 worker 競爭</li>
<li>partition 數量本身要克制 — 100+ partition 是訊號該重新評估 partition strategy</li>
</ol>
<h3 id="case-5index-bloat-沒被-vacuum-處理">Case 5：Index bloat 沒被 vacuum 處理</h3>
<p><strong>徵兆</strong>：表 vacuum 跑完了、<code>n_dead_tup</code> 為 0、但 index size 持續長大；query 用該 index 越來越慢、跟 sequential scan 差不多。</p>
<p><strong>根因</strong>：autovacuum 只處理 <em>heap</em>（table data）跟 <em>index leaf pages</em>；index B-tree 內部結構 fragmentation 不被 vacuum 處理。dead pointer 留在 index leaf page、查詢仍 traverse 過、IO 多。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><code>REINDEX CONCURRENTLY</code> 線上重建 index（PG 12+）、不鎖表</li>
<li>監測 index bloat：<code>pgstattuple_approx</code> extension 或 <code>pg_repack</code></li>
<li>預防：B-tree index 設計避免 high cardinality + 大量 UPDATE 同欄位（typical 場景：status column update）；考慮 <em>partial index</em> 或 <em>hash index</em>（PG 10+ logged）</li>
<li>大量 bloat index 用 <code>pg_repack</code> 重建（不需要 superuser、不鎖表）</li>
</ol>
<h2 id="容量規劃">容量規劃</h2>
<p>vacuum capacity 用 <em>跟得上 dead tuple 產生速度</em> 衡量：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>估算方式</th>
          <th>警戒</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>dead tuple 產生 rate</td>
          <td><code>UPDATE/s + DELETE/s + ~10% INSERT/s（HOT update miss）</code></td>
          <td>跟 vacuum rate 對比</td>
      </tr>
      <tr>
          <td>vacuum 處理 rate</td>
          <td><code>cost_limit / cost_delay × page_size</code>、~MB/s 數量級</td>
          <td>跟 dead tuple rate 對比</td>
      </tr>
      <tr>
          <td>autovacuum_max_workers</td>
          <td>partition 數 + hot table 數 / 3-5</td>
          <td>100+ partition 必須拉 worker</td>
      </tr>
      <tr>
          <td>maintenance_work_mem</td>
          <td>1-2GB / vacuum worker</td>
          <td>全 worker 跑時的記憶體上限要 sizing</td>
      </tr>
      <tr>
          <td>anti-wraparound 觸發頻率</td>
          <td>預設 200M xid、write-heavy ~ 1-2 週觸發一次</td>
          <td>拉到 1B 後 ~ 2-3 月一次</td>
      </tr>
      <tr>
          <td>Bloat ratio</td>
          <td><code>pg_stat_user_tables.n_dead_tup / n_live_tup</code></td>
          <td>&gt; 50% 表示 vacuum 追不上</td>
      </tr>
  </tbody>
</table>
<p>實務 default：</p>
<ul>
<li>OLTP write-heavy（事件 / 訂單）：cost_limit 2000-5000、scale_factor 0.05、freeze_max_age 100M</li>
<li>OLTP read-heavy（user / config）：default 即可</li>
<li>Append-only log：scale_factor 0.5、freeze_max_age 800M、<code>autovacuum_enabled = false</code> for cold partition</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-partitioning-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">partitioning</a> 整合</h3>
<p>partitioning 是 vacuum 問題的長期解：</p>
<ul>
<li>大表（&gt; 100GB）vacuum 時間隨 size 線性、partition 後 vacuum 只動最近 partition</li>
<li>Cold partition <code>autovacuum_enabled = false</code> 完全停掉、新數據只在 hot partition</li>
<li>缺點：partition 數量爆炸時、autovacuum_max_workers 也要拉</li>
</ul>
<h3 id="跟-monitoring-整合">跟 monitoring 整合</h3>
<p>關鍵 metric：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- bloat 比例
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">relname</span><span class="p">,</span><span class="w"> </span><span class="n">n_dead_tup</span><span class="p">,</span><span class="w"> </span><span class="n">n_live_tup</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">       </span><span class="n">round</span><span class="p">(</span><span class="n">n_dead_tup</span><span class="p">::</span><span class="nb">numeric</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="k">nullif</span><span class="p">(</span><span class="n">n_live_tup</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">100</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">dead_pct</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_user_tables</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">n_live_tup</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">1000</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">n_dead_tup</span><span class="w"> </span><span class="k">DESC</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">20</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- vacuum 進度
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_progress_vacuum</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- xid wraparound 距離
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">datname</span><span class="p">,</span><span class="w"> </span><span class="n">age</span><span class="p">(</span><span class="n">datfrozenxid</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_database</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="k">DESC</span><span class="p">;</span></span></span></code></pre></div><p>Prometheus alert 三條：<code>dead_pct &gt; 30</code>、<code>vacuum_running_seconds &gt; 3600</code>、<code>xid_age &gt; 500000000</code>。</p>
<h3 id="跟-backup-window">跟 backup window</h3>
<p>VACUUM FREEZE 在 backup 前跑能減少 backup size（freeze tuple 不需要 special handling）：</p>
<ol>
<li>每週 maintenance window 跑 <code>VACUUM (FREEZE, ANALYZE) hot_table</code> — 預先 freeze + 更新 stats</li>
<li>backup 前避免長 transaction、確保 vacuum 能跑</li>
</ol>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>HOT update 跟 fillfactor</strong>：UPDATE 同頁可重用空間、fillfactor 80 為 hot table 留 20% buffer</li>
<li><strong><code>pg_repack</code> vs <code>VACUUM FULL</code></strong>：online vs offline、長期維護工具選擇</li>
<li><strong>PostgreSQL 14+ parallel vacuum</strong>：index vacuum 平行化、大表受益明顯</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>上游 chapter：<a href="/blog/backend/01-database/high-concurrency-access/" data-link-title="1.1 高併發下的 SQL 讀寫邊界" data-link-desc="說明高併發服務如何共用資料庫 client、控制 transaction、管理 connection pool、避免資料庫成為瓶頸">High Concurrency Access</a> — vacuum 是 concurrency 治理一環</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">Declarative Partitioning</a> / <a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">MVCC + Lock Model</a>（為什麼會有 dead tuple、跟 lock 互動）</li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/declarative-partitioning/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/declarative-partitioning/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明大表（&amp;gt; 1TB）需要 partitioning、本文聚焦 &lt;em>partition 真實價值在哪、為什麼多數人第一次 partition 都做錯&lt;/em>。&lt;/p>&lt;/blockquote>
&lt;h2 id="partition-不是把大表切小是讓-planner-pruning--縮小-maintenance-scope">Partition 不是「把大表切小」、是「讓 planner pruning + 縮小 maintenance scope」&lt;/h2>
&lt;p>剛開始學 partitioning 的人多半從「表太大、切小一點」直覺出發；切了之後發現 — &lt;em>query 變慢&lt;/em>（planner 還在看所有 partition）、&lt;em>INSERT 變慢&lt;/em>（trigger / partition routing overhead）、&lt;em>backup 沒變短&lt;/em>（總資料量沒變）。直覺錯了：partition 的工程價值來自兩個機制、跟「切小」沒直接關係：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Query planner pruning&lt;/strong>：planner 在 planning 階段 &lt;em>跳過&lt;/em> 不可能命中 partition key 的 partition、查詢只 scan 相關 partition；前提是 &lt;em>WHERE 條件含 partition key&lt;/em>、否則 planner 看完所有 partition、效能反而比單表差&lt;/li>
&lt;li>&lt;strong>Maintenance scope 縮小&lt;/strong>：vacuum / index rebuild / DROP / archive 只動單一 partition、不掃整表；vacuum 12 小時變 30 分鐘 / DROP 老資料 0.01 秒、是 partition 真正回本的地方&lt;/li>
&lt;/ol>
&lt;p>partition 是 &lt;em>為了 maintenance 跟 planner pruning&lt;/em> 設計、不是「表變小」設計。漏掉這個 framing、partition 配置會錯。&lt;/p>
&lt;h2 id="range--list--hashpartition-策略對應業務形狀">RANGE / LIST / HASH：partition 策略對應業務形狀&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- RANGE: 時間序列、log、event（最常見）
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">bigint&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">event_time&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">timestamptz&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">payload&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">jsonb&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">RANGE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">event_time&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events_2026_05&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">OF&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">FOR&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">VALUES&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;2026-05-01&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TO&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;2026-06-01&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- LIST: tenant ID / region / status enum
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">bigint&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">tenant_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">int&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="p">...&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">LIST&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tenant_id&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">18&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders_tenant_premium&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">OF&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">19&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">FOR&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">VALUES&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">IN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1001&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">1002&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">1003&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">20&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">21&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- HASH: 均勻散落（無自然 partition key）
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">22&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">23&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">user_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">bigint&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">24&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="p">...&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">25&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">HASH&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">user_id&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">26&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">27&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users_0&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">OF&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">28&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">FOR&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">VALUES&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">WITH&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">MODULUS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">4&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">REMAINDER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">);&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>策略選擇關鍵：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明大表（&gt; 1TB）需要 partitioning、本文聚焦 <em>partition 真實價值在哪、為什麼多數人第一次 partition 都做錯</em>。</p></blockquote>
<h2 id="partition-不是把大表切小是讓-planner-pruning--縮小-maintenance-scope">Partition 不是「把大表切小」、是「讓 planner pruning + 縮小 maintenance scope」</h2>
<p>剛開始學 partitioning 的人多半從「表太大、切小一點」直覺出發；切了之後發現 — <em>query 變慢</em>（planner 還在看所有 partition）、<em>INSERT 變慢</em>（trigger / partition routing overhead）、<em>backup 沒變短</em>（總資料量沒變）。直覺錯了：partition 的工程價值來自兩個機制、跟「切小」沒直接關係：</p>
<ol>
<li><strong>Query planner pruning</strong>：planner 在 planning 階段 <em>跳過</em> 不可能命中 partition key 的 partition、查詢只 scan 相關 partition；前提是 <em>WHERE 條件含 partition key</em>、否則 planner 看完所有 partition、效能反而比單表差</li>
<li><strong>Maintenance scope 縮小</strong>：vacuum / index rebuild / DROP / archive 只動單一 partition、不掃整表；vacuum 12 小時變 30 分鐘 / DROP 老資料 0.01 秒、是 partition 真正回本的地方</li>
</ol>
<p>partition 是 <em>為了 maintenance 跟 planner pruning</em> 設計、不是「表變小」設計。漏掉這個 framing、partition 配置會錯。</p>
<h2 id="range--list--hashpartition-策略對應業務形狀">RANGE / LIST / HASH：partition 策略對應業務形狀</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- RANGE: 時間序列、log、event（最常見）
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="n">id</span><span class="w"> </span><span class="nb">bigint</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="n">event_time</span><span class="w"> </span><span class="n">timestamptz</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="n">payload</span><span class="w"> </span><span class="n">jsonb</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">RANGE</span><span class="w"> </span><span class="p">(</span><span class="n">event_time</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_2026_05</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">OF</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">  </span><span class="k">FOR</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;2026-05-01&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;2026-06-01&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- LIST: tenant ID / region / status enum
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">  </span><span class="n">id</span><span class="w"> </span><span class="nb">bigint</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">  </span><span class="n">tenant_id</span><span class="w"> </span><span class="nb">int</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">  </span><span class="p">...</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">LIST</span><span class="w"> </span><span class="p">(</span><span class="n">tenant_id</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders_tenant_premium</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">OF</span><span class="w"> </span><span class="n">orders</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">  </span><span class="k">FOR</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="mi">1001</span><span class="p">,</span><span class="w"> </span><span class="mi">1002</span><span class="p">,</span><span class="w"> </span><span class="mi">1003</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w"></span><span class="c1">-- HASH: 均勻散落（無自然 partition key）
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">  </span><span class="n">user_id</span><span class="w"> </span><span class="nb">bigint</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w">  </span><span class="p">...</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">HASH</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">users_0</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">OF</span><span class="w"> </span><span class="n">users</span><span class="w">
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="w">  </span><span class="k">FOR</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">MODULUS</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span><span class="w"> </span><span class="n">REMAINDER</span><span class="w"> </span><span class="mi">0</span><span class="p">);</span></span></span></code></pre></div><p>策略選擇關鍵：</p>
<ul>
<li><strong>RANGE</strong> 適合 <em>時間 / 有序值</em> — query 多半帶 <code>WHERE event_time &gt;= X</code>、prune 效率最高；archive / drop 老資料是 <code>DROP PARTITION</code> 0.01 秒</li>
<li><strong>LIST</strong> 適合 <em>離散 enum / tenant</em> — query 帶 <code>WHERE tenant_id = X</code> prune；缺點是 tenant 增長要手動 ALTER ADD PARTITION</li>
<li><strong>HASH</strong> 適合 <em>均勻分散、沒自然 key</em> — query 多半 by-PK lookup、HASH 讓單 partition 大小均勻；prune 只在 <code>WHERE hash_key = X</code> 等值查詢觸發</li>
</ul>
<h3 id="選錯-partition-key-是最常見的錯誤">選錯 partition key 是最常見的錯誤</h3>
<p>例：events 表用 <code>user_id</code> HASH partition、但 query 多半 <code>WHERE event_time BETWEEN ...</code>、<code>user_id</code> 不在 WHERE — planner 沒法 prune、掃所有 partition、效能比單表更差（多了 partition routing overhead）。</p>
<p>partition key <em>必須</em> 對應 query 最常用的 WHERE filter；錯了就退化成 <em>維護面有好處、查詢面有壞處</em> 的尷尬狀態。</p>
<h2 id="partition-pruningplanner-怎麼決定跳過">Partition pruning：planner 怎麼決定跳過</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">EXPLAIN</span><span class="w"> </span><span class="p">(</span><span class="k">ANALYZE</span><span class="p">,</span><span class="w"> </span><span class="n">BUFFERS</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="s1">&#39;2026-05-15&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 期望輸出包含：
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1">--  Append (cost=...)
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1">--    -&gt; Seq Scan on events_2026_05  (cost=...)
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1">-- (只 scan 一個 partition、其他 partition pruned)</span></span></span></code></pre></div><p>pruning 觸發條件：</p>
<ol>
<li>WHERE 含 partition key 的 <em>constant expression</em>（<code>WHERE x = 5</code> 觸發；<code>WHERE x = some_function()</code> 不觸發 planning-time prune、但 PG 11+ execution-time prune 可救）</li>
<li>PG 11+ 支援 <em>execution-time pruning</em> — query plan 內含 partition key、runtime 才知道值（prepared statement / NestedLoop join）</li>
<li>partition key 不在 WHERE 時 — <em>全部 partition 掃</em>、是反指標、表示 partition strategy 不對</li>
</ol>
<h3 id="partition-wise-join--aggregate-pg-11">Partition-wise join / aggregate (PG 11+)</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SET</span><span class="w"> </span><span class="n">enable_partitionwise_join</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">on</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SET</span><span class="w"> </span><span class="n">enable_partitionwise_aggregate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">on</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 兩個同 partition 策略的表 JOIN 時、planner 可 partition-wise 平行做
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="n">e</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">events_metadata</span><span class="w"> </span><span class="n">m</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="k">ON</span><span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="n">event_time</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m</span><span class="p">.</span><span class="n">event_time</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">  </span><span class="k">WHERE</span><span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="n">event_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="p">;</span></span></span></code></pre></div><p>需要兩個表 <em>partition strategy 完全一致</em>（同 partition key + 同 partition boundary）— 設計時對齊、後期不容易調整。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1partition-key-選錯query-變慢">Case 1：partition key 選錯，query 變慢</h3>
<p><strong>徵兆</strong>：partition 後特定查詢從 200ms 變成 2000ms；EXPLAIN 顯示 <code>Append</code> 下面所有 partition 都被 scan、沒 partition 被 prune。</p>
<p><strong>根因</strong>：partition by <code>user_id</code> HASH、但 query 多用 <code>WHERE created_at BETWEEN X AND Y</code>；planner 不知道 user 在哪個 partition、必須掃全部。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>驗證 step</strong>：partition 前先 <code>pg_stat_statements</code> 看 top 10 query 的 WHERE pattern、partition key 必須對應其中 80% 流量的 filter</li>
<li><strong>修正</strong>：DROP partition strategy、改 partition by <code>created_at</code> RANGE；遷移用 <code>pg_dump --section=data</code> per-partition 重灌</li>
<li><strong>避免</strong>：partitioning 不可逆、設計階段 query pattern 沒看清楚不要動</li>
</ol>
<h3 id="case-2cross-partition-unique-constraint-不-enforce">Case 2：cross-partition unique constraint 不 enforce</h3>
<p><strong>徵兆</strong>：partition 後發現 application code 寫死 duplicate user_email、但 unique constraint 沒擋；DB 內有同 email 多筆。</p>
<p><strong>根因</strong>：PostgreSQL partition table 的 <code>UNIQUE</code> constraint <em>必須包含 partition key</em> — <code>UNIQUE (email)</code> 在 partition by <code>tenant_id</code> 的表上 <em>無法 enforce</em>（PostgreSQL 拒建）；workaround 用 <code>UNIQUE (email, tenant_id)</code>、但業務語意是「email 全域唯一」、PG 無法保證。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>架構</strong>：跨 partition 唯一性必須在 <em>application 層</em> enforce（lock + check 模式）</li>
<li><strong>替代</strong>：用 <em>non-partitioned</em> 表存唯一性目標（user_email_registry）、做寫入前 lookup</li>
<li><strong>設計階段檢查</strong>：partition by X、unique constraint 必須含 X；若業務要求 unique 不含 X、partition strategy 錯</li>
</ol>
<h3 id="case-3attach-partition-鎖表太久">Case 3：ATTACH PARTITION 鎖表太久</h3>
<p><strong>徵兆</strong>：新 month partition <code>ATTACH PARTITION</code> 跑 30 秒、期間整個 events 表 read 阻塞、application timeout 大量。</p>
<p><strong>根因</strong>：<code>ATTACH PARTITION</code> 預設加 <code>ACCESS EXCLUSIVE</code> lock 在 parent table、scan 整個新 partition 驗證 CHECK constraint；大 partition + 沒 CHECK constraint 預先驗證 → 鎖時間爆。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. 先把要 attach 的 partition 加 CHECK constraint，用 NOT VALID 不掃描
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_2026_06</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">CONSTRAINT</span><span class="w"> </span><span class="n">events_2026_06_range</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="k">CHECK</span><span class="w"> </span><span class="p">(</span><span class="n">event_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="s1">&#39;2026-06-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="s1">&#39;2026-07-01&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">VALID</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="c1">-- 2. VALIDATE 用 SHARE UPDATE EXCLUSIVE lock、允許讀寫
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_2026_06</span><span class="w"> </span><span class="n">VALIDATE</span><span class="w"> </span><span class="k">CONSTRAINT</span><span class="w"> </span><span class="n">events_2026_06_range</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- 3. ATTACH 不再需要 scan（CHECK 已 VALIDATE 過）
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="n">ATTACH</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">events_2026_06</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">  </span><span class="k">FOR</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;2026-06-01&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;2026-07-01&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- ATTACH 變 instant</span></span></span></code></pre></div><h3 id="case-4partition-數爆炸planner-planning-time-爆">Case 4：partition 數爆炸，planner planning time 爆</h3>
<p><strong>徵兆</strong>：partition 累積到 500+（daily partition 跑 1-2 年）、簡單 query EXPLAIN 顯示 planning_time 從 1ms 漲到 200ms、application response 變慢。</p>
<p><strong>根因</strong>：partition 越多 planner 要評估的 partition 越多、即使有 pruning、planning 階段也要 walk 全部 partition table；500+ partition 是 planning overhead 明顯的閾值。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>架構</strong>：partition granularity 對應 retention — 不要 daily partition 留 2 年（→ weekly / monthly）</li>
<li><strong>archive 老 partition</strong>：DETACH 老 partition、轉成 cold storage 表、planner 不再看</li>
<li><strong><code>enable_partition_pruning</code></strong> 預設 on、確保啟用</li>
<li><strong>PG 12+</strong>：planner 對 partition table 的 list 處理優化、planning time 上限拉高、但仍要控</li>
</ol>
<h3 id="case-5detach-後磁碟空間沒回收">Case 5：DETACH 後磁碟空間沒回收</h3>
<p><strong>徵兆</strong>：DETACH PARTITION 後 <code>pg_database_size</code> 沒下降、預期釋放 50GB；磁碟仍滿。</p>
<p><strong>根因</strong>：DETACH 只是把 partition 從 parent table <em>分離</em>、partition 自己仍是獨立表存在；要真釋放需要 <code>DROP TABLE detached_partition</code>。SRE 以為 DETACH = 刪掉。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 完整流程
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="n">DETACH</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">events_2024_01</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- events_2024_01 仍存在、佔磁碟
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 確認沒 query 在用後
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">DROP</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_2024_01</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- 才釋放磁碟</span></span></span></code></pre></div><h3 id="routinearchive-workflow">Routine：archive workflow</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 月底跑：
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1">-- 1. detach 13 個月前的 partition
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="n">DETACH</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">events_2025_04</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 2. dump 到 cold storage
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="err">\</span><span class="k">COPY</span><span class="w"> </span><span class="n">events_2025_04</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="s1">&#39;/cold/events_2025_04.csv&#39;</span><span class="w"> </span><span class="p">(</span><span class="n">FORMAT</span><span class="w"> </span><span class="n">CSV</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="c1">-- 3. drop 釋放磁碟
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1"></span><span class="k">DROP</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_2025_04</span><span class="p">;</span></span></span></code></pre></div><h2 id="容量規劃">容量規劃</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>估算</th>
          <th>警戒</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>單 partition size</td>
          <td>跟單表 vacuum 上限對齊（10-100GB sweet spot）</td>
          <td>&gt; 200GB 時考慮 sub-partition 或細化 granularity</td>
      </tr>
      <tr>
          <td>Partition 數量</td>
          <td>對應 retention × granularity</td>
          <td>&gt; 200 partition 時 planning time 開始浮現</td>
      </tr>
      <tr>
          <td>Partition key cardinality</td>
          <td>LIST：&lt; 100 / HASH：自定 modulus / RANGE：時間 + 維度</td>
          <td>太多獨立 partition value 用 HASH</td>
      </tr>
      <tr>
          <td>Cross-partition query 比例</td>
          <td>EXPLAIN 看 partition scan 數</td>
          <td>&gt; 30% query 掃 &gt; 50% partition 表示 key 選錯</td>
      </tr>
      <tr>
          <td>Maintenance window</td>
          <td>DROP / DETACH / ATTACH 各 partition 各自管</td>
          <td>hot partition 維護仍在 maintenance window</td>
      </tr>
  </tbody>
</table>
<p>實務 default：</p>
<ul>
<li>時間序列（events / log）：monthly RANGE partition、retention 12-24 個月</li>
<li>Multi-tenant（orders / records）：tenant_id LIST partition + 大 tenant 各自獨立 partition</li>
<li>均勻散落（user / metric）：8-16 個 HASH partition、單 partition 50-100GB</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-autovacuum-tuning-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum tuning</a> 整合</h3>
<p>partitioning 是 autovacuum 問題的長期解：</p>
<ol>
<li>Hot partition autovacuum 緊（scale_factor 0.05、cost_limit 5000）</li>
<li>Cold partition <code>autovacuum_enabled = false</code></li>
<li>但 partition 數爆會把 <code>autovacuum_max_workers</code> 跑滿、需要拉</li>
</ol>
<h3 id="跟-index-設計整合">跟 index 設計整合</h3>
<p>partition table 的 index 處理：</p>
<ol>
<li>PG 11+ 全域 index：<code>CREATE INDEX ON partitioned_table (...)</code> 自動在每 partition 建 local index</li>
<li><strong>不存在跨 partition unique</strong> — 只能 partition-local</li>
<li><strong>partition-wise index scan</strong>：PG 11+ 跟 partition-wise join 一起、index lookup 平行</li>
</ol>
<h3 id="跟-backup--pitr">跟 backup / PITR</h3>
<p>partition 不是 backup 替代品 — 但能加速 <em>partial restore</em>：</p>
<ol>
<li>只 restore 特定時段的 partition、不用 restore 整個表</li>
<li>對應 <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL archiving</a> 的 partial recovery scenario</li>
</ol>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Sub-partitioning</strong>：partition 內再 partition（時間 + tenant）、適合 multi-tenant + 時間序列</li>
<li><strong>pg_partman extension</strong>：自動建月 partition、不用 cron</li>
<li><strong>Foreign key to partitioned table</strong> (PG 12+)：跨 partition FK enforce、但 cascade 限制多</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>上游 chapter：<a href="/blog/backend/01-database/schema-design/" data-link-title="1.2 Schema Design 與資料建模" data-link-desc="整理 table、index、key、partition、denormalization 與命名規則">Schema Design</a> — partition 是 schema 決策</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum tuning</a> / <a href="/blog/backend/01-database/vendors/postgresql/timescaledb-deep-dive/" data-link-title="TimescaleDB Deep Dive：Hypertable / Continuous Aggregate / Compression 把 PG 變 Time-Series DB" data-link-desc="TimescaleDB 是 PG extension（不是 fork）、用 *hypertable* 自動 partition by time、加 *continuous aggregate* 做 incremental materialized view、加 *compression* 對舊 chunk 壓 90%&#43;、把 PG 變成 InfluxDB / Prometheus 級 time-series DB。本文走 hypertable 機制、continuous aggregate 跟普通 MV 差異、compression policy、retention policy、5 production 踩雷（chunk size 不對 / CAGG refresh 落後 / compression 後 update 限制 / hypertable 不能加 FK / TimescaleDB 跟 PG 主版本對齊）、跟 PG 原生 partitioning 對比">TimescaleDB Deep Dive</a>（hypertable 是 partition 自動化）</li>
<li>後續路由：<a href="/blog/backend/01-database/vendors/postgresql/partition-redesign/" data-link-title="PostgreSQL Partition Redesign：當 monthly partition 越跑越慢" data-link-desc="PostgreSQL partition redesign 是 Type F「topology re-layout」第 2 個 dogfood — 從 monthly partition 改 daily / 從 range 改 list / 從單軸改 sub-partition；6 維 audit 皆 Low &#43; topology 軸 High；涵蓋 partition 不平衡偵測、ATTACH/DETACH 線上重劃、5 個 production 踩雷、跟 partition_pruning &#43; autovacuum 整合">Partition Redesign</a>（重排 partition strategy 的 migration playbook）</li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Logical Replication + Debezium CDC：replication slot × failure × recovery 對照</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 提到 logical decoding / Debezium CDC、本文聚焦 &lt;em>replication slot 生命週期 + 5 個 production failure mode 跟 recovery&lt;/em> 的對照。&lt;/p>&lt;/blockquote>
&lt;h2 id="replication-slot--failure--recovery-對照">Replication slot × Failure × Recovery 對照&lt;/h2>
&lt;p>Logical replication 跟 Debezium CDC 的 production 議題集中在 &lt;em>replication slot&lt;/em> — 它是 PostgreSQL 內保證 WAL 不被回收的 anchor point；slot 設不對、整個 CDC pipeline 失效。各 failure mode 對 slot 的影響跟 recovery 路徑：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Failure mode&lt;/th>
 &lt;th>對 slot 影響&lt;/th>
 &lt;th>Primary 端徵兆&lt;/th>
 &lt;th>Recovery 路徑&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Consumer 卡住 / lag&lt;/td>
 &lt;td>slot LSN 不前進、WAL 留著&lt;/td>
 &lt;td>&lt;code>pg_wal&lt;/code> 目錄持續長大、disk 撐爆&lt;/td>
 &lt;td>修 consumer / 加 throttle / 必要時 drop slot&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Consumer crash 無 restart&lt;/td>
 &lt;td>slot 留在 active state&lt;/td>
 &lt;td>跟 lag 同、不會自動清&lt;/td>
 &lt;td>手動 &lt;code>SELECT pg_drop_replication_slot('name')&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Schema change（ADD COLUMN）&lt;/td>
 &lt;td>多數 plugin 自動處理、無感&lt;/td>
 &lt;td>通常無感&lt;/td>
 &lt;td>-&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Schema change（DROP / RENAME COLUMN）&lt;/td>
 &lt;td>多數 plugin 直接斷&lt;/td>
 &lt;td>Consumer log 報錯、slot active 卻不前進&lt;/td>
 &lt;td>重建 publication / 重 init load&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Initial COPY&lt;/td>
 &lt;td>slot 建立時跑 snapshot、long-running tx&lt;/td>
 &lt;td>大表 COPY 期間鎖跟 WAL 都受影響&lt;/td>
 &lt;td>用 &lt;code>CREATE_REPLICATION_SLOT ... NOEXPORT_SNAPSHOT&lt;/code> 分階段&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Promotion (failover)&lt;/td>
 &lt;td>physical slot 跟 logical slot 處理不同&lt;/td>
 &lt;td>logical slot 在 PG 16- 不跨 failover&lt;/td>
 &lt;td>PG 16+ logical slot 持久化、或 consumer 重 init load&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Replay storm（offset 重置）&lt;/td>
 &lt;td>slot 不變、consumer 重讀&lt;/td>
 &lt;td>Kafka 端流量爆、application 看 duplicate&lt;/td>
 &lt;td>Idempotent consumer 設計、或 transactional outbox&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>每個 failure mode 對應的詳細配置 + recovery 步驟、下面分段展開。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 提到 logical decoding / Debezium CDC、本文聚焦 <em>replication slot 生命週期 + 5 個 production failure mode 跟 recovery</em> 的對照。</p></blockquote>
<h2 id="replication-slot--failure--recovery-對照">Replication slot × Failure × Recovery 對照</h2>
<p>Logical replication 跟 Debezium CDC 的 production 議題集中在 <em>replication slot</em> — 它是 PostgreSQL 內保證 WAL 不被回收的 anchor point；slot 設不對、整個 CDC pipeline 失效。各 failure mode 對 slot 的影響跟 recovery 路徑：</p>
<table>
  <thead>
      <tr>
          <th>Failure mode</th>
          <th>對 slot 影響</th>
          <th>Primary 端徵兆</th>
          <th>Recovery 路徑</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Consumer 卡住 / lag</td>
          <td>slot LSN 不前進、WAL 留著</td>
          <td><code>pg_wal</code> 目錄持續長大、disk 撐爆</td>
          <td>修 consumer / 加 throttle / 必要時 drop slot</td>
      </tr>
      <tr>
          <td>Consumer crash 無 restart</td>
          <td>slot 留在 active state</td>
          <td>跟 lag 同、不會自動清</td>
          <td>手動 <code>SELECT pg_drop_replication_slot('name')</code></td>
      </tr>
      <tr>
          <td>Schema change（ADD COLUMN）</td>
          <td>多數 plugin 自動處理、無感</td>
          <td>通常無感</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Schema change（DROP / RENAME COLUMN）</td>
          <td>多數 plugin 直接斷</td>
          <td>Consumer log 報錯、slot active 卻不前進</td>
          <td>重建 publication / 重 init load</td>
      </tr>
      <tr>
          <td>Initial COPY</td>
          <td>slot 建立時跑 snapshot、long-running tx</td>
          <td>大表 COPY 期間鎖跟 WAL 都受影響</td>
          <td>用 <code>CREATE_REPLICATION_SLOT ... NOEXPORT_SNAPSHOT</code> 分階段</td>
      </tr>
      <tr>
          <td>Promotion (failover)</td>
          <td>physical slot 跟 logical slot 處理不同</td>
          <td>logical slot 在 PG 16- 不跨 failover</td>
          <td>PG 16+ logical slot 持久化、或 consumer 重 init load</td>
      </tr>
      <tr>
          <td>Replay storm（offset 重置）</td>
          <td>slot 不變、consumer 重讀</td>
          <td>Kafka 端流量爆、application 看 duplicate</td>
          <td>Idempotent consumer 設計、或 transactional outbox</td>
      </tr>
  </tbody>
</table>
<p>每個 failure mode 對應的詳細配置 + recovery 步驟、下面分段展開。</p>
<h2 id="logical-replication-基礎publication--subscription--slot">Logical replication 基礎：publication + subscription + slot</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Primary：建 publication
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">PUBLICATION</span><span class="w"> </span><span class="n">app_changes</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="p">,</span><span class="w"> </span><span class="n">events</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Subscriber：建 subscription（自動建 replication slot）
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">SUBSCRIPTION</span><span class="w"> </span><span class="n">app_sub</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="k">CONNECTION</span><span class="w"> </span><span class="s1">&#39;host=primary user=replicator dbname=app&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">  </span><span class="n">PUBLICATION</span><span class="w"> </span><span class="n">app_changes</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w">  </span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">slot_name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;app_sub_slot&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">copy_data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">true</span><span class="p">);</span></span></span></code></pre></div><p>關鍵物件：</p>
<ul>
<li><strong>publication</strong>（primary 端）：宣告 <em>哪些表 + 哪些操作（INSERT/UPDATE/DELETE/TRUNCATE）</em> 對外暴露</li>
<li><strong>subscription</strong>（subscriber 端、若是 PG-to-PG）：訂閱 + 自動建 slot + 自動 initial COPY</li>
<li><strong>replication slot</strong>：primary 端、保證 <em>consumer 還沒消費的 WAL</em> 不被回收</li>
</ul>
<p><code>copy_data = true</code> 觸發 initial COPY（snapshot）+ 後續 streaming；<code>copy_data = false</code> 只 streaming、適合 already-in-sync 場景。</p>
<h2 id="debezium-cdc用-logical-replication-slot-但繞過-subscription">Debezium CDC：用 logical replication slot 但繞過 subscription</h2>
<p>Debezium 不是 PostgreSQL subscriber、是 <em>直接讀 replication slot</em> 的外部 consumer：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-properties" data-lang="properties"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Debezium PostgreSQL connector</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="na">connector.class</span><span class="o">=</span><span class="s">io.debezium.connector.postgresql.PostgresConnector</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="na">database.hostname</span><span class="o">=</span><span class="s">primary</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="na">database.dbname</span><span class="o">=</span><span class="s">app</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="na">plugin.name</span><span class="o">=</span><span class="s">pgoutput                            # 內建、PG 10+ 推薦</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="na">slot.name</span><span class="o">=</span><span class="s">debezium_app</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="na">publication.name</span><span class="o">=</span><span class="s">app_changes</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="na">publication.autocreate.mode</span><span class="o">=</span><span class="s">filtered            # debezium 自動建 publication</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="na">table.include.list</span><span class="o">=</span><span class="s">public.orders,public.events</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="na">snapshot.mode</span><span class="o">=</span><span class="s">initial                            # 起始 snapshot 後 streaming</span></span></span></code></pre></div><p>差異：</p>
<ul>
<li>Debezium 用 <code>pgoutput</code>（PG 10+ 內建）或 <code>wal2json</code>（外掛 plugin）解 WAL、轉成結構化事件送 Kafka</li>
<li>不像 PG-to-PG subscription、Debezium 沒 subscription object、是 <em>外部 consumer 自管</em> replication slot</li>
<li>Failure mode 上 <em>consumer 端是 Debezium 自己</em>、所以 lag 來源是 Debezium 處理速度 / Kafka 寫入速度</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1consumer-lagslot-lsn-不前進primary-disk-爆">Case 1：consumer lag、slot LSN 不前進、primary disk 爆</h3>
<p><strong>徵兆</strong>：primary <code>pg_wal</code> 目錄持續長大、<code>df -h</code> 看磁碟 90%+；<code>pg_replication_slots</code> 看 <code>confirmed_flush_lsn</code> 卡在某 LSN、<code>pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)</code> 數十 GB。</p>
<p><strong>根因</strong>：consumer（Debezium / subscriber）處理慢於 primary 寫入；replication slot <em>保證 WAL 不回收</em>、但 consumer 沒消費 → WAL 堆積。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>監測</strong>：Prometheus alert <code>pg_replication_slot_lag_bytes &gt; 5GB</code> 觸發前 catch</li>
<li><strong>修 consumer</strong>：throttle primary 寫入 OR scale Debezium / subscriber 處理能力</li>
<li><strong>緊急</strong>：<code>SELECT pg_drop_replication_slot('debezium_app')</code> 釋放 WAL — 但 consumer 必須重 init load（資料缺一塊）</li>
<li><strong>架構</strong>：用 <em>max_slot_wal_keep_size</em>（PG 13+）設 slot 能保留 WAL 上限、超出自動 invalidate slot、保護 primary disk</li>
</ol>
<h3 id="case-2consumer-crash-後-slot-變-zombie">Case 2：consumer crash 後 slot 變 zombie</h3>
<p><strong>徵兆</strong>：Debezium pod OOM crash、新 pod 起來時報 <code>slot is active for PID X</code>、無法 attach；primary 端 <code>pg_replication_slots.active = true</code>、<code>active_pid</code> 指向已經死掉的 process。</p>
<p><strong>根因</strong>：PostgreSQL 把 slot 標 active 是基於 <em>當下有 connection</em>；consumer crash 但 connection 沒被 server 端發現（network 沒 RST）、slot 留在 active state。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 手動清 zombie slot
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_terminate_backend</span><span class="p">(</span><span class="n">active_pid</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="k">WHERE</span><span class="w"> </span><span class="n">slot_name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;debezium_app&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">active</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 或直接 drop（會丟資料、consumer 要重 init）
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_drop_replication_slot</span><span class="p">(</span><span class="s1">&#39;debezium_app&#39;</span><span class="p">);</span></span></span></code></pre></div><p>預防：</p>
<ol>
<li>PostgreSQL <code>tcp_keepalives_idle / interval / count</code> 設較短（300 / 60 / 6）、network drop 較快被發現</li>
<li>Consumer 端用 <em>graceful shutdown</em> + <code>pg_terminate_backend(active_pid)</code> 在 startup 前主動清 stale connection</li>
</ol>
<h3 id="case-3schema-changedrop--rename-column斷流">Case 3：schema change（DROP / RENAME COLUMN）斷流</h3>
<p><strong>徵兆</strong>：Debezium consumer 突然停 produce 訊息、log 報 <code>column XYZ does not exist</code>；primary 端 slot 還 active、但 <code>confirmed_flush_lsn</code> 不前進。</p>
<p><strong>根因</strong>：pgoutput plugin 把 WAL 解成 row event 時、用的 schema 是 <em>當下 catalog</em>；如果中間 DROP COLUMN、之前 WAL 內的 row event 含已不存在欄位、解析失敗。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：schema change 走 <em>expand-contract pattern</em>
<ul>
<li>Phase 1: ADD COLUMN new_col（不影響 logical replication）</li>
<li>Phase 2: application 雙寫 old + new</li>
<li>Phase 3: 等 consumer catch up old column 訊息</li>
<li>Phase 4: DROP COLUMN old_col（此時無 in-flight WAL 帶 old_col）</li>
</ul>
</li>
<li><strong>緊急</strong>：DROP existing slot、重建 publication 跟 slot、consumer 重 init load</li>
<li><strong>長期</strong>：用 Debezium <em>snapshot.mode=schema_only_recovery</em> 在 schema 變動時不重灌資料、只 reset schema</li>
</ol>
<h3 id="case-4initial-copy-大表鎖太久">Case 4：initial COPY 大表鎖太久</h3>
<p><strong>徵兆</strong>：對 1TB 表跑 <code>CREATE SUBSCRIPTION ... WITH (copy_data=true)</code> 後、application 對該表 query / write 阻塞 30+ 分鐘；application timeout 大量。</p>
<p><strong>根因</strong>：initial COPY 默認跑在 <em>single transaction</em>、整個 snapshot LSN 鎖住、長 transaction 跟 vacuum 衝突；同時對 subscriber 端鎖表寫入。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>分階段 init</strong>：</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- Primary：建 publication 不 copy
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">PUBLICATION</span><span class="w"> </span><span class="n">app_changes</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">big_table</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- Subscriber：建 subscription 不 copy
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">SUBSCRIPTION</span><span class="w"> </span><span class="n">app_sub</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">  </span><span class="k">CONNECTION</span><span class="w"> </span><span class="s1">&#39;...&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">  </span><span class="n">PUBLICATION</span><span class="w"> </span><span class="n">app_changes</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">  </span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">copy_data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">false</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 手動跑 partition-by-partition COPY（若是 partition table）
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1">-- 或用 pg_dump / pg_basebackup 拿 snapshot</span></span></span></code></pre></div><ol start="2">
<li><strong>PG 16+ parallel init</strong>：<code>max_sync_workers_per_subscription = 4</code> 平行 COPY 多個表</li>
<li><strong>Debezium replacement</strong>：用 incremental snapshot（Debezium 1.6+）、background trickle copy、不鎖長 transaction</li>
</ol>
<h3 id="case-5replay-storm-後-consumer-offset-reset">Case 5：replay storm 後 consumer offset reset</h3>
<p><strong>徵兆</strong>：Debezium 修 bug / 重 deploy 後、<code>snapshot.mode=initial</code> 觸發整個資料重灌；Kafka topic 流量爆 10x、下游 application 看到大量 duplicate event。</p>
<p><strong>根因</strong>：Debezium offset store（Kafka topic 或 file）被誤刪 / corruption；重啟時不知道從哪 LSN 開始、預設 fall back 到 initial snapshot。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：Debezium offset store 跟 Kafka cluster <em>backup 一起做</em>、不要單獨依賴 Kafka topic</li>
<li><strong>架構</strong>：consumer side 設計 <em>idempotent</em> — 用 event 自帶的 (source LSN + transaction ID) 當 dedupe key</li>
<li><strong>transactional outbox pattern</strong>：CDC 只 capture outbox 表、application 主動寫 outbox + business data 在同 transaction；duplicate 由 application 自己 dedupe</li>
</ol>
<h2 id="容量規劃">容量規劃</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>估算</th>
          <th>警戒</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Replication slot lag</td>
          <td><code>pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)</code></td>
          <td>&gt; 1GB lag 訊號 consumer 跟不上</td>
      </tr>
      <tr>
          <td>Primary <code>pg_wal</code> size</td>
          <td>retention × peak WAL rate</td>
          <td>預留 disk 容量 = max_slot_wal_keep_size + 30% buffer</td>
      </tr>
      <tr>
          <td>Debezium throughput</td>
          <td>~5-10K row/s 單 connector、多表平行可拉</td>
          <td>跟 primary write rate 對比</td>
      </tr>
      <tr>
          <td>Initial COPY time</td>
          <td>100GB ~ 10-30 分鐘（看 network + subscriber IO）</td>
          <td>TB 級必須分階段</td>
      </tr>
      <tr>
          <td>Slot 數量</td>
          <td>每 slot 佔 primary 一份 WAL 保留 buffer</td>
          <td>5+ slot 同時跑 disk 壓力倍增</td>
      </tr>
      <tr>
          <td>max_replication_slots</td>
          <td>預設 10、production 跑 CDC + standby 各佔 slot 要拉到 20-50</td>
          <td>達上限會拒新 slot 建立</td>
      </tr>
  </tbody>
</table>
<p>實務 default：</p>
<ul>
<li>Debezium production：1 connector per source schema、不要 1 connector 跨 50 個表</li>
<li>Slot retention：<code>max_slot_wal_keep_size = 100GB</code>、超出 invalidate slot 保護 primary</li>
<li>Monitor cadence：1 分鐘 sample lag + 5 分鐘 alert threshold</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-patroni-ha-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> 整合</h3>
<p>logical slot 在 PG 16- 不跨 failover、是長期痛點：</p>
<ol>
<li><strong>PG 16-</strong>：failover 後 logical consumer 必須重 init（slot 在新 leader 上不存在）</li>
<li><strong>PG 16+</strong>：<code>failover</code> parameter 讓 logical slot 在 standby 同步、failover 後 consumer 直接接</li>
<li>Patroni 16+ 支援 logical slot persistence 配置、配合用</li>
</ol>
<h3 id="跟-kafka-outbox-pattern">跟 Kafka outbox pattern</h3>
<p>production-grade CDC 不直接 read business table、是 read <em>outbox table</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Application transaction
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(...)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(...);</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">outbox</span><span class="w"> </span><span class="p">(</span><span class="n">event_type</span><span class="p">,</span><span class="w"> </span><span class="n">payload</span><span class="p">,</span><span class="w"> </span><span class="n">created_at</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;order_created&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;...&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">now</span><span class="p">());</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">COMMIT</span><span class="p">;</span></span></span></code></pre></div><p>Debezium 只 capture outbox table、event payload 已是 application-shaped JSON、不用解 row event。好處：</p>
<ol>
<li>Schema change 不影響 CDC（outbox table schema 穩定）</li>
<li>跨表 transaction 對應到單 event（outbox 是業務語意層）</li>
<li>Replay 可靠 — outbox 是 append-only、可重讀</li>
</ol>
<h3 id="跟-partitioning-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">partitioning</a> 整合</h3>
<p>partitioned table 的 logical replication：</p>
<ol>
<li>PG 13+ <code>publish_via_partition_root = true</code> — publication 從 parent 角度看、不是 per-partition</li>
<li>Subscriber 端可 partition 不同 strategy（甚至不 partition）</li>
<li>Schema change 對 partition table 更複雜、走 expand-contract 嚴格</li>
</ol>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Logical replication conflict</strong>：subscriber 端寫衝突的處理（PG 17+ 加 conflict resolution）</li>
<li><strong>bi-directional replication（pg_active）</strong>：多 region active-active、衝突解決設計</li>
<li><strong>Decoder plugin 對比</strong>：pgoutput / wal2json / decoderbufs 效能跟易用性</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>上游 chapter：<a href="/blog/backend/01-database/schema-migration-rollout-evidence/" data-link-title="1.7 Schema Migration Rollout 證據（Schema Migration Rollout Evidence）實作示範" data-link-desc="以訂單付款狀態欄位演進示範 schema migration 如何產出 evidence、release gate 與 incident decision log。">Schema Migration Rollout Evidence</a> — schema change × CDC 對應</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL Archiving</a> / <a href="/blog/backend/01-database/vendors/postgresql/replication-slot-management/" data-link-title="PostgreSQL Replication Slot Management：Physical / Logical / Failover Slot 治理" data-link-desc="PG replication slot 是 *primary 端的 standby 進度紀錄*、防 WAL premature deletion。但 orphan slot 會吃 disk、failover 後 logical slot 不會自動跟新 primary、是 PG 操作的 hidden complexity。本文走 physical / logical slot 差異、slot lifecycle、failover slot synchronization（PG 17&#43; 新特性）、orphan slot 治理、5 production 踩雷（orphan slot disk 爆 / logical slot lag / failover 後 slot 丟 / wal_keep_size 跟 slot 衝突 / connection 同時打 slot 數量限制）">Replication Slot Management</a>（slot lifecycle / orphan / failover sync）/ <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>（streaming + LSN 基礎）</li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL PITR + WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 backup / recovery 是 OLTP 必備能力、本文聚焦 &lt;em>PITR（Point-In-Time Recovery）的雙軌資料設計 + production 5 個 failure mode&lt;/em>。&lt;/p>&lt;/blockquote>
&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>Logical bug 在 production 部署、執行 6 小時後才發現 — 某個 batch job 把 50 萬筆 user.email 改成 NULL。此時：&lt;/p>
&lt;ul>
&lt;li>還原最新 daily backup（昨晚）→ 丟掉今天所有正常寫入（訂單、註冊）&lt;/li>
&lt;li>從 standby promote → standby 已同步 bug、跟 primary 同狀態&lt;/li>
&lt;li>從 application log 重建 → 部分操作不可逆（已寄出 email）&lt;/li>
&lt;/ul>
&lt;p>PITR 是這類 &lt;em>logical disaster&lt;/em> 的標準解 — 不還原到 backup 時間點、而是 &lt;em>還原到 bug 發生前一刻&lt;/em>（例：1 分鐘前）。需要 &lt;em>base backup + WAL archive&lt;/em> 雙軌資料：base backup 是 snapshot、WAL archive 是 snapshot 之後的所有寫入；recovery 時 replay WAL 到指定 timestamp / LSN / transaction ID。&lt;/p>
&lt;h2 id="核心概念base-backup--wal-archive-的雙軌設計">核心概念：base backup + WAL archive 的雙軌設計&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">[Base backup t0] + [WAL archive t0 → now]
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> ↓ ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> 全量 snapshot incremental log
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> ↓ ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl"> └────── recover to t_target ──→ [restored cluster at t_target]&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>兩個軌道各自獨立但必須對齊：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Base backup&lt;/strong>：某時刻整個 data dir 的 snapshot。&lt;code>pg_basebackup&lt;/code> / &lt;code>pgBackRest&lt;/code> / &lt;code>WAL-G&lt;/code> 都產這個；通常 &lt;em>每天 / 每週&lt;/em> 跑一次&lt;/li>
&lt;li>&lt;strong>WAL archive&lt;/strong>：base backup 之後每段 WAL 都 push 到外部 storage（S3 / GCS / NFS）。&lt;code>archive_command&lt;/code> 觸發、PostgreSQL 等到 archive 成功才 &lt;em>回收&lt;/em> 那段 WAL&lt;/li>
&lt;/ol>
&lt;p>兩者組合決定 RPO（recovery point objective）：&lt;/p>
&lt;ul>
&lt;li>RPO ≈ WAL archive frequency（streaming 即時、&lt;code>archive_timeout&lt;/code> 預設 1 分鐘）&lt;/li>
&lt;li>RPO 不是 base backup frequency — daily base backup + 每分鐘 archive WAL → RPO 1 分鐘&lt;/li>
&lt;/ul>
&lt;p>RTO（recovery time objective）跟 &lt;em>base backup size + WAL replay 量&lt;/em> 相關：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 backup / recovery 是 OLTP 必備能力、本文聚焦 <em>PITR（Point-In-Time Recovery）的雙軌資料設計 + production 5 個 failure mode</em>。</p></blockquote>
<h2 id="問題情境">問題情境</h2>
<p>Logical bug 在 production 部署、執行 6 小時後才發現 — 某個 batch job 把 50 萬筆 user.email 改成 NULL。此時：</p>
<ul>
<li>還原最新 daily backup（昨晚）→ 丟掉今天所有正常寫入（訂單、註冊）</li>
<li>從 standby promote → standby 已同步 bug、跟 primary 同狀態</li>
<li>從 application log 重建 → 部分操作不可逆（已寄出 email）</li>
</ul>
<p>PITR 是這類 <em>logical disaster</em> 的標準解 — 不還原到 backup 時間點、而是 <em>還原到 bug 發生前一刻</em>（例：1 分鐘前）。需要 <em>base backup + WAL archive</em> 雙軌資料：base backup 是 snapshot、WAL archive 是 snapshot 之後的所有寫入；recovery 時 replay WAL 到指定 timestamp / LSN / transaction ID。</p>
<h2 id="核心概念base-backup--wal-archive-的雙軌設計">核心概念：base backup + WAL archive 的雙軌設計</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">[Base backup t0]  +  [WAL archive t0 → now]
</span></span><span class="line"><span class="ln">2</span><span class="cl">     ↓                       ↓
</span></span><span class="line"><span class="ln">3</span><span class="cl">  全量 snapshot          incremental log
</span></span><span class="line"><span class="ln">4</span><span class="cl">     ↓                       ↓
</span></span><span class="line"><span class="ln">5</span><span class="cl">     └────── recover to t_target ──→ [restored cluster at t_target]</span></span></code></pre></div><p>兩個軌道各自獨立但必須對齊：</p>
<ol>
<li><strong>Base backup</strong>：某時刻整個 data dir 的 snapshot。<code>pg_basebackup</code> / <code>pgBackRest</code> / <code>WAL-G</code> 都產這個；通常 <em>每天 / 每週</em> 跑一次</li>
<li><strong>WAL archive</strong>：base backup 之後每段 WAL 都 push 到外部 storage（S3 / GCS / NFS）。<code>archive_command</code> 觸發、PostgreSQL 等到 archive 成功才 <em>回收</em> 那段 WAL</li>
</ol>
<p>兩者組合決定 RPO（recovery point objective）：</p>
<ul>
<li>RPO ≈ WAL archive frequency（streaming 即時、<code>archive_timeout</code> 預設 1 分鐘）</li>
<li>RPO 不是 base backup frequency — daily base backup + 每分鐘 archive WAL → RPO 1 分鐘</li>
</ul>
<p>RTO（recovery time objective）跟 <em>base backup size + WAL replay 量</em> 相關：</p>
<ul>
<li>Restore base backup ~ 1-4 小時（TB 級）</li>
<li>WAL replay 時間 ~ archive 累積量 / replay throughput</li>
</ul>
<h2 id="step-by-step-配置">Step-by-step 配置</h2>
<h3 id="primaryarchive_command-設好">Primary：archive_command 設好</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">wal_level</span> <span class="o">=</span> <span class="s">replica                          # 預設 replica、PITR 需要</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">archive_mode</span> <span class="o">=</span> <span class="s">on                            # 啟用 archive</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">archive_command</span> <span class="o">=</span> <span class="s">&#39;wal-g wal-push %p&#39;        # 或 pgBackRest / 自寫 script</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">archive_timeout</span> <span class="o">=</span> <span class="s">60                         # 60s 無 WAL 時強制切 segment</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="na">max_wal_size</span> <span class="o">=</span> <span class="s">4GB</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="na">checkpoint_timeout</span> <span class="o">=</span> <span class="s">15min</span></span></span></code></pre></div><p><code>archive_command</code> 必須 <em>回 exit code 0 才算成功</em>；非 0 PostgreSQL retry、retry 失敗會在 <code>pg_wal</code> 堆積 WAL 直到 disk 滿。<strong>critical：archive_command 不能寫成 silent-fail</strong>。</p>
<h3 id="用-pgbackrest-取代手寫-script">用 pgBackRest 取代手寫 script</h3>
<p>production 強烈不建議自寫 archive script — pgBackRest / WAL-G / Barman 處理過所有 edge case：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># pgbackrest.conf</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="k">[global]</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="na">repo1-type</span><span class="o">=</span><span class="s">s3</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="na">repo1-s3-bucket</span><span class="o">=</span><span class="s">mybucket</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="na">repo1-s3-region</span><span class="o">=</span><span class="s">us-east-1</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="na">repo1-retention-full</span><span class="o">=</span><span class="s">4                       # 留 4 個 full backup</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="na">repo1-retention-diff</span><span class="o">=</span><span class="s">8                       # 留 8 個 differential</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="na">repo1-cipher-type</span><span class="o">=</span><span class="s">aes-256-cbc                # encrypt at rest</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="na">process-max</span><span class="o">=</span><span class="s">8                                # parallel restore</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="k">[main]</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="na">pg1-path</span><span class="o">=</span><span class="s">/var/lib/postgresql/16/main</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 跑 full backup</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pgbackrest --stanza<span class="o">=</span>main backup --type<span class="o">=</span>full
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># archive_command 用 pgbackrest 內建</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="nv">archive_command</span> <span class="o">=</span> <span class="s1">&#39;pgbackrest --stanza=main archive-push %p&#39;</span></span></span></code></pre></div><p>pgBackRest 處理：parallel push、compression、encryption、checksum、archive replay timing、backup catalog、retention 自動清理。</p>
<h3 id="restorerecovery_target_time">Restore：recovery_target_time</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 1. 從 S3 / repo 拉 base backup</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pgbackrest --stanza<span class="o">=</span>main --type<span class="o">=</span><span class="nb">time</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --target<span class="o">=</span><span class="s2">&#34;2026-05-18 14:30:00+00&#34;</span> restore
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 2. PostgreSQL 進 recovery mode、自動 replay WAL 到 target time</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># (pgBackRest 寫好 recovery.signal + postgresql.auto.conf)</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"># 3. 確認到目標 timestamp 後、promote</span>
</span></span><span class="line"><span class="ln">9</span><span class="cl">pg_ctl promote</span></span></code></pre></div><p>Recovery target 三種：</p>
<ul>
<li><strong><code>recovery_target_time</code></strong>：到某 timestamp</li>
<li><strong><code>recovery_target_xid</code></strong>：到某 transaction ID（log 有 xid 才好定位）</li>
<li><strong><code>recovery_target_lsn</code></strong>：到某 WAL LSN（最精確、但需要事先記下 LSN）</li>
</ul>
<p>production 多用 timestamp、application log 有時間戳容易定位。</p>
<h2 id="故障演練--邊界-case">故障演練 / 邊界 case</h2>
<h3 id="case-1archive_command-靜默失敗">Case 1：archive_command 靜默失敗</h3>
<p><strong>徵兆</strong>：DBA 發現某 PITR test 時、最近 3 天的 WAL 在 S3 上沒有；但 PostgreSQL 沒 alert、<code>pg_wal</code> 也沒堆積（早就被回收？）。</p>
<p><strong>根因</strong>：archive_command 寫成 <code>aws s3 cp %p s3://bucket/... 2&gt;/dev/null</code> — 錯誤訊息被吞、exit code 卻是 0（cp 失敗但 redirect 後 shell wrapper 不傳 fail code）；PostgreSQL 以為成功、繼續 advance WAL pointer、舊 WAL 已回收、archive 上實際沒有。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>絕對不要靜默 exit code</strong>：archive_command 必須 <em>fail loud</em>、exit code 非 0</li>
<li><strong>用 pgBackRest / WAL-G</strong>、不自寫 shell 腳本</li>
<li><strong>monitoring</strong>：對 archive lag 寫 alert</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_last_archived_xact_time</span><span class="p">(),</span><span class="w"> </span><span class="n">now</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pg_last_archived_xact_time</span><span class="p">()</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">lag</span><span class="p">;</span></span></span></code></pre></div><p>alert if lag &gt; 5 minutes</p>
<ol start="4">
<li><strong>定期測試 restore</strong>：每月跑一次 PITR drill、實際從 archive restore + 驗證 timestamp</li>
</ol>
<h3 id="case-2wal-archive-lagprimary-disk-壓力">Case 2：WAL archive lag、primary disk 壓力</h3>
<p><strong>徵兆</strong>：<code>pg_wal</code> 目錄持續長大、<code>df -h</code> 90%+；<code>pg_stat_archiver</code> 顯示 <code>failed_count</code> 累積、<code>last_failed_time</code> 是 30 分鐘前；archive_command 寫不出去（S3 throttle / network 慢）。</p>
<p><strong>根因</strong>：archive_command 寫到 S3、但 S3 rate limit / connection timeout、PostgreSQL retry；WAL 一直在 <code>pg_wal</code> 不能回收、disk 持續長。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：<code>archive_command</code> 內部 retry + parallel push（pgBackRest 自帶 <code>process-max</code>）</li>
<li><strong>alert</strong>：<code>pg_stat_archiver.failed_count</code> 增長 + primary disk usage &gt; 80%</li>
<li><strong>緊急</strong>：暫時改 archive_command 寫 local NFS / 其他 storage、等 S3 恢復再同步；不要直接 disable archive（會丟資料）</li>
<li><strong>架構</strong>：archive storage 至少跨 region 兩份、單一 storage 故障不影響 archive</li>
</ol>
<h3 id="case-3recovery-跑到-wrong-target-time">Case 3：recovery 跑到 wrong target time</h3>
<p><strong>徵兆</strong>：PITR 還原後資料看起來 <em>缺一塊</em>；DBA 後悔 — target time 設早了 30 分鐘、recovery 已 promote、後續 WAL 在新 timeline 上、回不去。</p>
<p><strong>根因</strong>：recovery 過程不可逆 — 一旦 promote 開新 timeline、舊 WAL 在新 timeline 上不會被 replay；想還原到更晚 timestamp 必須 <em>重新 restore base backup + WAL</em>。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong><code>recovery_target_action = pause</code></strong>（PG 13+）：到 target time 後 <em>暫停</em>、不自動 promote；DBA 手動 query 確認資料對才 promote</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">recovery_target_time</span> <span class="o">=</span> <span class="s">&#39;2026-05-18 14:30:00+00&#39;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">recovery_target_action</span> <span class="o">=</span> <span class="s">pause</span></span></span></code></pre></div><ol start="2">
<li><strong>多次 PITR 試錯</strong>：用 <em>獨立 staging cluster</em> restore、驗證 target time 對、再對 production 跑</li>
<li><strong>記錄 target time 來源</strong>：application log / event timestamp 多比對、避免時區錯亂（<code>+00</code> UTC 跟 local time 差）</li>
</ol>
<h3 id="case-4base-backup-過期未清storage-爆">Case 4：base backup 過期未清、storage 爆</h3>
<p><strong>徵兆</strong>：S3 backup bucket size 半年內從 200GB 漲到 5TB；DBA 才發現 retention 沒設、daily base backup 留 180 天。</p>
<p><strong>根因</strong>：archive_command 自寫腳本沒 retention 邏輯、或 pgBackRest 設了 <code>repo1-retention-full=180</code> 漏看；DB 容量本來就成長 + 每日 full backup 累積。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># pgBackRest retention：4 full + auto-expire archive</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">repo1-retention-full</span><span class="o">=</span><span class="s">4                         # 留 4 個 full backup</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">repo1-retention-diff</span><span class="o">=</span><span class="s">8                         # 留 8 個 differential</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">repo1-retention-archive</span><span class="o">=</span><span class="s">4                      # WAL archive 跟 full 對齊</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">repo1-retention-archive-type</span><span class="o">=</span><span class="s">full</span></span></span></code></pre></div><p>storage budgeting：</p>
<ul>
<li>daily full + diff + WAL archive ≈ 1-2x DB size / day</li>
<li>4-week retention → ~30-60x DB size storage</li>
<li>跨 region replication → 2-3x</li>
</ul>
<h3 id="case-5timeline-分歧後-recovery-模糊">Case 5：timeline 分歧後 recovery 模糊</h3>
<p><strong>徵兆</strong>：production 經歷一次 failover（Patroni promote）+ 之後又 PITR 一次；現在要再 PITR 到 failover 前一刻、archive 上有兩個 timeline、recovery target 搞不清要哪個。</p>
<p><strong>根因</strong>：每次 promote 開新 timeline ID（<code>.history</code> 檔）；archive storage 上同 LSN 可能對應不同 timeline；recovery target time 在分歧點附近、ambiguous。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong><code>recovery_target_timeline</code></strong> 明示要 follow 哪個 timeline</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">recovery_target_time</span> <span class="o">=</span> <span class="s">&#39;2026-05-15 10:00:00+00&#39;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">recovery_target_timeline</span> <span class="o">=</span> <span class="s">&#39;3&#39;                 # 要 follow timeline 3</span></span></span></code></pre></div><ol start="2">
<li><strong>熟悉 <code>.history</code> 檔</strong>：<code>/wal_archive/000000XX.history</code> 記錄 timeline 切換點、PITR 前先看</li>
<li><strong>預防</strong>：每次 promote 後 <em>立刻</em> 跑新的 base backup、簡化未來 PITR 流程（不用跨 timeline）</li>
</ol>
<h2 id="容量--cost-規劃">容量 / cost 規劃</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>估算</th>
          <th>警戒</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base backup size</td>
          <td>跟 DB data dir 大小成正比（PostgreSQL 內部 compression 後）</td>
          <td>每 backup ~ 0.5-1x DB size</td>
      </tr>
      <tr>
          <td>WAL archive size</td>
          <td>~5-50GB / day depending on write volume</td>
          <td>1TB DB / write-heavy 可能 100GB+ / day</td>
      </tr>
      <tr>
          <td>Storage retention</td>
          <td>4-12 weeks 典型</td>
          <td>30-60x DB size budget</td>
      </tr>
      <tr>
          <td>Base backup time</td>
          <td>TB 級 1-4 小時</td>
          <td>跑在 maintenance window</td>
      </tr>
      <tr>
          <td>Restore time</td>
          <td>base backup restore + WAL replay</td>
          <td>TB 級 PITR 通常 2-6 小時</td>
      </tr>
      <tr>
          <td>Network bandwidth</td>
          <td>full backup 期間 100-500 Mbps</td>
          <td>跨 region 注意 egress cost</td>
      </tr>
  </tbody>
</table>
<p>實務 default：</p>
<ul>
<li>Daily full backup + 4 weeks retention</li>
<li>WAL archive every 60s（<code>archive_timeout = 60</code>）</li>
<li>跨 region replication（S3 → S3 cross-region）</li>
<li>月度 restore drill 驗證可用</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-patroni-ha-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> 整合</h3>
<p>Patroni 不管 backup，但 promotion 後 timeline 切換影響 archive：</p>
<ol>
<li>archive_command 用 <code>%t</code>（timeline）+ <code>%f</code>（filename）路徑、避免不同 timeline WAL 覆蓋</li>
<li>Patroni <code>recovery_conf</code> 包含 <code>restore_command</code>、standby clone 從 archive 拉</li>
<li>每次 Patroni failover 後跑 <em>full backup</em>、簡化未來 PITR</li>
</ol>
<h3 id="跟-logical-replication-對位">跟 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">logical replication</a> 對位</h3>
<p>PITR 跟 logical replication 服務不同 use case：</p>
<ul>
<li>PITR 是 <em>災難恢復</em>（logical bug / corruption）— 全量還原到某時刻</li>
<li>Logical replication 是 <em>連續 sync</em> — Kafka / 跨 DB 即時複製</li>
</ul>
<p>兩者 <em>都依賴 WAL</em>、但目標不同；同 PostgreSQL 可同時跑、互不衝突。</p>
<h3 id="跟-monitoring--alert">跟 monitoring + alert</h3>
<p>關鍵 metric：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- archive 健康度
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_archiver</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- archived_count, failed_count, last_archived_wal, last_archived_time
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- WAL 在 pg_wal 等待 archive 量
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_ls_waldir</span><span class="p">()</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s1">&#39;^[0-9A-F]{24}$&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="c1">-- base backup 上次跑時間
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1">-- (pgBackRest API 或 backup catalog)</span></span></span></code></pre></div><p>Prometheus alert 三條：archive failed_count 增、archive lag &gt; 5min、base backup &gt; 25h 沒跑。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Incremental backup（PG 17+）</strong>：base backup 不全量、只 base + incremental</li>
<li><strong>Block-level differential</strong>：pgBackRest 已支援</li>
<li><strong>Cloud-native 替代</strong>：RDS / Aurora 用 storage-layer snapshot、不走 PITR 鏈</li>
<li><strong><code>pg_dump</code> vs PITR</strong>：pg_dump 是 logical backup（resume to different schema OK）、PITR 是 physical（必須同 version + same arch）</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>上游 chapter：<a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">Database Migration Playbook</a> — PITR 是 migration 的失敗回退</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a> / <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum tuning</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL major version upgrade (14 → 17)：為什麼這篇不套 5 type migration</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/major-version-upgrade/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/major-version-upgrade/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。寫作前判讀 &lt;em>不適用&lt;/em> &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> 的 5 type — 本文是該 methodology 「何時不該套」段的第 2 項實證（同 vendor major version upgrade）。&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼這篇不套-5-type-migration">為什麼這篇不套 5 type migration&lt;/h2>
&lt;p>跑 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit&lt;/a> 對 PostgreSQL 14 → 17：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>評估&lt;/th>
 &lt;th>等級&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Schema / API&lt;/td>
 &lt;td>同 PostgreSQL wire protocol、SQL syntax 99%+ 相容&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operational model&lt;/td>
 &lt;td>同 PostgreSQL operational stack、tooling 不變&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Abstraction / paradigm&lt;/td>
 &lt;td>同 OLTP RDBMS&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Number of components&lt;/td>
 &lt;td>同 1 個&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application change&lt;/td>
 &lt;td>多數 application 不改&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>5 維皆 Low — 對映 Type B drop-in。但 &lt;em>實際工作量&lt;/em> 跟 drop-in 完全不同：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Extension 相容性&lt;/strong>：pg14 的 extension 不一定能在 pg17 直接用（API 變動 / ABI break）&lt;/li>
&lt;li>&lt;strong>Breaking change&lt;/strong>：每個 major version 有 release-specific behavior change（pg17 移除 &lt;code>relation&lt;/code>/&lt;code>oid&lt;/code> 隱性 type、pg15 公開 &lt;code>pg_role&lt;/code> 規則變嚴）&lt;/li>
&lt;li>&lt;strong>Storage format&lt;/strong>：major version 之間 &lt;em>data dir 不向後相容&lt;/em>、必須 &lt;code>pg_upgrade&lt;/code> 或 dump-restore&lt;/li>
&lt;li>&lt;strong>Statistics 重建&lt;/strong>：upgrade 後 &lt;code>pg_statistic&lt;/code> 失效、必須跑 &lt;code>ANALYZE&lt;/code>、否則 query plan 退化&lt;/li>
&lt;li>&lt;strong>Replication slot&lt;/strong>：logical replication slot 不跨 major version&lt;/li>
&lt;/ul>
&lt;p>5 type 對映 &lt;em>跨 vendor process&lt;/em>、漏了 &lt;em>同 vendor 內升級&lt;/em> 的 upgrade-specific dimension。本文採用 &lt;em>deep article methodology 的 6-section + 額外 upgrade audit 段&lt;/em> 結構、不是 5 type 的任一個。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。寫作前判讀 <em>不適用</em> <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> 的 5 type — 本文是該 methodology 「何時不該套」段的第 2 項實證（同 vendor major version upgrade）。</p></blockquote>
<h2 id="為什麼這篇不套-5-type-migration">為什麼這篇不套 5 type migration</h2>
<p>跑 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a> 對 PostgreSQL 14 → 17：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>同 PostgreSQL wire protocol、SQL syntax 99%+ 相容</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>同 PostgreSQL operational stack、tooling 不變</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Abstraction / paradigm</td>
          <td>同 OLTP RDBMS</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Number of components</td>
          <td>同 1 個</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>多數 application 不改</td>
          <td>Low</td>
      </tr>
  </tbody>
</table>
<p>5 維皆 Low — 對映 Type B drop-in。但 <em>實際工作量</em> 跟 drop-in 完全不同：</p>
<ul>
<li><strong>Extension 相容性</strong>：pg14 的 extension 不一定能在 pg17 直接用（API 變動 / ABI break）</li>
<li><strong>Breaking change</strong>：每個 major version 有 release-specific behavior change（pg17 移除 <code>relation</code>/<code>oid</code> 隱性 type、pg15 公開 <code>pg_role</code> 規則變嚴）</li>
<li><strong>Storage format</strong>：major version 之間 <em>data dir 不向後相容</em>、必須 <code>pg_upgrade</code> 或 dump-restore</li>
<li><strong>Statistics 重建</strong>：upgrade 後 <code>pg_statistic</code> 失效、必須跑 <code>ANALYZE</code>、否則 query plan 退化</li>
<li><strong>Replication slot</strong>：logical replication slot 不跨 major version</li>
</ul>
<p>5 type 對映 <em>跨 vendor process</em>、漏了 <em>同 vendor 內升級</em> 的 upgrade-specific dimension。本文採用 <em>deep article methodology 的 6-section + 額外 upgrade audit 段</em> 結構、不是 5 type 的任一個。</p>
<h2 id="結構-differentiatordeep-article--upgrade-audit">結構 differentiator：deep article + upgrade audit</h2>
<p>跟 single feature deep article（如 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgBouncer config</a> / <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a>）對照、本文多一段 <em>upgrade audit</em>；跟 migration playbook 對照、本文 <em>沒 phased translation / parallel run / cutover routing</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">問題情境（為什麼升）
</span></span><span class="line"><span class="ln">2</span><span class="cl">→ Upgrade audit（extension / breaking change / dependency）
</span></span><span class="line"><span class="ln">3</span><span class="cl">→ 升級方法選擇（pg_upgrade / logical / blue-green）
</span></span><span class="line"><span class="ln">4</span><span class="cl">→ Step-by-step 執行
</span></span><span class="line"><span class="ln">5</span><span class="cl">→ 故障演練
</span></span><span class="line"><span class="ln">6</span><span class="cl">→ Capacity / downtime trade-off
</span></span><span class="line"><span class="ln">7</span><span class="cl">→ 整合 / 下一步</span></span></code></pre></div><p>7 段、220-280 行。比 single feature deep article 多 1 段 audit、比 migration playbook 少 phased translation 章節。</p>
<h2 id="問題情境major-version-不只是-minor-bump">問題情境：major version 不只是 minor bump</h2>
<p>PostgreSQL major version（14 / 15 / 16 / 17）一年一版、每版含 <em>breaking change</em>、不是 minor bump。常見升級驅動：</p>
<ul>
<li><strong>EOL pressure</strong>：PostgreSQL 每版 maintained 5 年、pg14 EOL 2026-11；pg13 EOL 2025-11 已過、production 仍跑 pg13 是 risk</li>
<li><strong>新 feature 需求</strong>：pg15 MERGE / pg16 parallel hash join / pg17 incremental backup</li>
<li><strong>Cloud provider 強制</strong>：Aurora / RDS 對 EOL 版本停 minor patch、planned upgrade 不能拖</li>
</ul>
<p>不升級的代價：security patch 停發、新功能不能用、跟新 client / extension 漸增不相容。</p>
<h2 id="upgrade-audit">Upgrade audit</h2>
<p>升級前的硬閘門 audit、跳過任一個 production 必踩：</p>
<h3 id="audit-1extension-相容性">Audit 1：Extension 相容性</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">extname</span><span class="p">,</span><span class="w"> </span><span class="n">extversion</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_extension</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">extname</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s1">&#39;plpgsql&#39;</span><span class="p">;</span></span></span></code></pre></div><p>對每個 extension 跑：</p>
<ol>
<li>對應 target version (pg17) 是否有 release？</li>
<li>ABI break？（如 PostGIS major version 對應 PG major version）</li>
<li>是否有 maintainer 持續更新？（TimescaleDB 已不 cover pg17 部分 feature）</li>
</ol>
<p>常見 pg14 → pg17 需要 <em>先升 extension</em> 的：PostGIS / TimescaleDB / pgaudit / pg_partman / pg_repack。</p>
<h3 id="audit-2breaking-change-pull">Audit 2：Breaking change pull</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 查 release note 累積 breaking change（pg14 → pg17 跨 3 個 major）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># pg15: deprecated public schema 預設 write 權限變嚴</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># pg16: regrole removed implicit casts</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># pg17: removed several deprecated columns from system catalogs</span></span></span></code></pre></div><p>對每個 breaking change：</p>
<ol>
<li>用 SQL grep / static analysis 找 application code 影響範圍</li>
<li>評估修改工作量（通常 50-95% 是 false alarm、5-10% 真實影響）</li>
<li>列出無法立刻修的、規劃 <em>逐 major 升</em> 而不是 <em>一次升 3 major</em></li>
</ol>
<h3 id="audit-3replication--logical-slot">Audit 3：Replication / logical slot</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">slot_name</span><span class="p">,</span><span class="w"> </span><span class="n">plugin</span><span class="p">,</span><span class="w"> </span><span class="n">slot_type</span><span class="p">,</span><span class="w"> </span><span class="n">active</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_replication_slots</span><span class="p">;</span></span></span></code></pre></div><p>major version upgrade 後：</p>
<ul>
<li><strong>Physical replication slot</strong>：standby 必須先升級到 <em>相同 major version</em> 才能跟新 primary</li>
<li><strong>Logical replication slot</strong>：<strong>不跨 major version</strong>、必須在 upgrade 前 drop、之後重建（消費者重 init load）</li>
<li>對應 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Debezium CDC</a> consumer 必須重 init</li>
</ul>
<h3 id="audit-4config-參數變更">Audit 4：Config 參數變更</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># diff postgresql.conf default 14 vs 17</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 重點: shared_preload_libraries / autovacuum_* / wal_level / synchronous_commit</span></span></span></code></pre></div><p>新 major version 預設值常變（pg14 → 17：<code>max_worker_processes</code> 預設變 / <code>unix_socket_directories</code> 行為差異）；自定 config 需逐項 review。</p>
<h3 id="audit-5statistics-重建計畫">Audit 5：Statistics 重建計畫</h3>
<p><code>pg_upgrade</code> 後 <code>pg_statistic</code> 重置、第一次跑 query plan 用空 stats、production 性能會塌；upgrade 計畫必須含：</p>
<ul>
<li><code>ANALYZE</code> 跑全 DB（小 DB ~10 分鐘、大 DB 1-3 小時）</li>
<li>多 stage <code>vacuumdb --analyze-in-stages</code> 先快速跑 baseline、再跑 full</li>
<li>Maintenance window 內預留 statistics 重建時間</li>
</ul>
<h2 id="升級方法選擇">升級方法選擇</h2>
<p>三種主流方法、依 downtime 容忍跟 DB 大小：</p>
<table>
  <thead>
      <tr>
          <th>方法</th>
          <th>Downtime</th>
          <th>風險</th>
          <th>適用</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>pg_upgrade --link</code></td>
          <td>10-30 分鐘</td>
          <td>data dir 跟 OS package 同 host、回退複雜</td>
          <td>&lt; 500GB、可接受 30 分鐘 downtime</td>
      </tr>
      <tr>
          <td>Logical replication</td>
          <td>切換瞬間（&lt; 1 分鐘）</td>
          <td>設定複雜、long-running migration window</td>
          <td>TB 級、低 downtime 需求</td>
      </tr>
      <tr>
          <td>Blue-green deployment</td>
          <td>切換瞬間</td>
          <td>雙倍硬體、cutover 期間需嚴格 traffic shifting</td>
          <td>Cloud-managed（Aurora / RDS 內建）</td>
      </tr>
  </tbody>
</table>
<h3 id="pg_upgrade---link-流程"><code>pg_upgrade --link</code> 流程</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. install pg17 binary（不啟動）</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"># 2. stop pg14</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">sudo systemctl stop postgresql@14
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 3. 跑 pg_upgrade（hard link、不複製資料）</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">sudo -u postgres /usr/lib/postgresql/17/bin/pg_upgrade <span class="se">\
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="se"></span>  --old-bindir<span class="o">=</span>/usr/lib/postgresql/14/bin <span class="se">\
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="se"></span>  --new-bindir<span class="o">=</span>/usr/lib/postgresql/17/bin <span class="se">\
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="se"></span>  --old-datadir<span class="o">=</span>/var/lib/postgresql/14/main <span class="se">\
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="se"></span>  --new-datadir<span class="o">=</span>/var/lib/postgresql/17/main <span class="se">\
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="se"></span>  --link <span class="se">\
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="se"></span>  --jobs<span class="o">=</span><span class="m">8</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"># 4. 啟動 pg17</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">sudo systemctl start postgresql@17
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1"># 5. 跑 pg_upgrade 產出的 analyze script</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">sudo -u postgres /tmp/analyze_new_cluster.sh</span></span></code></pre></div><p><code>--link</code> 用 hard link、不複製 data dir、適合大 DB；缺點是 <em>回退到 pg14 不可能</em>（data dir 已被新 pg 修改）— 必須有完整 backup + tested restore。</p>
<h2 id="故障演練">故障演練</h2>
<h3 id="case-1extension-相容性沒先-auditupgrade-後啟動失敗">Case 1：Extension 相容性沒先 audit、upgrade 後啟動失敗</h3>
<p><strong>徵兆</strong>：pg_upgrade 跑完、<code>pg_ctl start</code> 失敗、log 顯示 <code>could not load library &quot;timescaledb-2.13.so&quot;</code>。</p>
<p><strong>根因</strong>：TimescaleDB 對應 pg14、pg17 需要 TimescaleDB 2.16+；pg_upgrade 階段沒 check、library path 找不到。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-upgrade audit</strong>：每個 extension 列出 target version 對應、預先升 extension（在 pg14 上跑、用 <code>ALTER EXTENSION ... UPDATE</code>）</li>
<li><strong>回退</strong>：data dir 用 <code>--link</code> 已不可逆、必須從 backup restore + 重試</li>
<li><strong>預防</strong>：staging 環境完整 dry-run、production upgrade 前已知 path 都驗證過</li>
</ol>
<h3 id="case-2application-用-deprecated-sql跑壞">Case 2：Application 用 deprecated SQL、跑壞</h3>
<p><strong>徵兆</strong>：upgrade 後某些 application query 直接 error <code>ERROR: type &quot;regtype&quot; does not have a cast</code>。</p>
<p><strong>根因</strong>：pg16 移除了某些隱性 cast、application code 用了 implicit cast、現在 explicit cast 才能跑。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-upgrade</strong>：跑 application test suite 對 pg17 staging、catch 不相容 query</li>
<li><strong>緊急</strong>：staging 找到的 query 在 production 改 application code、deploy 後再 upgrade DB</li>
<li><strong>長期</strong>：application code 用 ORM / query builder、避免 raw SQL 對 PG version-specific behavior 依賴</li>
</ol>
<h3 id="case-3analyze-沒跑production-query-性能崩">Case 3：<code>ANALYZE</code> 沒跑、production query 性能崩</h3>
<p><strong>徵兆</strong>：upgrade 後 5 分鐘、application latency p99 從 50ms 衝到 5000ms；query plan 從 index scan 退化到 seq scan。</p>
<p><strong>根因</strong>：<code>pg_upgrade</code> 重置 <code>pg_statistic</code>、planner 用空 stats 跑 plan、無法估 selectivity、保守選 seq scan。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># upgrade 完立刻跑 (順序)</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">vacuumdb --all --analyze-in-stages --jobs<span class="o">=</span><span class="m">4</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># Stage 1: 最少 stats（快、~5 分鐘）</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># Stage 2: 中 stats（~30 分鐘）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># Stage 3: 完整 stats（1-3 小時）</span></span></span></code></pre></div><p><code>--analyze-in-stages</code> 分 3 階段、第 1 階段就能讓 planner 做大致正確的決策；可在 maintenance window 內接受 stage 3 仍在跑。</p>
<h3 id="case-4logical-replication-slot-漏-dropdebezium-卡死">Case 4：Logical replication slot 漏 drop、Debezium 卡死</h3>
<p><strong>徵兆</strong>：upgrade 完開機後、Debezium connector log 顯示 <code>slot not found</code>、消費停滯；Kafka downstream 訊息斷流。</p>
<p><strong>根因</strong>：logical replication slot 不跨 major version、<code>pg_upgrade</code> 不自動處理 logical slot；upgrade 前沒 drop、新 cluster 上 slot 不存在。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-upgrade</strong>：列所有 logical replication slot、Debezium 暫停 consumer + drop slot</li>
<li><strong>Upgrade 後重建</strong>：用新 LSN starting position 建 slot、Debezium snapshot.mode=schema_only_recovery 取代 initial（避免重 init load）</li>
<li><strong>架構</strong>：未來考慮用 <em>outbox pattern</em>、CDC 只追 outbox 表、降低 logical slot 重建成本</li>
</ol>
<h3 id="case-5standby-沒同步升replication-斷">Case 5：Standby 沒同步升、replication 斷</h3>
<p><strong>徵兆</strong>：primary 升 pg17 後、standby 仍 pg14、replication 不通；<code>pg_stat_replication</code> 沒 standby connection。</p>
<p><strong>根因</strong>：streaming replication 不跨 major version；standby 必須 <em>先升</em> 或 <em>upgrade 後重 base backup</em>。</p>
<p><strong>修法</strong>：</p>
<p>兩種策略：</p>
<ol>
<li><strong>In-place upgrade standby</strong>：standby 也跑 <code>pg_upgrade</code>、但要先 stop streaming、升完重接（standby 端 archive_command + restore_command 對齊）</li>
<li><strong>Rebuild standby</strong>：upgrade primary 完、standby 跑 <code>pg_basebackup</code> 重建（適合 standby 容量小、network 快）</li>
</ol>
<p>Patroni HA 環境：用 <em>rolling upgrade</em> — 先升 sync standby、failover 過去、再升舊 primary 變新 standby。複雜度高、需要 staging 演練。</p>
<h2 id="capacity--downtime-trade-off">Capacity / downtime trade-off</h2>
<table>
  <thead>
      <tr>
          <th>方法</th>
          <th>Downtime 估算（500GB DB）</th>
          <th>硬體成本</th>
          <th>風險</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>pg_upgrade --link</code></td>
          <td>15-30 分鐘（含 ANALYZE 1st stage）</td>
          <td>同當前</td>
          <td>高（不可逆）</td>
      </tr>
      <tr>
          <td><code>pg_upgrade --clone</code></td>
          <td>1-3 小時</td>
          <td>暫時 2x storage</td>
          <td>中</td>
      </tr>
      <tr>
          <td>Logical replication</td>
          <td>&lt; 1 分鐘 cutover</td>
          <td>暫時 2x compute + storage</td>
          <td>中（複雜）</td>
      </tr>
      <tr>
          <td>Blue-green</td>
          <td>切換瞬間（&lt; 30 秒）</td>
          <td>持續 2x（cutover 後可拆）</td>
          <td>低（cloud managed）</td>
      </tr>
  </tbody>
</table>
<p>實務 default：</p>
<ul>
<li>&lt; 100GB、可接受 30 分鐘 downtime：<code>pg_upgrade --link</code></li>
<li>100GB - 1TB、要求 &lt; 5 分鐘 downtime：logical replication（標準 PostgreSQL）</li>
<li>1TB+ 或 SLA 嚴格：blue-green via Aurora / RDS（cloud managed）</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-patroni-ha-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> 整合</h3>
<p>HA cluster upgrade 流程：</p>
<ol>
<li>升新 standby（不在 cluster 中、physical / logical replicate 過去）</li>
<li>Promote 新 standby、舊 cluster failover 過去</li>
<li>重建剩餘 standby</li>
</ol>
<p>Patroni 17+ 支援 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">logical slot 跨 failover</a> — major version upgrade 期間 logical consumer 影響降低。</p>
<h3 id="跟-monitoring-整合">跟 monitoring 整合</h3>
<p>upgrade 期間特別關注的 metric：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Pre-upgrade baseline
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_database_size</span><span class="p">(</span><span class="s1">&#39;myapp&#39;</span><span class="p">),</span><span class="w"> </span><span class="k">version</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Post-upgrade verification
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_database_size</span><span class="p">(</span><span class="s1">&#39;myapp&#39;</span><span class="p">),</span><span class="w"> </span><span class="k">version</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_user_tables</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">last_analyze</span><span class="w"> </span><span class="k">IS</span><span class="w"> </span><span class="k">NULL</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- 應該 = 0、若有未 analyze 表、ANALYZE 沒跑完</span></span></span></code></pre></div><p>Prometheus alert 三條：<code>pg_database_size</code> upgrade 後差異 &lt; 1%、<code>pg_stat_replication</code> lag &lt; 10s、<code>pg_query_p99_latency</code> 對 baseline &lt; 1.5x。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Aurora major version upgrade</strong>：blue-green deployment 是 default、流程跟 self-managed 完全不同、見 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora migration</a> 對位段</li>
<li><strong>Cross-major version skip upgrade</strong>：pg13 → pg17 跨 4 major、breaking change 累積、建議 <em>逐 major 升</em> 而不是 <em>single hop</em></li>
<li><strong>Extension lifecycle 管理</strong>：自動 audit extension 跟 PG version compatibility、每 quarter 跑 dry-run</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL Archiving</a> / <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a></li>
<li>對位 migration：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a> / <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a>（本文驗證 <em>漏類</em>）</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL → Aurora Migration：protocol 相容、operational 重設計</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration&lt;/a> playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a>（self-managed source）跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&amp;#43;75% 效能改善的 production 證據">Aurora&lt;/a>（cloud-managed target）。跟前兩篇 migration（&lt;a href="https://tarrragon.github.io/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic&lt;/a> 高 schema 差 / &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &amp;#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB&lt;/a> drop-in）對照、本篇是 &lt;em>middle ground&lt;/em>：wire protocol drop-in、但 operational model 重設計。每階段切換用 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate&lt;/a> 把關。&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼遷operational-cost--ha--dr-三條-driver">為什麼遷：operational cost / HA / DR 三條 driver&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Driver&lt;/th>
 &lt;th>觸發場景&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>Operational cost&lt;/strong>&lt;/td>
 &lt;td>self-managed PostgreSQL + Patroni HA + pgBackRest backup + monitoring 需 0.5-2 FTE；Aurora 把這層責任轉嫁 AWS、SRE 專注 application&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>HA reliability&lt;/strong>&lt;/td>
 &lt;td>Patroni split-brain / DCS quorum 偶爾踩雷、production failover 4-15s；Aurora 自動 multi-AZ failover &amp;lt; 30s、shared storage 不丟資料&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>DR / backup&lt;/strong>&lt;/td>
 &lt;td>自管 PITR + cross-region replication 複雜；Aurora 內建 PITR + global database + backup retention 簡化&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>反向 driver（Aurora → self-managed）也存在 — 主要是 &lt;em>cost 在 10TB+ 規模時 Aurora 反而更貴&lt;/em>、或 &lt;em>需要 PostgreSQL extension Aurora 不支援&lt;/em>（pg_partman / pg_repack / TimescaleDB 等）。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor <a href="/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration</a> playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a>（self-managed source）跟 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora</a>（cloud-managed target）。跟前兩篇 migration（<a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic</a> 高 schema 差 / <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a> drop-in）對照、本篇是 <em>middle ground</em>：wire protocol drop-in、但 operational model 重設計。每階段切換用 <a href="/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate</a> 把關。</p></blockquote>
<h2 id="為什麼遷operational-cost--ha--dr-三條-driver">為什麼遷：operational cost / HA / DR 三條 driver</h2>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>觸發場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Operational cost</strong></td>
          <td>self-managed PostgreSQL + Patroni HA + pgBackRest backup + monitoring 需 0.5-2 FTE；Aurora 把這層責任轉嫁 AWS、SRE 專注 application</td>
      </tr>
      <tr>
          <td><strong>HA reliability</strong></td>
          <td>Patroni split-brain / DCS quorum 偶爾踩雷、production failover 4-15s；Aurora 自動 multi-AZ failover &lt; 30s、shared storage 不丟資料</td>
      </tr>
      <tr>
          <td><strong>DR / backup</strong></td>
          <td>自管 PITR + cross-region replication 複雜；Aurora 內建 PITR + global database + backup retention 簡化</td>
      </tr>
  </tbody>
</table>
<p>反向 driver（Aurora → self-managed）也存在 — 主要是 <em>cost 在 10TB+ 規模時 Aurora 反而更貴</em>、或 <em>需要 PostgreSQL extension Aurora 不支援</em>（pg_partman / pg_repack / TimescaleDB 等）。</p>
<h2 id="結構protocol-相容--operational-phased-的混合">結構：protocol 相容 + operational phased 的混合</h2>
<p>跟前兩篇對照、Aurora migration 結構是 <em>protocol drop-in</em>（application 不改 SQL）+ <em>operational redesign</em>（HA / backup / monitoring 全換）：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Splunk → Elastic（高 schema 差）</th>
          <th>Redis → DragonflyDB（drop-in）</th>
          <th>PostgreSQL → Aurora（middle）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Wire protocol</td>
          <td>完全不同（SPL vs KQL）</td>
          <td>完全相同（RESP）</td>
          <td>完全相同（PostgreSQL wire）</td>
      </tr>
      <tr>
          <td>Schema / data model</td>
          <td>高差異（CIM vs ECS）</td>
          <td>完全相同</td>
          <td>完全相同</td>
      </tr>
      <tr>
          <td>Application code</td>
          <td>必改</td>
          <td>不改</td>
          <td>不改</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>不同</td>
          <td>相似</td>
          <td><strong>大差</strong></td>
      </tr>
      <tr>
          <td>HA / replication</td>
          <td>不同</td>
          <td>相似</td>
          <td><strong>完全重設計</strong></td>
      </tr>
      <tr>
          <td>Backup model</td>
          <td>不同</td>
          <td>簡化</td>
          <td><strong>完全換 AWS-native</strong></td>
      </tr>
      <tr>
          <td>Migration 週期</td>
          <td>4-9 個月</td>
          <td>1-4 週</td>
          <td>6-12 週</td>
      </tr>
      <tr>
          <td>Phased 結構需要</td>
          <td>6-phase 明顯</td>
          <td>不需要</td>
          <td><strong>混合</strong>（3 operational phase + drop-in cutover）</td>
      </tr>
  </tbody>
</table>
<p><strong>Hypothesis 驗證</strong>：migration playbook 結構由 <em>最大差異維度</em> 決定 — Splunk → Elastic 是 schema 差導向 phased、Aurora migration 是 operational 差導向局部 phased。</p>
<h2 id="operational-redesign-對位">Operational redesign 對位</h2>
<p>跟 self-managed PostgreSQL 比、Aurora 的 operational 模型差異：</p>
<table>
  <thead>
      <tr>
          <th>Operational concept</th>
          <th>Self-managed PostgreSQL</th>
          <th>Aurora</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Storage</td>
          <td>Local disk / EBS、跟 compute 一體</td>
          <td>Shared storage 跨 AZ 6 副本、跟 compute 解耦</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>Patroni + DCS quorum + watchdog</td>
          <td>Aurora 自家 failover、shared storage 不重 promote</td>
      </tr>
      <tr>
          <td>Read replica</td>
          <td>Streaming replication + Patroni 管理</td>
          <td>Aurora reader endpoint、cluster 自動 routing</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>pgBackRest / WAL-G + S3</td>
          <td>自動 continuous backup + PITR（內建）</td>
      </tr>
      <tr>
          <td>Failover time</td>
          <td>15-60s（Patroni）</td>
          <td>&lt; 30s（同 AZ）/ 1-2 min（跨 AZ）</td>
      </tr>
      <tr>
          <td>Connection management</td>
          <td>PgBouncer 必裝</td>
          <td>RDS Proxy 推薦、Aurora 自家 connection pool</td>
      </tr>
      <tr>
          <td>Major version upgrade</td>
          <td>手動 + 停機</td>
          <td>Aurora 自家 blue/green deployment</td>
      </tr>
      <tr>
          <td>Monitoring</td>
          <td>Prometheus + grafana-postgresql</td>
          <td>CloudWatch + Performance Insights</td>
      </tr>
      <tr>
          <td>Extension support</td>
          <td>自由安裝</td>
          <td><strong>白名單</strong>、限 AWS 認可 extension</td>
      </tr>
      <tr>
          <td>Custom config</td>
          <td>postgresql.conf 全控</td>
          <td>Parameter Group（限制）</td>
      </tr>
      <tr>
          <td>OS / kernel access</td>
          <td>完全控</td>
          <td><strong>無</strong>（fully managed）</td>
      </tr>
  </tbody>
</table>
<p>每一條 operational concept 都需要 migration plan、application code 不變但 <em>運維知識體系全換</em>。</p>
<h2 id="migration-流程3-phase-operational--drop-in-cutover">Migration 流程：3 phase operational + drop-in cutover</h2>
<h3 id="phase-0pre-migration-audit1-2-週">Phase 0：Pre-migration audit（1-2 週）</h3>
<ol>
<li><strong>Extension 清單對位</strong>：</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">extname</span><span class="p">,</span><span class="w"> </span><span class="n">extversion</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_extension</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="c1">-- 對照 Aurora supported extensions list
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1">-- 不支援的（pg_repack / pg_partman 部分 / TimescaleDB / Citus）需替代方案</span></span></span></code></pre></div><ol start="2">
<li><strong>Custom config 清單</strong>：</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">setting</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_settings</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="k">source</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s1">&#39;default&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="c1">-- 對照 Aurora Parameter Group 可調項目</span></span></span></code></pre></div><ol start="3">
<li><strong>Capacity 評估</strong>：</li>
</ol>
<ul>
<li>當前 IOPS / connection / storage / WAL rate</li>
<li>對應 Aurora instance class（db.r6g.large to db.r6g.32xlarge）</li>
<li>估算 cost（vCPU + IOPS + storage + backup retention）</li>
</ul>
<ol start="4">
<li><strong>Application connection pool audit</strong>：</li>
</ol>
<ul>
<li>PgBouncer 配置是否能直接搬到 RDS Proxy</li>
<li>Connection string + IAM 認證準備</li>
</ul>
<h3 id="phase-1operational-infrastructure-準備2-3-週">Phase 1：Operational infrastructure 準備（2-3 週）</h3>
<ol>
<li>建 Aurora cluster（Terraform / CloudFormation）</li>
<li>設 Parameter Group、對位 self-managed 配置</li>
<li>設 Security Group + IAM role</li>
<li>設 RDS Proxy（推薦、connection 集中管理）</li>
<li>CloudWatch alert + Performance Insights baseline</li>
<li>Backup retention + PITR window 設定</li>
</ol>
<h3 id="phase-2data-migration取決於-dataset-大小">Phase 2：Data migration（取決於 dataset 大小）</h3>
<p>兩條路：</p>
<h4 id="路線-aaws-dms推薦中等規模--5tb">路線 A：AWS DMS（推薦中等規模 &lt; 5TB）</h4>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">self-managed Postgres ──(DMS)──→ Aurora
</span></span><span class="line"><span class="ln">2</span><span class="cl">                         |
</span></span><span class="line"><span class="ln">3</span><span class="cl">                  full load + CDC continuous</span></span></code></pre></div><ul>
<li>DMS task 設 <code>Full Load + Ongoing Replication</code></li>
<li>跑 full load 估算（100GB ~ 1-3 小時依 instance class）</li>
<li>CDC 持續直到 cutover</li>
</ul>
<h4 id="路線-blogical-replication推薦-5tb-或要精準控制">路線 B：Logical replication（推薦 5TB+ 或要精準控制）</h4>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Source：建 publication
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">PUBLICATION</span><span class="w"> </span><span class="n">migrate_pub</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="k">ALL</span><span class="w"> </span><span class="n">TABLES</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Aurora：建 subscription
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">SUBSCRIPTION</span><span class="w"> </span><span class="n">migrate_sub</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="k">CONNECTION</span><span class="w"> </span><span class="s1">&#39;host=&lt;source&gt; dbname=&lt;db&gt; user=&lt;replicator&gt;&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">  </span><span class="n">PUBLICATION</span><span class="w"> </span><span class="n">migrate_pub</span><span class="p">;</span></span></span></code></pre></div><ul>
<li>Initial COPY 跑完後 streaming</li>
<li>詳見 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a></li>
</ul>
<h3 id="phase-3cutover-跟-verification">Phase 3：Cutover 跟 verification</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. Application 端設 maintenance mode（block writes）
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. 等 replication lag → 0
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. 確認 Aurora 端 row count + checksum 對齊
</span></span><span class="line"><span class="ln">4</span><span class="cl">4. Application connection string 切到 Aurora endpoint
</span></span><span class="line"><span class="ln">5</span><span class="cl">5. 解除 maintenance mode
</span></span><span class="line"><span class="ln">6</span><span class="cl">6. Self-managed 端 read-only 保留 1-2 週 standby</span></span></code></pre></div><p>Cutover window 視 dataset 大小：</p>
<ul>
<li>&lt; 100GB：1-2 小時</li>
<li>100GB - 1TB：2-4 小時</li>
<li>1TB+：考慮 <em>zero-downtime cutover</em> via blue-green deployment</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1extension-不支援application-直接壞">Case 1：Extension 不支援、application 直接壞</h3>
<p><strong>徵兆</strong>：cutover 後 application 某些 query 報 <code>extension &quot;pg_repack&quot; not available</code>、batch job 壞。</p>
<p><strong>根因</strong>：Phase 0 audit 漏掉 application 用 pg_repack 做 maintenance；Aurora 不支援、self-managed 端的 cron job 改不過去。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration audit 必做</strong>：<code>SELECT extname FROM pg_extension</code> 對照 <a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Extensions.html">Aurora extension whitelist</a></li>
<li><strong>替代方案</strong>：
<ul>
<li>pg_repack → Aurora 自家 vacuum + storage auto-resize</li>
<li>TimescaleDB → 改 declarative partitioning 或換 Timestream</li>
<li>Citus → 評估保留 self-managed 或重設計 schema</li>
</ul>
</li>
<li><strong>退役策略</strong>：Extension 是 application 必要的、評估暫不遷或選 alternative cloud（如 AlloyDB / Citus on Azure）</li>
</ol>
<h3 id="case-2replication-slot-不直通">Case 2：Replication slot 不直通</h3>
<p><strong>徵兆</strong>：self-managed 端有 Debezium CDC 接 application 事件、cutover 後 CDC pipeline 直接壞、Kafka 端訊息斷流。</p>
<p><strong>根因</strong>：Aurora 對 logical replication slot 有限制 — 不直接支援 external consumer（如 Debezium）讀 slot；要走 <em>RDS Database Events</em> 或 <em>DMS CDC</em>。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration audit</strong>：列所有 logical consumer（Debezium / Kafka Connect / 自家 CDC）</li>
<li><strong>替代方案</strong>：
<ul>
<li>DMS CDC 取代 Debezium（Aurora 原生支援）</li>
<li>評估 RDS Database Activity Streams（newer feature）</li>
<li>重設計 CDC：application 寫 outbox 表、Aurora trigger 發 SNS → Lambda → Kafka</li>
</ul>
</li>
<li><strong>接受代價</strong>：CDC pipeline 重建是 2-4 週工作、納入 migration scope</li>
</ol>
<h3 id="case-3autovacuum-行為跟-self-managed-不同">Case 3：Autovacuum 行為跟 self-managed 不同</h3>
<p><strong>徵兆</strong>：cutover 後幾天、特定 hot table 的 bloat 數據異常、application 端 query latency p99 漲；CloudWatch Performance Insights 顯示 autovacuum 跑頻率比 self-managed 端高 3 倍。</p>
<p><strong>根因</strong>：Aurora 預設 Parameter Group 的 autovacuum 配置跟 self-managed 不同 — <code>autovacuum_vacuum_cost_limit</code> 預設更低、<code>vacuum_scale_factor</code> 更激進；shared storage 上 vacuum 行為不一樣。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Parameter Group 對位</strong>：把 self-managed <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum tuning</a> 配置複製到 Aurora Parameter Group</li>
<li><strong>per-table tuning</strong>：hot table 的 <code>ALTER TABLE SET (autovacuum_*)</code> 可遷過去</li>
<li><strong>接受差異</strong>：Aurora storage 設計讓 vacuum 不一定要跟 self-managed 同 cadence、SRE 心智模型要調</li>
</ol>
<h3 id="case-4iam-認證強制application-端改-connection-logic">Case 4：IAM 認證強制、application 端改 connection logic</h3>
<p><strong>徵兆</strong>：production 切到 Aurora 後、application 仍用 password authentication、SOC team 要求改 IAM 認證（compliance）；application 連線 logic 大改、token rotation 邏輯也要加。</p>
<p><strong>根因</strong>：self-managed 端用固定 username/password、Aurora 推薦（部分情境強制）IAM authentication；token 15 分鐘輪換、application 必須改連線 SDK。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Migration scope 內包含</strong>：authentication migration 是必要工作、不能事後補</li>
<li><strong>SDK 整合</strong>：用 AWS SDK + RDS Proxy 抽象 token rotation、application 不直接管 token</li>
<li><strong>Hybrid 期間</strong>：保留 password auth 直到 application 全切 IAM、再 disable password auth</li>
</ol>
<h3 id="case-5cost-model-預估錯月底帳單炸">Case 5：Cost model 預估錯、月底帳單炸</h3>
<p><strong>徵兆</strong>：第一個月 Aurora 帳單比預估高 50-80%；IOPS / backup storage / I/O cost 都比預期多。</p>
<p><strong>根因</strong>：Aurora pricing 三層（compute instance / storage / I/O）—</p>
<ul>
<li>Storage：actual data + backup × retention</li>
<li>I/O：每個 read / write block 都計費（self-managed 不算）</li>
<li>Backup：超過 backup retention 部分 charged as snapshot storage</li>
</ul>
<p>self-managed 端習慣 <em>fixed EC2 + EBS</em> cost、Aurora I/O-based 計費對 high-IOPS workload 衝擊大。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration cost estimate</strong>：用 self-managed <code>pg_stat_database</code> 估 I/O 量、套 Aurora pricing calc</li>
<li><strong>I/O optimization</strong>：開 Aurora I/O-Optimized storage class（fixed monthly + 不算 I/O）、適合 high-IOPS workload</li>
<li><strong>Backup retention 控制</strong>：不要 default 35 天、依 compliance 調整（7-14 天通常夠）</li>
<li><strong>Reserved Instance</strong>：穩定 workload 預付 1-3 年、省 30-40%</li>
</ol>
<h2 id="capacity--cost-對照">Capacity / cost 對照</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Self-managed PostgreSQL（EC2 + EBS）</th>
          <th>Aurora</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Instance cost</td>
          <td>EC2 + EBS（compute + storage 自管）</td>
          <td>Aurora instance class + storage + I/O</td>
      </tr>
      <tr>
          <td>HA cost</td>
          <td>Patroni 跨 3 AZ + EBS 3 副本</td>
          <td>Aurora 跨 3 AZ shared storage（內建）</td>
      </tr>
      <tr>
          <td>Backup cost</td>
          <td>pgBackRest + S3 archive</td>
          <td>Aurora 自動 continuous backup（內建）</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.5-2 FTE（HA / backup / patching）</td>
          <td>0.1-0.3 FTE（application 端 + Parameter Group）</td>
      </tr>
      <tr>
          <td>1TB / month cost</td>
          <td>$400-800（含 HA）</td>
          <td>$700-1500（含 HA）</td>
      </tr>
      <tr>
          <td>10TB / month cost</td>
          <td>$2K-4K</td>
          <td>$4K-8K（I/O cost 顯著）</td>
      </tr>
      <tr>
          <td>50TB+ cost</td>
          <td>$10K-20K</td>
          <td>$30K+（cost 反轉、self-managed 更便宜）</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：&lt; 10TB workload Aurora 平攤 operational cost 後仍便宜；50TB+ workload Aurora cost 顯著高、要 reserved + I/O-Optimized 才有競爭力。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-patroni-ha-對位">跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> 對位</h3>
<p>Patroni 在 Aurora migration 後 <em>退役</em> — Aurora 自家 failover 取代；但 SRE 心智模型要調：</p>
<ul>
<li>Patroni 的 <code>pg_rewind</code> 概念不存在（shared storage）</li>
<li>Patroni 的 <code>synchronous_commit</code> 行為 Aurora 隱藏在 storage layer</li>
<li>Aurora 跨 region 用 <em>Global Database</em>、不是 Patroni cross-region setup</li>
</ul>
<h3 id="跟-pitr-對位">跟 <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR</a> 對位</h3>
<p>self-managed PITR rebuild 工作量大、Aurora PITR 是 native API call：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">aws rds restore-db-cluster-to-point-in-time <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --source-db-cluster-identifier myapp-prod <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --db-cluster-identifier myapp-prod-restored <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  --restore-to-time 2026-05-19T14:30:00Z</span></span></code></pre></div><p>完全不需要 base backup + WAL replay 思維、storage layer 自動處理。</p>
<h3 id="跟-pgbouncer--rds-proxy">跟 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PgBouncer</a> → RDS Proxy</h3>
<p>PgBouncer 多數情境可換 RDS Proxy：</p>
<ul>
<li>transaction pooling 等效</li>
<li>IAM authentication 整合</li>
<li>Connection pinning（Lambda / serverless workload）</li>
<li><strong>限制</strong>：RDS Proxy 對某些 PG 14+ feature 仍 catching up、prepared statements 行為差異</li>
</ul>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Aurora Serverless v2 評估</strong>：variable workload 適合、steady workload 反而貴</li>
<li><strong>Babelfish 評估</strong>：跑 SQL Server protocol on Aurora（多 source 遷移到 Aurora）</li>
<li><strong>Cross-region DR</strong>：Aurora Global Database vs self-managed cross-region streaming + Patroni</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source vendor：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>Target vendor：<a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora</a></li>
<li>平行 migration playbook：<a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic Security</a> / <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a></li>
<li>Aurora family 內進一步遷移：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora-dsql/" data-link-title="PostgreSQL → Aurora DSQL Migration：PG wire-compatible Distributed SQL 的 Paradigm Shift" data-link-desc="Aurora DSQL（2024-12 re:Invent preview / 2025-05 GA）是 AWS 推的 PG wire-compatible *active-active distributed SQL*、跟 self-managed PG / Aurora PG 不同 paradigm（OCC &#43; snapshot isolation &#43; multi-region strong consistency）。Migration 結構是 *protocol drop-in &#43; paradigm shift*：app SQL 不太改、但 transaction retry / extension 缺位 / 多 region 一致性需重設計。本文走 DSQL vs Aurora PG vs self-managed PG 三軸對比、為什麼遷的三條 driver（global write / operational zero-touch / region resiliency）、Type E phased plan、5 production 踩雷（transaction retry 沒處理 / extension 缺位 / sequence throughput 限制 / Aurora PG 直升 DSQL 不可行 / region failover semantic）、跟 PG → Aurora 跟 PG → CockroachDB 對比">→ Aurora DSQL</a>（從 Aurora PG 升 DSQL active-active distributed、Type E paradigm shift）</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL Archiving</a> / <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL → Aurora DSQL Migration：PG wire-compatible Distributed SQL 的 Paradigm Shift</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora-dsql/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora-dsql/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration&lt;/a> playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a>（source）跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&amp;#43;75% 效能改善的 production 證據">Aurora&lt;/a>（DSQL 也屬 Aurora family、但 paradigm 不同）。跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &amp;#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">migrate-to-aurora&lt;/a>（PG → Aurora PG、protocol drop-in + operational redesign）跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &amp;#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">migrate-to-cockroachdb&lt;/a>（PG → CRDB、Type E paradigm shift）對照、本篇是 &lt;em>Aurora 內 PG → DSQL 的 paradigm shift&lt;/em>。每階段切換用 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate&lt;/a> 把關。&lt;/p>&lt;/blockquote>
&lt;blockquote>
&lt;p>&lt;strong>時間錨點&lt;/strong>：Aurora DSQL 在 &lt;strong>2024-12 re:Invent preview&lt;/strong>、&lt;strong>2025-05-27 GA&lt;/strong>。本文 vendor claim 以 2025-2026 公開狀態為準、實際 migration 前請以 AWS docs 為準（feature 持續演進中）。&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼遷global-write--operational-zero-touch--region-resiliency-三條-driver">為什麼遷：Global Write / Operational Zero-touch / Region Resiliency 三條 driver&lt;/h2>
&lt;p>PG → DSQL 不是「自然演進」、是 &lt;em>application 需求超出 single-primary 模型&lt;/em> 時的 paradigm 換軌。三條典型 driver 各自對應一種 application 約束、不是「三選一」、而是「至少其中一條剛性、其他兩條是 bonus」：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor <a href="/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration</a> playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a>（source）跟 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora</a>（DSQL 也屬 Aurora family、但 paradigm 不同）。跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">migrate-to-aurora</a>（PG → Aurora PG、protocol drop-in + operational redesign）跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">migrate-to-cockroachdb</a>（PG → CRDB、Type E paradigm shift）對照、本篇是 <em>Aurora 內 PG → DSQL 的 paradigm shift</em>。每階段切換用 <a href="/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate</a> 把關。</p></blockquote>
<blockquote>
<p><strong>時間錨點</strong>：Aurora DSQL 在 <strong>2024-12 re:Invent preview</strong>、<strong>2025-05-27 GA</strong>。本文 vendor claim 以 2025-2026 公開狀態為準、實際 migration 前請以 AWS docs 為準（feature 持續演進中）。</p></blockquote>
<h2 id="為什麼遷global-write--operational-zero-touch--region-resiliency-三條-driver">為什麼遷：Global Write / Operational Zero-touch / Region Resiliency 三條 driver</h2>
<p>PG → DSQL 不是「自然演進」、是 <em>application 需求超出 single-primary 模型</em> 時的 paradigm 換軌。三條典型 driver 各自對應一種 application 約束、不是「三選一」、而是「至少其中一條剛性、其他兩條是 bonus」：</p>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>觸發場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Global write</strong></td>
          <td>Application 需要多 region active-active write（不是 Aurora PG 的 single-writer + read replica）</td>
      </tr>
      <tr>
          <td><strong>Operational zero-touch</strong></td>
          <td>不想管 Patroni / PgBouncer / autovacuum / failover / backup retention、Aurora PG 已減一半、DSQL 進一步零接觸</td>
      </tr>
      <tr>
          <td><strong>Region resiliency</strong></td>
          <td>整 region 失效時應用無感切換（Aurora PG 是 cross-region replica 異步、DSQL 是 strong consistency 多 region）</td>
      </tr>
  </tbody>
</table>
<p>反向 driver（DSQL → Aurora PG）也存在：</p>
<ul>
<li>需要 PG extension（pgvector / TimescaleDB / PostGIS / pg_repack）— DSQL 不支援</li>
<li>Cost：DSQL 比 Aurora PG 貴 2-5x（依 region 數量）</li>
<li>Single-region OLTP 不需 distributed transaction 的 overhead</li>
</ul>
<h2 id="結構protocol-drop-in--paradigm-shift">結構：Protocol Drop-in + Paradigm Shift</h2>
<p>DSQL 是 PG wire-compatible（用 <code>psql</code> 連得上）、但內部是 <em>distributed SQL engine</em>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>self-managed PG</th>
          <th>Aurora PG</th>
          <th>Aurora DSQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Wire protocol</td>
          <td>PG</td>
          <td>PG</td>
          <td>PG（subset）</td>
      </tr>
      <tr>
          <td>Architecture</td>
          <td>Single primary</td>
          <td>Single primary + shared storage</td>
          <td><strong>Active-active distributed</strong></td>
      </tr>
      <tr>
          <td>Multi-region write</td>
          <td>不支援（async replica）</td>
          <td>不支援（async replica）</td>
          <td><strong>Strong consistency 多 region</strong></td>
      </tr>
      <tr>
          <td>Transaction model</td>
          <td>MVCC + snapshot isolation</td>
          <td>MVCC + snapshot isolation</td>
          <td><strong>OCC + strong snapshot isolation</strong></td>
      </tr>
      <tr>
          <td>Extension</td>
          <td>任意</td>
          <td>AWS whitelist</td>
          <td><strong>無 extension 支援</strong></td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>全部自管</td>
          <td>AWS 管 storage / failover</td>
          <td>AWS 管全部、零接觸</td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>Patroni 15-60s</td>
          <td>Aurora 30s</td>
          <td>N/A（永遠 active-active、無 failover 概念）</td>
      </tr>
      <tr>
          <td>Cost model</td>
          <td>Self-managed instance</td>
          <td>Instance hour + storage</td>
          <td>Per-DPU + multi-AZ replication</td>
      </tr>
  </tbody>
</table>
<p><strong>Paradigm shift 的核心</strong>：</p>
<ol>
<li><strong>Transaction semantic</strong>：DSQL 用 OCC（Optimistic Concurrency Control）+ strong snapshot isolation、跟 PG 預設 read committed / repeatable read snapshot 不同 — 同 row 有 concurrent write 時、commit 階段才偵測衝突 + abort、application 要 handle <code>40001</code> serialization_failure</li>
<li><strong>No extension</strong>：PostGIS / pgvector / TimescaleDB / pg_partman 都不能用、依賴這些 feature 的 application 要拆出去</li>
<li><strong>No connection pool stateful</strong>：DSQL 內建 connection pool、application 不能依賴 session state（temp table / prepared statement / advisory lock）</li>
</ol>
<h2 id="schema-gappg-對-dsql-限制">Schema gap：PG 對 DSQL 限制</h2>
<p>DSQL 是 PG-compatible <em>subset</em>、有幾類功能不支援：</p>
<table>
  <thead>
      <tr>
          <th>類別</th>
          <th>PG 支援</th>
          <th>DSQL 支援</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Extension</td>
          <td>是</td>
          <td>否（沒 <code>CREATE EXTENSION</code>）</td>
      </tr>
      <tr>
          <td>Foreign key constraint</td>
          <td>是</td>
          <td>否（application 維護 referential integrity）</td>
      </tr>
      <tr>
          <td>View / Materialized view</td>
          <td>是</td>
          <td>View 部分 / Materialized view 否</td>
      </tr>
      <tr>
          <td>JSON / JSONB</td>
          <td>是</td>
          <td>部分（無 GIN index 加速）</td>
      </tr>
      <tr>
          <td>Foreign data wrapper</td>
          <td>是</td>
          <td>否</td>
      </tr>
      <tr>
          <td>Stored procedure（PL/pgSQL）</td>
          <td>是</td>
          <td>部分（限制多）</td>
      </tr>
      <tr>
          <td>Trigger</td>
          <td>是</td>
          <td>部分</td>
      </tr>
      <tr>
          <td>LISTEN / NOTIFY</td>
          <td>是</td>
          <td>否</td>
      </tr>
      <tr>
          <td><code>SELECT ... FOR UPDATE</code></td>
          <td>是</td>
          <td>部分（DSQL OCC semantic）</td>
      </tr>
      <tr>
          <td>Sequence（serial / identity）</td>
          <td>是</td>
          <td>支援、但高吞吐有 coordination overhead</td>
      </tr>
      <tr>
          <td>Table partition</td>
          <td>是</td>
          <td>部分</td>
      </tr>
      <tr>
          <td>Logical replication slot</td>
          <td>是</td>
          <td>否</td>
      </tr>
  </tbody>
</table>
<p><strong>Migration 必做 schema audit</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 找所有 extension 依賴
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_extension</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- 找 materialized view
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">schemaname</span><span class="p">,</span><span class="w"> </span><span class="n">matviewname</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_matviews</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- 找 sequence
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_sequences</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 找 FDW
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_foreign_server</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- 找 trigger
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_trigger</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="n">tgisinternal</span><span class="p">;</span></span></span></code></pre></div><p>任何項目命中、都是 migration blocker。</p>
<h2 id="operational-redesign">Operational Redesign</h2>
<p>跟 self-managed PG 或 Aurora PG 比、DSQL operational model 大幅簡化但語意不同：</p>
<table>
  <thead>
      <tr>
          <th>Operational concept</th>
          <th>self-managed PG</th>
          <th>Aurora PG</th>
          <th>Aurora DSQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Storage</td>
          <td>Local / EBS</td>
          <td>Shared 6 副本</td>
          <td>Distributed log + replicated state</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>Patroni</td>
          <td>Aurora failover</td>
          <td>永遠 HA（無 failover 概念）</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>pgBackRest / WAL-G</td>
          <td>內建 continuous</td>
          <td>內建 continuous（更深整合）</td>
      </tr>
      <tr>
          <td>Connection pool</td>
          <td>PgBouncer / PgCat</td>
          <td>RDS Proxy 推薦</td>
          <td>內建（無需配置）</td>
      </tr>
      <tr>
          <td>Major version upgrade</td>
          <td>手動 + 停機</td>
          <td>Aurora blue/green</td>
          <td>完全 transparent（AWS 升）</td>
      </tr>
      <tr>
          <td>Read replica</td>
          <td>Streaming replication</td>
          <td>Reader endpoint</td>
          <td>無分（每 region 都讀寫）</td>
      </tr>
      <tr>
          <td>Monitoring</td>
          <td>Prometheus / pg_stat_*</td>
          <td>CloudWatch + Performance Insights</td>
          <td>CloudWatch（簡化）</td>
      </tr>
      <tr>
          <td>預期 SRE FTE</td>
          <td>0.5-2</td>
          <td>0.2-0.5</td>
          <td>&lt; 0.1</td>
      </tr>
  </tbody>
</table>
<h2 id="migration-流程type-e-phased-plan">Migration 流程：Type E Phased Plan</h2>
<p>Type E paradigm shift 的 phased plan、跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">migrate-to-cockroachdb</a> 結構類似：</p>
<h3 id="phase-1schema--application-audit">Phase 1：Schema / Application Audit</h3>
<ul>
<li>跑 schema audit（extension / MV / FDW / sequence / trigger）</li>
<li>識別 application 哪些 query / transaction pattern 需重設計</li>
<li>估算 <em>能直接遷的 % vs 需重寫的 %</em>、典型 60-80% / 20-40%</li>
</ul>
<h3 id="phase-2application-改造不上-dsql先在-pg-跑">Phase 2：Application 改造（不上 DSQL、先在 PG 跑）</h3>
<ul>
<li>加 transaction retry middleware（攔截 <code>40001</code>、exponential backoff）</li>
<li>用 UUID 替代 serial / bigserial</li>
<li>移除依賴 LISTEN/NOTIFY 的功能（改 SQS / EventBridge）</li>
<li>移除 materialized view（改 application-side cache 或 incremental ETL）</li>
<li>Stored procedure 改 application code</li>
<li>在 PG 上跑 staging、確認新 application code 還對</li>
</ul>
<h3 id="phase-3dsql-cluster-建立--schema-遷">Phase 3：DSQL Cluster 建立 + Schema 遷</h3>
<ul>
<li>DSQL cluster create</li>
<li>DDL apply（subset of PG schema、無 extension）</li>
<li>DMS（Database Migration Service）initial load + ongoing replication</li>
<li>兩邊跑 shadow traffic、比對 query 結果</li>
</ul>
<h3 id="phase-4cutover">Phase 4：Cutover</h3>
<ul>
<li>Application 切 connection string 到 DSQL</li>
<li>保留 PG read-only 一週、出狀況 rollback</li>
<li>Monitor <code>40001</code> retry rate、scaling event 行為</li>
</ul>
<h3 id="phase-5多-region-拓展如適用">Phase 5：多 region 拓展（如適用）</h3>
<ul>
<li>加第二 region endpoint</li>
<li>Application 改 multi-region routing（latency-based）</li>
<li>Test region failure / network partition 行為</li>
</ul>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="case-1transaction-retry-沒處理">Case 1：Transaction Retry 沒處理</h3>
<p><strong>情境</strong>：PG 上「兩個 transaction 都 update 同 row」走 lock + wait；DSQL 同情境一個會收 <code>40001 serialization_failure</code>、application 沒 catch、user 看到 500 error。</p>
<p>修法：</p>
<ul>
<li>DAO 層加 retry middleware：catch <code>40001</code> + exponential backoff（jitter）</li>
<li>Retry 上限 3-5 次、超過回 4xx 給 user</li>
<li>Transaction 內不要做 side effect（API call / message send）、retry 會重做</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">with_retry</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="n">max_attempts</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="k">for</span> <span class="n">attempt</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_attempts</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">        <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">            <span class="k">return</span> <span class="n">fn</span><span class="p">()</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">        <span class="k">except</span> <span class="n">SerializationError</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">            <span class="k">if</span> <span class="n">attempt</span> <span class="o">==</span> <span class="n">max_attempts</span> <span class="o">-</span> <span class="mi">1</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">                <span class="k">raise</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">            <span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">((</span><span class="mi">2</span> <span class="o">**</span> <span class="n">attempt</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.05</span> <span class="o">+</span> <span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">()</span> <span class="o">*</span> <span class="mf">0.05</span><span class="p">)</span></span></span></code></pre></div><h3 id="case-2extension-缺位feature-整段掉">Case 2：Extension 缺位、Feature 整段掉</h3>
<p><strong>情境</strong>：production PG 用 pgvector 做 RAG search、PostGIS 做 store locator、TimescaleDB 做 metrics — 切 DSQL 後三 feature 全沒。</p>
<p>修法：</p>
<ul>
<li>不要直接遷、評估 <em>which extension is load-bearing</em></li>
<li>pgvector → 外掛 Pinecone / Weaviate 或保留 PG 跑 vector workload</li>
<li>PostGIS → 保留 PG 跑 GIS workload</li>
<li>TimescaleDB → 切 Amazon Timestream 或保留 PG</li>
<li>DSQL 只放 <em>不依賴 extension</em> 的 transactional core</li>
</ul>
<p>實務常見拓撲：DSQL 跑 transactional core、附 PG（vector） + PG（GIS） + Timestream（metrics）。</p>
<h3 id="case-3sequence-高吞吐撞-coordination-overhead">Case 3：Sequence 高吞吐撞 Coordination Overhead</h3>
<p><strong>情境</strong>：<code>SERIAL</code> / <code>GENERATED AS IDENTITY</code> PK 在 DSQL 用、insert 量 1000+/s 時 sequence nextval 變成 bottleneck、insert latency 從 5ms 跳到 80-100ms+。</p>
<p>DSQL 有支援 sequence、但不是「local atomic counter」、是分散式 counter — 每次 nextval 需跨 region coordination 保證唯一性。低吞吐 OK、高吞吐撞牆。</p>
<p>修法：</p>
<ul>
<li>高吞吐表 PK 換 UUID v7（time-sortable、無 coordination）：<code>gen_random_uuid()</code> 或 application-side UUID v7 library</li>
<li>或 application-side ULID（time-sortable、12-byte 緊湊）</li>
<li>完全避免依賴「連續 integer PK」的 application 邏輯（reporting / paging 改用 <code>ORDER BY created_at, id</code>）</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 換 UUID PK
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="n">UUID</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="k">DEFAULT</span><span class="w"> </span><span class="n">gen_random_uuid</span><span class="p">(),</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="p">...</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p>低吞吐表（settings / config）保留 sequence OK；high-volume transactional 表（orders / events）建議 UUID。</p>
<h3 id="case-4aurora-pg-直升-dsql-想當-in-place">Case 4：Aurora PG 直升 DSQL 想當 in-place</h3>
<p><strong>情境</strong>：team 以為「Aurora PG 跟 Aurora DSQL 都是 Aurora、應該能直升」、申請 cluster modify、發現完全是兩個 service。</p>
<p>修法：</p>
<ul>
<li>不是 in-place upgrade、是 full migration（DMS + cutover）</li>
<li>把 DSQL 當完全新的 cluster type、走 Phase 1-4 完整流程</li>
<li>Aurora PG → Aurora DSQL 不比 PG → CRDB 容易、wire-compatible 只解 application connect 問題、不解 schema / paradigm 差異</li>
</ul>
<h3 id="case-5region-failover-semantic">Case 5：Region Failover Semantic</h3>
<p><strong>情境</strong>：team 以為「DSQL multi-region 等於高可用」、設計時假設「整 region 掛還是能寫」、實測發現「網絡分割時 DSQL 走 quorum、可能 reject write」。</p>
<p>DSQL 是 strong consistency 多 region、CAP 取 CP（不是 AP）—  network partition 時部分 region 會拒絕 write、不是「永遠可寫」。</p>
<p>修法：</p>
<ul>
<li>設計 application 要 handle write reject（partition recovery 後 retry）</li>
<li>不要把 DSQL 當「永遠可寫」的 cache 或 queue 用</li>
<li>真要 AP 行為、用 DynamoDB（global table）</li>
</ul>
<h2 id="capacity-規劃">Capacity 規劃</h2>
<p>DSQL 計費跟 Aurora PG 差很多：</p>
<table>
  <thead>
      <tr>
          <th>計費項目</th>
          <th>Aurora PG</th>
          <th>Aurora DSQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Instance</td>
          <td>Per-instance hour</td>
          <td>無（serverless）</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>Per-GB-month</td>
          <td>Per-GB-month（多副本價）</td>
      </tr>
      <tr>
          <td>IO</td>
          <td>Per-million IO</td>
          <td>每 transaction 計費</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>Per-GB-month</td>
          <td>內建（無額外）</td>
      </tr>
      <tr>
          <td>Multi-region</td>
          <td>Cross-region replica（額外）</td>
          <td>每 region 全費 × N</td>
      </tr>
  </tbody>
</table>
<p>實務 cost：Aurora PG db.r6g.4xlarge multi-AZ 月 ~$2000 → DSQL 同 workload ~$5000-10000（依 region 數）。</p>
<p>何時 DSQL cost 划算：</p>
<ul>
<li>多 region active-active 需求剛性（不是 nice-to-have）</li>
<li>Operational FTE 節省超過 cost 差</li>
<li>Burst workload（DSQL 自動 scale、Aurora PG 預配置 idle 期浪費）</li>
</ul>
<h2 id="跟既有-migration-playbook-對比">跟既有 Migration Playbook 對比</h2>
<table>
  <thead>
      <tr>
          <th>Migration</th>
          <th>Type</th>
          <th>主結構</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">→ Aurora PG</a></td>
          <td>C</td>
          <td>Protocol drop-in + operational redesign</td>
      </tr>
      <tr>
          <td><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">→ CockroachDB</a></td>
          <td>E</td>
          <td>Paradigm shift（distributed SQL）</td>
      </tr>
      <tr>
          <td>→ Aurora DSQL（本篇）</td>
          <td>E</td>
          <td>Paradigm shift（PG-compatible distributed）</td>
      </tr>
  </tbody>
</table>
<p><strong>Aurora DSQL vs CockroachDB 選擇</strong>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Aurora DSQL</th>
          <th>CockroachDB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PG compatibility</td>
          <td>Wire-compatible 較完整</td>
          <td>高、但有差異</td>
      </tr>
      <tr>
          <td>Vendor lock-in</td>
          <td>AWS only</td>
          <td>跨雲 / on-prem</td>
      </tr>
      <tr>
          <td>Cost</td>
          <td>AWS pricing</td>
          <td>自管或 CockroachDB Cloud</td>
      </tr>
      <tr>
          <td>Multi-region 模型</td>
          <td>Strong consistency 內建</td>
          <td>可配置（regional / global table）</td>
      </tr>
      <tr>
          <td>Extension</td>
          <td>完全沒</td>
          <td>部分（CDC / changefeed）</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>Zero-touch</td>
          <td>自管或 managed</td>
      </tr>
  </tbody>
</table>
<p>選 DSQL：已綁 AWS、不想管基礎設施、需 PG semantic。
選 CRDB：跨雲、有自管 SRE、需要 fine-grained control。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">migrate-to-aurora</a>：Aurora PG 對比（Type C）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">migrate-to-cockroachdb</a>：CRDB 對比（Type E）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/extension-ecosystem/" data-link-title="PostgreSQL Extension Ecosystem：把 PG 變成 vector DB / time-series / sharded 的 plugin 生態" data-link-desc="PG 的 extension 機制不只是 plugin、是 *結構性產品線擴張* — pgvector 讓 PG 變 vector DB、TimescaleDB 變 time-series、Citus 變 sharded、PostGIS 變 GIS。本文走 PG extension lifecycle、6 個 production-critical extension（pg_stat_statements / pg_partman / pg_repack / pgvector / TimescaleDB / PostGIS）、5 production 踩雷（extension version 跟 PG version 對齊 / managed PG 限制 / upgrade order / shared_preload_libraries 衝突 / extension 跟 logical replication 互動）、cloud vendor 對 extension 的限制">extension-ecosystem</a>：DSQL 不支援的 extension</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/connection-scaling/" data-link-title="PostgreSQL Connection Scaling：process-per-connection model 跟為什麼 pooler 是必裝" data-link-desc="PG 每個 client connection fork 一個 backend process（不是 thread）、RAM 成本 5-15MB/connection、context switch 跟 fork() cost 在 100&#43; connection 後線性放大、所以 pooler 不是 *optional optimization* 而是 *production prerequisite*。本文走 process-per-connection model 跟 MySQL thread-per-connection 對比、max_connections &#43; shared_buffers &#43; work_mem 三 GUC 互動、application-side pool vs middleware pool vs RDS Proxy 三層選擇、5 production 踩雷（connection storm / fork() cost 在 burst 流量 / shared_buffers 跟 connection 數壓縮 / double-pool 配置錯誤 / max_connections 設太大反而慢）、跟 PgBouncer config 互補不重複">connection-scaling</a>：DSQL 內建 pool 跟 PgBouncer 對比</li>
</ul>
<h2 id="下一步">下一步</h2>
<ul>
<li>看 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora overview</a> 認識 Aurora family</li>
<li>看 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/" data-link-title="PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration" data-link-desc="PostgreSQL → CockroachDB 是 Schema / Operational / Paradigm 三維皆 High 的 multi-axis migration、實證 [#127](/report/content-structure-by-max-diff-dimension/) 的「多重歸類跟 tie-breaking」規則；主結構走 Type E paradigm shift、Schema 差 &#43; Operational redesign 抽出獨立段；涵蓋 transaction model 重設計、SQL dialect gap、5 個 production 踩雷">migrate-to-cockroachdb</a> 對比另一個 Type E migration</li>
<li>回 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL overview</a> 看全圖</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL → CockroachDB：三維皆 High 的多重歸類 migration</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-cockroachdb/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration&lt;/a> playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB&lt;/a>。本文是 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 多重歸類跟 tie-breaking&lt;/a> 規則的實證 — 三維皆 High 配對的處理方式不是「選 type A 或 type C 或 type E」、是 &lt;em>主導維度走 Type E、其他高維度獨立加段&lt;/em>。每階段切換用 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate&lt;/a> 把關。&lt;/p>&lt;/blockquote>
&lt;h2 id="三維皆-high決策矩陣">三維皆 High：決策矩陣&lt;/h2>
&lt;p>跑 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit&lt;/a> 對 PostgreSQL → CockroachDB：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>評估&lt;/th>
 &lt;th>等級&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Schema / API&lt;/td>
 &lt;td>PostgreSQL wire protocol 兼容、但 SQL feature set 部分缺（CTE recursive 部分 / window function 部分 / extension 完全缺）&lt;/td>
 &lt;td>&lt;strong>High&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operational model&lt;/td>
 &lt;td>Single-node + Patroni → distributed Raft + 自動 rebalance；HA / backup / topology 全換&lt;/td>
 &lt;td>&lt;strong>High&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Abstraction / paradigm&lt;/td>
 &lt;td>Single-node MVCC + transaction → distributed Serializable Snapshot Isolation (SSI)&lt;/td>
 &lt;td>&lt;strong>High&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Number of components&lt;/td>
 &lt;td>同 1 個 DB cluster&lt;/td>
 &lt;td>Low&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application change&lt;/td>
 &lt;td>Transaction retry pattern 必須改、ORM 可能需 patch&lt;/td>
 &lt;td>Medium&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>3 維 High + 1 維 Medium。按 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">methodology audit Step 5&lt;/a> 的多重歸類處理規則：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor <a href="/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration</a> playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> 跟 <a href="/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB</a>。本文是 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 多重歸類跟 tie-breaking</a> 規則的實證 — 三維皆 High 配對的處理方式不是「選 type A 或 type C 或 type E」、是 <em>主導維度走 Type E、其他高維度獨立加段</em>。每階段切換用 <a href="/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate</a> 把關。</p></blockquote>
<h2 id="三維皆-high決策矩陣">三維皆 High：決策矩陣</h2>
<p>跑 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a> 對 PostgreSQL → CockroachDB：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>PostgreSQL wire protocol 兼容、但 SQL feature set 部分缺（CTE recursive 部分 / window function 部分 / extension 完全缺）</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>Single-node + Patroni → distributed Raft + 自動 rebalance；HA / backup / topology 全換</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Abstraction / paradigm</td>
          <td>Single-node MVCC + transaction → distributed Serializable Snapshot Isolation (SSI)</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Number of components</td>
          <td>同 1 個 DB cluster</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Transaction retry pattern 必須改、ORM 可能需 patch</td>
          <td>Medium</td>
      </tr>
  </tbody>
</table>
<p>3 維 High + 1 維 Medium。按 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">methodology audit Step 5</a> 的多重歸類處理規則：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">主導維度判讀 (優先序): Schema &gt; Paradigm &gt; Operational &gt; Components
</span></span><span class="line"><span class="ln">2</span><span class="cl">
</span></span><span class="line"><span class="ln">3</span><span class="cl">實際應用: Schema High + Paradigm High + Operational High
</span></span><span class="line"><span class="ln">4</span><span class="cl">- Schema 是 High、但 CRDB 提供 PostgreSQL wire protocol 兼容
</span></span><span class="line"><span class="ln">5</span><span class="cl">- Paradigm 是 High、是 *單機 → 分散式* 的根本轉變、讀者最關心
</span></span><span class="line"><span class="ln">6</span><span class="cl">- Operational 是 High、但很大程度是 Paradigm 的 downstream
</span></span><span class="line"><span class="ln">7</span><span class="cl">
</span></span><span class="line"><span class="ln">8</span><span class="cl">→ 主結構選 Paradigm（Type E）、Schema + Operational 抽獨立段補充</span></span></code></pre></div><p>不強迫單一 type 標籤 — 本文是 <em>Type E 為主 + Type A / C 高維度增補</em> 的 multi-axis 形態。</p>
<h2 id="結構-differentiatortype-e-主結構--多軸增補段">結構 differentiator：Type E 主結構 + 多軸增補段</h2>
<p>跟前批 5 個 migration playbook 對照：</p>
<table>
  <thead>
      <tr>
          <th>結構元素</th>
          <th>Type A Splunk → Elastic</th>
          <th>Type B Redis → DragonflyDB</th>
          <th>Type C PostgreSQL → Aurora</th>
          <th>Type D Datadog → Grafana</th>
          <th>Type E Kafka ↔ NATS</th>
          <th><strong>本文（三維 High）</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Phased translation</td>
          <td>yes</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>partial</td>
      </tr>
      <tr>
          <td>Compatibility audit</td>
          <td>-</td>
          <td>yes</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
      </tr>
      <tr>
          <td>Operational redesign 對位</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>-</td>
          <td>-</td>
          <td><strong>yes（獨立段）</strong></td>
      </tr>
      <tr>
          <td>Schema gap 對位</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td><strong>yes（獨立段）</strong></td>
      </tr>
      <tr>
          <td>Parallel streams</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Paradigm contrast</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>yes</td>
      </tr>
      <tr>
          <td>Application 重設計</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>yes</td>
      </tr>
      <tr>
          <td>混合架構 long-term</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>yes</td>
          <td>partial（部分 workload）</td>
      </tr>
  </tbody>
</table>
<p>本文是「Type E 為主 + Type A schema gap 段 + Type C operational redesign 段」混合形態、9-10 章節、260-300 行。</p>
<h2 id="維度-1paradigm-shift主導">維度 1：Paradigm shift（主導）</h2>
<p>CRDB 是 <em>distributed SQL DB</em>、不是「PostgreSQL 多節點版」。核心差異：</p>
<table>
  <thead>
      <tr>
          <th>概念</th>
          <th>PostgreSQL</th>
          <th>CockroachDB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Transaction isolation</td>
          <td>MVCC、Read Committed default</td>
          <td>Serializable Snapshot Isolation (SSI)、強一致</td>
      </tr>
      <tr>
          <td>Transaction conflict</td>
          <td>First writer wins</td>
          <td>Retry-on-conflict、application 必須處理 <code>40001</code> retry code</td>
      </tr>
      <tr>
          <td>Replication</td>
          <td>Streaming replication + standby</td>
          <td>Raft consensus、每筆寫 quorum + 自動 rebalance</td>
      </tr>
      <tr>
          <td>Partition</td>
          <td>Declarative partitioning（手動）</td>
          <td>Automatic range-based + locality-aware</td>
      </tr>
      <tr>
          <td>Latency p99</td>
          <td>1-10ms（單 region）</td>
          <td>5-50ms（cross-AZ Raft quorum）</td>
      </tr>
      <tr>
          <td>Throughput limit</td>
          <td>單 primary 上限 ~10-50K TPS</td>
          <td>Linear scale by adding node、~5K TPS / node</td>
      </tr>
  </tbody>
</table>
<p>關鍵 paradigm 改變：<em>transaction 是 retry-able 操作、不是 atomic guaranteed</em>。所有 transaction code 需要包 retry loop（CRDB 提供 <code>cockroach_restart</code> savepoint）。</p>
<h2 id="維度-2schema-gappostgresql-features-crdb-不支援">維度 2：Schema gap（PostgreSQL features CRDB 不支援）</h2>
<p>CRDB 號稱 PostgreSQL-compatible、但 <em>covergence rate 80-90%</em>；常見 gap：</p>
<table>
  <thead>
      <tr>
          <th>PostgreSQL feature</th>
          <th>CRDB 狀態</th>
          <th>影響</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Stored procedure / function (PL/pgSQL)</td>
          <td>Limited（CRDB 22.2+ 部分支援）</td>
          <td>Migration scope 內必須 audit + 改寫</td>
      </tr>
      <tr>
          <td>Common Table Expression (CTE) recursive</td>
          <td>Limited (depth + structure)</td>
          <td>複雜 CTE 可能跑不通、必須 query refactor</td>
      </tr>
      <tr>
          <td>Window function 全集</td>
          <td>Partial</td>
          <td>報表 query 需逐 case 驗證</td>
      </tr>
      <tr>
          <td>Extensions (pg_repack / pgaudit / TimescaleDB)</td>
          <td><strong>不支援</strong></td>
          <td>用 CRDB 自家 alternative 或自管 application 層</td>
      </tr>
      <tr>
          <td>Triggers</td>
          <td>Limited</td>
          <td>Audit / data integrity 邏輯遷到 application 層</td>
      </tr>
      <tr>
          <td>Custom types / domain</td>
          <td>Partial</td>
          <td>用 CHECK constraint 替代</td>
      </tr>
      <tr>
          <td>Geographic types (PostGIS)</td>
          <td>CRDB native geo support（語法不同）</td>
          <td>Spatial query 改寫</td>
      </tr>
      <tr>
          <td><code>SELECT FOR UPDATE</code> semantics</td>
          <td>對等但底層機制不同（distributed lock）</td>
          <td>注意 deadlock pattern 差異</td>
      </tr>
      <tr>
          <td>Advisory locks</td>
          <td><strong>不支援</strong></td>
          <td>Application 端用其他 distributed lock（Redis / Consul）</td>
      </tr>
  </tbody>
</table>
<p>Migration 必須 <em>先 audit 完整 SQL feature 使用</em>、列出 gap、評估解法或退役。</p>
<h2 id="維度-3operational-redesign">維度 3：Operational redesign</h2>
<p>CRDB operational model 完全不同：</p>
<table>
  <thead>
      <tr>
          <th>Operational concept</th>
          <th>PostgreSQL self-managed</th>
          <th>CRDB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster bootstrap</td>
          <td>Patroni / Stolon + manual</td>
          <td><code>cockroach init</code> + 自動 Raft formation</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>Patroni + DCS + watchdog</td>
          <td>內建 Raft、無 single primary</td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>Patroni-managed、15-60s</td>
          <td>透明 Raft re-election、&lt; 5s</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>pgBackRest + WAL archive</td>
          <td><code>BACKUP TO</code> (incremental + full)</td>
      </tr>
      <tr>
          <td>Restore</td>
          <td><code>pgBackRest restore</code> + PITR</td>
          <td><code>RESTORE FROM</code></td>
      </tr>
      <tr>
          <td>Replication</td>
          <td>Streaming + logical</td>
          <td>Built-in、無 logical replication 對等概念</td>
      </tr>
      <tr>
          <td>Schema migration</td>
          <td><code>pg_dump</code> / Flyway / Liquibase</td>
          <td><code>cockroach sql</code> + online schema change（無 lock）</td>
      </tr>
      <tr>
          <td>Monitoring</td>
          <td>pg_stat_* views + Prometheus exporter</td>
          <td>CRDB admin UI + Prometheus（schema 不同）</td>
      </tr>
      <tr>
          <td>Sizing</td>
          <td>Vertical scale（單 node big spec）</td>
          <td>Horizontal scale（多 node 小 spec）</td>
      </tr>
  </tbody>
</table>
<p>SRE 心智模型完全重訓：<em>無 primary 概念 / 無 streaming lag 概念 / 無 standby promote 概念</em>。</p>
<h2 id="migration-流程混合形態">Migration 流程（混合形態）</h2>
<p>不是線性 phased、是 <em>phased + parallel + partial</em> 混合：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Phase 0: scope 判讀
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  - 列 application、區分「適合 CRDB」vs「保留 PostgreSQL」
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  - SQL feature audit
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  - Application transaction pattern audit
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">Phase 1: schema port + application 改寫
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  - DDL 轉成 CRDB syntax
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  - 不支援 extension 找 alternative
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  - Application transaction code 加 retry loop
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl">Phase 2: 雙寫期（部分 application 開始走 CRDB）
</span></span><span class="line"><span class="ln">12</span><span class="cl">  - 新 application 走 CRDB
</span></span><span class="line"><span class="ln">13</span><span class="cl">  - 舊 application 持續 PostgreSQL
</span></span><span class="line"><span class="ln">14</span><span class="cl">  - CDC bridge（Debezium → Kafka → CRDB consumer）
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl">Phase 3: cutover 適合的 application
</span></span><span class="line"><span class="ln">17</span><span class="cl">  - 每個 application 獨立 cutover
</span></span><span class="line"><span class="ln">18</span><span class="cl">  - 不是「全 DB 一次切」
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl">Phase 4: 長期混合架構
</span></span><span class="line"><span class="ln">21</span><span class="cl">  - 某些 workload 永遠保留 PostgreSQL（不適合分散式）
</span></span><span class="line"><span class="ln">22</span><span class="cl">  - CRDB 跑 distributed 適配 workload</span></span></code></pre></div><p>整體 3-6 個月、不收斂到全 CRDB。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1transaction-retry-沒處理application-大量-40001-error">Case 1：Transaction retry 沒處理、application 大量 <code>40001</code> error</h3>
<p><strong>徵兆</strong>：cutover 後 application 5-10% transaction 報 <code>restart transaction: TransactionRetryWithProtoRefreshError</code>、業務 fail。</p>
<p><strong>根因</strong>：PostgreSQL Read Committed 不要求 application 處理 conflict、CRDB Serializable Isolation 必須 <em>retry-on-conflict</em>；application code 沒 retry loop。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">// CRDB transaction with retry</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="k">for</span> <span class="nx">retries</span> <span class="o">:=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">retries</span> <span class="p">&lt;</span> <span class="mi">10</span><span class="p">;</span> <span class="nx">retries</span><span class="o">++</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="nx">tx</span><span class="p">,</span> <span class="nx">_</span> <span class="o">:=</span> <span class="nx">db</span><span class="p">.</span><span class="nf">Begin</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="c1">// ... transaction logic ...</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="nx">err</span> <span class="o">:=</span> <span class="nx">tx</span><span class="p">.</span><span class="nf">Commit</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="k">if</span> <span class="nx">err</span> <span class="o">!=</span> <span class="kc">nil</span> <span class="o">&amp;&amp;</span> <span class="nx">strings</span><span class="p">.</span><span class="nf">Contains</span><span class="p">(</span><span class="nx">err</span><span class="p">.</span><span class="nf">Error</span><span class="p">(),</span> <span class="s">&#34;40001&#34;</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">        <span class="nx">time</span><span class="p">.</span><span class="nf">Sleep</span><span class="p">(</span><span class="nf">backoff</span><span class="p">(</span><span class="nx">retries</span><span class="p">))</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">        <span class="k">continue</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="k">break</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p>framework-level：用 CRDB-provided client lib（go-cockroachdb / crdb-jdbc）有 retry helper。</p>
<h3 id="case-2extension-缺位application-feature-整段掉">Case 2：Extension 缺位、application feature 整段掉</h3>
<p><strong>徵兆</strong>：cutover 後 application 某個地理計算功能直接報錯、PostGIS 函數不存在；migrate 計畫漏看。</p>
<p><strong>根因</strong>：CRDB native geo 不同 syntax / API、PostGIS extension 不能直接搬。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration 必跑 extension audit</strong>：列所有 <code>pg_extension</code>、找對應 CRDB feature 或退役</li>
<li><strong>PostGIS 替代</strong>：CRDB native ST_* functions、部分 syntax 對齊但 spatial index 不同</li>
<li><strong>退役不能換的 feature</strong>：評估保留 PostgreSQL（混合架構）</li>
</ol>
<h3 id="case-3sequential-pk-撞-raft-quorum-瓶頸">Case 3：Sequential PK 撞 Raft quorum 瓶頸</h3>
<p><strong>徵兆</strong>：cutover 後寫入吞吐量 / latency 不如預期、CRDB cluster CPU &lt; 30% 但 write latency p99 high。</p>
<p><strong>根因</strong>：application 用 <code>AUTO_INCREMENT</code> / <code>SERIAL</code> 連續 PK；CRDB 把連續 key 放 <em>同一 range</em> / 同一 Raft group、寫入串行化、無法平行 scale。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>改 UUID v7 / <code>unique_rowid()</code></strong>：時序排序但散佈跨 range、自動 partition by hash</li>
<li><strong><code>PRIMARY KEY (region, id)</code></strong>：multi-region 場景 multi-tenancy 自然拆分</li>
<li><strong>不適合的 workload 留 PostgreSQL</strong>：不是所有 schema 都適合 distributed</li>
</ol>
<h3 id="case-4long-transaction-對-raft-衝擊">Case 4：Long transaction 對 Raft 衝擊</h3>
<p><strong>徵兆</strong>：跨 1 分鐘+ 的 transaction（batch processing / 大 ETL）大量 retry、最後失敗；同期間其他短 transaction 也 retry rate 上升。</p>
<p><strong>根因</strong>：CRDB long transaction holds intent on touched ranges、阻塞其他 transaction；SSI conflict 機率隨 transaction 時間平方增長。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Long transaction 拆短</strong>：batch 用多個 short transaction、checkpoint 在 application 層</li>
<li><strong>Heavy ETL 不跑 CRDB</strong>：用 CRDB CDC export 到 OLAP（Snowflake / BigQuery）跑 batch</li>
<li><strong>Read-only long transaction 用 follower read</strong>：<code>AS OF SYSTEM TIME</code> 不 hold intent、適合 reporting</li>
</ol>
<h3 id="case-5backup--restore-行為跟-postgresql-不同sre-runbook-失效">Case 5：Backup / restore 行為跟 PostgreSQL 不同、SRE runbook 失效</h3>
<p><strong>徵兆</strong>：DBA 嘗試 <code>pg_restore</code> 失敗、CRDB 端 backup format 完全不同；incident response 卡關 1-2 小時。</p>
<p><strong>根因</strong>：CRDB backup 是 <em>cluster-internal format</em>、不能用 PostgreSQL tooling；SRE runbook 仍是 PostgreSQL world、應急時心智模型錯位。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Runbook 重寫</strong>：CRDB-specific backup / restore 流程、SRE training</li>
<li><strong>DR drill</strong>：cutover 前跑完整 DR drill、用 CRDB tooling 完成、不依賴 PostgreSQL 經驗</li>
<li><strong>Multi-region backup</strong>：CRDB 跨 region backup 配置、避免單 region 故障</li>
</ol>
<h2 id="capacity-規劃">Capacity 規劃</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>PostgreSQL self-managed</th>
          <th>CockroachDB</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Single-node 上限</td>
          <td>~10-50K TPS（vertical scale 到 32-128 vCPU）</td>
          <td>~5K TPS / node（horizontal scale by adding node）</td>
      </tr>
      <tr>
          <td>跨 region</td>
          <td>高 latency 跨區 streaming</td>
          <td>設計 native、Locality-aware queries</td>
      </tr>
      <tr>
          <td>Sharding</td>
          <td>手動 partition / pg_partman</td>
          <td>自動 range-based</td>
      </tr>
      <tr>
          <td>Storage / TPS ratio</td>
          <td>不變</td>
          <td>Storage 跨 node 3x（Raft quorum 3-replica default）</td>
      </tr>
      <tr>
          <td>Total cost (10TB)</td>
          <td>$2-4K USD / month（self-managed）</td>
          <td>$5-10K USD / month（CRDB Cloud + 3x storage）</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：CRDB cost 顯著高、選 CRDB 必須是 <em>paradigm 需求</em>（distributed transaction / multi-region / linear scale）；單純成本 / availability 改善走 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">Aurora</a> 更划算。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-postgresql--aurora-migration-對比">跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora migration</a> 對比</h3>
<p>兩條 PostgreSQL 出路：</p>
<ul>
<li><strong>Aurora</strong>：operational simplification、protocol drop-in、cost 中等漲；適合 <em>不需 distributed transaction</em> 的 production</li>
<li><strong>CRDB</strong>：distributed paradigm shift、application 必須改、cost 顯著漲；適合 <em>真的需要 distributed</em> 的 workload</li>
</ul>
<p>多數 application 不需要 distributed transaction、Aurora 更合理；真正需要 cross-region 強一致 / linear scale by adding node 才走 CRDB。</p>
<h3 id="跟-application-transaction-pattern-重設計">跟 application transaction pattern 重設計</h3>
<p>CRDB 強制 application 改 transaction code、retry loop 必加。團隊心智模型轉換是 migration 主要 effort、技術部分相對少。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>CRDB → PostgreSQL reverse migration</strong>：當業務 simplify 後 distributed 不必要、reverse migration cost 高、實務上 CRDB 是 <em>single-direction lock-in</em></li>
<li><strong>CRDB Serverless</strong>：cost 起點低、burst workload 適合；steady workload 仍是 dedicated cluster</li>
<li><strong>Multi-region active-active</strong>：CRDB 真正強項、但網路成本爆、僅金融 / 政府客戶 ROI 合理</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source / target vendor：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> / <a href="/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB</a></li>
<li>對位 migration：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a>（另一條 PostgreSQL 出路）</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> / <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Process content 結構由最大差異維度決定</a>（本文驗證 <em>多重歸類 multi-axis 處理</em>）</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Partition Redesign：當 monthly partition 越跑越慢</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/partition-redesign/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/partition-redesign/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。對應 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Type F「Topology re-layout」&lt;/a> 第 2 個 dogfood（第 1 個是 &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis cluster re-sharding&lt;/a>）— 驗證 Type F anatomy 在不同 vendor 上的通用性。&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼-monthly-partition-越跑越慢">為什麼 monthly partition 越跑越慢&lt;/h2>
&lt;p>上線時 monthly range partition 設計很合理 — 每月一個 partition、12 個月一年、partition_pruning 在 &lt;code>WHERE event_time &amp;gt;= '2026-05-01'&lt;/code> 時跑單 partition、查詢快。但業務跑了 18 個月後：&lt;/p>
&lt;ul>
&lt;li>每月 partition size 從 50GB 漲到 500GB（流量 10x）&lt;/li>
&lt;li>單月查詢 &lt;code>WHERE event_time BETWEEN '2026-05-01' AND '2026-05-15'&lt;/code> 仍掃整月 500GB（partition_pruning 粒度只到 month）&lt;/li>
&lt;li>Vacuum 一個月 partition 需要 6-8 小時、跑不進 maintenance window&lt;/li>
&lt;li>DROP 老 partition 釋放 storage 是 monthly cadence、但 retention policy 要求 daily granularity&lt;/li>
&lt;/ul>
&lt;p>partition 設計需要 &lt;em>redesign&lt;/em>、不是「optimize」 — 從 monthly range partition 改成 daily range partition、partition 數量從 36 個（3 年 retention）變 1095 個。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。對應 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Type F「Topology re-layout」</a> 第 2 個 dogfood（第 1 個是 <a href="/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis cluster re-sharding</a>）— 驗證 Type F anatomy 在不同 vendor 上的通用性。</p></blockquote>
<h2 id="為什麼-monthly-partition-越跑越慢">為什麼 monthly partition 越跑越慢</h2>
<p>上線時 monthly range partition 設計很合理 — 每月一個 partition、12 個月一年、partition_pruning 在 <code>WHERE event_time &gt;= '2026-05-01'</code> 時跑單 partition、查詢快。但業務跑了 18 個月後：</p>
<ul>
<li>每月 partition size 從 50GB 漲到 500GB（流量 10x）</li>
<li>單月查詢 <code>WHERE event_time BETWEEN '2026-05-01' AND '2026-05-15'</code> 仍掃整月 500GB（partition_pruning 粒度只到 month）</li>
<li>Vacuum 一個月 partition 需要 6-8 小時、跑不進 maintenance window</li>
<li>DROP 老 partition 釋放 storage 是 monthly cadence、但 retention policy 要求 daily granularity</li>
</ul>
<p>partition 設計需要 <em>redesign</em>、不是「optimize」 — 從 monthly range partition 改成 daily range partition、partition 數量從 36 個（3 年 retention）變 1095 個。</p>
<p><a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a> 結果：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>同 PostgreSQL、同 table 定義、partition key 不變</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>同 PostgreSQL operational stack</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>同 OLTP RDBMS</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>同 1 個 DB</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>不改（partition_pruning 透明）</td>
          <td>Low</td>
      </tr>
      <tr>
          <td><strong>Data topology</strong></td>
          <td><strong>Partition strategy 從 monthly → daily</strong></td>
          <td><strong>High</strong></td>
      </tr>
  </tbody>
</table>
<p>6 維皆 Low + topology High = <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">Type F「Topology re-layout」</a>。</p>
<h2 id="pre-layout-analysispartition-不平衡偵測">Pre-layout analysis：partition 不平衡偵測</h2>
<p>執行 redesign 前必須先量化當前 topology：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. 每 partition size + row count
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="n">child</span><span class="p">.</span><span class="n">relname</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">partition_name</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">size</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="n">child</span><span class="p">.</span><span class="n">reltuples</span><span class="p">::</span><span class="nb">bigint</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">estimated_rows</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">  </span><span class="n">pg_stat_get_last_vacuum_time</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">last_vacuum</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_inherits</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">JOIN</span><span class="w"> </span><span class="n">pg_class</span><span class="w"> </span><span class="n">parent</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">pg_inherits</span><span class="p">.</span><span class="n">inhparent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">parent</span><span class="p">.</span><span class="n">oid</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">JOIN</span><span class="w"> </span><span class="n">pg_class</span><span class="w"> </span><span class="n">child</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">pg_inherits</span><span class="p">.</span><span class="n">inhrelid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">parent</span><span class="p">.</span><span class="n">relname</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;events&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="p">)</span><span class="w"> </span><span class="k">DESC</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- 2. partition_pruning 命中率
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span><span class="k">EXPLAIN</span><span class="w"> </span><span class="p">(</span><span class="k">ANALYZE</span><span class="p">,</span><span class="w"> </span><span class="n">BUFFERS</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="k">BETWEEN</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="s1">&#39;2026-05-15&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="c1">-- 期望: 只 scan 1 partition (target: daily) 或 1 partition (current: monthly)
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1">-- 觀察: monthly 設計下、即使 query 只跨 15 天、planner 仍 scan 整月 partition (~500GB)
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="c1">-- 3. 找 partition imbalance
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w">  </span><span class="n">to_char</span><span class="p">(</span><span class="n">event_time</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;YYYY-MM&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">month</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">  </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">row_count</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="k">DESC</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w"></span><span class="c1">-- 找 hot month / cold month、判斷 redesign 後分佈</span></span></span></code></pre></div><p>Pre-layout 階段的 output：</p>
<ul>
<li><strong>當前 topology 量化</strong>：36 monthly partition、總 size 1.8TB、最大 partition 500GB、最小 50GB</li>
<li><strong>Hot key 分佈</strong>：80% 流量集中最近 3 個月</li>
<li><strong>Redesign 目標</strong>：daily partition、最近 3 個月 hot daily / 3 個月 + 之前 cold weekly / 1 年 + 之前 monthly（sub-partition strategy）</li>
<li><strong>Migration scope</strong>：1095 個 partition 不直接全建、按 retention policy 階段性</li>
</ul>
<h2 id="re-layout-機制attach--detach-線上重劃">Re-layout 機制：ATTACH / DETACH 線上重劃</h2>
<p>PostgreSQL 不支援「直接改 partition strategy」、必須走 <em>新 partition tree + 資料搬遷</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. 建新 daily partition table (parallel to events)
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_daily</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="n">id</span><span class="w"> </span><span class="nb">bigint</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="n">event_time</span><span class="w"> </span><span class="n">timestamptz</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="n">payload</span><span class="w"> </span><span class="n">jsonb</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">RANGE</span><span class="w"> </span><span class="p">(</span><span class="n">event_time</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- 2. 預建未來 90 天 daily partition
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">  </span><span class="n">format</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">    </span><span class="s1">&#39;CREATE TABLE events_daily_%s PARTITION OF events_daily FOR VALUES FROM (%L) TO (%L)&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="n">to_char</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;YYYY_MM_DD&#39;</span><span class="p">),</span><span class="w"> </span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">d</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nb">interval</span><span class="w"> </span><span class="s1">&#39;1 day&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">  </span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">generate_series</span><span class="p">(</span><span class="k">current_date</span><span class="p">,</span><span class="w"> </span><span class="k">current_date</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nb">interval</span><span class="w"> </span><span class="s1">&#39;90 days&#39;</span><span class="p">,</span><span class="w"> </span><span class="nb">interval</span><span class="w"> </span><span class="s1">&#39;1 day&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">d</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="c1">-- 3. dual-write phase: application 同寫 events + events_daily
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1">-- (用 trigger 或 application-side)
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">OR</span><span class="w"> </span><span class="k">REPLACE</span><span class="w"> </span><span class="k">FUNCTION</span><span class="w"> </span><span class="n">dual_write_events</span><span class="p">()</span><span class="w"> </span><span class="k">RETURNS</span><span class="w"> </span><span class="k">TRIGGER</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="err">$$</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"></span><span class="k">BEGIN</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">  </span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">events_daily</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="k">NEW</span><span class="p">.</span><span class="o">*</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w">  </span><span class="k">RETURN</span><span class="w"> </span><span class="k">NEW</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w"></span><span class="k">END</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w"></span><span class="err">$$</span><span class="w"> </span><span class="k">LANGUAGE</span><span class="w"> </span><span class="n">plpgsql</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TRIGGER</span><span class="w"> </span><span class="n">events_dual_write</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w"></span><span class="k">AFTER</span><span class="w"> </span><span class="k">INSERT</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w"></span><span class="k">FOR</span><span class="w"> </span><span class="k">EACH</span><span class="w"> </span><span class="k">ROW</span><span class="w"> </span><span class="k">EXECUTE</span><span class="w"> </span><span class="k">FUNCTION</span><span class="w"> </span><span class="n">dual_write_events</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="w"></span><span class="c1">-- 4. backfill historical data per partition
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">events_daily</span><span class="w">
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="s1">&#39;2026-05-02&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="w"></span><span class="c1">-- ... 每天跑一個 day partition、avoid long transaction
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="w"></span><span class="c1">-- 5. cutover: rename swap
</span></span></span><span class="line"><span class="ln">36</span><span class="cl"><span class="c1"></span><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">37</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">RENAME</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">events_old</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">38</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_daily</span><span class="w"> </span><span class="k">RENAME</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">events</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">39</span><span class="cl"><span class="w"></span><span class="k">DROP</span><span class="w"> </span><span class="k">TRIGGER</span><span class="w"> </span><span class="n">events_dual_write</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events_old</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">40</span><span class="cl"><span class="w"></span><span class="k">COMMIT</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">41</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">42</span><span class="cl"><span class="w"></span><span class="c1">-- 6. 觀察 1-2 週、DROP events_old</span></span></span></code></pre></div><p>關鍵：rename swap 是 <em>single transaction</em>、cutover 瞬間發生；application connection 不需重連、但 prepared statement cache 可能要刷新。</p>
<h2 id="execution-flow-per-step">Execution flow per-step</h2>
<p>5 段、每段含 rollback boundary：</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>動作</th>
          <th>Rollback boundary</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 預建 partition</td>
          <td>建 events_daily + 90 天 partition、不影響 production</td>
          <td>DROP events_daily、無 impact</td>
      </tr>
      <tr>
          <td>2 Dual-write</td>
          <td>加 trigger 同寫兩端、observe diff</td>
          <td>DROP trigger、events_daily 留作 cleanup</td>
      </tr>
      <tr>
          <td>3 Backfill</td>
          <td>逐日 backfill 歷史資料、用 CHECK constraint 確保完整性</td>
          <td>DROP backfilled partition、不影響 source events</td>
      </tr>
      <tr>
          <td>4 Verify</td>
          <td>對 sample query 跑 events vs events_daily、確認 row count 一致</td>
          <td>仍在 dual-write、發現 diff 可暫停 cutover</td>
      </tr>
      <tr>
          <td>5 Cutover</td>
          <td>Rename swap</td>
          <td><strong>不可逆</strong>、回退需 reverse rename + dual-write restart</td>
      </tr>
  </tbody>
</table>
<p>Step 5 是不可逆邊界、應該排在 <em>低流量 maintenance window</em> 跑、且 cutover 前必須有 backup checkpoint。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1backfill-期間-long-transaction-阻塞-vacuum">Case 1：Backfill 期間 long transaction 阻塞 vacuum</h3>
<p><strong>徵兆</strong>：backfill 跑 6 小時的 <code>INSERT INTO events_daily SELECT * FROM events WHERE ...</code>、期間 events 表的 autovacuum 完全不跑、dead tuple 累積、production query 變慢。</p>
<p><strong>根因</strong>：PostgreSQL transaction 期間 <em>xmin horizon 鎖死</em>、vacuum 只能回收「不會被任何 active transaction 看到」的 dead tuple；long backfill = long open transaction、vacuum 失效。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>拆 batch INSERT</strong>：每日 backfill 拆成 small batch（10 萬 row 一個 transaction）、每個 commit 釋放 xmin</li>
<li><strong>用 COPY 不用 INSERT</strong>：<code>COPY events_daily FROM (SELECT * FROM events WHERE ...)</code> 是 PG 對 batch 最快 + 對 vacuum 影響小</li>
<li><strong>Backfill 跑在 standby</strong>：用 logical replication 從 standby 拉資料、不在 primary 跑長 transaction</li>
</ol>
<h3 id="case-2trigger-dual-write-對-application-造成-latency">Case 2：Trigger dual-write 對 application 造成 latency</h3>
<p><strong>徵兆</strong>：加 trigger 後 application 寫入 latency p99 從 5ms 漲到 25-50ms；high-throughput batch job 直接 timeout。</p>
<p><strong>根因</strong>：每筆 INSERT 都觸發 trigger function 跑一次 INSERT 到 events_daily、IO 雙倍、index 也雙倍維護。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>改 application-side dual-write</strong>：application code 顯式寫兩端、用 connection pool batch 攤平 IO</li>
<li><strong>用 logical replication slot</strong>：events → events_daily 用 logical replication 取代 trigger、降 IO 衝擊</li>
<li><strong>dual-write 時間最小化</strong>：trigger 只在 backfill + verify 期間打開、cutover 前關掉</li>
</ol>
<h3 id="case-3partition_pruning-沒命中planner-仍掃所有-partition">Case 3：Partition_pruning 沒命中、planner 仍掃所有 partition</h3>
<p><strong>徵兆</strong>：cutover 完成後、application 端某些 query latency 從 200ms 跳到 5000ms；EXPLAIN 顯示 <code>Append</code> 下面所有 1095 個 partition 都被 scan。</p>
<p><strong>根因</strong>：partition 數量爆到 1000+、planner planning_time 對某些 query 變長（含 prepared statement 沒帶 partition key bound）；或 query 用了 <code>WHERE event_time = some_function(now())</code>、planning-time pruning 不觸發。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong><code>enable_partition_pruning = on</code></strong> 預設、確認沒被 disable</li>
<li><strong>PG 11+ runtime pruning</strong>：prepared statement 用 generic plan、runtime pruning 補位</li>
<li><strong>Sub-partition strategy</strong>：1095 個 daily 太多、改 <em>最近 90 天 daily / 之前 monthly</em> 混合 strategy、減 partition count</li>
<li><strong>Planner statistics</strong>：跑 <code>ANALYZE</code> 重建 statistics、partition 樹太大時 planner 需新 stats</li>
</ol>
<h3 id="case-4constraint-exclusion-失敗跨-partition-unique-不-enforce">Case 4：Constraint exclusion 失敗、跨 partition unique 不 enforce</h3>
<p><strong>徵兆</strong>：cutover 後發現某 user 的 event 在多個 partition 都有、unique constraint <code>(user_id, event_id)</code> 沒 enforce；data audit 抓到 duplicate。</p>
<p><strong>根因</strong>：PostgreSQL partition table 的 <code>UNIQUE</code> constraint <em>必須包含 partition key</em>；本來 monthly partition 下 <code>UNIQUE (user_id, event_id)</code> 加上 <code>event_time</code>（partition key）變 <code>UNIQUE (user_id, event_id, event_time)</code>、實際語意是「同月同 user 同 event_id 唯一」；改 daily 後變「同日同 user 同 event_id 唯一」— unique scope 從月變天、原本月內跨日 dedup 失效。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-redesign</strong>：明示 unique constraint 的 <em>時間 scope</em>、redesign 後 scope 縮小是否可接受</li>
<li><strong>Application-side dedup</strong>：跨 partition 唯一性走 application 層 lookup（用 Redis SETEX 暫存 key）</li>
<li><strong>退到 non-partitioned dedup 表</strong>：建獨立 user_events_dedup 表、application 寫入前先 lookup</li>
</ol>
<h3 id="case-5drop-老-partition-太頻繁shared_buffers-cache-miss-爆">Case 5：DROP 老 partition 太頻繁、shared_buffers cache miss 爆</h3>
<p><strong>徵兆</strong>：daily partition 上線後、每天凌晨 cron DROP <code>events_2025_05_18</code>（90 天前）；DROP 後 shared_buffers 大量 invalidate、application 端 query latency p99 從 10ms 跳到 100-200ms 持續 30 分鐘。</p>
<p><strong>根因</strong>：PostgreSQL shared_buffers cache 對被 DROP 表的 page 全部 invalidate；DROP 大 partition（10GB+）後 cache hit rate 從 99% 掉到 60%、application 等 disk IO。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>DROP 跑在 off-peak</strong>：凌晨 3-4 點 cron、避開業務高峰</li>
<li><strong>預熱 next partition</strong>：DROP 前用 <code>pg_prewarm</code> 主動 load 熱 partition 進 cache</li>
<li><strong>改 DETACH + DROP TABLE delayed</strong>：DETACH 是 fast、DROP TABLE 排到 weekly batch、降頻率</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Monthly partition (current)</th>
          <th>Daily partition (target)</th>
          <th>Trade-off</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Partition count</td>
          <td>36 (3 年 retention)</td>
          <td>1095 (3 年 retention)</td>
          <td>30x partition count、planner cost 略升</td>
      </tr>
      <tr>
          <td>Single partition size</td>
          <td>50-500GB</td>
          <td>1-20GB</td>
          <td>Daily 更易 vacuum</td>
      </tr>
      <tr>
          <td>DROP old data</td>
          <td>Monthly cadence</td>
          <td>Daily cadence</td>
          <td>更細 retention 控制</td>
      </tr>
      <tr>
          <td>Query latency</td>
          <td>跨 partition 多時 50-200ms</td>
          <td>跨 partition 少時 5-50ms</td>
          <td>Daily 多數 query 更快</td>
      </tr>
      <tr>
          <td>Planning time</td>
          <td>5-10ms</td>
          <td>50-100ms (對 generic plan)</td>
          <td>Planning overhead + 1 order</td>
      </tr>
      <tr>
          <td>Maintenance window</td>
          <td>Vacuum 1 partition 6 小時</td>
          <td>Vacuum 1 partition 5-30 分鐘</td>
          <td>維護視窗更小、可日跑</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：daily partition 適合 <em>高流量 + 跨日查詢多 + retention 細的場景</em>；超大 partition (TB 級單日) 仍要 sub-partition 拆。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-autovacuum-tuning-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum tuning</a> 整合</h3>
<p>Daily partition 後 autovacuum 行為：</p>
<ul>
<li>每 daily partition 獨立 autovacuum、scale_factor + threshold per-partition tuning</li>
<li><code>autovacuum_max_workers</code> 要從 3 拉到 6-10（partition 數爆）</li>
<li>Cold partition (&gt; 30 天) <code>autovacuum_enabled = false</code>、不浪費 CPU</li>
</ul>
<h3 id="跟-patroni-ha-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> 整合</h3>
<p>Failover 期間 partition migration 不能跑、必須在 stable cluster state 執行；Patroni promote 後重新評估 partition health。</p>
<h3 id="跟-logical-replication--debezium-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a> 整合</h3>
<p><code>publish_via_partition_root = true</code> 讓 publication 從 parent 角度看；CDC consumer 不需要對每個 partition 設 subscription。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>跨 daily partition 的 archive strategy</strong>：archive 到 S3 cold storage、daily granularity 給更細 retention 控制</li>
<li><strong>pg_partman extension</strong>：自動建 daily partition、不用 cron；但要先確認 Aurora / RDS 支援</li>
<li><strong>Sub-partitioning</strong>：未來流量爆時用「daily by time + list by tenant」雙軸 partition</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">Declarative Partitioning</a>（partition 基礎）/ <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a></li>
<li>平行 Type F dogfood：<a href="/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis Cluster Re-sharding</a>（dogfood #1）/ <a href="/blog/backend/01-database/vendors/mongodb/shard-expansion-multi-dc/" data-link-title="MongoDB Shard Expansion &#43; Multi-DC：Type F「不需要 parallel run」的 multi-region 例外" data-link-desc="MongoDB sharded cluster 加 shard &#43; 跨 DC expansion 是 Type F「topology re-layout」第 3 個 dogfood — 同時改 sharding &#43; replication topology &#43; region distribution；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 第 3 點「Type F 不需要 parallel run」claim 的例外（multi-region rollout 必須 parallel run &#43; 切流量）；涵蓋 chunk migration / replica set add member / cross-DC routing">MongoDB Shard + Multi-DC</a>（dogfood #3、F-multi-region sub-type）</li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> / <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Process content 結構由最大差異維度決定</a> / <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Data topology 是第 6 audit 維度</a></li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL Multi-Region GDPR Rollout：政策驅動的 migration 屬本 methodology 嗎</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/multi-region-gdpr-rollout/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/multi-region-gdpr-rollout/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。同時是 &lt;a href="https://tarrragon.github.io/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 self-aware limitation&lt;/a> 第 1 點「6 維仍可能漏類（identity / consistency / residency 三軸候選）」的 &lt;em>residency 軸驗證&lt;/em>、跟 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration playbook methodology「何時不該套」段&lt;/a> 對「政策合規驅動」是否在 methodology scope 的反思。&lt;/p>&lt;/blockquote>
&lt;h2 id="政策驅動的-migration-屬本-methodology-嗎">政策驅動的 migration 屬本 methodology 嗎&lt;/h2>
&lt;p>&lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> 「何時不該套」段曾把「compliance-driven migration」歸為排除情境、後來改寫為「不在排除範圍 — 法規驅動只是 driver、資料層仍走 type A-E 之一」。本文是該改寫的 &lt;em>正面實證&lt;/em> — GDPR EU residency 強制需求驅動 single-region → multi-region rollout、本文是 &lt;em>政策驅動但仍走 audit + type 對映流程&lt;/em> 的 case study。&lt;/p>
&lt;p>但 reviewer D 在第三輪 audit 提出：residency 不只是 &lt;em>driver&lt;/em>、本身是 &lt;em>cross-cutting constraint&lt;/em>、反向約束 topology + operational + schema；該不該升 &lt;em>獨立 audit 軸&lt;/em>？本文是該議題的 dogfood。&lt;/p>
&lt;h2 id="三層約束driver--topology--contract">三層約束：driver / topology / contract&lt;/h2>
&lt;p>GDPR 對 PostgreSQL multi-region rollout 的影響在三個層次：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。同時是 <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 self-aware limitation</a> 第 1 點「6 維仍可能漏類（identity / consistency / residency 三軸候選）」的 <em>residency 軸驗證</em>、跟 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration playbook methodology「何時不該套」段</a> 對「政策合規驅動」是否在 methodology scope 的反思。</p></blockquote>
<h2 id="政策驅動的-migration-屬本-methodology-嗎">政策驅動的 migration 屬本 methodology 嗎</h2>
<p><a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> 「何時不該套」段曾把「compliance-driven migration」歸為排除情境、後來改寫為「不在排除範圍 — 法規驅動只是 driver、資料層仍走 type A-E 之一」。本文是該改寫的 <em>正面實證</em> — GDPR EU residency 強制需求驅動 single-region → multi-region rollout、本文是 <em>政策驅動但仍走 audit + type 對映流程</em> 的 case study。</p>
<p>但 reviewer D 在第三輪 audit 提出：residency 不只是 <em>driver</em>、本身是 <em>cross-cutting constraint</em>、反向約束 topology + operational + schema；該不該升 <em>獨立 audit 軸</em>？本文是該議題的 dogfood。</p>
<h2 id="三層約束driver--topology--contract">三層約束：driver / topology / contract</h2>
<p>GDPR 對 PostgreSQL multi-region rollout 的影響在三個層次：</p>
<ol>
<li><strong>Driver layer</strong>：EU 客戶資料必須 <em>物理上儲存在 EU</em>（GDPR Article 44-49）— 觸發 multi-region migration 的根本理由</li>
<li><strong>Topology layer</strong>：跨 region replication 不能 <em>自由跨 region 複製</em> EU 客戶資料、必須按 GDPR scope 分區；topology 設計受合規約束</li>
<li><strong>Contract layer</strong>：審計能 <em>demonstrate</em> 「EU 資料在 EU」、操作日誌 + replication evidence 必須可追溯；application + ops contract 多出合規 obligation</li>
</ol>
<p>跑 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">6 維 diff dimension audit</a> 對「single us-east → us-east + eu-west」：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>同 PostgreSQL、可能加 region column</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>HA / backup / monitoring 跨 region 重設計</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>同 OLTP RDBMS</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>同 PostgreSQL instance + Patroni</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>Routing logic by user region、必改</td>
          <td>Medium</td>
      </tr>
      <tr>
          <td>Data topology</td>
          <td>Single → multi-region replication</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td><strong>Residency contract</strong></td>
          <td><strong>EU 資料禁止離開 EU、log + replication 範圍受約束</strong></td>
          <td><strong>High</strong></td>
      </tr>
  </tbody>
</table>
<p>6 維 audit 抓不到「Residency contract = High」這軸。用既有 6 維歸類、會走 Type F multi-axis（topology + operational + application change 多 High）+ 政策合規補強段；但這個歸類 <em>漏掉合規對 topology / operational / application 的反向約束</em>：</p>
<ul>
<li>Topology layer：6 維只 audit 「topology 是否變動」、漏 audit 「topology 範圍是否受合規約束」</li>
<li>Operational layer：6 維只 audit 「operational 是否重設計」、漏 audit 「audit log / encryption / access control 是否符合合規要求」</li>
<li>Application layer：6 維只 audit 「application code 是否改」、漏 audit 「資料 routing 是否符合 residency rule」</li>
</ul>
<p><strong>Residency 不只是 driver、是 cross-cutting constraint</strong>、會反向約束其他 3-4 維、且帶獨立工作量（合規 evidence collection / DPIA / audit prep）。</p>
<h2 id="residency-axis-是否獨立3-個論據">Residency axis 是否獨立：3 個論據</h2>
<p><strong>Yes、residency 是獨立軸</strong>：</p>
<ol>
<li><strong>可獨立發生</strong>：原本 multi-region setup、新增「PCI 強制信用卡資料只能 us-east」、是 <em>純 residency 變更</em>、其他 6 維皆 Low（topology 不重設計、operational 不重設計、application 加 routing rule 即可）；但 residency 約束 routing + log 範圍</li>
<li><strong>驅動工作量分佈</strong>：本文 multi-region GDPR rollout 工作量分佈：
<ul>
<li>Topology setup（logical replication / region setup）：~25%</li>
<li>Operational redesign（HA / backup / monitoring）：~20%</li>
<li>Application routing change（region detection / data filter）：~15%</li>
<li><strong>Residency compliance（DPIA / audit log / access control / encryption / evidence）：~40%</strong></li>
</ul>
</li>
<li><strong>Cross-cutting nature</strong>：residency 不只影響「資料放哪」、影響：
<ul>
<li>Backup 可不可以 cross-region store（多數 GDPR 不允許）</li>
<li>Audit log 是否包含 EU PII（需 EU 端 log + 跨 region log filter）</li>
<li>Encryption key 是否可 cross-region share（多數情境不允許）</li>
<li>Application access logs 是否含 EU IP / user ID</li>
</ul>
</li>
</ol>
<p><strong>No、residency 可塞 operational + driver</strong>：</p>
<ul>
<li>反論：residency 是 operational 子議題、加 audit + replication scope 規則就好</li>
<li>拒絕：residency 反向約束 topology / application / operational、且帶獨立合規工作量（DPIA / cross-border transfer agreement / data subject rights）；不是單純 operational 子議題</li>
</ul>
<p>實證：本文 migration 工作量 40% 在 compliance、確認 residency 是 <em>獨立工作量主軸</em>。</p>
<h2 id="結構type-f-multi-axis--residency-compliance-獨立段">結構：Type F multi-axis + residency compliance 獨立段</h2>
<p>本文結構是 <em>Type F 為主</em>（topology high + operational high）+ <em>residency compliance 獨立段</em>（不在 6 維任一個）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. 政策驅動的 migration 屬本 methodology 嗎（meta-reflection 開頭）
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. 三層約束：driver / topology / contract
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. Residency axis 是否獨立的論據
</span></span><span class="line"><span class="ln">4</span><span class="cl">4. 結構 differentiator（Type F multi-axis + residency compliance 段）
</span></span><span class="line"><span class="ln">5</span><span class="cl">5. EU residency 對 topology / operational / application 的反向約束
</span></span><span class="line"><span class="ln">6</span><span class="cl">6. Migration 流程（含 DPIA 跟 evidence collection 階段）
</span></span><span class="line"><span class="ln">7</span><span class="cl">7. Production 故障演練
</span></span><span class="line"><span class="ln">8</span><span class="cl">8. Capacity / cost（含合規 audit cost）
</span></span><span class="line"><span class="ln">9</span><span class="cl">9. 整合 / 下一步</span></span></code></pre></div><p>9 章節、240-270 行。比標準 Type F 多 1 段（residency compliance）+ 1 段（meta-reflection）。</p>
<h2 id="eu-residency-對其他維度的反向約束">EU residency 對其他維度的反向約束</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Residency rule → Topology constraint:
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">- EU customer data 不能 replicate to us-east
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">- Backup of EU table 不能 store in non-EU region
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">- Logical replication subscriber 在 us-east 必須 filter out EU data
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">Residency rule → Operational constraint:
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">- Cross-region monitoring 不能 export EU PII to global SaaS (Datadog)
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">- Audit log 含 EU user_id 必須 store 在 EU
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">- Encryption key (KMS) 不能 share 跨 region（EU 端用 EU KMS）
</span></span><span class="line"><span class="ln">10</span><span class="cl">- DBA / SRE access EU data 必須 from EU jurisdiction + 記 audit trail
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl">Residency rule → Application constraint:
</span></span><span class="line"><span class="ln">13</span><span class="cl">- Application 必須 detect user region + route 對應 DB endpoint
</span></span><span class="line"><span class="ln">14</span><span class="cl">- Cross-region join / aggregate 對 EU user 必須走 EU 端 query
</span></span><span class="line"><span class="ln">15</span><span class="cl">- Data export feature 必須 reject 跨 region export request</span></span></code></pre></div><p>每條反向約束都是 <em>新工作量</em>、不在 6 維 audit 內。</p>
<h2 id="migration-流程含-dpia--evidence-collection">Migration 流程（含 DPIA + evidence collection）</h2>
<p>10 step、跨 5 個月：</p>
<table>
  <thead>
      <tr>
          <th>Phase</th>
          <th>Step</th>
          <th>對應 6 維 / 合規</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0 Pre-migration</td>
          <td>1. DPIA（Data Protection Impact Assessment）</td>
          <td>Compliance pre-requisite</td>
      </tr>
      <tr>
          <td>0</td>
          <td>2. 法務 review 跨境傳輸 agreement</td>
          <td>Compliance</td>
      </tr>
      <tr>
          <td>1 Setup</td>
          <td>3. EU PostgreSQL cluster build + Patroni</td>
          <td>Operational + Topology</td>
      </tr>
      <tr>
          <td>1</td>
          <td>4. EU KMS + audit log + monitoring stack</td>
          <td>Operational + Residency</td>
      </tr>
      <tr>
          <td>2 Data</td>
          <td>5. Logical replication 設 filter（exclude EU table from us-east）</td>
          <td>Topology + Residency</td>
      </tr>
      <tr>
          <td>2</td>
          <td>6. Initial sync EU table 到 EU cluster</td>
          <td>Topology</td>
      </tr>
      <tr>
          <td>3 App</td>
          <td>7. Application 端加 region detection + routing</td>
          <td>Application change</td>
      </tr>
      <tr>
          <td>3</td>
          <td>8. Cross-region query banning（cross-region join 拒絕 EU table）</td>
          <td>Application + Residency</td>
      </tr>
      <tr>
          <td>4 Verify</td>
          <td>9. Compliance audit + evidence package</td>
          <td>Residency</td>
      </tr>
      <tr>
          <td>4</td>
          <td>10. DPO sign-off + DR drill</td>
          <td>Residency + Operational</td>
      </tr>
  </tbody>
</table>
<p>Step 1 + 9 + 10 是 <em>residency-specific</em>、不在既有 6 維內。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1replication-filter-漏-tableeu-資料-leak-到-us-east">Case 1：Replication filter 漏 table、EU 資料 leak 到 us-east</h3>
<p><strong>徵兆</strong>：6 個月後 internal audit 發現 us-east 端 <code>customers</code> table 含 EU 客戶資料；replication filter 設定漏改、新加的 <code>eu_customer_extensions</code> table 被自動 replicate 到 us-east。</p>
<p><strong>根因</strong>：PostgreSQL logical replication publication 預設 <code>FOR ALL TABLES</code>、新加的 table 自動納入；應該明示 <code>FOR TABLE list...</code> 並 GDPR review。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Publication 改 explicit table list</strong>：<code>CREATE PUBLICATION xxx FOR TABLE users, orders, ...</code>、不用 <code>FOR ALL TABLES</code></li>
<li><strong>Schema change review 加 GDPR check</strong>：每個 DDL PR 必須答「新 table 是否含 EU PII、是否該 filter」</li>
<li><strong>Replication monitor</strong>：定期跑 <code>SELECT * FROM pg_publication_tables</code> 對照 expected list、漂移立刻 alert</li>
<li><strong>Evidence collection</strong>：filter 配置 + audit log 留檔、出事 DPO 知道何時 leak</li>
</ol>
<h3 id="case-2backup-跨-region-store合規違規">Case 2：Backup 跨 region store、合規違規</h3>
<p><strong>徵兆</strong>：跑 1 年後 GDPR audit 抓到 EU table 的 backup 存在 us-west S3 bucket；違反 Article 44-49 限制。</p>
<p><strong>根因</strong>：pgBackRest 預設用 <em>global S3 bucket</em>（在 us-east-1）；EU PostgreSQL cluster backup 跑去 us-east、跨境傳輸無 transfer mechanism。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Per-region backup config</strong>：EU cluster 用 EU S3 bucket（eu-west-1）、寫進 pgBackRest config</li>
<li><strong>Backup test</strong>：每月跑一次 backup restore drill、validate backup 是 from EU region</li>
<li><strong>Bucket policy 強 enforce</strong>：EU bucket 加 <code>aws:RequestedRegion=eu-west-1</code> 強制 region match</li>
<li><strong>Audit log archive 同理</strong>：log shipping 也必須 region-respect</li>
</ol>
<h3 id="case-3monitor-saas-收集-eu-pii合規-alert">Case 3：Monitor SaaS 收集 EU PII、合規 alert</h3>
<p><strong>徵兆</strong>：Datadog APM 收集了 EU customer 端 request 含 user_email 在 trace、被 DPO catch、required to delete 過去 90 天的 Datadog data。</p>
<p><strong>根因</strong>：APM trace 預設收集 application context、含 PII；Datadog 是 us-east SaaS、PII 跨境到 Datadog us-east、違規。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>APM scrub PII</strong>：application 端在 trace 前 scrub user_email / user_id 替換成 hash</li>
<li><strong>EU-specific monitor stack</strong>：EU PostgreSQL + APM 用 Grafana on EU EKS、不送 Datadog</li>
<li><strong>跨 region SaaS use 必須 audit</strong>：所有外部 SaaS（Datadog / Sentry / NewRelic）必須 GDPR-friendly 配置</li>
<li><strong>Privacy by design</strong>：log / trace 預設 scrub PII、不是 opt-in</li>
</ol>
<h3 id="case-4cross-region-query-跑-eu--us-資料residency-違規">Case 4：Cross-region query 跑 EU + US 資料、residency 違規</h3>
<p><strong>徵兆</strong>：BI dashboard 跑跨 region aggregation query（EU sales + US sales）、PostgreSQL FDW 從 us-east cluster query EU cluster、EU 端 server log 顯示「PII export to us-east」。</p>
<p><strong>根因</strong>：開發者用 PostgreSQL Foreign Data Wrapper（FDW）方便跑跨 region query、不知道這在 GDPR 視為跨境 PII export。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Architecture: aggregate at edge</strong>：BI 跑 <em>per-region aggregate</em>、再在 BI layer compose（無 PII）；不直接跨 region join</li>
<li><strong>FDW 限制</strong>：disable FDW from us-east → EU cluster、enforce one-way data flow</li>
<li><strong>DBA access policy</strong>：DBA 不能直接 query EU cluster 從 us-east jumpbox</li>
<li><strong>Query audit</strong>：production query log 跑 PII detection（regex / NER）、發現跨境 export 立即 alert</li>
</ol>
<h3 id="case-5dr-drill-跨-region-failover暴露-residency-assumption-失敗">Case 5：DR drill 跨 region failover、暴露 residency assumption 失敗</h3>
<p><strong>徵兆</strong>：DR drill「EU 完全不可用、切到 us-east」執行後、發現 us-east 端 <em>沒 EU 資料</em> — 因為一直 strict residency filter；business 端 EU 客戶 24 小時無法服務。</p>
<p><strong>根因</strong>：strict GDPR residency 跟 strict DR availability 衝突 — 要 <em>跨 region DR</em> 就要 <em>跨 region 持有資料</em>、要 <em>strict residency</em> 就 <em>DR 範圍受限</em>。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>DR strategy revision</strong>：EU 端 multi-AZ within EU、不靠跨 region；EU region 全不可用情境接受 longer RTO</li>
<li><strong>Compliance + DR negotiation</strong>：跟 DPO / 法務談 <em>DR 跨境 short-window 是否可接受</em>、簽 cross-border transfer agreement</li>
<li><strong>Backup recovery 在 EU 內</strong>：EU 端 backup 跨 AZ store、不跨 region；EU AZ 災難用 EU 另一個 AZ 重建</li>
<li><strong>明示 RTO trade-off</strong>：EU customer SLA 寫「regional DR 內 RTO 1 小時、global DR 24-48 小時」、residency 跟 DR 是 <em>互斥取捨</em></li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Single region</th>
          <th>Multi-region GDPR-compliant</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Infrastructure cost</td>
          <td>baseline</td>
          <td>+60-100%（雙 cluster + cross-region replication）</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.5-1</td>
          <td>1-2 FTE（雙 region SRE + compliance）</td>
      </tr>
      <tr>
          <td>Compliance cost</td>
          <td>0</td>
          <td>$50-200K USD setup（DPIA / audit / DPO time）+ ongoing</td>
      </tr>
      <tr>
          <td>Egress cost</td>
          <td>Low</td>
          <td>High（cross-region replication 流量）</td>
      </tr>
      <tr>
          <td>Application latency</td>
          <td>Single AZ</td>
          <td>EU customer 連 EU、低；US customer 連 US、低</td>
      </tr>
      <tr>
          <td>DR RTO</td>
          <td>30 分鐘 (single region)</td>
          <td>EU regional 1 小時 / global 24-48 小時</td>
      </tr>
      <tr>
          <td>Audit cost</td>
          <td>Minimal</td>
          <td>季度 DPIA + 年度 compliance audit</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：GDPR multi-region 成本 1.5-2.5x、但合規是 <em>必要 spend</em>、用 cost optimization 的框架看會誤判；多數歐洲業務 7+ 年回本（避免 4% revenue fine）。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-postgresql--aurora-對位">跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a> 對位</h3>
<p>Aurora Global Database 可簡化跨 region setup、但 residency filter 仍需 application 端；不是「Aurora 就解決 GDPR」。</p>
<h3 id="跟-multi-dc-mongodb-對位">跟 <a href="/blog/backend/01-database/vendors/mongodb/shard-expansion-multi-dc/" data-link-title="MongoDB Shard Expansion &#43; Multi-DC：Type F「不需要 parallel run」的 multi-region 例外" data-link-desc="MongoDB sharded cluster 加 shard &#43; 跨 DC expansion 是 Type F「topology re-layout」第 3 個 dogfood — 同時改 sharding &#43; replication topology &#43; region distribution；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 第 3 點「Type F 不需要 parallel run」claim 的例外（multi-region rollout 必須 parallel run &#43; 切流量）；涵蓋 chunk migration / replica set add member / cross-DC routing">Multi-DC MongoDB</a> 對位</h3>
<p>兩篇都是 multi-region rollout、但本文加合規維度；MongoDB 篇純 capacity + DR driver、本文加 residency constraint、結構不同。</p>
<h3 id="跟-128-self-aware-limitation-第-1-點對位">跟 #128 self-aware limitation 第 1 點對位</h3>
<p>本文驗證 <em>residency axis 候選</em>：</p>
<ul>
<li><strong>Yes 軸獨立</strong>：reverse-constrain topology + operational + application、且帶獨立 compliance 工作量（DPIA / evidence collection / DPO sign-off）</li>
<li><strong>作為 driver 不夠</strong>：methodology 把 residency 歸為 driver 太窄、忽略 cross-cutting constraint 性質</li>
</ul>
<p>未來 audit 可能擴 7 維（加 residency / compliance contract）；累積 PCI / HIPAA / SOX 等不同合規 case 後再評估。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Identity + Consistency + Residency 三軸候選統合</strong>：本批 3 篇分別驗證、未來累積 evidence 後考慮獨立 #129 卡 / 擴 audit 到 7-8 維</li>
<li><strong>Schrems II + new EU data transfer rules</strong>：跨大西洋資料傳輸法規變動快、playbook 半衰期短</li>
<li><strong>Data localization in China / Russia / India</strong>：類似 GDPR 但細節不同、未來 case 累積後評估</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>平行 multi-region case：<a href="/blog/backend/01-database/vendors/mongodb/shard-expansion-multi-dc/" data-link-title="MongoDB Shard Expansion &#43; Multi-DC：Type F「不需要 parallel run」的 multi-region 例外" data-link-desc="MongoDB sharded cluster 加 shard &#43; 跨 DC expansion 是 Type F「topology re-layout」第 3 個 dogfood — 同時改 sharding &#43; replication topology &#43; region distribution；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 第 3 點「Type F 不需要 parallel run」claim 的例外（multi-region rollout 必須 parallel run &#43; 切流量）；涵蓋 chunk migration / replica set add member / cross-DC routing">MongoDB Shard + Multi-DC</a></li>
<li>平行 axis 候選驗證：<a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/migrate-to-aws-secrets-manager/" data-link-title="Vault → AWS Secrets Manager：「secret」不是「secret」、identity model 才是核心差異" data-link-desc="Vault → AWS Secrets Manager migration 表面是 secret store 替換、實際核心是 identity model 對位（Vault token &#43; policy vs AWS IAM &#43; resource policy）；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 提出的 identity axis 候選 — identity 是否獨立 audit 軸；5 個 production 踩雷（IAM principal 對位 / dynamic credential 對等失敗 / lease lifecycle 模型不同 / audit log 結構差 / 計費模型反轉）">Vault → AWS Secrets Manager</a>（identity 候選）/ <a href="/blog/backend/01-database/vendors/dynamodb/consistency-model-optimization/" data-link-title="DynamoDB Strongly Consistent → Eventually Consistent：same protocol, different contract" data-link-desc="DynamoDB consistency model 從 strongly consistent read 改 eventually consistent read 是 50% cost 優化但風險集中在 application contract — 同 vendor / 同 protocol / 同 table / 不同 read consistency；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 提出的 consistency axis 候選；涵蓋 read pattern audit / 5 個 production 踩雷">DynamoDB Consistency Model</a>（consistency 候選）</li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> / <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 self-aware limitation 第 1 點</a>（residency axis 候選驗證、本文是該驗證的 dogfood）</li>
</ul>
]]></content:encoded></item><item><title>PostgreSQL pgBouncer 配置 + 連線池治理</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pgbouncer-config/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pgbouncer-config/</guid><description>&lt;p>PostgreSQL 的 connection 是 &lt;em>昂貴的 process&lt;/em>、每個 connection ~10MB RAM、idle connection 也吃 backend slot。當 application instance 數量爆炸（K8s replica × 多 deployment × pool size）、直接連 PostgreSQL 會把 backend slot 耗盡、新 connection 全 refuse — 即使 active query 不多。pgBouncer 是 &lt;em>connection pool proxy&lt;/em>、把幾千個 application connection 收斂成幾百個 PostgreSQL backend connection、production-grade PostgreSQL 部署的標配。&lt;/p>
&lt;p>本文不是 pgBouncer overview（請看 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor 頁&lt;/a> 中 connection pool 段）— 而是 &lt;em>production 部署 + 故障演練&lt;/em> 的實作層教學。覆蓋三層 pool（application → pgBouncer → PostgreSQL）的對齊、transaction pooling 跟 session pooling 的選擇陷阱、跟 HA failover 的整合、容量規劃。&lt;/p>
&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>典型觸發場景：團隊規模從 50 人爬到 200 人、microservice 從 20 個爬到 100 個、K8s replica 從 3 個爬到每服務 5-10 個。直連 PostgreSQL 的 connection 計算：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">100 service × 6 replica × 30 application pool = 18000 connection&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>PostgreSQL 預設 &lt;code>max_connections = 100&lt;/code>、production 設 &lt;code>max_connections = 500-1000&lt;/code> 已經是上限（每多一個都加 memory + context switch cost）。18000 連線打 PostgreSQL 直接打爆。&lt;/p>
&lt;p>進一步問題：&lt;/p>
&lt;ul>
&lt;li>一半 connection 是 &lt;em>idle&lt;/em>（application pool 預留、實際沒查詢）— 浪費 backend slot&lt;/li>
&lt;li>Cold start 時所有 replica 同時建 connection、瞬間 spike&lt;/li>
&lt;li>DB failover 時所有 application 同時 reconnect、prod-test pattern 跑不通&lt;/li>
&lt;li>DNS-based failover 時 application connection pool 不知道 backend 換了&lt;/li>
&lt;/ul>
&lt;p>pgBouncer 解這四個問題。但 &lt;em>引入 pgBouncer&lt;/em> 後又會引入新的問題層（pgBouncer 跟 application pool 不對齊、transaction pooling 的 session state 限制、HA 故障時 pgBouncer 也要 failover）— 本文討論這些。&lt;/p>
&lt;h2 id="核心概念pool-mode--sizing">核心概念：pool mode + sizing&lt;/h2>
&lt;p>pgBouncer 的 first-class concept 是 &lt;em>pool mode&lt;/em>、決定 application connection 跟 PostgreSQL backend connection 的綁定方式：&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL 的 connection 是 <em>昂貴的 process</em>、每個 connection ~10MB RAM、idle connection 也吃 backend slot。當 application instance 數量爆炸（K8s replica × 多 deployment × pool size）、直接連 PostgreSQL 會把 backend slot 耗盡、新 connection 全 refuse — 即使 active query 不多。pgBouncer 是 <em>connection pool proxy</em>、把幾千個 application connection 收斂成幾百個 PostgreSQL backend connection、production-grade PostgreSQL 部署的標配。</p>
<p>本文不是 pgBouncer overview（請看 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor 頁</a> 中 connection pool 段）— 而是 <em>production 部署 + 故障演練</em> 的實作層教學。覆蓋三層 pool（application → pgBouncer → PostgreSQL）的對齊、transaction pooling 跟 session pooling 的選擇陷阱、跟 HA failover 的整合、容量規劃。</p>
<h2 id="問題情境">問題情境</h2>
<p>典型觸發場景：團隊規模從 50 人爬到 200 人、microservice 從 20 個爬到 100 個、K8s replica 從 3 個爬到每服務 5-10 個。直連 PostgreSQL 的 connection 計算：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">100 service × 6 replica × 30 application pool = 18000 connection</span></span></code></pre></div><p>PostgreSQL 預設 <code>max_connections = 100</code>、production 設 <code>max_connections = 500-1000</code> 已經是上限（每多一個都加 memory + context switch cost）。18000 連線打 PostgreSQL 直接打爆。</p>
<p>進一步問題：</p>
<ul>
<li>一半 connection 是 <em>idle</em>（application pool 預留、實際沒查詢）— 浪費 backend slot</li>
<li>Cold start 時所有 replica 同時建 connection、瞬間 spike</li>
<li>DB failover 時所有 application 同時 reconnect、prod-test pattern 跑不通</li>
<li>DNS-based failover 時 application connection pool 不知道 backend 換了</li>
</ul>
<p>pgBouncer 解這四個問題。但 <em>引入 pgBouncer</em> 後又會引入新的問題層（pgBouncer 跟 application pool 不對齊、transaction pooling 的 session state 限制、HA 故障時 pgBouncer 也要 failover）— 本文討論這些。</p>
<h2 id="核心概念pool-mode--sizing">核心概念：pool mode + sizing</h2>
<p>pgBouncer 的 first-class concept 是 <em>pool mode</em>、決定 application connection 跟 PostgreSQL backend connection 的綁定方式：</p>
<ul>
<li><strong>Session pooling</strong>：application connection 拿到 backend connection 後、整個 session 期間都綁同一個 backend。tear-down 才釋放。語義跟「直連」一樣、不破壞 session state。但 <em>idle connection 仍占 backend slot</em>、收斂效率低、適合 <em>連線數不多但要保留 session state</em>（用了 prepared statement、temporary table、advisory lock 等）的場景。</li>
<li><strong>Transaction pooling</strong>：application connection 在 <em>transaction 邊界</em> 才綁 backend、commit / rollback 後立即釋放。同一個 application connection 不同 transaction 可能拿到不同 backend。收斂效率高（idle connection 完全不占 backend slot）、但 <em>session state 限制嚴</em> — 不能用 <code>SET</code> 改 session-level setting、不能用 prepared statement（除非 application 端禁用）、不能用 advisory lock 跨 transaction。</li>
<li><strong>Statement pooling</strong>：每個 statement 完就釋放 backend。極端高收斂但 <em>連 transaction 都不能跨 statement</em>、絕大多數 application 用不了、只在 batch query 場景。</li>
</ul>
<p><strong>Production 預設選 transaction pooling</strong>、application 端禁用 prepared statement（或用 <a href="https://www.pgbouncer.org/config.html#max_prepared_statements">PgBouncer-supported prepared statement</a>、需 pgBouncer 1.21+）。例外場景才開 session pooling。</p>
<p><strong>Pool sizing 公式</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">PostgreSQL max_connections     = pgBouncer N × default_pool_size + reserve
</span></span><span class="line"><span class="ln">2</span><span class="cl">pgBouncer default_pool_size    = per-database backend connection 上限
</span></span><span class="line"><span class="ln">3</span><span class="cl">Application pool size          = 每 application instance 拿幾個 pgBouncer connection</span></span></code></pre></div><p>實例：50 個 application replica、每 instance pool 30 個、pgBouncer 後 default_pool_size = 20（per database）、3 個 database。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Total application → pgBouncer = 50 × 30 = 1500 connection
</span></span><span class="line"><span class="ln">2</span><span class="cl">pgBouncer → PostgreSQL        = 3 × 20 = 60 connection
</span></span><span class="line"><span class="ln">3</span><span class="cl">PostgreSQL max_connections    = 60 + reserve (50 預留 admin / migration) = 110</span></span></code></pre></div><p>1500 → 110 收斂 13.6 倍、PostgreSQL 還在合理上限內。</p>
<h2 id="step-by-step-配置">Step-by-step 配置</h2>
<p><strong>pgBouncer.ini</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">[databases]</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="na">mydb</span> <span class="o">=</span> <span class="s">host=postgres-primary.internal port=5432 dbname=mydb auth_user=pgbouncer</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="k">[pgbouncer]</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="na">listen_port</span> <span class="o">=</span> <span class="s">6432</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="na">listen_addr</span> <span class="o">=</span> <span class="s">0.0.0.0</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="na">auth_type</span> <span class="o">=</span> <span class="s">scram-sha-256</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="na">auth_file</span> <span class="o">=</span> <span class="s">/etc/pgbouncer/userlist.txt</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="na">auth_query</span> <span class="o">=</span> <span class="s">SELECT usename, passwd FROM pg_shadow WHERE usename=$1</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="na">pool_mode</span> <span class="o">=</span> <span class="s">transaction</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="na">default_pool_size</span> <span class="o">=</span> <span class="s">20</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="na">min_pool_size</span> <span class="o">=</span> <span class="s">5</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="na">reserve_pool_size</span> <span class="o">=</span> <span class="s">10</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="na">reserve_pool_timeout</span> <span class="o">=</span> <span class="s">5</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="na">max_client_conn</span> <span class="o">=</span> <span class="s">2000</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="na">max_db_connections</span> <span class="o">=</span> <span class="s">100</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="na">server_idle_timeout</span> <span class="o">=</span> <span class="s">600</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="na">server_lifetime</span> <span class="o">=</span> <span class="s">3600</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="na">server_connect_timeout</span> <span class="o">=</span> <span class="s">15</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="na">server_login_retry</span> <span class="o">=</span> <span class="s">5</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="na">client_idle_timeout</span> <span class="o">=</span> <span class="s">0</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="na">client_login_timeout</span> <span class="o">=</span> <span class="s">60</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl">
</span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="na">stats_period</span> <span class="o">=</span> <span class="s">60</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="na">log_connections</span> <span class="o">=</span> <span class="s">0</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="na">log_disconnections</span> <span class="o">=</span> <span class="s">0</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="na">log_pooler_errors</span> <span class="o">=</span> <span class="s">1</span>
</span></span><span class="line"><span class="ln">32</span><span class="cl">
</span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="na">admin_users</span> <span class="o">=</span> <span class="s">pgbouncer_admin</span>
</span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="na">stats_users</span> <span class="o">=</span> <span class="s">pgbouncer_stats</span></span></span></code></pre></div><p>關鍵欄位解釋：</p>
<ul>
<li><code>pool_mode = transaction</code>：絕大多數 production 場景</li>
<li><code>default_pool_size = 20</code>：每 database 對 PostgreSQL 的 backend connection 上限、調整時要算進 PostgreSQL <code>max_connections</code></li>
<li><code>reserve_pool_size = 10</code> + <code>reserve_pool_timeout = 5</code>：當 default_pool_size 用滿、等 5 秒還拿不到 connection 才用 reserve pool — 是 <em>突發 spike</em> 的 buffer、不是 baseline</li>
<li><code>max_client_conn = 2000</code>：application 端能連 pgBouncer 的最大數</li>
<li><code>server_lifetime = 3600</code>：每 1 小時強制 recycle backend connection、避免 long-lived connection 累積 memory bloat（PostgreSQL <code>pg_stat_activity</code> 看 connection age）</li>
<li><code>auth_query</code>：pgBouncer 直接從 PostgreSQL <code>pg_shadow</code> 拉密碼、不需要在 pgBouncer 本地維護 userlist — production 推薦做法</li>
</ul>
<p><strong>Application 端 pool 設定</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># 例：Spring Boot HikariCP</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">spring.datasource.url</span><span class="p">:</span><span class="w"> </span><span class="l">jdbc:postgresql://pgbouncer.internal:6432/mydb</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">spring.datasource.hikari.maximum-pool-size</span><span class="p">:</span><span class="w"> </span><span class="m">30</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="nt">spring.datasource.hikari.minimum-idle</span><span class="p">:</span><span class="w"> </span><span class="m">5</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="nt">spring.datasource.hikari.connection-timeout</span><span class="p">:</span><span class="w"> </span><span class="m">30000</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="nt">spring.datasource.hikari.idle-timeout</span><span class="p">:</span><span class="w"> </span><span class="m">600000</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="nt">spring.datasource.hikari.max-lifetime</span><span class="p">:</span><span class="w"> </span><span class="m">1800000</span><span class="w">  </span><span class="c"># 30 min &lt; pgBouncer server_lifetime 60 min</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c"># 例：SQLAlchemy</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="l">engine = create_engine(</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">    </span><span class="s2">&#34;postgresql://pgbouncer.internal:6432/mydb&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="l">pool_size=30,</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">    </span><span class="l">max_overflow=5,</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">    </span><span class="l">pool_pre_ping=True,       </span><span class="w"> </span><span class="c"># 必開、檢測 stale connection</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">    </span><span class="l">pool_recycle=1800,        </span><span class="w"> </span><span class="c"># 30 min、跟 pgBouncer server_lifetime 對齊</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="l">)</span></span></span></code></pre></div><p><strong>Application 跟 pgBouncer 對齊</strong>：</p>
<ul>
<li>application <code>max-lifetime</code> &lt; pgBouncer <code>server_lifetime</code>：避免 application 拿到已被 pgBouncer recycle 的 connection</li>
<li><code>pool_pre_ping = True</code>：每次 checkout 前 send <code>SELECT 1</code>、檢測 stale connection — 對 transaction pooling 是必要的</li>
<li>application 端 <em>不要</em> 用 prepared statement（除非 pgBouncer 1.21+ 設 <code>max_prepared_statements</code>）</li>
</ul>
<h2 id="故障演練--邊界-case">故障演練 / 邊界 case</h2>
<h3 id="case-1pool-exhaustiondefault_pool_size-用滿">Case 1：Pool exhaustion（default_pool_size 用滿）</h3>
<p>徵兆：application log <code>ERROR: no more connections allowed</code>、pgBouncer log <code>pool is full</code>、pgBouncer admin console <code>SHOW POOLS</code> 顯示 <code>cl_waiting &gt; 0</code>。</p>
<p>Debug：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 連 pgBouncer admin
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="err">\</span><span class="k">c</span><span class="w"> </span><span class="n">pgbouncer</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SHOW</span><span class="w"> </span><span class="n">POOLS</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 看 cl_active / cl_waiting / sv_active / sv_idle
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SHOW</span><span class="w"> </span><span class="n">SERVERS</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- 看 server connection state（active / idle / used）</span></span></span></code></pre></div><p>修：</p>
<ul>
<li>短期：調高 <code>default_pool_size</code> 跟 PostgreSQL <code>max_connections</code>、配合 reserve pool</li>
<li>中期：找 <em>long-running query</em>（PostgreSQL <code>pg_stat_activity</code> 看 <code>query_start</code>、kill 過長 query）</li>
<li>長期：拆 database / 改 read replica / 移 OLAP query 到 data warehouse</li>
</ul>
<h3 id="case-2transaction-pooling-下-session-state-漏洞">Case 2：Transaction pooling 下 session state 漏洞</h3>
<p>徵兆：random 失敗 <code>prepared statement &quot;S_3&quot; does not exist</code>、<code>relation &quot;tmp_xxx&quot; does not exist</code>、advisory lock 不釋放。</p>
<p>原因：application 用了 prepared statement / temporary table / advisory lock、但 transaction commit 後 backend connection 釋放、下一個 transaction 拿到不同 backend、session state 不存在。</p>
<p>修：</p>
<ul>
<li>Application 框架禁用 prepared statement（JDBC <code>prepareThreshold=0</code>、SQLAlchemy <code>use_native_prepared_statements=False</code>）</li>
<li>temporary table 改 <a href="https://www.postgresql.org/docs/current/sql-createtable.html#SQL-CREATETABLE-UNLOGGED-TABLES">unlogged table</a> + cleanup</li>
<li>advisory lock 改 row-level lock 或 application-level lock（Redis）</li>
<li>或：切到 session pooling、犧牲收斂效率</li>
</ul>
<h3 id="case-3dns-based-failover-後-application-連到舊-master">Case 3：DNS-based failover 後 application 連到舊 master</h3>
<p>徵兆：PostgreSQL 切換 master 後、application 寫操作 <em>時好時壞</em>（看連到哪台）。</p>
<p>原因：pgBouncer 在 application 跟 PostgreSQL 之間、application 不知道 backend 換了；pgBouncer 自己也需要 reload config 才會連新 master。</p>
<p>修：</p>
<ul>
<li>pgBouncer 用 <code>RECONNECT</code> admin command 強制 close all backend connection、重連</li>
<li>配 Patroni / Stolon 等 HA 工具自動 trigger pgBouncer reconnect</li>
<li>application 端 <code>pool_pre_ping</code> 開啟、stale connection 自動踢</li>
</ul>
<h3 id="case-4server-lifetime-recycle-跟-in-flight-transaction-衝突">Case 4：Server lifetime recycle 跟 in-flight transaction 衝突</h3>
<p>徵兆：偶發 <code>server closed the connection unexpectedly</code>、跟 long-running transaction 重疊。</p>
<p>原因：pgBouncer <code>server_lifetime = 3600</code> 強制 recycle、但有 transaction 在跑時 pgBouncer 不會切、超過時間後仍會切。</p>
<p>修：</p>
<ul>
<li>確認沒有 <em>超過 1 小時</em> 的 transaction（PostgreSQL <code>pg_stat_activity</code> 看 <code>xact_start</code>）</li>
<li>必要時調高 <code>server_lifetime</code>、但 memory bloat 風險上升</li>
<li>application 端做 transaction timeout</li>
</ul>
<h3 id="case-5pgbouncer-自己-crash--oom">Case 5：pgBouncer 自己 crash / OOM</h3>
<p>徵兆：所有 application 同時失去 PostgreSQL 連線。</p>
<p>原因：pgBouncer 是 single-process（除非 1.21+ 用 <code>so_reuseport</code> 多 process）、memory leak / OOM / 部署事件都會打掉整個 connection layer。</p>
<p>修：</p>
<ul>
<li>多 pgBouncer instance + load balancer（HAProxy / Envoy）前置、application 連 LB</li>
<li><code>so_reuseport = 1</code>（1.21+）讓多個 pgBouncer process 共用 port</li>
<li>Resource limit 跟 alert：RSS &gt; N、connection count &gt; M</li>
<li>HA mode：active-passive 配 keepalived</li>
</ul>
<h2 id="容量--cost-規劃">容量 / cost 規劃</h2>
<p><strong>單一 pgBouncer 容量上限</strong>：</p>
<ul>
<li><code>max_client_conn</code>：實務 &lt; 5000 per instance（再高 CPU 跟 file descriptor 緊）</li>
<li><code>default_pool_size × database 數</code>：實務 &lt; 200 per instance</li>
<li>single process CPU bound：在 10K QPS 等級已經是瓶頸、要橫向 scale</li>
</ul>
<p><strong>何時加 pgBouncer instance</strong>：</p>
<ul>
<li>application connection 數突破 3000 / pgBouncer instance</li>
<li>pgBouncer CPU usage &gt; 60%（baseline、不算 spike）</li>
<li>跨 region application 需要 region-local pgBouncer</li>
</ul>
<p><strong>何時改架構（pgBouncer 不夠用）</strong>：</p>
<ul>
<li>PostgreSQL backend connection 數突破 500（即使有 pgBouncer 也撐不住）→ 改 read replica / partitioning / sharding</li>
<li>write 量太大（每秒 50K+ TPS）→ 改 sharding（<a href="https://vitess.io">Vitess</a> / <a href="https://www.citusdata.com">Citus</a>）或全球分散式 SQL（<a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a>）</li>
<li>application 大量 prepared statement / session state 需求 → 改 <a href="https://github.com/postgresml/pgcat">PgCat</a>（Rust 寫、支援更完整的 session feature）或回 session pooling</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<p><strong>跟 HA failover 整合</strong>（<a href="https://github.com/zalando/patroni">Patroni</a>）：</p>
<ul>
<li>Patroni 切換 master 後 trigger pgBouncer <code>RECONNECT</code></li>
<li>pgBouncer 透過 service discovery（Consul / etcd）拿新 master 位址、不是寫死在 config</li>
<li>application 不需感知 failover、connection 從 pgBouncer 拿到新 master 的 backend</li>
</ul>
<p><strong>跟監控整合</strong>：</p>
<ul>
<li>pgBouncer admin console <code>SHOW STATS</code> / <code>SHOW POOLS</code> / <code>SHOW SERVERS</code> 拉到 Prometheus（<a href="https://github.com/jbub/pgbouncer_exporter">pgbouncer_exporter</a>）</li>
<li>必看 metric：<code>cl_waiting</code>（等 backend 的 client 數）、<code>sv_active</code>（active backend 數）、<code>avg_query_time</code>、<code>avg_xact_time</code></li>
<li>Alert：<code>cl_waiting &gt; 0 持續 30s</code>、<code>server connection error rate &gt; 0</code></li>
</ul>
<p><strong>跟 application observability 整合</strong>：</p>
<ul>
<li>Application APM（<a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog</a> / Honeycomb / OpenTelemetry）的 DB span 顯示 <em>application 看到的 latency</em>、pgBouncer metric 顯示 <em>pgBouncer ↔ PostgreSQL latency</em> — 兩者差異揭露 connection wait time</li>
</ul>
<p><strong>何時 revisit 這個配置</strong>：</p>
<ul>
<li>application 數量倍增（trigger pool sizing 重算）</li>
<li>PostgreSQL 升級（pgBouncer 跟 PostgreSQL 版本相容性）</li>
<li>跨 region 部署（要不要 region-local pgBouncer）</li>
<li>切換到 RDS Proxy / Aurora Cluster Endpoint（managed alternative）</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a> — 本文是該頁尾「pgBouncer / PgCat 配置 best practice」backlog 的深度展開</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/connection-scaling/" data-link-title="PostgreSQL Connection Scaling：process-per-connection model 跟為什麼 pooler 是必裝" data-link-desc="PG 每個 client connection fork 一個 backend process（不是 thread）、RAM 成本 5-15MB/connection、context switch 跟 fork() cost 在 100&#43; connection 後線性放大、所以 pooler 不是 *optional optimization* 而是 *production prerequisite*。本文走 process-per-connection model 跟 MySQL thread-per-connection 對比、max_connections &#43; shared_buffers &#43; work_mem 三 GUC 互動、application-side pool vs middleware pool vs RDS Proxy 三層選擇、5 production 踩雷（connection storm / fork() cost 在 burst 流量 / shared_buffers 跟 connection 數壓縮 / double-pool 配置錯誤 / max_connections 設太大反而慢）、跟 PgBouncer config 互補不重複">Connection Scaling Deep Dive</a> — connection-per-process model 跟為什麼 pooler 是必裝（根因 vs 配置）</li>
<li><a href="/blog/backend/01-database/high-concurrency-access/" data-link-title="1.1 高併發下的 SQL 讀寫邊界" data-link-desc="說明高併發服務如何共用資料庫 client、控制 transaction、管理 connection pool、避免資料庫成為瓶頸">1.1 高併發資料存取</a> — 上游：什麼時候需要 connection pool</li>
<li><a href="/blog/backend/knowledge-cards/connection-pool/" data-link-title="Connection Pool" data-link-desc="說明連線池如何限制下游資源並影響服務容量">Connection Pool 卡片</a> — 概念基底</li>
<li><a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章方法論</a> — 本文是該方法論的 demo #1</li>
<li><a href="/blog/backend/09-performance-capacity/cases/ntt-docomo-lemino-japanese-streaming/" data-link-title="9.C29 NTT DOCOMO Lemino：3 個月達 500 萬 MAU 的串流後端" data-link-desc="Lemino 用 DynamoDB &#43; AWS Media Services 撐 30 channels live &#43; 5M MAU、工程工時下降 90%">9.C29 Lemino RDB connection limit case</a> — connection 爆是 streaming surge 場景的 vendor-switch 主因</li>
<li>官方：<a href="https://www.pgbouncer.org/usage.html">pgBouncer Documentation</a></li>
</ul>
]]></content:encoded></item><item><title>Aurora PostgreSQL I/O-Optimized Cost</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/aurora-io-optimized-cost/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/aurora-io-optimized-cost/</guid><description>&lt;p>Aurora PostgreSQL I/O-Optimized cost 的核心責任是把 Aurora storage configuration 從定價選項轉成 workload 決策。AWS 官方文件將 Aurora cluster storage configuration 分成 Aurora Standard 與 Aurora I/O-Optimized；前者適合一般 I/O 分布，後者針對 I/O 密集 workload 提供不同成本結構。&lt;/p>
&lt;p>本文的判讀錨點是：I/O-Optimized 是成本與 workload profile 決策，而非效能保證。要看的是 read / write I/O charge、storage、instance、backup、replica、query pattern、maintenance 與未來成長。&lt;/p>
&lt;p>官方文件路由的核心責任是固定時間敏感 claim。實作前先查 &lt;a href="https://docs.aws.amazon.com/en_us/AmazonRDS/latest/AuroraUserGuide/Aurora.Overview.StorageReliability.html">Aurora storage configurations&lt;/a> 與 &lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.Aurora_Fea_Regions_DB-eng.Feature.storage-type.html">supported engines / regions&lt;/a>；本文最後檢查日是 2026-05-22。&lt;/p>
&lt;h2 id="cost-model">Cost Model&lt;/h2>
&lt;p>Cost model 的核心責任是拆解 Aurora bill 的來源。Aurora 成本通常包含 instance、storage、I/O request、backup、replica、data transfer 與 support / operation。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>成本項&lt;/th>
 &lt;th>Standard 判讀&lt;/th>
 &lt;th>I/O-Optimized 判讀&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Instance&lt;/td>
 &lt;td>仍依 instance / capacity 計費&lt;/td>
 &lt;td>仍依 instance / capacity 計費&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Storage&lt;/td>
 &lt;td>依儲存使用量&lt;/td>
 &lt;td>依 I/O-Optimized storage 設定&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>I/O requests&lt;/td>
 &lt;td>I/O 成本可成為主要變動項&lt;/td>
 &lt;td>I/O charge 結構改變，適合高 I/O workload&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Backup / snapshot&lt;/td>
 &lt;td>依保留與使用量&lt;/td>
 &lt;td>仍需納入總成本&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Data transfer&lt;/td>
 &lt;td>跨 AZ / region / service 需審查&lt;/td>
 &lt;td>仍需納入總成本&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>成本評估要用真實帳單和 CloudWatch 指標。只用平均 QPS 估算會漏掉 batch job、vacuum、index build、replica、backfill 與報表查詢帶來的 I/O 尖峰。&lt;/p>
&lt;h2 id="workload-signals">Workload Signals&lt;/h2>
&lt;p>Workload signals 的核心責任是找出 I/O 是否為主要成本與瓶頸。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>訊號&lt;/th>
 &lt;th>意義&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>I/O request 成本占比高&lt;/td>
 &lt;td>Standard 可能受 I/O charge 影響大&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Buffer cache hit ratio 低&lt;/td>
 &lt;td>工作集超過 memory 或 query 掃描過重&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>大量 random read / write&lt;/td>
 &lt;td>storage I/O 壓力明顯&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>ETL / backfill 經常跑&lt;/td>
 &lt;td>短期 I/O spike 可能影響帳單與 latency&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Index / query 設計已優化&lt;/td>
 &lt;td>成本切換更能反映真實 workload&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>先做 query 與 index review。若 I/O 來自缺 index、全表掃描、過度 eager loading 或不必要 backfill，直接切 I/O-Optimized 只會把浪費制度化。&lt;/p>
&lt;h2 id="evaluation-process">Evaluation Process&lt;/h2>
&lt;p>Evaluation process 的核心責任是讓切換決策可回溯。&lt;/p>
&lt;ol>
&lt;li>收集 30 到 90 天成本：instance、storage、I/O、backup、transfer。&lt;/li>
&lt;li>收集 workload 指標：read/write IOPS、cache hit、slow query、top SQL。&lt;/li>
&lt;li>標記特殊事件：migration、backfill、incident、seasonality。&lt;/li>
&lt;li>建立 Standard vs I/O-Optimized 成本試算。&lt;/li>
&lt;li>在 staging / canary 確認 application behavior。&lt;/li>
&lt;li>設定切換後 7 / 14 / 30 天回顧點。&lt;/li>
&lt;/ol>
&lt;p>試算要包含季節性。月初結算、年度促銷、批次報表與資料重整都可能讓 I/O profile 和普通週不同。&lt;/p></description><content:encoded><![CDATA[<p>Aurora PostgreSQL I/O-Optimized cost 的核心責任是把 Aurora storage configuration 從定價選項轉成 workload 決策。AWS 官方文件將 Aurora cluster storage configuration 分成 Aurora Standard 與 Aurora I/O-Optimized；前者適合一般 I/O 分布，後者針對 I/O 密集 workload 提供不同成本結構。</p>
<p>本文的判讀錨點是：I/O-Optimized 是成本與 workload profile 決策，而非效能保證。要看的是 read / write I/O charge、storage、instance、backup、replica、query pattern、maintenance 與未來成長。</p>
<p>官方文件路由的核心責任是固定時間敏感 claim。實作前先查 <a href="https://docs.aws.amazon.com/en_us/AmazonRDS/latest/AuroraUserGuide/Aurora.Overview.StorageReliability.html">Aurora storage configurations</a> 與 <a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.Aurora_Fea_Regions_DB-eng.Feature.storage-type.html">supported engines / regions</a>；本文最後檢查日是 2026-05-22。</p>
<h2 id="cost-model">Cost Model</h2>
<p>Cost model 的核心責任是拆解 Aurora bill 的來源。Aurora 成本通常包含 instance、storage、I/O request、backup、replica、data transfer 與 support / operation。</p>
<table>
  <thead>
      <tr>
          <th>成本項</th>
          <th>Standard 判讀</th>
          <th>I/O-Optimized 判讀</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Instance</td>
          <td>仍依 instance / capacity 計費</td>
          <td>仍依 instance / capacity 計費</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>依儲存使用量</td>
          <td>依 I/O-Optimized storage 設定</td>
      </tr>
      <tr>
          <td>I/O requests</td>
          <td>I/O 成本可成為主要變動項</td>
          <td>I/O charge 結構改變，適合高 I/O workload</td>
      </tr>
      <tr>
          <td>Backup / snapshot</td>
          <td>依保留與使用量</td>
          <td>仍需納入總成本</td>
      </tr>
      <tr>
          <td>Data transfer</td>
          <td>跨 AZ / region / service 需審查</td>
          <td>仍需納入總成本</td>
      </tr>
  </tbody>
</table>
<p>成本評估要用真實帳單和 CloudWatch 指標。只用平均 QPS 估算會漏掉 batch job、vacuum、index build、replica、backfill 與報表查詢帶來的 I/O 尖峰。</p>
<h2 id="workload-signals">Workload Signals</h2>
<p>Workload signals 的核心責任是找出 I/O 是否為主要成本與瓶頸。</p>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>意義</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>I/O request 成本占比高</td>
          <td>Standard 可能受 I/O charge 影響大</td>
      </tr>
      <tr>
          <td>Buffer cache hit ratio 低</td>
          <td>工作集超過 memory 或 query 掃描過重</td>
      </tr>
      <tr>
          <td>大量 random read / write</td>
          <td>storage I/O 壓力明顯</td>
      </tr>
      <tr>
          <td>ETL / backfill 經常跑</td>
          <td>短期 I/O spike 可能影響帳單與 latency</td>
      </tr>
      <tr>
          <td>Index / query 設計已優化</td>
          <td>成本切換更能反映真實 workload</td>
      </tr>
  </tbody>
</table>
<p>先做 query 與 index review。若 I/O 來自缺 index、全表掃描、過度 eager loading 或不必要 backfill，直接切 I/O-Optimized 只會把浪費制度化。</p>
<h2 id="evaluation-process">Evaluation Process</h2>
<p>Evaluation process 的核心責任是讓切換決策可回溯。</p>
<ol>
<li>收集 30 到 90 天成本：instance、storage、I/O、backup、transfer。</li>
<li>收集 workload 指標：read/write IOPS、cache hit、slow query、top SQL。</li>
<li>標記特殊事件：migration、backfill、incident、seasonality。</li>
<li>建立 Standard vs I/O-Optimized 成本試算。</li>
<li>在 staging / canary 確認 application behavior。</li>
<li>設定切換後 7 / 14 / 30 天回顧點。</li>
</ol>
<p>試算要包含季節性。月初結算、年度促銷、批次報表與資料重整都可能讓 I/O profile 和普通週不同。</p>
<h2 id="migration-and-rollback">Migration and Rollback</h2>
<p>Migration and rollback 的核心責任是把 storage configuration change 放進變更流程。Aurora storage configuration 是 cluster-level decision，應先確認支援區域、engine version、切換限制、維護窗口與回退條件。</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-check</td>
          <td>engine version、region support、current bill</td>
      </tr>
      <tr>
          <td>Cost baseline</td>
          <td>近期成本與 I/O 指標</td>
      </tr>
      <tr>
          <td>Change window</td>
          <td>application traffic、maintenance</td>
      </tr>
      <tr>
          <td>Post-check</td>
          <td>latency、I/O、error、bill trend</td>
      </tr>
      <tr>
          <td>Review</td>
          <td>7 / 14 / 30 天成本與效能</td>
      </tr>
  </tbody>
</table>
<p>Rollback 條件要明確。若切換後成本下降未達目標、latency 沒改善、或 workload profile 改變，應重新評估 Standard 與 query optimization。</p>
<h2 id="anti-patterns">Anti-Patterns</h2>
<p>Anti-pattern 的核心責任是避免把計費選項當成效能調校。</p>
<table>
  <thead>
      <tr>
          <th>反模式</th>
          <th>風險</th>
          <th>修正方向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>未看 top SQL 直接切換</td>
          <td>把壞 query 的成本包進新方案</td>
          <td>先做 query / index review</td>
      </tr>
      <tr>
          <td>用單日帳單推估全年</td>
          <td>忽略 seasonality</td>
          <td>至少看完整業務週期</td>
      </tr>
      <tr>
          <td>忽略 backup / transfer</td>
          <td>總成本估算失真</td>
          <td>全 bill component 一起比較</td>
      </tr>
      <tr>
          <td>切換後無 review</td>
          <td>成本漂移無 owner</td>
          <td>設定 7 / 14 / 30 天 tripwire</td>
      </tr>
  </tbody>
</table>
<p>I/O-Optimized 的價值來自成本結構對齊 workload。它應該是 FinOps 與 database operation 的共同決策。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Aurora I/O-Optimized cost 完成後，Aurora 遷移讀 <a href="../migrate-to-aurora/">PostgreSQL to Aurora Migration</a>；query 成本讀 <a href="../query-optimization/">Query Optimization</a>；capacity 與瓶頸判斷讀 <a href="/blog/backend/09-performance-capacity/bottleneck-localization/" data-link-title="9.5 瓶頸定位流程" data-link-desc="從 app 到 DB / cache / broker / 第三方 quota 的逐層瓶頸定位">Bottleneck Localization</a>。</p>
]]></content:encoded></item><item><title>Managed PostgreSQL Comparison</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/managed-pg-comparison/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/managed-pg-comparison/</guid><description>&lt;p>Managed PostgreSQL comparison 的核心責任是把「都是 PostgreSQL」拆成不同的操作責任邊界。Managed service 可能代管 backup、patch、replica、minor upgrade、monitoring、connection proxy、serverless scaling 或 branch workflow；但 application schema、query、migration、role、cost 與 incident decision 仍需要 team 承擔。&lt;/p>
&lt;p>本文的判讀錨點是：managed PostgreSQL 是 operation trade-off，而非 vendor-neutral checkbox。選型要看 workload、合規、extension、HA / DR、connection、cost visibility、exit route 與 team skill。&lt;/p>
&lt;p>官方文件路由的核心責任是固定 provider claim。實作前分別查 &lt;a href="https://docs.cloud.google.com/alloydb/docs">AlloyDB docs&lt;/a>、&lt;a href="https://cloud.google.com/sql/postgresql">Cloud SQL for PostgreSQL&lt;/a>、&lt;a href="https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/overview">Azure Database for PostgreSQL Flexible Server&lt;/a> 與 &lt;a href="https://supabase.com/docs/guides/deployment/branching">Supabase branching docs&lt;/a>；本文最後檢查日是 2026-05-22。&lt;/p>
&lt;h2 id="provider-boundary">Provider Boundary&lt;/h2>
&lt;p>Provider boundary 的核心責任是定義 vendor 接手哪些資料庫操作。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>類型&lt;/th>
 &lt;th>代表選項&lt;/th>
 &lt;th>適合情境&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Cloud managed PostgreSQL&lt;/td>
 &lt;td>RDS PostgreSQL、Cloud SQL、Azure PG&lt;/td>
 &lt;td>標準 PostgreSQL、雲平台整合&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Aurora PostgreSQL-compatible&lt;/td>
 &lt;td>Amazon Aurora PostgreSQL&lt;/td>
 &lt;td>AWS 生態、高可用 storage layer、read scaling&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Serverless / branching PG&lt;/td>
 &lt;td>Neon、Supabase 部分能力&lt;/td>
 &lt;td>dev preview、稀疏 workload、快速分支&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Specialist managed PG&lt;/td>
 &lt;td>Crunchy Bridge 等&lt;/td>
 &lt;td>PostgreSQL 專業支援、extension 需求&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Self-managed&lt;/td>
 &lt;td>VM / K8s 上自管&lt;/td>
 &lt;td>需要完整控制、具備 DBA 能力&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Provider boundary 要寫成 responsibility matrix。誰負責 backup restore、major upgrade、extension enable、failover、connection proxy、audit export、encryption key、support ticket 與 incident decision。&lt;/p>
&lt;p>Serverless / branching PG 這一列的 Neon 與 Supabase 不在同一個 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/capability-outsourcing-depth/" data-link-title="Capability Outsourcing Depth（外包深度）" data-link-desc="說明外包一塊後端能力有三種深度（managed 基礎設施、feature SaaS、BaaS bundle）、深度決定保留多少控制權與遷出代價">外包深度&lt;/a>。Neon 是純 serverless PostgreSQL（managed 基礎設施）；Supabase 是把 Postgres 當其中一塊的 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/baas/" data-link-title="BaaS（Backend as a Service）" data-link-desc="說明把認證、資料庫、檔案儲存、推播打包成現成模組、由前端 SDK 直連的後端交付形態">BaaS bundle&lt;/a>（同時含 Auth、Storage、Realtime）。只需要資料庫、兩者皆可比較且 Neon 更輕；要連認證、儲存一起到位、才是 Supabase 的賣點。這個外包深度差異與「該買整個 bundle 還是只用它的 Postgres」的判讀、見 &lt;a href="https://tarrragon.github.io/blog/backend/00-service-selection/capability-buy-vs-build/" data-link-title="0.22 能力級買 vs 建：feature-as-a-service 與 BaaS bundle 選型" data-link-desc="在交付形態決定整個系統要不要自建之後、逐能力判斷該外包還是自建：辨識 managed 基礎設施、feature SaaS 與 BaaS bundle 三種外包深度、no-code 到 dev-tool 的服務光譜、買 vs 建判準與權重浮動、整合接縫與遷出代價">0.22 能力級買 vs 建&lt;/a>。&lt;/p>
&lt;h2 id="evaluation-dimensions">Evaluation Dimensions&lt;/h2>
&lt;p>Evaluation dimensions 的核心責任是讓比較避免只看價格或品牌。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>PostgreSQL fidelity&lt;/td>
 &lt;td>engine version、extension、parameter、superuser 限制&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>HA / DR&lt;/td>
 &lt;td>AZ failover、cross-region replica、PITR、restore drill&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Connection&lt;/td>
 &lt;td>max connection、pooler、proxy、serverless cold start&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Migration&lt;/td>
 &lt;td>import/export、logical replication、downtime window&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Observability&lt;/td>
 &lt;td>logs、metrics、slow query、audit、SIEM export&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Security&lt;/td>
 &lt;td>network、IAM、KMS、TLS、RLS / pgAudit support&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cost&lt;/td>
 &lt;td>instance、storage、I/O、backup、egress、support&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Exit&lt;/td>
 &lt;td>dump、logical replication、snapshot portability&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>PostgreSQL fidelity 是第一關。若服務依賴 extension、logical decoding、superuser function、custom parameter 或 filesystem access，managed provider 的限制會直接影響可行性。&lt;/p></description><content:encoded><![CDATA[<p>Managed PostgreSQL comparison 的核心責任是把「都是 PostgreSQL」拆成不同的操作責任邊界。Managed service 可能代管 backup、patch、replica、minor upgrade、monitoring、connection proxy、serverless scaling 或 branch workflow；但 application schema、query、migration、role、cost 與 incident decision 仍需要 team 承擔。</p>
<p>本文的判讀錨點是：managed PostgreSQL 是 operation trade-off，而非 vendor-neutral checkbox。選型要看 workload、合規、extension、HA / DR、connection、cost visibility、exit route 與 team skill。</p>
<p>官方文件路由的核心責任是固定 provider claim。實作前分別查 <a href="https://docs.cloud.google.com/alloydb/docs">AlloyDB docs</a>、<a href="https://cloud.google.com/sql/postgresql">Cloud SQL for PostgreSQL</a>、<a href="https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/overview">Azure Database for PostgreSQL Flexible Server</a> 與 <a href="https://supabase.com/docs/guides/deployment/branching">Supabase branching docs</a>；本文最後檢查日是 2026-05-22。</p>
<h2 id="provider-boundary">Provider Boundary</h2>
<p>Provider boundary 的核心責任是定義 vendor 接手哪些資料庫操作。</p>
<table>
  <thead>
      <tr>
          <th>類型</th>
          <th>代表選項</th>
          <th>適合情境</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cloud managed PostgreSQL</td>
          <td>RDS PostgreSQL、Cloud SQL、Azure PG</td>
          <td>標準 PostgreSQL、雲平台整合</td>
      </tr>
      <tr>
          <td>Aurora PostgreSQL-compatible</td>
          <td>Amazon Aurora PostgreSQL</td>
          <td>AWS 生態、高可用 storage layer、read scaling</td>
      </tr>
      <tr>
          <td>Serverless / branching PG</td>
          <td>Neon、Supabase 部分能力</td>
          <td>dev preview、稀疏 workload、快速分支</td>
      </tr>
      <tr>
          <td>Specialist managed PG</td>
          <td>Crunchy Bridge 等</td>
          <td>PostgreSQL 專業支援、extension 需求</td>
      </tr>
      <tr>
          <td>Self-managed</td>
          <td>VM / K8s 上自管</td>
          <td>需要完整控制、具備 DBA 能力</td>
      </tr>
  </tbody>
</table>
<p>Provider boundary 要寫成 responsibility matrix。誰負責 backup restore、major upgrade、extension enable、failover、connection proxy、audit export、encryption key、support ticket 與 incident decision。</p>
<p>Serverless / branching PG 這一列的 Neon 與 Supabase 不在同一個 <a href="/blog/backend/knowledge-cards/capability-outsourcing-depth/" data-link-title="Capability Outsourcing Depth（外包深度）" data-link-desc="說明外包一塊後端能力有三種深度（managed 基礎設施、feature SaaS、BaaS bundle）、深度決定保留多少控制權與遷出代價">外包深度</a>。Neon 是純 serverless PostgreSQL（managed 基礎設施）；Supabase 是把 Postgres 當其中一塊的 <a href="/blog/backend/knowledge-cards/baas/" data-link-title="BaaS（Backend as a Service）" data-link-desc="說明把認證、資料庫、檔案儲存、推播打包成現成模組、由前端 SDK 直連的後端交付形態">BaaS bundle</a>（同時含 Auth、Storage、Realtime）。只需要資料庫、兩者皆可比較且 Neon 更輕；要連認證、儲存一起到位、才是 Supabase 的賣點。這個外包深度差異與「該買整個 bundle 還是只用它的 Postgres」的判讀、見 <a href="/blog/backend/00-service-selection/capability-buy-vs-build/" data-link-title="0.22 能力級買 vs 建：feature-as-a-service 與 BaaS bundle 選型" data-link-desc="在交付形態決定整個系統要不要自建之後、逐能力判斷該外包還是自建：辨識 managed 基礎設施、feature SaaS 與 BaaS bundle 三種外包深度、no-code 到 dev-tool 的服務光譜、買 vs 建判準與權重浮動、整合接縫與遷出代價">0.22 能力級買 vs 建</a>。</p>
<h2 id="evaluation-dimensions">Evaluation Dimensions</h2>
<p>Evaluation dimensions 的核心責任是讓比較避免只看價格或品牌。</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PostgreSQL fidelity</td>
          <td>engine version、extension、parameter、superuser 限制</td>
      </tr>
      <tr>
          <td>HA / DR</td>
          <td>AZ failover、cross-region replica、PITR、restore drill</td>
      </tr>
      <tr>
          <td>Connection</td>
          <td>max connection、pooler、proxy、serverless cold start</td>
      </tr>
      <tr>
          <td>Migration</td>
          <td>import/export、logical replication、downtime window</td>
      </tr>
      <tr>
          <td>Observability</td>
          <td>logs、metrics、slow query、audit、SIEM export</td>
      </tr>
      <tr>
          <td>Security</td>
          <td>network、IAM、KMS、TLS、RLS / pgAudit support</td>
      </tr>
      <tr>
          <td>Cost</td>
          <td>instance、storage、I/O、backup、egress、support</td>
      </tr>
      <tr>
          <td>Exit</td>
          <td>dump、logical replication、snapshot portability</td>
      </tr>
  </tbody>
</table>
<p>PostgreSQL fidelity 是第一關。若服務依賴 extension、logical decoding、superuser function、custom parameter 或 filesystem access，managed provider 的限制會直接影響可行性。</p>
<h2 id="workload-fit">Workload Fit</h2>
<p>Workload fit 的核心責任是把 provider 能力和產品需求對齊。</p>
<table>
  <thead>
      <tr>
          <th>Workload</th>
          <th>優先考量</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SaaS OLTP</td>
          <td>HA、PITR、connection pool、online migration</td>
      </tr>
      <tr>
          <td>Analytics-heavy OLTP</td>
          <td>read replica、I/O cost、work_mem、warehouse boundary</td>
      </tr>
      <tr>
          <td>Dev / preview env</td>
          <td>branching、fast restore、low idle cost</td>
      </tr>
      <tr>
          <td>Regulated workload</td>
          <td>audit、KMS、network isolation、retention</td>
      </tr>
      <tr>
          <td>Extension-heavy app</td>
          <td>PostGIS、pgvector、TimescaleDB、logical decoding support</td>
      </tr>
  </tbody>
</table>
<p>Serverless / branching PG 適合 preview 與稀疏 workload，但 sustained high-throughput production 要審查 cold start、connection、storage separation latency 與 cost curve。</p>
<p>Aurora PostgreSQL 適合 AWS-heavy 架構與高可用 storage layer，但要審查 PostgreSQL compatibility、parameter 限制、I/O cost 與 migration / exit。</p>
<h2 id="migration-and-exit">Migration and Exit</h2>
<p>Migration and exit 的核心責任是避免 managed service 變成單向門。導入前要先知道如何進去、如何出來。</p>
<table>
  <thead>
      <tr>
          <th>流程</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Import</td>
          <td>dump / restore、logical replication、DMS</td>
      </tr>
      <tr>
          <td>Cutover</td>
          <td>freeze window、replica catch-up、validation</td>
      </tr>
      <tr>
          <td>Rollback</td>
          <td>source snapshot、write replay、DNS switch</td>
      </tr>
      <tr>
          <td>Exit</td>
          <td>pg_dump、logical replication、snapshot export</td>
      </tr>
      <tr>
          <td>Rehearsal</td>
          <td>staging restore、row count、checksum</td>
      </tr>
  </tbody>
</table>
<p>Exit route 要比口頭承諾更具體。至少要能在 staging 將資料匯出到 vanilla PostgreSQL 或下一個 managed provider，並跑 application smoke test。</p>
<h2 id="cost-review">Cost Review</h2>
<p>Cost review 的核心責任是把 managed convenience 轉成總成本。總成本包含 instance、storage、I/O、backup、replica、egress、support、observability、operation labor 與 incident cost。</p>
<table>
  <thead>
      <tr>
          <th>Cost driver</th>
          <th>常見誤判</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>I/O</td>
          <td>只看 instance price</td>
      </tr>
      <tr>
          <td>Backup retention</td>
          <td>長 retention 被忽略</td>
      </tr>
      <tr>
          <td>Cross-region replica</td>
          <td>data transfer / storage 增加</td>
      </tr>
      <tr>
          <td>Observability export</td>
          <td>log volume 與 SIEM 成本</td>
      </tr>
      <tr>
          <td>Serverless idle</td>
          <td>idle 低但 sustained workload 成本不同</td>
      </tr>
  </tbody>
</table>
<p>Cost review 要設 tripwire。當 I/O 成本占比提高、backup retention 變長、replica 增加或 serverless workload 變成常駐，重新評估方案。</p>
<h2 id="decision-route">Decision Route</h2>
<p>Decision route 的核心責任是把 provider 選型導向具體路線。</p>
<table>
  <thead>
      <tr>
          <th>需求</th>
          <th>優先路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>標準雲平台 PostgreSQL</td>
          <td>RDS / Cloud SQL / Azure PG</td>
      </tr>
      <tr>
          <td>AWS 生態 + HA storage layer</td>
          <td>Aurora PostgreSQL</td>
      </tr>
      <tr>
          <td>Preview branch / dev env</td>
          <td>Neon / Supabase branch workflow</td>
      </tr>
      <tr>
          <td>Extension / PG 專業支援</td>
          <td>specialist managed PG</td>
      </tr>
      <tr>
          <td>完整控制與特殊 extension</td>
          <td>self-managed PostgreSQL</td>
      </tr>
  </tbody>
</table>
<p>Managed provider 的最終選擇要回到 team skill。少維護元件是價值；把尚未理解的限制外包給 vendor，會在 incident 和 migration 時回來。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Managed PostgreSQL comparison 完成後，Aurora 遷移讀 <a href="../migrate-to-aurora/">PostgreSQL to Aurora Migration</a>；Aurora DSQL 讀 <a href="../migrate-to-aurora-dsql/">PostgreSQL to Aurora DSQL</a>；serverless / specialized variant 讀 <a href="../specialized-pg-variants/">Specialized PostgreSQL Variants</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL Connection Pooler Comparison</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/connection-pooler-comparison/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/connection-pooler-comparison/</guid><description>&lt;p>PostgreSQL connection pooler comparison 的核心責任是把連線數壓力、transaction 語意與維運責任拆開判讀。PostgreSQL backend process 成本高，application instance 擴張後，connection pooler 常成為保護資料庫的第一層容量控制。&lt;/p>
&lt;p>本文的判讀錨點是：pooler 解決的是 connection fan-out 與 queueing，而非查詢本身變快。查詢慢、lock wait、transaction 過長、index 錯誤仍要回到 &lt;a href="../query-optimization/">Query Optimization&lt;/a> 與 &lt;a href="../mvcc-lock-model/">MVCC / lock model&lt;/a>。&lt;/p>
&lt;h2 id="pooling-models">Pooling Models&lt;/h2>
&lt;p>Pooling model 的核心責任是決定 client connection 和 server connection 的綁定時間。PgBouncer 代表最常見的 PostgreSQL pooler 模型；官方文件將 pool mode 分成 session、transaction 與 statement。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>模式&lt;/th>
 &lt;th>Server connection 綁定&lt;/th>
 &lt;th>適合情境&lt;/th>
 &lt;th>主要風險&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Session&lt;/td>
 &lt;td>client session 全程&lt;/td>
 &lt;td>使用 session state、temp table&lt;/td>
 &lt;td>壓縮率低&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Transaction&lt;/td>
 &lt;td>transaction 期間&lt;/td>
 &lt;td>Web API、短交易、Stateless query&lt;/td>
 &lt;td>session variable、prepared statement 語意受限&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Statement&lt;/td>
 &lt;td>single statement&lt;/td>
 &lt;td>特殊 read-only workload&lt;/td>
 &lt;td>transaction workflow 受限&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>App pool&lt;/td>
 &lt;td>application process 內&lt;/td>
 &lt;td>單服務、低 fan-out&lt;/td>
 &lt;td>多 instance 後總連線失控&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/transaction-pooling/" data-link-title="Transaction Pooling" data-link-desc="說明 connection pooler 的 transaction 綁定模式如何壓縮連線並改變 session 語意">Transaction pooling&lt;/a> 的價值在於把大量 idle client connection 收斂成少量 active server connection。它要求 application 把 session state 放回 request / transaction boundary，例如 timezone、role、search_path、prepared statement 與 advisory lock 都要明確管理。&lt;/p>
&lt;p>Session pooling 的價值在於相容性。若 application 大量使用 temp table、LISTEN / NOTIFY、session-level setting 或 server-side prepared statement，session pooling 能降低行為差異，但連線壓縮效果較弱。&lt;/p>
&lt;h2 id="product-boundary">Product Boundary&lt;/h2>
&lt;p>Product boundary 的核心責任是把 pooler 放在正確的維運位置。不同選項的責任邊界差異很大。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>選項&lt;/th>
 &lt;th>主要責任&lt;/th>
 &lt;th>適合情境&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>PgBouncer&lt;/td>
 &lt;td>輕量 PostgreSQL connection pooling&lt;/td>
 &lt;td>自管 VM / K8s、transaction pooling 標準路線&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Odyssey&lt;/td>
 &lt;td>多租戶與複雜 routing pooler&lt;/td>
 &lt;td>大型部署、需要進階 routing / auth&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>RDS Proxy&lt;/td>
 &lt;td>AWS managed connection proxy&lt;/td>
 &lt;td>RDS / Aurora 生態、希望降低 proxy 維運&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application pool&lt;/td>
 &lt;td>服務內部連線池&lt;/td>
 &lt;td>instance 數少、連線總量可控&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>No pooler&lt;/td>
 &lt;td>直接連 PostgreSQL&lt;/td>
 &lt;td>小型服務、低併發、連線數遠低於上限&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>PgBouncer 的操作重點是 mode、pool size、server reset query、auth、TLS 與 metrics。它很適合放在 application 與 database 中間，承擔連線排隊與 backpressure。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL connection pooler comparison 的核心責任是把連線數壓力、transaction 語意與維運責任拆開判讀。PostgreSQL backend process 成本高，application instance 擴張後，connection pooler 常成為保護資料庫的第一層容量控制。</p>
<p>本文的判讀錨點是：pooler 解決的是 connection fan-out 與 queueing，而非查詢本身變快。查詢慢、lock wait、transaction 過長、index 錯誤仍要回到 <a href="../query-optimization/">Query Optimization</a> 與 <a href="../mvcc-lock-model/">MVCC / lock model</a>。</p>
<h2 id="pooling-models">Pooling Models</h2>
<p>Pooling model 的核心責任是決定 client connection 和 server connection 的綁定時間。PgBouncer 代表最常見的 PostgreSQL pooler 模型；官方文件將 pool mode 分成 session、transaction 與 statement。</p>
<table>
  <thead>
      <tr>
          <th>模式</th>
          <th>Server connection 綁定</th>
          <th>適合情境</th>
          <th>主要風險</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Session</td>
          <td>client session 全程</td>
          <td>使用 session state、temp table</td>
          <td>壓縮率低</td>
      </tr>
      <tr>
          <td>Transaction</td>
          <td>transaction 期間</td>
          <td>Web API、短交易、Stateless query</td>
          <td>session variable、prepared statement 語意受限</td>
      </tr>
      <tr>
          <td>Statement</td>
          <td>single statement</td>
          <td>特殊 read-only workload</td>
          <td>transaction workflow 受限</td>
      </tr>
      <tr>
          <td>App pool</td>
          <td>application process 內</td>
          <td>單服務、低 fan-out</td>
          <td>多 instance 後總連線失控</td>
      </tr>
  </tbody>
</table>
<p><a href="/blog/backend/knowledge-cards/transaction-pooling/" data-link-title="Transaction Pooling" data-link-desc="說明 connection pooler 的 transaction 綁定模式如何壓縮連線並改變 session 語意">Transaction pooling</a> 的價值在於把大量 idle client connection 收斂成少量 active server connection。它要求 application 把 session state 放回 request / transaction boundary，例如 timezone、role、search_path、prepared statement 與 advisory lock 都要明確管理。</p>
<p>Session pooling 的價值在於相容性。若 application 大量使用 temp table、LISTEN / NOTIFY、session-level setting 或 server-side prepared statement，session pooling 能降低行為差異，但連線壓縮效果較弱。</p>
<h2 id="product-boundary">Product Boundary</h2>
<p>Product boundary 的核心責任是把 pooler 放在正確的維運位置。不同選項的責任邊界差異很大。</p>
<table>
  <thead>
      <tr>
          <th>選項</th>
          <th>主要責任</th>
          <th>適合情境</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PgBouncer</td>
          <td>輕量 PostgreSQL connection pooling</td>
          <td>自管 VM / K8s、transaction pooling 標準路線</td>
      </tr>
      <tr>
          <td>Odyssey</td>
          <td>多租戶與複雜 routing pooler</td>
          <td>大型部署、需要進階 routing / auth</td>
      </tr>
      <tr>
          <td>RDS Proxy</td>
          <td>AWS managed connection proxy</td>
          <td>RDS / Aurora 生態、希望降低 proxy 維運</td>
      </tr>
      <tr>
          <td>Application pool</td>
          <td>服務內部連線池</td>
          <td>instance 數少、連線總量可控</td>
      </tr>
      <tr>
          <td>No pooler</td>
          <td>直接連 PostgreSQL</td>
          <td>小型服務、低併發、連線數遠低於上限</td>
      </tr>
  </tbody>
</table>
<p>PgBouncer 的操作重點是 mode、pool size、server reset query、auth、TLS 與 metrics。它很適合放在 application 與 database 中間，承擔連線排隊與 backpressure。</p>
<p>Managed proxy 的操作重點是平台限制、failover behavior、credential integration、latency overhead 與 observability。若 team 想少維護一個 pooler process，managed proxy 可以降低操作成本，但要接受雲平台邊界。</p>
<h2 id="decision-signals">Decision Signals</h2>
<p>Decision signals 的核心責任是判斷何時導入 pooler，以及導入哪一種。連線數壓力要用 evidence 說明。</p>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>代表問題</th>
          <th>建議路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>max_connections</code> 接近上限</td>
          <td>application fan-out 過高</td>
          <td>PgBouncer transaction pooling</td>
      </tr>
      <tr>
          <td>大量 idle connection</td>
          <td>client 連線長期閒置</td>
          <td>transaction pooling 或 app pool 調整</td>
      </tr>
      <tr>
          <td>failover 後 reconnect storm</td>
          <td>client 同時重連衝擊 primary</td>
          <td>pooler queue + jitter</td>
      </tr>
      <tr>
          <td>query latency 高但 connection 不高</td>
          <td>查詢 / lock / index 問題</td>
          <td>query optimization</td>
      </tr>
      <tr>
          <td>session state 依賴多</td>
          <td>transaction pooling 相容性風險</td>
          <td>session pooling 或 refactor session state</td>
      </tr>
  </tbody>
</table>
<p>Connection pooler 的成功訊號是 database backend count 下降、queue 可觀測、error rate 穩定、tail latency 受控。若導入後只是把 timeout 從 DB 移到 pooler，代表 capacity model 仍需調整。</p>
<h2 id="transaction-pooling-compatibility">Transaction Pooling Compatibility</h2>
<p>Transaction pooling compatibility 的核心責任是找出 application 對 session state 的隱性依賴。這些依賴要在 staging 先測出來。</p>
<table>
  <thead>
      <tr>
          <th>依賴類型</th>
          <th>風險</th>
          <th>修正策略</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>SET search_path</code></td>
          <td>下一個 transaction 可能換連線</td>
          <td>每個 transaction 明確設定或固定 schema</td>
      </tr>
      <tr>
          <td>temp table</td>
          <td>transaction 後 server connection 釋放</td>
          <td>改 permanent staging table 或 session mode</td>
      </tr>
      <tr>
          <td>prepared statement</td>
          <td>server-side state 不穩定</td>
          <td>使用 client-side prepare 或 session mode</td>
      </tr>
      <tr>
          <td>advisory lock</td>
          <td>lock ownership 混亂</td>
          <td>transaction-scoped lock 或移出 pooler path</td>
      </tr>
      <tr>
          <td>LISTEN / NOTIFY</td>
          <td>session channel 需要持續連線</td>
          <td>專用 direct connection</td>
      </tr>
  </tbody>
</table>
<p>Compatibility review 要在 repository / migration / background job 三個層面跑。Web request 通常容易改成 transaction-safe；migration tool、CDC job、worker queue 常有長連線與 session state，要分開配置。</p>
<h2 id="sizing-and-evidence">Sizing and Evidence</h2>
<p>Sizing and evidence 的核心責任是用 workload 設定 pool size。Pooler 設太大會把壓力直接傳到 PostgreSQL；設太小會造成 queue 與 timeout。</p>
<p>基本 sizing 步驟：</p>
<ol>
<li>量測 active query concurrency，而非只看 request concurrency。</li>
<li>設定 database 保留連線給 admin、replication、migration 與 emergency access。</li>
<li>每個 service 設定 pool quota，避免單一服務吃掉全部 backend。</li>
<li>觀測 wait time、server utilization、client timeout、query latency。</li>
<li>用 load test 驗證 failover / reconnect storm。</li>
</ol>
<p>Pooler dashboard 至少要有 client connections、server connections、waiting clients、pool wait time、server reuse、timeout count 與 authentication failure。</p>
<h2 id="anti-patterns">Anti-Patterns</h2>
<p>Anti-pattern 的核心責任是把 pooler 常見誤用提前排除。</p>
<table>
  <thead>
      <tr>
          <th>反模式</th>
          <th>風險</th>
          <th>修正方向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>把 pool size 設到 DB 上限</td>
          <td>DB 失去保護層</td>
          <td>每個服務配額 + 保留 admin capacity</td>
      </tr>
      <tr>
          <td>transaction pooling 直接上線</td>
          <td>session state 依賴在 production 爆出</td>
          <td>staging compatibility matrix</td>
      </tr>
      <tr>
          <td>pooler 沒有 metrics</td>
          <td>queueing 事故難以判讀</td>
          <td>pooler dashboard + alert</td>
      </tr>
      <tr>
          <td>migration 共用 web pool</td>
          <td>長 DDL 卡住 web request</td>
          <td>migration 專用連線與維護窗口</td>
      </tr>
      <tr>
          <td>retry 無 jitter</td>
          <td>reconnect storm 放大</td>
          <td>exponential backoff + jitter</td>
      </tr>
  </tbody>
</table>
<p>Pooler 是 backpressure 元件。它要讓系統在過載時可排隊、可拒絕、可觀測，而非把所有請求推進 database。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Connection pooler comparison 完成後，實作層讀 <a href="../pgbouncer-config/">PgBouncer config</a>；要觀察連線壓力讀 <a href="../connection-scaling/">Connection Scaling</a>；需要演練讀 <a href="../hands-on/connection-pool-lab/">Connection Pool Lab</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL Cross-region DR</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/cross-region-dr/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/cross-region-dr/</guid><description>&lt;p>PostgreSQL cross-region DR 的核心責任是把區域性事故下的資料恢復、服務切換與資料一致性風險寫成可演練流程。跨區 DR 通常由法規、業務連續性、雲區故障、區域隔離或高可用承諾觸發。&lt;/p>
&lt;p>本文的判讀錨點是：cross-region DR 是恢復策略，而非自動等同 multi-region active-active。PostgreSQL 可以透過 backup / WAL archive、physical standby、logical replication、managed service replica 或 application-level replication 支援不同 RPO / RTO；每種路線都有資料延遲、切換與回切成本。&lt;/p>
&lt;h2 id="dr-strategy">DR Strategy&lt;/h2>
&lt;p>DR strategy 的核心責任是把恢復目標和技術路線對齊。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>策略&lt;/th>
 &lt;th>RPO / RTO 型態&lt;/th>
 &lt;th>適合情境&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Backup + WAL archive&lt;/td>
 &lt;td>RPO 依 WAL archive，RTO 依 restore&lt;/td>
 &lt;td>成本敏感、低頻災難復原&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cross-region standby&lt;/td>
 &lt;td>RPO 接近 replication lag，RTO 較短&lt;/td>
 &lt;td>需要較快啟動 read / promote&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Logical replication&lt;/td>
 &lt;td>table-level / selective DR&lt;/td>
 &lt;td>跨版本、跨 schema、局部資料同步&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Managed global DB&lt;/td>
 &lt;td>雲平台提供跨區 replica&lt;/td>
 &lt;td>希望降低自管複製與 promote 維運&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application replay&lt;/td>
 &lt;td>event / queue 重建狀態&lt;/td>
 &lt;td>domain event 已是 source of truth&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>RPO 要由業務定義。若付款、訂單、庫存只允許秒級遺失，backup-only 路線通常成本不足；若是內部報表或可重建資料，backup + WAL archive 可能足夠。&lt;/p>
&lt;h2 id="physical-vs-logical">Physical vs Logical&lt;/h2>
&lt;p>Physical vs logical 的核心責任是區分 byte-level recovery 與 row-level replication。Physical replica 保留 PostgreSQL cluster 層級狀態；&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/logical-replication/" data-link-title="Logical Replication" data-link-desc="說明以表為粒度解碼 row-level 變更的複製方式，對照 byte-level 的實體複製">logical replication&lt;/a> 提供 table / publication 層級彈性。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>面向&lt;/th>
 &lt;th>Physical standby&lt;/th>
 &lt;th>Logical replication&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>粒度&lt;/td>
 &lt;td>cluster / database&lt;/td>
 &lt;td>table / publication&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>版本彈性&lt;/td>
 &lt;td>通常要求版本與系統相容&lt;/td>
 &lt;td>可支援跨版本 / selective migration&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>DDL&lt;/td>
 &lt;td>跟隨 WAL / 需相容&lt;/td>
 &lt;td>需要 schema coordination&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Failover&lt;/td>
 &lt;td>promote standby&lt;/td>
 &lt;td>application / target DB 切換&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>風險&lt;/td>
 &lt;td>replication lag、timeline&lt;/td>
 &lt;td>slot lag、schema drift、missing key&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Physical standby 適合整體 DR。它的 runbook 要處理 WAL archive、replication lag、promotion、timeline、DNS / connection string 切換與回切。&lt;/p>
&lt;p>Logical replication 適合局部資料或跨版本轉換。它的 runbook 要處理 publication、subscription、replication slot、schema migration ordering 與資料 diff。&lt;/p>
&lt;h2 id="failover-runbook">Failover Runbook&lt;/h2>
&lt;p>Failover runbook 的核心責任是把災難切換變成可演練步驟。最小流程包含 incident declare、source freeze、replica health check、promote、traffic switch、data validation 與 rollback / rebuild。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL cross-region DR 的核心責任是把區域性事故下的資料恢復、服務切換與資料一致性風險寫成可演練流程。跨區 DR 通常由法規、業務連續性、雲區故障、區域隔離或高可用承諾觸發。</p>
<p>本文的判讀錨點是：cross-region DR 是恢復策略，而非自動等同 multi-region active-active。PostgreSQL 可以透過 backup / WAL archive、physical standby、logical replication、managed service replica 或 application-level replication 支援不同 RPO / RTO；每種路線都有資料延遲、切換與回切成本。</p>
<h2 id="dr-strategy">DR Strategy</h2>
<p>DR strategy 的核心責任是把恢復目標和技術路線對齊。</p>
<table>
  <thead>
      <tr>
          <th>策略</th>
          <th>RPO / RTO 型態</th>
          <th>適合情境</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Backup + WAL archive</td>
          <td>RPO 依 WAL archive，RTO 依 restore</td>
          <td>成本敏感、低頻災難復原</td>
      </tr>
      <tr>
          <td>Cross-region standby</td>
          <td>RPO 接近 replication lag，RTO 較短</td>
          <td>需要較快啟動 read / promote</td>
      </tr>
      <tr>
          <td>Logical replication</td>
          <td>table-level / selective DR</td>
          <td>跨版本、跨 schema、局部資料同步</td>
      </tr>
      <tr>
          <td>Managed global DB</td>
          <td>雲平台提供跨區 replica</td>
          <td>希望降低自管複製與 promote 維運</td>
      </tr>
      <tr>
          <td>Application replay</td>
          <td>event / queue 重建狀態</td>
          <td>domain event 已是 source of truth</td>
      </tr>
  </tbody>
</table>
<p>RPO 要由業務定義。若付款、訂單、庫存只允許秒級遺失，backup-only 路線通常成本不足；若是內部報表或可重建資料，backup + WAL archive 可能足夠。</p>
<h2 id="physical-vs-logical">Physical vs Logical</h2>
<p>Physical vs logical 的核心責任是區分 byte-level recovery 與 row-level replication。Physical replica 保留 PostgreSQL cluster 層級狀態；<a href="/blog/backend/knowledge-cards/logical-replication/" data-link-title="Logical Replication" data-link-desc="說明以表為粒度解碼 row-level 變更的複製方式，對照 byte-level 的實體複製">logical replication</a> 提供 table / publication 層級彈性。</p>
<table>
  <thead>
      <tr>
          <th>面向</th>
          <th>Physical standby</th>
          <th>Logical replication</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>粒度</td>
          <td>cluster / database</td>
          <td>table / publication</td>
      </tr>
      <tr>
          <td>版本彈性</td>
          <td>通常要求版本與系統相容</td>
          <td>可支援跨版本 / selective migration</td>
      </tr>
      <tr>
          <td>DDL</td>
          <td>跟隨 WAL / 需相容</td>
          <td>需要 schema coordination</td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>promote standby</td>
          <td>application / target DB 切換</td>
      </tr>
      <tr>
          <td>風險</td>
          <td>replication lag、timeline</td>
          <td>slot lag、schema drift、missing key</td>
      </tr>
  </tbody>
</table>
<p>Physical standby 適合整體 DR。它的 runbook 要處理 WAL archive、replication lag、promotion、timeline、DNS / connection string 切換與回切。</p>
<p>Logical replication 適合局部資料或跨版本轉換。它的 runbook 要處理 publication、subscription、replication slot、schema migration ordering 與資料 diff。</p>
<h2 id="failover-runbook">Failover Runbook</h2>
<p>Failover runbook 的核心責任是把災難切換變成可演練步驟。最小流程包含 incident declare、source freeze、replica health check、promote、traffic switch、data validation 與 rollback / rebuild。</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>操作</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Declare incident</td>
          <td>確認 primary region 事故範圍</td>
          <td>incident decision log</td>
      </tr>
      <tr>
          <td>Freeze source</td>
          <td>停止寫入或確認 source 已不可用</td>
          <td>last known LSN / timestamp</td>
      </tr>
      <tr>
          <td>Check replica</td>
          <td>lag、WAL received、read health</td>
          <td>replica status snapshot</td>
      </tr>
      <tr>
          <td>Promote</td>
          <td>promote standby 或啟用 target</td>
          <td>new timeline / role</td>
      </tr>
      <tr>
          <td>Switch traffic</td>
          <td>DNS、secret、connection string</td>
          <td>app smoke test</td>
      </tr>
      <tr>
          <td>Validate</td>
          <td>row count、critical invariant</td>
          <td>validation report</td>
      </tr>
      <tr>
          <td>Rebuild</td>
          <td>重建舊 primary 或新 standby</td>
          <td>follow-up runbook</td>
      </tr>
  </tbody>
</table>
<p>Failover 決策要有 owner。自動化可以執行步驟，但是否接受資料遺失、是否凍結寫入、是否 promote，仍需要明確責任人與 tripwire。</p>
<h2 id="data-reconciliation">Data Reconciliation</h2>
<p>Data reconciliation 的核心責任是處理 cross-region 切換後的資料差異。只要 replication lag 存在，failover 後就可能有未套用交易。</p>
<table>
  <thead>
      <tr>
          <th>差異類型</th>
          <th>處理方式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>已提交但未複製</td>
          <td>從 source WAL / app log / event 補償</td>
      </tr>
      <tr>
          <td>client retry 重複寫入</td>
          <td>idempotency key / natural key 去重</td>
      </tr>
      <tr>
          <td>sequence / identity</td>
          <td>target sequence reset / collision check</td>
      </tr>
      <tr>
          <td>external side effect</td>
          <td>payment、email、queue 需對帳</td>
      </tr>
  </tbody>
</table>
<p>Reconciliation 要先定義 critical table。所有表都做 full diff 成本高；付款、訂單、權限、ledger、mutation log 等高風險資料要有專用 validation query。</p>
<h2 id="drill-design">Drill Design</h2>
<p>Drill design 的核心責任是定期驗證 RPO / RTO。DR 文件只有在演練後才可信。</p>
<p>演練至少包含：</p>
<ol>
<li>從 backup + WAL 還原到指定時間。</li>
<li>Promote standby 到 isolated environment。</li>
<li>Application 使用 DR endpoint 跑 smoke test。</li>
<li>計算實際 RPO / RTO。</li>
<li>記錄失敗點、人工步驟與下一次修正。</li>
</ol>
<p>演練應避開 production destructive action。使用 isolated VPC、staging app、read-only validation 與 mock external side effect。</p>
<h2 id="no-go-conditions">No-Go Conditions</h2>
<p>No-go conditions 的核心責任是指出 PostgreSQL cross-region DR 的邊界。</p>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>建議路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>多區同時交易寫入是核心需求</td>
          <td>CockroachDB / Spanner / YugabyteDB 類 distributed SQL</td>
      </tr>
      <tr>
          <td>RPO 接近零且跨區距離大</td>
          <td>synchronous replication latency 成本評估</td>
      </tr>
      <tr>
          <td>Team 缺少 DR 演練能力</td>
          <td>managed service + vendor runbook</td>
      </tr>
      <tr>
          <td>數據 residency 限制跨區複製</td>
          <td>regional shard / policy-driven replication</td>
      </tr>
  </tbody>
</table>
<p>Cross-region DR 要誠實面對延遲。把每個 region 都變成 writer 需要 distributed transaction 模型；PostgreSQL DR 路線主要提供恢復與切換。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Cross-region DR 完成後，恢復實作讀 <a href="../pitr-wal-archiving/">PITR / WAL Archiving</a>；replication 架構讀 <a href="../replication-topology/">Replication Topology</a>；跨區 rollout 的資料政策讀 <a href="../multi-region-gdpr-rollout/">Multi-region GDPR Rollout</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL Developer / DBA Responsibility Split</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/developer-dba-responsibility-split/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/developer-dba-responsibility-split/</guid><description>&lt;p>PostgreSQL developer / DBA responsibility split 的核心責任是把資料庫決策拆成 application ownership、database operation 與 platform governance。PostgreSQL 功能深，事故常跨 query、schema、connection、backup、replication 與 capacity；若責任分工模糊，問題會在 release 與 incident 時放大。&lt;/p>
&lt;p>本文的判讀錨點是：developer 和 DBA 分工要讓每個決策有清楚 owner、evidence、review gate 與 rollback，而非把資料庫丟給某一方。&lt;/p>
&lt;h2 id="ownership-map">Ownership Map&lt;/h2>
&lt;p>Ownership map 的核心責任是定義誰能改什麼、誰要驗證什麼。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>面向&lt;/th>
 &lt;th>Developer owner&lt;/th>
 &lt;th>DBA / platform owner&lt;/th>
 &lt;th>Shared gate&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Schema design&lt;/td>
 &lt;td>domain model、constraint、query&lt;/td>
 &lt;td>naming、storage、partition、extension&lt;/td>
 &lt;td>migration review&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Query performance&lt;/td>
 &lt;td>repository SQL、query shape&lt;/td>
 &lt;td>index、planner、statistics、capacity&lt;/td>
 &lt;td>explain evidence&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Migration&lt;/td>
 &lt;td>app compatibility、rollback&lt;/td>
 &lt;td>lock impact、DDL strategy、PITR&lt;/td>
 &lt;td>release gate&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Connection&lt;/td>
 &lt;td>pool usage、transaction length&lt;/td>
 &lt;td>pooler、max connection、proxy&lt;/td>
 &lt;td>load test&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Backup / DR&lt;/td>
 &lt;td>restore smoke test&lt;/td>
 &lt;td>WAL archive、PITR、replica&lt;/td>
 &lt;td>restore drill&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Security&lt;/td>
 &lt;td>tenant / workflow intent&lt;/td>
 &lt;td>role、RLS、audit、grant&lt;/td>
 &lt;td>access review&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>這張表的重點是 shared gate。Developer 最懂產品語意，DBA / platform 最懂資料庫風險；正式變更需要兩邊的 evidence 合併。&lt;/p>
&lt;h2 id="schema-and-migration">Schema and Migration&lt;/h2>
&lt;p>Schema and migration 的核心責任是讓 application release 與 database change 同步。Developer 應提供 business invariant、compatibility window、read/write path；DBA / platform 應審查 lock、index build、table rewrite、replica lag 與 rollback。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Migration 類型&lt;/th>
 &lt;th>Developer evidence&lt;/th>
 &lt;th>DBA / platform evidence&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Add nullable column&lt;/td>
 &lt;td>app read/write compatibility&lt;/td>
 &lt;td>DDL lock time、replica impact&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Add NOT NULL&lt;/td>
 &lt;td>backfill plan、default behavior&lt;/td>
 &lt;td>table rewrite / validation strategy&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Index build&lt;/td>
 &lt;td>query contract、expected selectivity&lt;/td>
 &lt;td>concurrent build、disk、bloat&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Partition change&lt;/td>
 &lt;td>routing logic、retention behavior&lt;/td>
 &lt;td>detach / attach、maintenance window&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Type change&lt;/td>
 &lt;td>serialization、API compatibility&lt;/td>
 &lt;td>cast risk、rewrite duration&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Migration review 要從 failure mode 開始。若 migration 卡住，誰停止 rollout；若 backfill 造成 lag，誰降速；若 app 新舊版本同時存在，哪個 schema 能兼容兩者。&lt;/p>
&lt;h2 id="query-and-capacity">Query and Capacity&lt;/h2>
&lt;p>Query and capacity 的核心責任是把 query shape 和 database resource 對齊。Developer 負責避免 N+1、長交易、無界查詢與錯誤 pagination；DBA / platform 負責 index、statistics、vacuum、work_mem、connection 與 storage。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL developer / DBA responsibility split 的核心責任是把資料庫決策拆成 application ownership、database operation 與 platform governance。PostgreSQL 功能深，事故常跨 query、schema、connection、backup、replication 與 capacity；若責任分工模糊，問題會在 release 與 incident 時放大。</p>
<p>本文的判讀錨點是：developer 和 DBA 分工要讓每個決策有清楚 owner、evidence、review gate 與 rollback，而非把資料庫丟給某一方。</p>
<h2 id="ownership-map">Ownership Map</h2>
<p>Ownership map 的核心責任是定義誰能改什麼、誰要驗證什麼。</p>
<table>
  <thead>
      <tr>
          <th>面向</th>
          <th>Developer owner</th>
          <th>DBA / platform owner</th>
          <th>Shared gate</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema design</td>
          <td>domain model、constraint、query</td>
          <td>naming、storage、partition、extension</td>
          <td>migration review</td>
      </tr>
      <tr>
          <td>Query performance</td>
          <td>repository SQL、query shape</td>
          <td>index、planner、statistics、capacity</td>
          <td>explain evidence</td>
      </tr>
      <tr>
          <td>Migration</td>
          <td>app compatibility、rollback</td>
          <td>lock impact、DDL strategy、PITR</td>
          <td>release gate</td>
      </tr>
      <tr>
          <td>Connection</td>
          <td>pool usage、transaction length</td>
          <td>pooler、max connection、proxy</td>
          <td>load test</td>
      </tr>
      <tr>
          <td>Backup / DR</td>
          <td>restore smoke test</td>
          <td>WAL archive、PITR、replica</td>
          <td>restore drill</td>
      </tr>
      <tr>
          <td>Security</td>
          <td>tenant / workflow intent</td>
          <td>role、RLS、audit、grant</td>
          <td>access review</td>
      </tr>
  </tbody>
</table>
<p>這張表的重點是 shared gate。Developer 最懂產品語意，DBA / platform 最懂資料庫風險；正式變更需要兩邊的 evidence 合併。</p>
<h2 id="schema-and-migration">Schema and Migration</h2>
<p>Schema and migration 的核心責任是讓 application release 與 database change 同步。Developer 應提供 business invariant、compatibility window、read/write path；DBA / platform 應審查 lock、index build、table rewrite、replica lag 與 rollback。</p>
<table>
  <thead>
      <tr>
          <th>Migration 類型</th>
          <th>Developer evidence</th>
          <th>DBA / platform evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Add nullable column</td>
          <td>app read/write compatibility</td>
          <td>DDL lock time、replica impact</td>
      </tr>
      <tr>
          <td>Add NOT NULL</td>
          <td>backfill plan、default behavior</td>
          <td>table rewrite / validation strategy</td>
      </tr>
      <tr>
          <td>Index build</td>
          <td>query contract、expected selectivity</td>
          <td>concurrent build、disk、bloat</td>
      </tr>
      <tr>
          <td>Partition change</td>
          <td>routing logic、retention behavior</td>
          <td>detach / attach、maintenance window</td>
      </tr>
      <tr>
          <td>Type change</td>
          <td>serialization、API compatibility</td>
          <td>cast risk、rewrite duration</td>
      </tr>
  </tbody>
</table>
<p>Migration review 要從 failure mode 開始。若 migration 卡住，誰停止 rollout；若 backfill 造成 lag，誰降速；若 app 新舊版本同時存在，哪個 schema 能兼容兩者。</p>
<h2 id="query-and-capacity">Query and Capacity</h2>
<p>Query and capacity 的核心責任是把 query shape 和 database resource 對齊。Developer 負責避免 N+1、長交易、無界查詢與錯誤 pagination；DBA / platform 負責 index、statistics、vacuum、work_mem、connection 與 storage。</p>
<p>Query review 的最小 evidence：</p>
<ol>
<li>SQL text 或 repository method。</li>
<li>Expected cardinality 與資料量。</li>
<li><code>EXPLAIN</code> / <code>EXPLAIN ANALYZE</code> 結果。</li>
<li>Index 依賴與 fallback plan。</li>
<li>Timeout、pagination、transaction boundary。</li>
</ol>
<p>Capacity review 要把 query 放進 workload。單一 query 快不代表整體穩定；高頻 query、batch job、migration backfill、CDC consumer 都會共享 I/O、CPU、lock 與 WAL。</p>
<h2 id="incident-roles">Incident Roles</h2>
<p>Incident roles 的核心責任是讓資料庫事故有分工。Incident 發生時，developer 看 workflow、feature flag、traffic 與 recent deploy；DBA / platform 看 lock、replica、WAL、disk、pooler 與 backup。</p>
<table>
  <thead>
      <tr>
          <th>Incident</th>
          <th>Developer 第一反應</th>
          <th>DBA / platform 第一反應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Lock storm</td>
          <td>暫停相關 workflow、停 rollout</td>
          <td>查 blocking PID、DDL、transaction</td>
      </tr>
      <tr>
          <td>Connection exhaustion</td>
          <td>降低 app concurrency、停 retry storm</td>
          <td>pooler queue、max connection、admin access</td>
      </tr>
      <tr>
          <td>Replica lag</td>
          <td>暫停 heavy write / backfill</td>
          <td>WAL sender、slot、standby apply</td>
      </tr>
      <tr>
          <td>Bad migration</td>
          <td>block release、保留 failed state</td>
          <td>restore point、rollback / PITR</td>
      </tr>
      <tr>
          <td>Slow query spike</td>
          <td>feature flag、query owner</td>
          <td>plan regression、statistics、index</td>
      </tr>
  </tbody>
</table>
<p>Incident command 要保留決策紀錄。資料庫事故常有高壓操作，例如 kill session、promote replica、drop slot、restore backup；每個操作都要記錄原因與回復路線。</p>
<h2 id="review-cadence">Review Cadence</h2>
<p>Review cadence 的核心責任是把資料庫品質納入日常。建議節奏如下：</p>
<table>
  <thead>
      <tr>
          <th>節奏</th>
          <th>Review 內容</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>每個 release</td>
          <td>migration diff、new query、role / grant</td>
      </tr>
      <tr>
          <td>每週</td>
          <td>slow query、lock wait、replica lag、pool</td>
      </tr>
      <tr>
          <td>每月</td>
          <td>backup restore drill、index bloat、vacuum</td>
      </tr>
      <tr>
          <td>每季</td>
          <td>DR drill、major version plan、extension review</td>
      </tr>
  </tbody>
</table>
<p>Review cadence 要跟服務風險對齊。高交易量或合規服務需要更短週期；內部工具可以更輕量，但仍要保留 backup / restore evidence。</p>
<h2 id="handoff-artifact">Handoff Artifact</h2>
<p>Handoff artifact 的核心責任是讓下一位維護者能接手。</p>
<p>最小內容：</p>
<ol>
<li>Database owner、application owner、platform owner。</li>
<li>Schema migration process 與 rollback route。</li>
<li>Query review checklist。</li>
<li>Connection / pooler policy。</li>
<li>Backup / PITR / DR evidence。</li>
<li>Security / role / audit owner。</li>
<li>Incident escalation route。</li>
</ol>
<p>這份 artifact 應連回 <a href="../">PostgreSQL overview</a>、<a href="../hands-on/schema-migration-evidence-lab/">Schema Migration Evidence Lab</a> 與 <a href="../hands-on/pitr-restore-drill/">PITR Restore Drill</a>。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>責任分工建立後，migration gate 讀 <a href="../online-schema-change/">Online Schema Change</a>；連線責任讀 <a href="../connection-pooler-comparison/">Connection Pooler Comparison</a>；安全責任讀 <a href="../security-rls-audit-logging/">Security / RLS / Audit Logging</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL Logical Decoding Plugins</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/logical-decoding-plugins/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/logical-decoding-plugins/</guid><description>&lt;p>PostgreSQL logical decoding plugins 的核心責任是把 WAL 中的變更轉成外部消費者可理解的事件格式。PostgreSQL 官方 logical decoding 文件說明，logical decoding 透過 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/replication-slot/" data-link-title="Replication Slot" data-link-desc="說明邏輯複製如何用 slot 追蹤消費進度，並對來源端造成保留壓力">replication slot&lt;/a> 將 WAL 變更解碼成 plugin output；output plugin 決定外部看到的是 PostgreSQL protocol、JSON、測試文字或自訂格式。&lt;/p>
&lt;p>本文的判讀錨點是：plugin 選型是 CDC contract 決策。它影響 schema evolution、事件欄位、delete 表示、transaction boundary、consumer compatibility、slot lag 與故障復原。&lt;/p>
&lt;h2 id="plugin-boundary">Plugin Boundary&lt;/h2>
&lt;p>Plugin boundary 的核心責任是定義 database 變更如何離開 PostgreSQL。常見選項包含內建 &lt;code>pgoutput&lt;/code>、測試用 &lt;code>test_decoding&lt;/code>、JSON-oriented plugin，以及 Debezium connector 支援的 plugin / protocol。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Plugin / path&lt;/th>
 &lt;th>主要責任&lt;/th>
 &lt;th>適合情境&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;code>pgoutput&lt;/code>&lt;/td>
 &lt;td>PostgreSQL logical replication protocol&lt;/td>
 &lt;td>built-in logical replication、Debezium 常見路線&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>test_decoding&lt;/code>&lt;/td>
 &lt;td>人類可讀測試 output&lt;/td>
 &lt;td>lab、debug、教育用途&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>wal2json&lt;/code>&lt;/td>
 &lt;td>JSON change event&lt;/td>
 &lt;td>自訂 consumer、legacy CDC&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>decoderbufs&lt;/td>
 &lt;td>Protobuf event&lt;/td>
 &lt;td>強 schema contract 的 pipeline&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Native subscription&lt;/td>
 &lt;td>DB-to-DB replication&lt;/td>
 &lt;td>PostgreSQL 之間 table replication&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;code>pgoutput&lt;/code> 適合標準化 CDC。它與 publication / subscription model 對齊，能保留 PostgreSQL logical replication 的主路線。&lt;/p>
&lt;p>&lt;code>test_decoding&lt;/code> 適合教學與排錯。它讓人看到 transaction 裡發生的 insert / update / delete，但它的定位是測試與理解，不應作為正式 event contract。&lt;/p>
&lt;h2 id="replication-slot-responsibility">Replication Slot Responsibility&lt;/h2>
&lt;p>Replication slot responsibility 的核心責任是保護 consumer 進度，同時管理 WAL retention。Logical slot 會讓 PostgreSQL 保留尚未被 consumer 確認的 WAL；consumer 停住時，slot lag 會轉成 disk pressure。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Signal&lt;/th>
 &lt;th>意義&lt;/th>
 &lt;th>操作反應&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;code>confirmed_flush_lsn&lt;/code>&lt;/td>
 &lt;td>consumer 已確認的位置&lt;/td>
 &lt;td>用來判斷 CDC 進度&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>retained WAL size&lt;/td>
 &lt;td>slot 造成的 WAL 保留量&lt;/td>
 &lt;td>alert、調整 consumer、drop / advance&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>inactive slot&lt;/td>
 &lt;td>consumer 離線&lt;/td>
 &lt;td>檢查 connector、暫停 release&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>publication table diff&lt;/td>
 &lt;td>CDC scope 與 schema 不一致&lt;/td>
 &lt;td>review publication / table ownership&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Slot 是 production resource。每個 logical slot 都要有 owner、consumer、SLO、drop condition、backfill plan 與 alert。&lt;/p>
&lt;h2 id="event-contract">Event Contract&lt;/h2>
&lt;p>Event contract 的核心責任是讓 downstream 知道每個變更代表什麼。CDC 事件至少要說明 key、before/after image、operation、commit timestamp、transaction ordering、schema version 與 delete representation。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL logical decoding plugins 的核心責任是把 WAL 中的變更轉成外部消費者可理解的事件格式。PostgreSQL 官方 logical decoding 文件說明，logical decoding 透過 <a href="/blog/backend/knowledge-cards/replication-slot/" data-link-title="Replication Slot" data-link-desc="說明邏輯複製如何用 slot 追蹤消費進度，並對來源端造成保留壓力">replication slot</a> 將 WAL 變更解碼成 plugin output；output plugin 決定外部看到的是 PostgreSQL protocol、JSON、測試文字或自訂格式。</p>
<p>本文的判讀錨點是：plugin 選型是 CDC contract 決策。它影響 schema evolution、事件欄位、delete 表示、transaction boundary、consumer compatibility、slot lag 與故障復原。</p>
<h2 id="plugin-boundary">Plugin Boundary</h2>
<p>Plugin boundary 的核心責任是定義 database 變更如何離開 PostgreSQL。常見選項包含內建 <code>pgoutput</code>、測試用 <code>test_decoding</code>、JSON-oriented plugin，以及 Debezium connector 支援的 plugin / protocol。</p>
<table>
  <thead>
      <tr>
          <th>Plugin / path</th>
          <th>主要責任</th>
          <th>適合情境</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>pgoutput</code></td>
          <td>PostgreSQL logical replication protocol</td>
          <td>built-in logical replication、Debezium 常見路線</td>
      </tr>
      <tr>
          <td><code>test_decoding</code></td>
          <td>人類可讀測試 output</td>
          <td>lab、debug、教育用途</td>
      </tr>
      <tr>
          <td><code>wal2json</code></td>
          <td>JSON change event</td>
          <td>自訂 consumer、legacy CDC</td>
      </tr>
      <tr>
          <td>decoderbufs</td>
          <td>Protobuf event</td>
          <td>強 schema contract 的 pipeline</td>
      </tr>
      <tr>
          <td>Native subscription</td>
          <td>DB-to-DB replication</td>
          <td>PostgreSQL 之間 table replication</td>
      </tr>
  </tbody>
</table>
<p><code>pgoutput</code> 適合標準化 CDC。它與 publication / subscription model 對齊，能保留 PostgreSQL logical replication 的主路線。</p>
<p><code>test_decoding</code> 適合教學與排錯。它讓人看到 transaction 裡發生的 insert / update / delete，但它的定位是測試與理解，不應作為正式 event contract。</p>
<h2 id="replication-slot-responsibility">Replication Slot Responsibility</h2>
<p>Replication slot responsibility 的核心責任是保護 consumer 進度，同時管理 WAL retention。Logical slot 會讓 PostgreSQL 保留尚未被 consumer 確認的 WAL；consumer 停住時，slot lag 會轉成 disk pressure。</p>
<table>
  <thead>
      <tr>
          <th>Signal</th>
          <th>意義</th>
          <th>操作反應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>confirmed_flush_lsn</code></td>
          <td>consumer 已確認的位置</td>
          <td>用來判斷 CDC 進度</td>
      </tr>
      <tr>
          <td>retained WAL size</td>
          <td>slot 造成的 WAL 保留量</td>
          <td>alert、調整 consumer、drop / advance</td>
      </tr>
      <tr>
          <td>inactive slot</td>
          <td>consumer 離線</td>
          <td>檢查 connector、暫停 release</td>
      </tr>
      <tr>
          <td>publication table diff</td>
          <td>CDC scope 與 schema 不一致</td>
          <td>review publication / table ownership</td>
      </tr>
  </tbody>
</table>
<p>Slot 是 production resource。每個 logical slot 都要有 owner、consumer、SLO、drop condition、backfill plan 與 alert。</p>
<h2 id="event-contract">Event Contract</h2>
<p>Event contract 的核心責任是讓 downstream 知道每個變更代表什麼。CDC 事件至少要說明 key、before/after image、operation、commit timestamp、transaction ordering、schema version 與 delete representation。</p>
<table>
  <thead>
      <tr>
          <th>Contract 面向</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Key</td>
          <td>table 是否有 replica identity / primary key</td>
      </tr>
      <tr>
          <td>Update image</td>
          <td>是否需要 before value</td>
      </tr>
      <tr>
          <td>Delete</td>
          <td>tombstone、key-only delete、soft delete</td>
      </tr>
      <tr>
          <td>Ordering</td>
          <td>transaction order 是否要保留</td>
      </tr>
      <tr>
          <td>Schema evolution</td>
          <td>新欄位、rename、drop 欄位如何通知</td>
      </tr>
      <tr>
          <td>Backfill</td>
          <td>initial snapshot 與 streaming 如何銜接</td>
      </tr>
  </tbody>
</table>
<p><a href="/blog/backend/knowledge-cards/replica-identity/" data-link-title="Replica Identity" data-link-desc="說明 row-level 變更事件如何帶穩定 key，讓下游能正確套用 update 與 delete">Replica identity</a> 是 CDC 的核心設定。沒有穩定 key 的 table 會讓 update / delete event 難以被 downstream 正確套用；這類 table 要先補 primary key 或明確設定 replica identity。</p>
<h2 id="connector-patterns">Connector Patterns</h2>
<p>Connector patterns 的核心責任是把 plugin output 接到實際 pipeline。Debezium、custom consumer、DB native subscription 的維運責任不同。</p>
<table>
  <thead>
      <tr>
          <th>Pattern</th>
          <th>優點</th>
          <th>風險</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Debezium connector</td>
          <td>成熟 snapshot + streaming workflow</td>
          <td>connector state、Kafka / offset operation</td>
      </tr>
      <tr>
          <td>Native subscription</td>
          <td>PostgreSQL 原生 DB-to-DB</td>
          <td>schema drift、DDL coordination</td>
      </tr>
      <tr>
          <td>Custom consumer</td>
          <td>可客製 event contract</td>
          <td>slot management 與 error handling 自行負責</td>
      </tr>
      <tr>
          <td>Batch export + CDC</td>
          <td>backfill 與 streaming 分開</td>
          <td>cutover LSN 與 duplication handling</td>
      </tr>
  </tbody>
</table>
<p>Connector 要定義 backfill 與 streaming 的接點。最常見的事故是 snapshot 還沒完成就開始消費、或 cutover LSN 沒有被記錄，導致 downstream 重複或漏資料。</p>
<h2 id="failure-modes">Failure Modes</h2>
<p>Failure modes 的核心責任是把 CDC 事故分成 database、connector、schema 與 downstream 四層。</p>
<table>
  <thead>
      <tr>
          <th>Failure mode</th>
          <th>判讀訊號</th>
          <th>第一反應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Slot lag growth</td>
          <td>retained WAL 持續增加</td>
          <td>暫停重型寫入、修 connector、評估 drop</td>
      </tr>
      <tr>
          <td>Schema break</td>
          <td>connector 解析失敗</td>
          <td>停止 DDL rollout、補 schema evolution</td>
      </tr>
      <tr>
          <td>Missing key</td>
          <td>update / delete 缺少可套用 key</td>
          <td>修 replica identity / key contract</td>
      </tr>
      <tr>
          <td>Duplicate event</td>
          <td>consumer 重啟或 offset 回退</td>
          <td>idempotent consumer</td>
      </tr>
      <tr>
          <td>Downstream slow</td>
          <td>Kafka / sink lag 增加</td>
          <td>擴 sink、調 batch、保護 slot</td>
      </tr>
  </tbody>
</table>
<p>Slot lag 是最高優先訊號，因為它會占用 PostgreSQL WAL storage。Runbook 要有「何時暫停 producer」、「何時 drop slot」、「如何重建 snapshot」的明確門檻。</p>
<h2 id="selection-checklist">Selection Checklist</h2>
<p>Selection checklist 的核心責任是讓 plugin 選型可審查。</p>
<ol>
<li>Downstream 需要 DB-to-DB replication、JSON event、Protobuf event 還是 connector-managed event。</li>
<li>每張 table 是否有 stable key 與 replica identity。</li>
<li>Initial snapshot 如何銜接 streaming。</li>
<li>Schema evolution 如何通知 consumer。</li>
<li>Slot lag、connector lag、sink lag 如何告警。</li>
<li>Consumer 是否 idempotent。</li>
<li>Disaster recovery 後 slot / offset 如何重建。</li>
</ol>
<p>完成這份 checklist 後，再決定 plugin 與 connector。CDC 的成功標準是 downstream 能長期維持正確資料，而不只是成功建立 slot。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Logical decoding plugins 完成後，實作 CDC pipeline 讀 <a href="../logical-replication-debezium/">Logical Replication / Debezium</a>；slot 維運讀 <a href="../replication-slot-management/">Replication Slot Management</a>；跨資料庫搬遷讀 <a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">Database Migration Playbook</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL pg_partman Advanced</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pg-partman-advanced/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pg-partman-advanced/</guid><description>&lt;p>PostgreSQL pg_partman advanced 的核心責任是把 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/table-partitioning/" data-link-title="Table Partitioning" data-link-desc="說明單一資料庫內如何把大表拆成多個分區，並由查詢規劃器只掃相關片段">declarative partitioning&lt;/a> 的日常維護自動化。pg_partman 可以協助建立未來 partition、管理 retention、執行 maintenance job，讓 time-based 或 serial-based partition 不再依賴人工 DDL。&lt;/p>
&lt;p>本文的判讀錨點是：pg_partman 解決的是 partition lifecycle operation，而非 partition strategy 本身。Partition key、query pattern、retention、index、foreign key 與 migration 仍要先在 &lt;a href="../declarative-partitioning/">Declarative Partitioning&lt;/a> 與 &lt;a href="../partition-redesign/">Partition Redesign&lt;/a> 做對。&lt;/p>
&lt;h2 id="responsibility-boundary">Responsibility Boundary&lt;/h2>
&lt;p>Responsibility boundary 的核心責任是區分 PostgreSQL 原生 partition 和 pg_partman。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>層級&lt;/th>
 &lt;th>責任&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>PostgreSQL declarative partitioning&lt;/td>
 &lt;td>partition table、constraint、planner pruning&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>pg_partman&lt;/td>
 &lt;td>future partition premake、retention、maintenance&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Scheduler / job runner&lt;/td>
 &lt;td>定期執行 maintenance&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>DBA / platform&lt;/td>
 &lt;td>monitoring、backup、DDL review&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Application&lt;/td>
 &lt;td>query pattern、partition key 使用&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>pg_partman 的價值在於減少重複 DDL。它不會替 application 選出正確 partition key，也不會自動修復跨 partition query 設計。&lt;/p>
&lt;h2 id="core-concepts">Core Concepts&lt;/h2>
&lt;p>Core concepts 的核心責任是理解 pg_partman operation vocabulary。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>概念&lt;/th>
 &lt;th>意義&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Parent table&lt;/td>
 &lt;td>partitioned table 的入口&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Child table&lt;/td>
 &lt;td>實際存放資料的 partition&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Premake&lt;/td>
 &lt;td>預先建立未來 partition&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Retention&lt;/td>
 &lt;td>自動 detach / drop 舊 partition&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Maintenance&lt;/td>
 &lt;td>建立新 partition、處理 retention 的 job&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Template&lt;/td>
 &lt;td>child partition 繼承 index / constraint 的模板&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Premake 是防止 insert 打到不存在 partition 的保護。若 partition 建立落後於時間，application insert 會失敗或落到 default partition；production 要對 future partition count 設 alert。&lt;/p>
&lt;p>Retention 是資料生命週期操作。Drop 舊 partition 速度快，但要先確認 legal retention、backup、analytics dependency 與 downstream CDC。&lt;/p>
&lt;h2 id="setup-pattern">Setup Pattern&lt;/h2>
&lt;p>Setup pattern 的核心責任是把 pg_partman 導入流程放進 migration gate。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXTENSION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">IF&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">EXISTS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">pg_partman&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">bigserial&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">tenant_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">uuid&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">created_at&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">timestamptz&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="n">payload&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">jsonb&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NOT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">NULL&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PARTITION&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">RANGE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">created_at&lt;/span>&lt;span class="p">);&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>實際建立 partman config 要依 pg_partman 版本與 provider 支援文件執行。Managed PostgreSQL 可能限制 extension version、background worker 或 scheduler，因此 setup 前要先確認 provider boundary。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL pg_partman advanced 的核心責任是把 <a href="/blog/backend/knowledge-cards/table-partitioning/" data-link-title="Table Partitioning" data-link-desc="說明單一資料庫內如何把大表拆成多個分區，並由查詢規劃器只掃相關片段">declarative partitioning</a> 的日常維護自動化。pg_partman 可以協助建立未來 partition、管理 retention、執行 maintenance job，讓 time-based 或 serial-based partition 不再依賴人工 DDL。</p>
<p>本文的判讀錨點是：pg_partman 解決的是 partition lifecycle operation，而非 partition strategy 本身。Partition key、query pattern、retention、index、foreign key 與 migration 仍要先在 <a href="../declarative-partitioning/">Declarative Partitioning</a> 與 <a href="../partition-redesign/">Partition Redesign</a> 做對。</p>
<h2 id="responsibility-boundary">Responsibility Boundary</h2>
<p>Responsibility boundary 的核心責任是區分 PostgreSQL 原生 partition 和 pg_partman。</p>
<table>
  <thead>
      <tr>
          <th>層級</th>
          <th>責任</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PostgreSQL declarative partitioning</td>
          <td>partition table、constraint、planner pruning</td>
      </tr>
      <tr>
          <td>pg_partman</td>
          <td>future partition premake、retention、maintenance</td>
      </tr>
      <tr>
          <td>Scheduler / job runner</td>
          <td>定期執行 maintenance</td>
      </tr>
      <tr>
          <td>DBA / platform</td>
          <td>monitoring、backup、DDL review</td>
      </tr>
      <tr>
          <td>Application</td>
          <td>query pattern、partition key 使用</td>
      </tr>
  </tbody>
</table>
<p>pg_partman 的價值在於減少重複 DDL。它不會替 application 選出正確 partition key，也不會自動修復跨 partition query 設計。</p>
<h2 id="core-concepts">Core Concepts</h2>
<p>Core concepts 的核心責任是理解 pg_partman operation vocabulary。</p>
<table>
  <thead>
      <tr>
          <th>概念</th>
          <th>意義</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Parent table</td>
          <td>partitioned table 的入口</td>
      </tr>
      <tr>
          <td>Child table</td>
          <td>實際存放資料的 partition</td>
      </tr>
      <tr>
          <td>Premake</td>
          <td>預先建立未來 partition</td>
      </tr>
      <tr>
          <td>Retention</td>
          <td>自動 detach / drop 舊 partition</td>
      </tr>
      <tr>
          <td>Maintenance</td>
          <td>建立新 partition、處理 retention 的 job</td>
      </tr>
      <tr>
          <td>Template</td>
          <td>child partition 繼承 index / constraint 的模板</td>
      </tr>
  </tbody>
</table>
<p>Premake 是防止 insert 打到不存在 partition 的保護。若 partition 建立落後於時間，application insert 會失敗或落到 default partition；production 要對 future partition count 設 alert。</p>
<p>Retention 是資料生命週期操作。Drop 舊 partition 速度快，但要先確認 legal retention、backup、analytics dependency 與 downstream CDC。</p>
<h2 id="setup-pattern">Setup Pattern</h2>
<p>Setup pattern 的核心責任是把 pg_partman 導入流程放進 migration gate。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="k">IF</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="n">pg_partman</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="n">id</span><span class="w"> </span><span class="n">bigserial</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">  </span><span class="n">tenant_id</span><span class="w"> </span><span class="n">uuid</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="n">created_at</span><span class="w"> </span><span class="n">timestamptz</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">  </span><span class="n">payload</span><span class="w"> </span><span class="n">jsonb</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">RANGE</span><span class="w"> </span><span class="p">(</span><span class="n">created_at</span><span class="p">);</span></span></span></code></pre></div><p>實際建立 partman config 要依 pg_partman 版本與 provider 支援文件執行。Managed PostgreSQL 可能限制 extension version、background worker 或 scheduler，因此 setup 前要先確認 provider boundary。</p>
<p>最小 setup evidence：</p>
<ol>
<li>Extension version。</li>
<li>Parent table DDL。</li>
<li>Partition key 與 interval。</li>
<li>Premake 數量。</li>
<li>Retention policy。</li>
<li>Maintenance job schedule。</li>
<li>Test insert 到 current / future partition。</li>
</ol>
<h2 id="maintenance-runbook">Maintenance Runbook</h2>
<p>Maintenance runbook 的核心責任是讓 partition lifecycle 可觀測。</p>
<table>
  <thead>
      <tr>
          <th>Signal</th>
          <th>意義</th>
          <th>反應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>future partition count</td>
          <td>premake 是否足夠</td>
          <td>手動跑 maintenance、修 scheduler</td>
      </tr>
      <tr>
          <td>default partition rows</td>
          <td>routing 失敗或 partition 缺漏</td>
          <td>建 partition、搬資料、修 app timestamp</td>
      </tr>
      <tr>
          <td>old partition count</td>
          <td>retention 是否執行</td>
          <td>檢查 policy、legal hold、job error</td>
      </tr>
      <tr>
          <td>maintenance duration</td>
          <td>DDL / lock / catalog 壓力</td>
          <td>調整 schedule、拆 table</td>
      </tr>
      <tr>
          <td>index build time</td>
          <td>child index 建立成本</td>
          <td>template / concurrent strategy review</td>
      </tr>
  </tbody>
</table>
<p>Maintenance job 要有 owner。Cron、pg_cron、background worker、Kubernetes job 或 managed scheduler 都可以；重點是 job failure 會告警，並且有人處理。</p>
<h2 id="migration-and-backfill">Migration and Backfill</h2>
<p>Migration and backfill 的核心責任是把既有大表轉成 partman-managed partition。這通常比新表導入更高風險。</p>
<table>
  <thead>
      <tr>
          <th>Phase</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Audit</td>
          <td>table size、query pattern、write rate</td>
      </tr>
      <tr>
          <td>New schema</td>
          <td>parent table、child partition、index</td>
      </tr>
      <tr>
          <td>Backfill</td>
          <td>batch size、lag、lock、checksum</td>
      </tr>
      <tr>
          <td>Dual write</td>
          <td>app compatibility</td>
      </tr>
      <tr>
          <td>Cutover</td>
          <td>rename / view / routing switch</td>
      </tr>
      <tr>
          <td>Cleanup</td>
          <td>old table retention、rollback</td>
      </tr>
  </tbody>
</table>
<p>Backfill 要控制 WAL、replica lag、autovacuum、index bloat 與 lock。大型 table 應先用 shadow table 或 partition redesign playbook，避開 peak traffic 直接重建。</p>
<h2 id="failure-modes">Failure Modes</h2>
<p>Failure modes 的核心責任是列出 pg_partman 常見事故。</p>
<table>
  <thead>
      <tr>
          <th>Failure mode</th>
          <th>判讀訊號</th>
          <th>修正方向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>未建立未來 partition</td>
          <td>insert 失敗或 default partition 增長</td>
          <td>補 partition、修 maintenance schedule</td>
      </tr>
      <tr>
          <td>retention drop 過早</td>
          <td>查詢缺歷史資料</td>
          <td>restore backup、調 policy、legal review</td>
      </tr>
      <tr>
          <td>managed provider 不支援</td>
          <td>extension / worker 限制</td>
          <td>改 manual partition job 或 provider</td>
      </tr>
      <tr>
          <td>index / constraint 漂移</td>
          <td>child partition schema 不一致</td>
          <td>template review、schema diff</td>
      </tr>
      <tr>
          <td>planner pruning 失效</td>
          <td>query 未帶 partition key</td>
          <td>query rewrite、index review</td>
      </tr>
  </tbody>
</table>
<p>pg_partman 事故通常是 lifecycle 事故。Runbook 要先看 maintenance job，再看 partition metadata 與 application query。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>pg_partman advanced 完成後，partition 設計讀 <a href="../declarative-partitioning/">Declarative Partitioning</a>；重排策略讀 <a href="../partition-redesign/">Partition Redesign</a>；migration gate 讀 <a href="../online-schema-change/">Online Schema Change</a>。</p>
]]></content:encoded></item><item><title>PostgreSQL Security / RLS / Audit Logging</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/security-rls-audit-logging/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/security-rls-audit-logging/</guid><description>&lt;p>PostgreSQL security / RLS / audit logging 的核心責任是把資料庫安全拆成存取邊界、資料列可見性與操作證據。PostgreSQL role / grant 決定誰能連線與操作 schema；&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/row-level-security/" data-link-title="Row-Level Security" data-link-desc="說明資料庫如何用 policy 限制同一張表中哪些 row 對某個角色可見或可寫">Row Level Security&lt;/a> 決定同一張表中哪些 row 對某個 role 可見；audit logging 則把敏感操作轉成可查詢、可保留、可告警的證據。&lt;/p>
&lt;p>本文的判讀錨點是：資料庫安全是 application auth 的下游防線。Application 仍要負責身份、session、租戶與 workflow；PostgreSQL security layer 負責在資料邊界補上 least privilege、tenant isolation 與 forensic evidence。&lt;/p>
&lt;h2 id="role-and-grant-baseline">Role and Grant Baseline&lt;/h2>
&lt;p>Role and grant baseline 的核心責任是把人、服務、migration 與分析查詢分開。Production database 至少要區分 application role、migration role、read-only role、admin role 與 replication / CDC role。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Role 類型&lt;/th>
 &lt;th>權限責任&lt;/th>
 &lt;th>常見風險&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Application&lt;/td>
 &lt;td>執行產品讀寫&lt;/td>
 &lt;td>權限過大、可 DDL、可讀所有 schema&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Migration&lt;/td>
 &lt;td>變更 schema&lt;/td>
 &lt;td>和 app 共用 role，事故難以追蹤&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Read-only&lt;/td>
 &lt;td>分析、debug、support&lt;/td>
 &lt;td>讀到 PII 或跨 tenant 資料&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Replication / CDC&lt;/td>
 &lt;td>logical replication、slot access&lt;/td>
 &lt;td>權限與 WAL retention 風險&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Admin&lt;/td>
 &lt;td>emergency operation&lt;/td>
 &lt;td>日常使用 admin role&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Grant review 要以 schema ownership 開始。Tables、sequences、functions、views、extensions 都有權限面；只管 table grant 會漏掉 sequence update、function execution 與 extension 使用。&lt;/p>
&lt;h2 id="row-level-security">Row Level Security&lt;/h2>
&lt;p>Row Level Security 的核心責任是在資料庫層 enforce row visibility。PostgreSQL 官方 RLS 文件描述 policy 可限制 normal query 返回、insert、update、delete 的 row；這讓 tenant boundary 可以在 database 層多一道 guard。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>RLS 使用情境&lt;/th>
 &lt;th>適合條件&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Multi-tenant SaaS&lt;/td>
 &lt;td>tenant_id 明確且每個 query 都可帶入&lt;/td>
 &lt;td>policy 是否覆蓋 SELECT / INSERT / UPDATE&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Support access&lt;/td>
 &lt;td>support role 需受限查詢&lt;/td>
 &lt;td>break-glass 是否有 audit&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Regional data&lt;/td>
 &lt;td>row 上有 region / residency&lt;/td>
 &lt;td>policy 是否和 GDPR / residency 對齊&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Sensitive subset&lt;/td>
 &lt;td>PII row 需特別隔離&lt;/td>
 &lt;td>masking / tokenization 是否仍需存在&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>RLS policy 要有 positive allow rule。每張啟用 RLS 的 table 都要有測試：同 tenant 可讀、跨 tenant 隔離、insert tenant mismatch 被擋、admin / support 例外被記錄。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL security / RLS / audit logging 的核心責任是把資料庫安全拆成存取邊界、資料列可見性與操作證據。PostgreSQL role / grant 決定誰能連線與操作 schema；<a href="/blog/backend/knowledge-cards/row-level-security/" data-link-title="Row-Level Security" data-link-desc="說明資料庫如何用 policy 限制同一張表中哪些 row 對某個角色可見或可寫">Row Level Security</a> 決定同一張表中哪些 row 對某個 role 可見；audit logging 則把敏感操作轉成可查詢、可保留、可告警的證據。</p>
<p>本文的判讀錨點是：資料庫安全是 application auth 的下游防線。Application 仍要負責身份、session、租戶與 workflow；PostgreSQL security layer 負責在資料邊界補上 least privilege、tenant isolation 與 forensic evidence。</p>
<h2 id="role-and-grant-baseline">Role and Grant Baseline</h2>
<p>Role and grant baseline 的核心責任是把人、服務、migration 與分析查詢分開。Production database 至少要區分 application role、migration role、read-only role、admin role 與 replication / CDC role。</p>
<table>
  <thead>
      <tr>
          <th>Role 類型</th>
          <th>權限責任</th>
          <th>常見風險</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Application</td>
          <td>執行產品讀寫</td>
          <td>權限過大、可 DDL、可讀所有 schema</td>
      </tr>
      <tr>
          <td>Migration</td>
          <td>變更 schema</td>
          <td>和 app 共用 role，事故難以追蹤</td>
      </tr>
      <tr>
          <td>Read-only</td>
          <td>分析、debug、support</td>
          <td>讀到 PII 或跨 tenant 資料</td>
      </tr>
      <tr>
          <td>Replication / CDC</td>
          <td>logical replication、slot access</td>
          <td>權限與 WAL retention 風險</td>
      </tr>
      <tr>
          <td>Admin</td>
          <td>emergency operation</td>
          <td>日常使用 admin role</td>
      </tr>
  </tbody>
</table>
<p>Grant review 要以 schema ownership 開始。Tables、sequences、functions、views、extensions 都有權限面；只管 table grant 會漏掉 sequence update、function execution 與 extension 使用。</p>
<h2 id="row-level-security">Row Level Security</h2>
<p>Row Level Security 的核心責任是在資料庫層 enforce row visibility。PostgreSQL 官方 RLS 文件描述 policy 可限制 normal query 返回、insert、update、delete 的 row；這讓 tenant boundary 可以在 database 層多一道 guard。</p>
<table>
  <thead>
      <tr>
          <th>RLS 使用情境</th>
          <th>適合條件</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Multi-tenant SaaS</td>
          <td>tenant_id 明確且每個 query 都可帶入</td>
          <td>policy 是否覆蓋 SELECT / INSERT / UPDATE</td>
      </tr>
      <tr>
          <td>Support access</td>
          <td>support role 需受限查詢</td>
          <td>break-glass 是否有 audit</td>
      </tr>
      <tr>
          <td>Regional data</td>
          <td>row 上有 region / residency</td>
          <td>policy 是否和 GDPR / residency 對齊</td>
      </tr>
      <tr>
          <td>Sensitive subset</td>
          <td>PII row 需特別隔離</td>
          <td>masking / tokenization 是否仍需存在</td>
      </tr>
  </tbody>
</table>
<p>RLS policy 要有 positive allow rule。每張啟用 RLS 的 table 都要有測試：同 tenant 可讀、跨 tenant 隔離、insert tenant mismatch 被擋、admin / support 例外被記錄。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">invoices</span><span class="w"> </span><span class="n">ENABLE</span><span class="w"> </span><span class="k">ROW</span><span class="w"> </span><span class="k">LEVEL</span><span class="w"> </span><span class="k">SECURITY</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">POLICY</span><span class="w"> </span><span class="n">tenant_isolation</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">invoices</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">USING</span><span class="w"> </span><span class="p">(</span><span class="n">tenant_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">current_setting</span><span class="p">(</span><span class="s1">&#39;app.tenant_id&#39;</span><span class="p">)::</span><span class="n">uuid</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">WITH</span><span class="w"> </span><span class="k">CHECK</span><span class="w"> </span><span class="p">(</span><span class="n">tenant_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">current_setting</span><span class="p">(</span><span class="s1">&#39;app.tenant_id&#39;</span><span class="p">)::</span><span class="n">uuid</span><span class="p">);</span></span></span></code></pre></div><p>這段 policy 依賴 application 在 transaction 內設定 <code>app.tenant_id</code>。使用 connection pooler 時，設定必須跟 transaction boundary 對齊，避免 session state 漂移。</p>
<h2 id="audit-logging">Audit Logging</h2>
<p>Audit logging 的核心責任是把敏感資料操作轉成可查詢證據。PostgreSQL 原生日誌可以記錄連線、DDL、錯誤與慢查詢；pgAudit 這類 extension 則補強 session / object audit。</p>
<table>
  <thead>
      <tr>
          <th>Audit 類型</th>
          <th>目的</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DDL audit</td>
          <td>schema 變更追蹤</td>
          <td>migration id、role、statement、timestamp</td>
      </tr>
      <tr>
          <td>Sensitive read</td>
          <td>PII / payment / health data 查詢</td>
          <td>role、tenant、operation、reason</td>
      </tr>
      <tr>
          <td>Privilege change</td>
          <td>grant / revoke / role 變更</td>
          <td>actor、target role、approval</td>
      </tr>
      <tr>
          <td>Failed access</td>
          <td>權限錯誤與 RLS block</td>
          <td>error code、role、relation</td>
      </tr>
      <tr>
          <td>Break-glass</td>
          <td>emergency admin access</td>
          <td>ticket id、duration、review result</td>
      </tr>
  </tbody>
</table>
<p>Audit log 要能進入 SIEM 或集中 log。只留在 database host 上，事故後查詢成本高；正式 runbook 要定義 retention、masking、access control 與 alert。</p>
<h2 id="pii-and-data-protection-boundary">PII and Data Protection Boundary</h2>
<p>PII and data protection boundary 的核心責任是把 database 權限和資料保護策略接起來。RLS 可以限制 row visibility，但 PII 的保護還需要 masking、tokenization、encryption、retention 與 deletion evidence。</p>
<table>
  <thead>
      <tr>
          <th>資料類型</th>
          <th>Database control</th>
          <th>跨模組路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tenant data</td>
          <td>RLS、tenant-scoped role</td>
          <td>data access review</td>
      </tr>
      <tr>
          <td>PII</td>
          <td>column grant、masking view</td>
          <td><a href="/blog/backend/07-security-data-protection/data-protection-and-masking-governance/" data-link-title="7.4 資料保護與遮罩治理" data-link-desc="以問題驅動方式整理資料分級、遮罩、匯出與備份治理">Data Protection</a></td>
      </tr>
      <tr>
          <td>Audit log</td>
          <td>append-only storage、retention</td>
          <td>SIEM / incident evidence</td>
      </tr>
      <tr>
          <td>Deletion request</td>
          <td>tombstone、cascade review</td>
          <td>retention policy、legal hold</td>
      </tr>
  </tbody>
</table>
<p>Column-level grant 和 masking view 適合 read-only analyst。Application role 通常需要明文處理 workflow；analyst / support role 則應走 restricted view。</p>
<h2 id="operational-evidence">Operational Evidence</h2>
<p>Operational evidence 的核心責任是讓安全設定可驗證。每次 release 或權限變更後，要跑固定檢查。</p>
<ol>
<li>Role matrix：每個 role 的 schema / table / sequence / function grant。</li>
<li>RLS test：tenant A / tenant B / support / admin 的可見性測試。</li>
<li>Audit sample：DDL、sensitive read、failed access 是否進 log。</li>
<li>Pooler compatibility：<code>SET LOCAL app.tenant_id</code> 是否跟 transaction 對齊。</li>
<li><a href="/blog/backend/knowledge-cards/break-glass-access/" data-link-title="Break-Glass Access" data-link-desc="說明緊急情況下臨時授予的高權限存取，如何用工單、時限與事後審查治理">Break-glass</a> drill：emergency access 是否可申請、可回收、可審查。</li>
</ol>
<p>Evidence 要保存在 release artifact。Security 設定只有文件描述時，incident 後難以證明它真的生效。</p>
<h2 id="failure-modes">Failure Modes</h2>
<p>Failure modes 的核心責任是把 database security 常見事故提前列出。</p>
<table>
  <thead>
      <tr>
          <th>Failure mode</th>
          <th>判讀訊號</th>
          <th>修正方向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>App role 權限過大</td>
          <td>app 可 DDL / drop / grant</td>
          <td>role split + least privilege</td>
      </tr>
      <tr>
          <td>RLS bypass</td>
          <td>owner / superuser / policy 漏洞</td>
          <td>dedicated app role + RLS test</td>
      </tr>
      <tr>
          <td>Pooler state drift</td>
          <td>tenant setting 漂到下個 request</td>
          <td><code>SET LOCAL</code> + transaction pooling review</td>
      </tr>
      <tr>
          <td>Audit gap</td>
          <td>敏感操作查不到 actor</td>
          <td>pgAudit / log schema / SIEM route</td>
      </tr>
      <tr>
          <td>Support overread</td>
          <td>support role 可讀全 tenant</td>
          <td>masking view + ticket-scoped access</td>
      </tr>
  </tbody>
</table>
<p>RLS bypass 要特別審查 table owner 與 superuser path。正式 application 連線應使用 dedicated role，並避免使用 table owner role 執行一般 request。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Security / RLS / audit logging 完成後，權限與 PII 治理讀 <a href="/blog/backend/07-security-data-protection/data-protection-and-masking-governance/" data-link-title="7.4 資料保護與遮罩治理" data-link-desc="以問題驅動方式整理資料分級、遮罩、匯出與備份治理">Data Protection</a>；connection state 風險讀 <a href="../connection-pooler-comparison/">Connection Pooler Comparison</a>；實作演練可放進 <a href="../hands-on/schema-migration-evidence-lab/">Schema Migration Evidence Lab</a> 的 release gate。</p>
]]></content:encoded></item><item><title>PostgreSQL to YugabyteDB / TiDB Migration</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-yugabytedb-tidb/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-yugabytedb-tidb/</guid><description>&lt;p>PostgreSQL to YugabyteDB / TiDB migration 的核心責任是處理從 single-primary PostgreSQL 走向 distributed SQL 的資料拓撲變更。這條路線通常由 multi-region write、horizontal scale、tenant sharding、availability 或 single-node capacity ceiling 觸發；其中 YugabyteDB 走 PostgreSQL-compatible YSQL 路線，TiDB 走 MySQL-compatible distributed SQL 路線，兩者的 application diff audit 不同。&lt;/p>
&lt;p>本文的判讀錨點是：API compatibility 只解決入口語法的一部分。YugabyteDB 要審查 PostgreSQL 相容與 distributed operation 差異；TiDB 要額外處理 PostgreSQL → MySQL dialect / driver / tooling 轉換。Distributed SQL 會改變 transaction latency、placement、index cost、DDL、sequence、lock、backup、observability 與 incident route。&lt;/p>
&lt;h2 id="official-documentation-route">Official Documentation Route&lt;/h2>
&lt;p>Official documentation route 的核心責任是把 compatibility claim 固定到可回查來源。YugabyteDB compatibility 先查 &lt;a href="https://docs.yugabyte.com/stable/reference/configuration/postgresql-compatibility/">YugabyteDB PostgreSQL compatibility&lt;/a>；TiDB compatibility 先查 &lt;a href="https://docs.pingcap.com/tidb/stable/mysql-compatibility/">TiDB MySQL compatibility&lt;/a>；本文最後檢查日是 2026-05-22。&lt;/p>
&lt;h2 id="driver-check">Driver Check&lt;/h2>
&lt;p>Driver check 的核心責任是確認 distributed SQL 解決的是核心問題。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Driver&lt;/th>
 &lt;th>代表需求&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Multi-region write&lt;/td>
 &lt;td>多地使用者都要低延遲寫入&lt;/td>
 &lt;td>consistency level、latency budget&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Horizontal write scaling&lt;/td>
 &lt;td>單 primary CPU / I/O 到頂&lt;/td>
 &lt;td>shard key、hot key、cross-shard txn&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Tenant distribution&lt;/td>
 &lt;td>tenant 可依 region / size 分布&lt;/td>
 &lt;td>tenant placement、rebalance&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Availability&lt;/td>
 &lt;td>節點 / zone failure 容忍&lt;/td>
 &lt;td>quorum、failover、RPO / RTO&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operational consolidation&lt;/td>
 &lt;td>多 PG shard 想收斂&lt;/td>
 &lt;td>migration complexity、cost&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>若主要問題是 read scaling、connection 數或 query index，先評估 read replica、pooler、partition、Citus 或 Aurora；distributed SQL 適合資料拓撲問題。&lt;/p>
&lt;h2 id="compatibility-audit">Compatibility Audit&lt;/h2>
&lt;p>Compatibility audit 的核心責任是把 PostgreSQL behavior 逐項對照 target。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>面向&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Protocol / API&lt;/td>
 &lt;td>YugabyteDB YSQL vs TiDB MySQL protocol&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>SQL dialect&lt;/td>
 &lt;td>function、extension、type、DDL support&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Transaction&lt;/td>
 &lt;td>isolation、lock、deadlock、retry&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Sequence / ID&lt;/td>
 &lt;td>global sequence latency、UUID policy&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Index&lt;/td>
 &lt;td>secondary index placement、write cost&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Foreign key&lt;/td>
 &lt;td>distributed FK cost / support&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Extension&lt;/td>
 &lt;td>PostGIS、pgvector、custom extension；TiDB 路線需改寫或拆出&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Tooling&lt;/td>
 &lt;td>migration tool、CDC、backup、monitoring&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Compatibility audit 要用 application query suite。只看 schema import 會漏掉 transaction retry、query planner、distributed index、dialect rewrite 與 latency。TiDB 路線還要加 PostgreSQL driver / SQL / type / migration tool 轉 MySQL ecosystem 的審查。&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL to YugabyteDB / TiDB migration 的核心責任是處理從 single-primary PostgreSQL 走向 distributed SQL 的資料拓撲變更。這條路線通常由 multi-region write、horizontal scale、tenant sharding、availability 或 single-node capacity ceiling 觸發；其中 YugabyteDB 走 PostgreSQL-compatible YSQL 路線，TiDB 走 MySQL-compatible distributed SQL 路線，兩者的 application diff audit 不同。</p>
<p>本文的判讀錨點是：API compatibility 只解決入口語法的一部分。YugabyteDB 要審查 PostgreSQL 相容與 distributed operation 差異；TiDB 要額外處理 PostgreSQL → MySQL dialect / driver / tooling 轉換。Distributed SQL 會改變 transaction latency、placement、index cost、DDL、sequence、lock、backup、observability 與 incident route。</p>
<h2 id="official-documentation-route">Official Documentation Route</h2>
<p>Official documentation route 的核心責任是把 compatibility claim 固定到可回查來源。YugabyteDB compatibility 先查 <a href="https://docs.yugabyte.com/stable/reference/configuration/postgresql-compatibility/">YugabyteDB PostgreSQL compatibility</a>；TiDB compatibility 先查 <a href="https://docs.pingcap.com/tidb/stable/mysql-compatibility/">TiDB MySQL compatibility</a>；本文最後檢查日是 2026-05-22。</p>
<h2 id="driver-check">Driver Check</h2>
<p>Driver check 的核心責任是確認 distributed SQL 解決的是核心問題。</p>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>代表需求</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Multi-region write</td>
          <td>多地使用者都要低延遲寫入</td>
          <td>consistency level、latency budget</td>
      </tr>
      <tr>
          <td>Horizontal write scaling</td>
          <td>單 primary CPU / I/O 到頂</td>
          <td>shard key、hot key、cross-shard txn</td>
      </tr>
      <tr>
          <td>Tenant distribution</td>
          <td>tenant 可依 region / size 分布</td>
          <td>tenant placement、rebalance</td>
      </tr>
      <tr>
          <td>Availability</td>
          <td>節點 / zone failure 容忍</td>
          <td>quorum、failover、RPO / RTO</td>
      </tr>
      <tr>
          <td>Operational consolidation</td>
          <td>多 PG shard 想收斂</td>
          <td>migration complexity、cost</td>
      </tr>
  </tbody>
</table>
<p>若主要問題是 read scaling、connection 數或 query index，先評估 read replica、pooler、partition、Citus 或 Aurora；distributed SQL 適合資料拓撲問題。</p>
<h2 id="compatibility-audit">Compatibility Audit</h2>
<p>Compatibility audit 的核心責任是把 PostgreSQL behavior 逐項對照 target。</p>
<table>
  <thead>
      <tr>
          <th>面向</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Protocol / API</td>
          <td>YugabyteDB YSQL vs TiDB MySQL protocol</td>
      </tr>
      <tr>
          <td>SQL dialect</td>
          <td>function、extension、type、DDL support</td>
      </tr>
      <tr>
          <td>Transaction</td>
          <td>isolation、lock、deadlock、retry</td>
      </tr>
      <tr>
          <td>Sequence / ID</td>
          <td>global sequence latency、UUID policy</td>
      </tr>
      <tr>
          <td>Index</td>
          <td>secondary index placement、write cost</td>
      </tr>
      <tr>
          <td>Foreign key</td>
          <td>distributed FK cost / support</td>
      </tr>
      <tr>
          <td>Extension</td>
          <td>PostGIS、pgvector、custom extension；TiDB 路線需改寫或拆出</td>
      </tr>
      <tr>
          <td>Tooling</td>
          <td>migration tool、CDC、backup、monitoring</td>
      </tr>
  </tbody>
</table>
<p>Compatibility audit 要用 application query suite。只看 schema import 會漏掉 transaction retry、query planner、distributed index、dialect rewrite 與 latency。TiDB 路線還要加 PostgreSQL driver / SQL / type / migration tool 轉 MySQL ecosystem 的審查。</p>
<h2 id="data-topology">Data Topology</h2>
<p>Data topology 的核心責任是決定資料如何分布。Distributed SQL 的成敗常取決於 primary key、tenant key、region placement 與 hot key 控制。</p>
<table>
  <thead>
      <tr>
          <th>拓撲決策</th>
          <th>判讀問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Distribution key</td>
          <td>query 是否能 co-locate data</td>
      </tr>
      <tr>
          <td>Region placement</td>
          <td>資料是否需要 residency / low latency</td>
      </tr>
      <tr>
          <td>Hot key</td>
          <td>high-write tenant / account 是否集中</td>
      </tr>
      <tr>
          <td>Secondary index</td>
          <td>index write 是否跨 shard / region</td>
      </tr>
      <tr>
          <td>Transaction span</td>
          <td>交易是否常跨 tenant / region</td>
      </tr>
  </tbody>
</table>
<p>Topology 設計要從最高頻 workflow 開始。若核心交易每次都跨 shard，distributed SQL 的 latency 與 conflict cost 會很高。</p>
<h2 id="migration-phases">Migration Phases</h2>
<p>Migration phases 的核心責任是降低跨拓撲遷移風險。</p>
<table>
  <thead>
      <tr>
          <th>Phase</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Lab import</td>
          <td>schema import、query suite、driver test</td>
      </tr>
      <tr>
          <td>Topology design</td>
          <td>key、placement、region、index review</td>
      </tr>
      <tr>
          <td>Backfill</td>
          <td>snapshot、batch、checksum</td>
      </tr>
      <tr>
          <td>CDC catch-up</td>
          <td>LSN / change stream、lag、idempotency</td>
      </tr>
      <tr>
          <td>Shadow read</td>
          <td>result diff、latency profile</td>
      </tr>
      <tr>
          <td>Cutover</td>
          <td>freeze、final sync、traffic switch</td>
      </tr>
      <tr>
          <td>Rollback</td>
          <td>source PG snapshot、write replay plan</td>
      </tr>
  </tbody>
</table>
<p>CDC catch-up 要有 clear cutover LSN。Distributed SQL migration 最怕 source / target 同時有寫入後，缺少 reconciliation plan。</p>
<h2 id="application-changes">Application Changes</h2>
<p>Application changes 的核心責任是讓程式接受 distributed system 的錯誤模式。</p>
<ol>
<li>Transaction retry：serialization / conflict error 要可重試。</li>
<li>Idempotency：critical write 要有 natural key 或 idempotency key。</li>
<li>Latency budget：跨 region transaction 要進 SLO。</li>
<li>Pagination / ordering：distributed query 的排序成本要審查。</li>
<li>Connection / driver：target driver、TLS、pooling、load balancing 要測。</li>
</ol>
<p>Application 若假設 single-node low-latency transaction，遷移後會在 tail latency 與 retry 行為上出現落差。TiDB 路線還會出現 driver、placeholder、SQL function、type mapping 與 error code 的轉換成本；這些要在 staging failure injection 先看到。</p>
<h2 id="no-go-conditions">No-Go Conditions</h2>
<p>No-go conditions 的核心責任是阻止把 distributed SQL 當成萬用擴容。</p>
<table>
  <thead>
      <tr>
          <th>No-go 訊號</th>
          <th>替代路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>主要瓶頸是少數 slow query</td>
          <td>query optimization / index</td>
      </tr>
      <tr>
          <td>多數交易跨全局資料</td>
          <td>重設 bounded context 或保持 single primary</td>
      </tr>
      <tr>
          <td>Team 缺少 distributed operation 能力</td>
          <td>managed provider / simpler topology</td>
      </tr>
      <tr>
          <td>PostgreSQL extension 依賴重</td>
          <td>保留 PG 或拆出 specialized service</td>
      </tr>
      <tr>
          <td>RPO / rollback 沒有演練</td>
          <td>先完成 migration playbook</td>
      </tr>
      <tr>
          <td>想保留 PostgreSQL driver / SQL surface</td>
          <td>優先評估 YugabyteDB / CockroachDB / Citus</td>
      </tr>
  </tbody>
</table>
<p>Distributed SQL 的價值來自拓撲匹配。若 workload 缺少自然分布邊界，導入後只是把單點瓶頸換成分散式複雜度。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>PostgreSQL to YugabyteDB / TiDB migration 完成後，先讀 <a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">Global Distributed OLTP</a>；若需求是 PostgreSQL 內分散式 table，讀 <a href="../citus-distributed/">Citus Distributed</a>；跨 vendor 流程讀 <a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">Database Migration Playbook</a>。</p>
]]></content:encoded></item><item><title>Specialized PostgreSQL Variants</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/specialized-pg-variants/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/specialized-pg-variants/</guid><description>&lt;p>Specialized PostgreSQL variants 的核心責任是把 PostgreSQL ecosystem 裡的 specialized engines、extensions 與 managed variants 放到正確服務位置。PostgreSQL 的擴充性讓它能支援 geospatial、time-series、vector search、distributed table、serverless branch 與 managed acceleration；但每個變體都改變 operation、migration、cost 與 lock-in。&lt;/p>
&lt;p>本文的判讀錨點是：PostgreSQL compatibility 是入口，不等於相同責任。選 variant 前，要先說清楚新增能力解決哪個 workload，並確認 exit route。&lt;/p>
&lt;h2 id="variant-taxonomy">Variant Taxonomy&lt;/h2>
&lt;p>Variant taxonomy 的核心責任是把變體按資料模型與操作責任分類。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>類型&lt;/th>
 &lt;th>代表&lt;/th>
 &lt;th>主要解決問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Extension domain&lt;/td>
 &lt;td>PostGIS、pgvector、TimescaleDB&lt;/td>
 &lt;td>geospatial、vector、time-series&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Distributed PG&lt;/td>
 &lt;td>Citus、Cosmos DB for PostgreSQL&lt;/td>
 &lt;td>sharding、distributed query&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Managed accelerated PG&lt;/td>
 &lt;td>AlloyDB、Aurora PG&lt;/td>
 &lt;td>managed performance / HA / platform&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Serverless / branching&lt;/td>
 &lt;td>Neon、Supabase workflow&lt;/td>
 &lt;td>preview、branch、稀疏 workload&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Compatibility layer&lt;/td>
 &lt;td>YugabyteDB、部分 distributed SQL&lt;/td>
 &lt;td>PostgreSQL-like API + distributed storage&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>分類的重點是避免把不同變體視為同一種升級。Extension domain 強化單一資料模型；distributed PG 改變資料拓撲；managed accelerated PG 改變操作邊界；serverless PG 改變 lifecycle。&lt;/p>
&lt;h2 id="workload-fit">Workload Fit&lt;/h2>
&lt;p>Workload fit 的核心責任是判斷 variant 是否匹配資料形狀。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Workload&lt;/th>
 &lt;th>合適路線&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Geospatial query&lt;/td>
 &lt;td>PostGIS&lt;/td>
 &lt;td>index、SRID、資料量、query latency&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Time-series retention&lt;/td>
 &lt;td>TimescaleDB / partition strategy&lt;/td>
 &lt;td>compression、chunk、retention&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Vector search&lt;/td>
 &lt;td>pgvector / pgvectorscale&lt;/td>
 &lt;td>recall、latency、index build、hybrid search&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Tenant sharding&lt;/td>
 &lt;td>Citus / distributed PG&lt;/td>
 &lt;td>distribution key、co-location、rebalance&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Preview environment&lt;/td>
 &lt;td>serverless / branching PG&lt;/td>
 &lt;td>data privacy、branch lifecycle&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cloud-managed acceleration&lt;/td>
 &lt;td>AlloyDB / Aurora&lt;/td>
 &lt;td>compatibility、cost、exit&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Variant 要先證明普通 PostgreSQL 加 index / partition / read replica 已到邊界。若基礎 query design 還沒成熟，導入 variant 會把複雜度提前。&lt;/p>
&lt;h2 id="migration-gap">Migration Gap&lt;/h2>
&lt;p>Migration gap 的核心責任是列出從 vanilla PostgreSQL 進入 variant 的差異。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>差異面&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>DDL&lt;/td>
 &lt;td>extension object、distributed table、chunk&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Query&lt;/td>
 &lt;td>planner、function、operator、pushdown&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Data movement&lt;/td>
 &lt;td>backfill、reshard、index build&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Operation&lt;/td>
 &lt;td>backup、restore、upgrade、failover&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Tooling&lt;/td>
 &lt;td>ORM、migration tool、CDC、monitoring&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Exit&lt;/td>
 &lt;td>dump / restore 是否回到 vanilla PG&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Migration 要有 compatibility test。每個核心 query 在 variant 上跑 explain、latency、result correctness；每個 migration step 都要有 rollback 或 rebuild path。&lt;/p></description><content:encoded><![CDATA[<p>Specialized PostgreSQL variants 的核心責任是把 PostgreSQL ecosystem 裡的 specialized engines、extensions 與 managed variants 放到正確服務位置。PostgreSQL 的擴充性讓它能支援 geospatial、time-series、vector search、distributed table、serverless branch 與 managed acceleration；但每個變體都改變 operation、migration、cost 與 lock-in。</p>
<p>本文的判讀錨點是：PostgreSQL compatibility 是入口，不等於相同責任。選 variant 前，要先說清楚新增能力解決哪個 workload，並確認 exit route。</p>
<h2 id="variant-taxonomy">Variant Taxonomy</h2>
<p>Variant taxonomy 的核心責任是把變體按資料模型與操作責任分類。</p>
<table>
  <thead>
      <tr>
          <th>類型</th>
          <th>代表</th>
          <th>主要解決問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Extension domain</td>
          <td>PostGIS、pgvector、TimescaleDB</td>
          <td>geospatial、vector、time-series</td>
      </tr>
      <tr>
          <td>Distributed PG</td>
          <td>Citus、Cosmos DB for PostgreSQL</td>
          <td>sharding、distributed query</td>
      </tr>
      <tr>
          <td>Managed accelerated PG</td>
          <td>AlloyDB、Aurora PG</td>
          <td>managed performance / HA / platform</td>
      </tr>
      <tr>
          <td>Serverless / branching</td>
          <td>Neon、Supabase workflow</td>
          <td>preview、branch、稀疏 workload</td>
      </tr>
      <tr>
          <td>Compatibility layer</td>
          <td>YugabyteDB、部分 distributed SQL</td>
          <td>PostgreSQL-like API + distributed storage</td>
      </tr>
  </tbody>
</table>
<p>分類的重點是避免把不同變體視為同一種升級。Extension domain 強化單一資料模型；distributed PG 改變資料拓撲；managed accelerated PG 改變操作邊界；serverless PG 改變 lifecycle。</p>
<h2 id="workload-fit">Workload Fit</h2>
<p>Workload fit 的核心責任是判斷 variant 是否匹配資料形狀。</p>
<table>
  <thead>
      <tr>
          <th>Workload</th>
          <th>合適路線</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Geospatial query</td>
          <td>PostGIS</td>
          <td>index、SRID、資料量、query latency</td>
      </tr>
      <tr>
          <td>Time-series retention</td>
          <td>TimescaleDB / partition strategy</td>
          <td>compression、chunk、retention</td>
      </tr>
      <tr>
          <td>Vector search</td>
          <td>pgvector / pgvectorscale</td>
          <td>recall、latency、index build、hybrid search</td>
      </tr>
      <tr>
          <td>Tenant sharding</td>
          <td>Citus / distributed PG</td>
          <td>distribution key、co-location、rebalance</td>
      </tr>
      <tr>
          <td>Preview environment</td>
          <td>serverless / branching PG</td>
          <td>data privacy、branch lifecycle</td>
      </tr>
      <tr>
          <td>Cloud-managed acceleration</td>
          <td>AlloyDB / Aurora</td>
          <td>compatibility、cost、exit</td>
      </tr>
  </tbody>
</table>
<p>Variant 要先證明普通 PostgreSQL 加 index / partition / read replica 已到邊界。若基礎 query design 還沒成熟，導入 variant 會把複雜度提前。</p>
<h2 id="migration-gap">Migration Gap</h2>
<p>Migration gap 的核心責任是列出從 vanilla PostgreSQL 進入 variant 的差異。</p>
<table>
  <thead>
      <tr>
          <th>差異面</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DDL</td>
          <td>extension object、distributed table、chunk</td>
      </tr>
      <tr>
          <td>Query</td>
          <td>planner、function、operator、pushdown</td>
      </tr>
      <tr>
          <td>Data movement</td>
          <td>backfill、reshard、index build</td>
      </tr>
      <tr>
          <td>Operation</td>
          <td>backup、restore、upgrade、failover</td>
      </tr>
      <tr>
          <td>Tooling</td>
          <td>ORM、migration tool、CDC、monitoring</td>
      </tr>
      <tr>
          <td>Exit</td>
          <td>dump / restore 是否回到 vanilla PG</td>
      </tr>
  </tbody>
</table>
<p>Migration 要有 compatibility test。每個核心 query 在 variant 上跑 explain、latency、result correctness；每個 migration step 都要有 rollback 或 rebuild path。</p>
<h2 id="lock-in-and-exit">Lock-In and Exit</h2>
<p>Lock-in and exit 的核心責任是把 variant-specific 能力和可攜性分開。</p>
<table>
  <thead>
      <tr>
          <th>Lock-in 來源</th>
          <th>控制方式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Extension-specific type</td>
          <td>adapter layer、domain boundary</td>
      </tr>
      <tr>
          <td>Managed-only feature</td>
          <td>decision record、exit test</td>
      </tr>
      <tr>
          <td>Distributed table DDL</td>
          <td>topology doc、reshard runbook</td>
      </tr>
      <tr>
          <td>Serverless branch API</td>
          <td>dev workflow boundary</td>
      </tr>
      <tr>
          <td>Proprietary index / function</td>
          <td>fallback query / export strategy</td>
      </tr>
  </tbody>
</table>
<p>Lock-in 可以接受，但要被命名。若 variant 能顯著降低成本或提高能力，採用是合理決策；工程責任是保留 exit evidence 與 migration plan。</p>
<h2 id="decision-matrix">Decision Matrix</h2>
<p>Decision matrix 的核心責任是把 variant 路由接到 PostgreSQL 主章。</p>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>下一步</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>地理查詢是核心產品能力</td>
          <td><a href="../postgis-deep-dive/">PostGIS Deep Dive</a></td>
      </tr>
      <tr>
          <td>時序資料與 retention 是主壓力</td>
          <td><a href="../timescaledb-deep-dive/">TimescaleDB Deep Dive</a></td>
      </tr>
      <tr>
          <td>向量搜尋在 PG 內整合</td>
          <td><a href="../pgvector-deep-dive/">pgvector Deep Dive</a></td>
      </tr>
      <tr>
          <td>tenant sharding / distributed query</td>
          <td><a href="../citus-distributed/">Citus Distributed</a></td>
      </tr>
      <tr>
          <td>managed provider 選型</td>
          <td><a href="../managed-pg-comparison/">Managed PostgreSQL Comparison</a></td>
      </tr>
      <tr>
          <td>分散式 SQL API 相容評估</td>
          <td><a href="../migrate-to-yugabytedb-tidb/">PostgreSQL to YugabyteDB / TiDB</a></td>
      </tr>
  </tbody>
</table>
<p>Decision matrix 要隨案例更新。Variant 選型最需要實際 workload：資料量、query pattern、SLO、team skill、合規與 exit 成本。</p>
<h2 id="review-checklist">Review Checklist</h2>
<p>Review checklist 的核心責任是避免 specialized variant 只被功能吸引。</p>
<ol>
<li>Workload 是否真的需要 specialized capability。</li>
<li>Vanilla PostgreSQL 的 index / partition / replica 是否已評估。</li>
<li>Extension / managed feature 的版本與支援政策。</li>
<li>Backup / restore / upgrade runbook。</li>
<li>Migration tool、CDC、observability 是否支援。</li>
<li>Exit route 是否至少在 staging 演練。</li>
<li>成本模型是否包含 storage、compute、I/O、support、operation。</li>
</ol>
<p>完成 checklist 後，variant 才能進入正式 proposal。這樣可以保留 PostgreSQL ecosystem 的彈性，也避免變體變成隱形平台遷移。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Specialized variants 完成後，回到 <a href="../">PostgreSQL overview</a> 做服務定位；需要 managed provider 比較讀 <a href="../managed-pg-comparison/">Managed PostgreSQL Comparison</a>；需要跨 vendor migration 讀 <a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">Database Migration Playbook</a>。</p>
]]></content:encoded></item></channel></rss>