<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Patroni on Tarragon</title><link>https://tarrragon.github.io/blog/tags/patroni/</link><description>Recent content in Patroni on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Mon, 18 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/patroni/index.xml" rel="self" type="application/rss+xml"/><item><title>PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/patroni-ha/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/patroni-ha/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PostgreSQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>Patroni-based HA&lt;/em> 的 lifecycle 設計 — 從正常運作到 failover 完成的 5 段、每段配置 + failure mode + recovery。&lt;/p>&lt;/blockquote>
&lt;h2 id="failover-lifecycle5-段不是一條曲線">Failover lifecycle：5 段不是一條曲線&lt;/h2>
&lt;p>PostgreSQL 原生沒有 auto-failover；primary 掛了、application 卡死、SRE 手動 promote standby — 整個過程通常 5-30 分鐘。Patroni 把這條鏈拆成 &lt;em>自動化的 5 段 lifecycle&lt;/em>、每段有自己的 trigger、配置、失敗模式：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>段&lt;/th>
 &lt;th>觸發&lt;/th>
 &lt;th>動作&lt;/th>
 &lt;th>失敗模式&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>1. Detection&lt;/strong>&lt;/td>
 &lt;td>Leader heartbeat 在 DCS（etcd / Consul）失聯&lt;/td>
 &lt;td>Standby 們開始觀察、累積失聯時間到 TTL&lt;/td>
 &lt;td>DCS 本身分裂 → false detection 啟動失敗 failover&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>2. Election&lt;/strong>&lt;/td>
 &lt;td>TTL 過、DCS 開放 leader lock&lt;/td>
 &lt;td>Standby 競爭寫 leader key（DCS quorum-based）&lt;/td>
 &lt;td>Network partition → 兩邊都自認 leader（split-brain）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>3. Promotion&lt;/strong>&lt;/td>
 &lt;td>新 leader 寫 DCS key 成功&lt;/td>
 &lt;td>跑 &lt;code>pg_ctl promote&lt;/code>、停 streaming replication、開始接寫&lt;/td>
 &lt;td>Standby 落後太多 → 拒 promote 或承接時資料缺&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>4. Reconfiguration&lt;/strong>&lt;/td>
 &lt;td>Patroni REST API 通知 routing 層&lt;/td>
 &lt;td>HAProxy / PgBouncer 切流量到新 leader&lt;/td>
 &lt;td>Routing 層 health check 慢 → 流量持續打舊 leader&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>5. Recovery&lt;/strong>&lt;/td>
 &lt;td>舊 leader 恢復（手動 / 自動）&lt;/td>
 &lt;td>跑 &lt;code>pg_rewind&lt;/code> + 重接 streaming replication 為 standby&lt;/td>
 &lt;td>WAL divergence 太大 → 必須重 base backup&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>每段都有獨立配置、不是「設一個 timeout 就好」。後面分段展開。&lt;/p>
&lt;h2 id="stage-1detection--dcs-heartbeat-跟-ttl">Stage 1：Detection — DCS heartbeat 跟 TTL&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c"># patroni.yml 核心配置&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">scope&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">myapp-pg-cluster&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">namespace&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">/db/&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">name&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">pg-node-1 &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># 跟 hostname 一致&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">etcd&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">hosts&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">etcd1:2379,etcd2:2379,etcd3:2379 &lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># DCS quorum&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">protocol&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">https&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">bootstrap&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">dcs&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">ttl&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">30&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># leader lock TTL&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">loop_wait&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">10&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># patroni 主循環間隔&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">retry_timeout&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">10&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># DCS retry 上限&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">maximum_lag_on_failover&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">1048576&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># standby 落後 1MB 內才能 promote&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nt">synchronous_mode&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="kc">false&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c"># async / sync 取捨&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>關鍵直覺：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PostgreSQL 在 OLTP 譜系的定位、本文聚焦 <em>Patroni-based HA</em> 的 lifecycle 設計 — 從正常運作到 failover 完成的 5 段、每段配置 + failure mode + recovery。</p></blockquote>
<h2 id="failover-lifecycle5-段不是一條曲線">Failover lifecycle：5 段不是一條曲線</h2>
<p>PostgreSQL 原生沒有 auto-failover；primary 掛了、application 卡死、SRE 手動 promote standby — 整個過程通常 5-30 分鐘。Patroni 把這條鏈拆成 <em>自動化的 5 段 lifecycle</em>、每段有自己的 trigger、配置、失敗模式：</p>
<table>
  <thead>
      <tr>
          <th>段</th>
          <th>觸發</th>
          <th>動作</th>
          <th>失敗模式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>1. Detection</strong></td>
          <td>Leader heartbeat 在 DCS（etcd / Consul）失聯</td>
          <td>Standby 們開始觀察、累積失聯時間到 TTL</td>
          <td>DCS 本身分裂 → false detection 啟動失敗 failover</td>
      </tr>
      <tr>
          <td><strong>2. Election</strong></td>
          <td>TTL 過、DCS 開放 leader lock</td>
          <td>Standby 競爭寫 leader key（DCS quorum-based）</td>
          <td>Network partition → 兩邊都自認 leader（split-brain）</td>
      </tr>
      <tr>
          <td><strong>3. Promotion</strong></td>
          <td>新 leader 寫 DCS key 成功</td>
          <td>跑 <code>pg_ctl promote</code>、停 streaming replication、開始接寫</td>
          <td>Standby 落後太多 → 拒 promote 或承接時資料缺</td>
      </tr>
      <tr>
          <td><strong>4. Reconfiguration</strong></td>
          <td>Patroni REST API 通知 routing 層</td>
          <td>HAProxy / PgBouncer 切流量到新 leader</td>
          <td>Routing 層 health check 慢 → 流量持續打舊 leader</td>
      </tr>
      <tr>
          <td><strong>5. Recovery</strong></td>
          <td>舊 leader 恢復（手動 / 自動）</td>
          <td>跑 <code>pg_rewind</code> + 重接 streaming replication 為 standby</td>
          <td>WAL divergence 太大 → 必須重 base backup</td>
      </tr>
  </tbody>
</table>
<p>每段都有獨立配置、不是「設一個 timeout 就好」。後面分段展開。</p>
<h2 id="stage-1detection--dcs-heartbeat-跟-ttl">Stage 1：Detection — DCS heartbeat 跟 TTL</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># patroni.yml 核心配置</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">scope</span><span class="p">:</span><span class="w"> </span><span class="l">myapp-pg-cluster</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l">/db/</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l">pg-node-1                               </span><span class="w"> </span><span class="c"># 跟 hostname 一致</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="nt">etcd</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">  </span><span class="nt">hosts</span><span class="p">:</span><span class="w"> </span><span class="l">etcd1:2379,etcd2:2379,etcd3:2379      </span><span class="w"> </span><span class="c"># DCS quorum</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">  </span><span class="nt">protocol</span><span class="p">:</span><span class="w"> </span><span class="l">https</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="nt">bootstrap</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">  </span><span class="nt">dcs</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="nt">ttl</span><span class="p">:</span><span class="w"> </span><span class="m">30</span><span class="w">                                     </span><span class="c"># leader lock TTL</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">    </span><span class="nt">loop_wait</span><span class="p">:</span><span class="w"> </span><span class="m">10</span><span class="w">                               </span><span class="c"># patroni 主循環間隔</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">    </span><span class="nt">retry_timeout</span><span class="p">:</span><span class="w"> </span><span class="m">10</span><span class="w">                           </span><span class="c"># DCS retry 上限</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">    </span><span class="nt">maximum_lag_on_failover</span><span class="p">:</span><span class="w"> </span><span class="m">1048576</span><span class="w">            </span><span class="c"># standby 落後 1MB 內才能 promote</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">    </span><span class="nt">synchronous_mode</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="w">                     </span><span class="c"># async / sync 取捨</span></span></span></code></pre></div><p>關鍵直覺：</p>
<ul>
<li><strong>TTL (30s) = leader 失聯多久才被視為 dead</strong>。設太短（&lt; 15s）會把 transient network jitter 當 dead；設太長（&gt; 60s）unavailability 拖長</li>
<li><strong>loop_wait + retry_timeout &lt; TTL</strong>：Patroni 必須在 TTL 內成功跟 DCS 互動 N 次、<code>loop_wait=10 + retry_timeout=10</code> 給每個循環 20s buffer</li>
<li><strong>maximum_lag_on_failover</strong>：standby WAL 落後超過這個閾值就 <em>不參與 election</em>；防止「promote 一個落後 5 分鐘的 standby」資料丟失</li>
</ul>
<h2 id="stage-2election--dcs-quorum--watchdog-防-split-brain">Stage 2：Election — DCS quorum + watchdog 防 split-brain</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln">1</span><span class="cl"><span class="nt">watchdog</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">  </span><span class="nt">mode</span><span class="p">:</span><span class="w"> </span><span class="l">required                               </span><span class="w"> </span><span class="c"># required / automatic / off</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="nt">device</span><span class="p">:</span><span class="w"> </span><span class="l">/dev/watchdog</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="nt">safety_margin</span><span class="p">:</span><span class="w"> </span><span class="m">5</span></span></span></code></pre></div><p>Election 期間最大風險是 <em>split-brain</em> — network partition 下、舊 leader 還活著但跟 DCS 斷線；新 leader 從 standby 升上來、application 同時連兩個 PostgreSQL 寫。資料 divergence 後 <em>無法自動 reconcile</em>。</p>
<p>防護機制兩層：</p>
<ol>
<li><strong>DCS quorum</strong>：etcd / Consul 至少 3 node、過半 quorum 才能寫 leader key — 少數派 partition 無法 elect 新 leader</li>
<li><strong>Watchdog (Linux kernel)</strong>：required mode 強制 — Patroni 必須定期 <em>poke</em> <code>/dev/watchdog</code>、若 Patroni 自己掛或被 OS 凍結、kernel 自動 reboot 整台機器、避免舊 leader 在 DCS 失聯後繼續接寫</li>
</ol>
<p>Watchdog <code>required</code> 是 production-grade 的硬要求 — <code>automatic</code> / <code>off</code> 在 split-brain 場景下無法防護。</p>
<h2 id="stage-3promotion--pg_ctl--replication-slot-切換">Stage 3：Promotion — pg_ctl + replication slot 切換</h2>
<p>新 leader 寫 DCS key 成功後、Patroni 自動執行：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># Patroni 內部、不要手動跑</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pg_ctl promote -D /var/lib/postgresql/data
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># postgresql.auto.conf 移除 primary_conninfo</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># postgresql.auto.conf 重新計算 timeline ID</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 啟動接寫</span></span></span></code></pre></div><p>Promotion 期間關鍵議題：</p>
<ul>
<li><strong>timeline divergence</strong>：新 leader 開新 timeline ID（從 leader 失聯時的 LSN 開始）；其他 standby 需要 <code>pg_rewind</code> 把自己的 WAL fork 點對齊新 timeline</li>
<li><strong>replication slot 處理</strong>：舊 leader 上的 replication slot 在 DCS 中已 stale、新 leader 重建 slot；如果 logical replication consumer 沒 idempotent、會 replay 部分訊息</li>
<li><strong>promotion latency</strong>：通常 3-10 秒（pg_ctl 本身 &lt; 5s、加 DCS 寫確認）</li>
</ul>
<h2 id="stage-4reconfiguration--client-routing-切換">Stage 4：Reconfiguration — client routing 切換</h2>
<p>PostgreSQL 自己升 leader 還不夠、application 不知道；要靠前端 routing 層轉發。三種典型 pattern：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">[client] → [HAProxy / pgBouncer] → [pg-node-1 (leader)]
</span></span><span class="line"><span class="ln">2</span><span class="cl">                                 → [pg-node-2 (standby, read)]
</span></span><span class="line"><span class="ln">3</span><span class="cl">                                 → [pg-node-3 (standby, read)]</span></span></code></pre></div><p>Patroni REST API 暴露 <code>/leader</code> / <code>/replica</code> / <code>/health</code> endpoint、HAProxy 用 <em>health check</em> 跑這些 endpoint：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl"># haproxy.cfg
</span></span><span class="line"><span class="ln">2</span><span class="cl">backend pg-write
</span></span><span class="line"><span class="ln">3</span><span class="cl">  option httpchk OPTIONS /leader
</span></span><span class="line"><span class="ln">4</span><span class="cl">  http-check expect status 200
</span></span><span class="line"><span class="ln">5</span><span class="cl">  server pg-node-1 pg-node-1:5432 check port 8008
</span></span><span class="line"><span class="ln">6</span><span class="cl">  server pg-node-2 pg-node-2:5432 check port 8008 backup
</span></span><span class="line"><span class="ln">7</span><span class="cl">  server pg-node-3 pg-node-3:5432 check port 8008 backup</span></span></code></pre></div><p>Reconfiguration 期間關鍵延遲：</p>
<ul>
<li>HAProxy health check 間隔（預設 2s）+ failure threshold（預設 3 次）= ~6s 切換感應</li>
<li>PgBouncer 不主動 health check、要靠 application 端 retry 跟 connection drop 觸發重連</li>
<li>整個 reconfiguration 端到端通常 10-20s（含 PostgreSQL promotion 時間）</li>
</ul>
<h2 id="stage-5recovery--pg_rewind-跟-base-backup-取捨">Stage 5：Recovery — pg_rewind 跟 base backup 取捨</h2>
<p>舊 leader 恢復後變 standby，但 WAL 已 divergence — 必須選一條 recovery path：</p>
<ul>
<li><strong><code>pg_rewind</code></strong>：rewind 舊 leader WAL 到分歧點、重新接 streaming replication；條件 = 分歧 WAL 量小（&lt; 幾 GB）且 timeline 可對齊</li>
<li><strong>重 base backup</strong>：用 <code>pg_basebackup</code> 從新 leader 拉完整 base + WAL；條件 = 任何時候都可、但時間長（TB 級 1-4 小時）</li>
</ul>
<p>Patroni 預設嘗試 pg_rewind、失敗才退 base backup。production 配置：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln">1</span><span class="cl"><span class="nt">postgresql</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">  </span><span class="nt">use_pg_rewind</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="nt">remove_data_directory_on_rewind_failure</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">   </span><span class="c"># rewind 失敗自動清 data dir、再 base backup</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="nt">remove_data_directory_on_diverged_timelines</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span></span></span></code></pre></div><h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1split-brain-due-to-dcs-partition">Case 1：Split-brain due to DCS partition</h3>
<p><strong>徵兆</strong>：兩個 PostgreSQL node 都在接寫、application 大量寫入 conflict / unique constraint violation。</p>
<p><strong>根因</strong>：DCS（etcd）partition — 兩個 etcd node 在 partition 兩側、都自認 quorum；其實是 split-vote、兩邊都不應該。Patroni 在兩邊各 elect 一個 leader。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>DCS 必須奇數 node（3 / 5 / 7）、過半 quorum 嚴格 enforce</li>
<li>DCS 部署跨 AZ / region 時、quorum size 要考慮 partition 機率（3 AZ 各 1 node 是 production 最低標）</li>
<li>Watchdog <code>required</code> mode 是最後一道閘門 — DCS partition 加 quorum 失靈時、watchdog 強制 reboot 失聯 node</li>
</ol>
<h3 id="case-2standby-落後太多無法-failover">Case 2：Standby 落後太多、無法 failover</h3>
<p><strong>徵兆</strong>：primary 失聯後、Patroni log 顯示 <code>Following members have lag greater than maximum_lag_on_failover</code>、所有 standby 都被拒 promote、cluster unavailable。</p>
<p><strong>根因</strong>：maximum_lag_on_failover 設 1MB、但 standby replication lag 累積到 50MB（write-heavy workload + slow disk on standby）。安全機制觸發、但代價是 <em>無 standby 可升</em>、需要人工降低門檻或等 standby catch up。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：standby 容量 / IO 對齊 primary、避免 lag 累積；prometheus alert <code>pg_replication_lag_bytes &gt; 10MB</code> 觸發前 catch</li>
<li><strong>臨時</strong>：手動 <code>patronictl edit-config</code> 把 maximum_lag_on_failover 暫時拉到 50MB、接受可能丟 50MB worth of writes、換 availability</li>
<li><strong>長期</strong>：sync replication（一個 standby 強制同步）、保證至少一個 standby zero-lag</li>
</ol>
<h3 id="case-3promotion-後-application-connection-storm">Case 3：Promotion 後 application connection storm</h3>
<p><strong>徵兆</strong>：failover 完成後 30-120 秒內、application log 大量 <code>connection refused</code> / <code>password authentication failed</code>、application 自己 retry storm。</p>
<p><strong>根因</strong>：新 leader 剛 promote、PostgreSQL <code>max_connections</code> 容量還在 warm up（shared memory / cache 未 prime）、application 同時湧入大量 connection request；應用 retry 不夠 jitter、queue 堆積。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Application 用 <em>exponential backoff with jitter</em>、不要 immediate retry</li>
<li>PgBouncer / connection pool 限制每 application instance 對 PG 的 connection 上限、不直連 PG</li>
<li>預先在 standby 跑 <code>pg_prewarm</code> 把熱表 cache 預熱、promotion 後 cache miss 不爆</li>
</ol>
<h3 id="case-4pg_rewind-失敗退到-base-backup-沒做">Case 4：pg_rewind 失敗、退到 base backup 沒做</h3>
<p><strong>徵兆</strong>：舊 leader 恢復後、Patroni log 顯示 <code>pg_rewind failed</code>、舊 leader 一直 STARTING、無法重接 cluster；SRE 手動跑 pg_basebackup 才恢復。</p>
<p><strong>根因</strong>：<code>remove_data_directory_on_rewind_failure: false</code>（預設）— rewind 失敗時 Patroni 不主動清 data dir、需要 SRE 手動處理；運維沒 runbook、卡在這步幾小時。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Production 設 <code>remove_data_directory_on_rewind_failure: true</code> + <code>remove_data_directory_on_diverged_timelines: true</code>、讓 Patroni 自動 fallback</li>
<li>data dir 跑在獨立 PV / disk、清掉風險可控（不要跑 root disk）</li>
<li>容量規劃：base backup 時間預估納入 RTO（TB 級 base backup 1-4 小時、不是 RTO 30 分鐘所能承受）</li>
</ol>
<h3 id="case-5watchdog-觸發整機-reboot誤殺">Case 5：Watchdog 觸發整機 reboot、誤殺</h3>
<p><strong>徵兆</strong>：production server 在無故障時 unexpected reboot、<code>dmesg</code> 顯示 <code>watchdog: BUG: soft lockup</code>。</p>
<p><strong>根因</strong>：Patroni 主循環因 etcd 短暫慢回應卡住 60+ 秒、kernel watchdog 觸發 reboot；但實際 PostgreSQL 沒 hang、是 Patroni-watchdog 鏈過敏。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><code>safety_margin</code> 設大一點（10-15）、給 Patroni loop_wait 抖動空間</li>
<li>etcd 跟 Patroni 部署在低延遲 network 內（同 AZ &lt; 5ms）、跨 region etcd 不建議</li>
<li>watchdog device 用 softdog（軟體模擬）vs 硬體 watchdog、debug 時 softdog 容易觀察</li>
</ol>
<h2 id="容量規劃">容量規劃</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>估算</th>
          <th>警戒</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cluster size</td>
          <td>3-5 node（含 leader + 2-4 standby）</td>
          <td>&lt; 3 不能 HA（單 standby 失敗整 cluster 掛）</td>
      </tr>
      <tr>
          <td>DCS size</td>
          <td>3 / 5 / 7 node（奇數 quorum）</td>
          <td>etcd 5 node 是 prod standard</td>
      </tr>
      <tr>
          <td>TTL</td>
          <td>30s（default 30、production 20-60）</td>
          <td>&lt; 15s 過敏、&gt; 60s 過鈍</td>
      </tr>
      <tr>
          <td>maximum_lag_on_failover</td>
          <td>1MB（default）</td>
          <td>大表 write-heavy 可放 10-100MB</td>
      </tr>
      <tr>
          <td>Synchronous standby</td>
          <td>1 個 sync + N 個 async 是 production 預設</td>
          <td>全 async 容易丟資料、全 sync write latency 爆</td>
      </tr>
      <tr>
          <td>RTO</td>
          <td>10-30 秒（detection 30s 內 + promotion 5-10s + reconfig 5s）</td>
          <td>&gt; 60s 要 audit 鏈路</td>
      </tr>
      <tr>
          <td>RPO</td>
          <td>sync mode 接近 0、async mode 跟 lag 同數量級</td>
          <td>async 在 disk IO 慢時 lag 可能 MB-GB level</td>
      </tr>
  </tbody>
</table>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-pgbouncer-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PgBouncer</a> 整合</h3>
<p>PgBouncer 不主動感知 Patroni failover、要靠：</p>
<ol>
<li><strong>HAProxy 在 PgBouncer 上層</strong>：HAProxy 跑 Patroni health check、PgBouncer connection 重新路由</li>
<li><strong>PgBouncer reload</strong>：failover 後 SRE / automation 跑 <code>pgbouncer -R</code>、強制重連 backend</li>
<li><strong>Connection pool drain</strong>：application 端 connection pool 設 <code>pool_lifetime_max=5min</code>、舊 connection 自然汰換</li>
</ol>
<h3 id="跟-cert-managertls-rotation">跟 cert-manager（TLS rotation）</h3>
<p>Patroni REST API 跟 PostgreSQL streaming replication 都用 TLS、cert rotation 不能停服務：</p>
<ol>
<li>cert-manager 自動換證後、Patroni 跟 PostgreSQL 都需要 reload（不是 restart）</li>
<li><code>patronictl reload &lt;cluster&gt;</code> 不會觸發 failover、只 reload config</li>
<li>PostgreSQL <code>pg_ctl reload</code> 是 SIGHUP、平滑載入新 cert</li>
</ol>
<h3 id="跟-backup--pitr">跟 backup / PITR</h3>
<p>Patroni 不管 backup — 但 standby promotion 後、WAL archive 必須跟新 leader 的 timeline 對齊：</p>
<ol>
<li>WAL archive 命令模板含 <code>%t</code>（timeline）：<code>archive_command = 'wal-g wal-push %p'</code></li>
<li>Backup tool（pgBackRest / WAL-G）支援 timeline 切換、archive 不會中斷</li>
<li>詳見 <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PITR + WAL archiving deep article</a></li>
</ol>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Multi-region Patroni</strong>：跨 region 部署的 DCS quorum 設計、跟單 region 的取捨完全不同</li>
<li><strong>PostgreSQL 16+ streaming replication slot 持久化</strong>：簡化 standby promotion 後 logical consumer 重連</li>
<li><strong>跟 Kubernetes operator 整合</strong>：Patroni 跑在 K8s 時、StatefulSet + pod identity + DCS 部署模式</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>上游 chapter：<a href="/blog/backend/01-database/high-concurrency-access/" data-link-title="1.1 高併發下的 SQL 讀寫邊界" data-link-desc="說明高併發服務如何共用資料庫 client、控制 transaction、管理 connection pool、避免資料庫成為瓶頸">High Concurrency Access</a> — connection / replication / HA 全鏈</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgBouncer 配置</a> / <a href="/blog/backend/07-security-data-protection/vendors/hashicorp-vault/dynamic-credential/" data-link-title="HashiCorp Vault Dynamic Credential：lease 治理跟 application 整合的實作層" data-link-desc="Vault database secrets engine 怎麼配、application 怎麼 renew lease、production 五大踩雷（lease 過期 race、DB max_connections 撞牆、Vault sealed、token expire、scope 過寬）、容量規劃跟 vault-agent injector 整合">Vault Dynamic Credential</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item></channel></rss>