<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Wal-Archive on Tarragon</title><link>https://tarrragon.github.io/blog/tags/wal-archive/</link><description>Recent content in Wal-Archive on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Mon, 18 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/wal-archive/index.xml" rel="self" type="application/rss+xml"/><item><title>PostgreSQL PITR + WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 backup / recovery 是 OLTP 必備能力、本文聚焦 &lt;em>PITR（Point-In-Time Recovery）的雙軌資料設計 + production 5 個 failure mode&lt;/em>。&lt;/p>&lt;/blockquote>
&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>Logical bug 在 production 部署、執行 6 小時後才發現 — 某個 batch job 把 50 萬筆 user.email 改成 NULL。此時：&lt;/p>
&lt;ul>
&lt;li>還原最新 daily backup（昨晚）→ 丟掉今天所有正常寫入（訂單、註冊）&lt;/li>
&lt;li>從 standby promote → standby 已同步 bug、跟 primary 同狀態&lt;/li>
&lt;li>從 application log 重建 → 部分操作不可逆（已寄出 email）&lt;/li>
&lt;/ul>
&lt;p>PITR 是這類 &lt;em>logical disaster&lt;/em> 的標準解 — 不還原到 backup 時間點、而是 &lt;em>還原到 bug 發生前一刻&lt;/em>（例：1 分鐘前）。需要 &lt;em>base backup + WAL archive&lt;/em> 雙軌資料：base backup 是 snapshot、WAL archive 是 snapshot 之後的所有寫入；recovery 時 replay WAL 到指定 timestamp / LSN / transaction ID。&lt;/p>
&lt;h2 id="核心概念base-backup--wal-archive-的雙軌設計">核心概念：base backup + WAL archive 的雙軌設計&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">[Base backup t0] + [WAL archive t0 → now]
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> ↓ ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> 全量 snapshot incremental log
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> ↓ ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl"> └────── recover to t_target ──→ [restored cluster at t_target]&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>兩個軌道各自獨立但必須對齊：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Base backup&lt;/strong>：某時刻整個 data dir 的 snapshot。&lt;code>pg_basebackup&lt;/code> / &lt;code>pgBackRest&lt;/code> / &lt;code>WAL-G&lt;/code> 都產這個；通常 &lt;em>每天 / 每週&lt;/em> 跑一次&lt;/li>
&lt;li>&lt;strong>WAL archive&lt;/strong>：base backup 之後每段 WAL 都 push 到外部 storage（S3 / GCS / NFS）。&lt;code>archive_command&lt;/code> 觸發、PostgreSQL 等到 archive 成功才 &lt;em>回收&lt;/em> 那段 WAL&lt;/li>
&lt;/ol>
&lt;p>兩者組合決定 RPO（recovery point objective）：&lt;/p>
&lt;ul>
&lt;li>RPO ≈ WAL archive frequency（streaming 即時、&lt;code>archive_timeout&lt;/code> 預設 1 分鐘）&lt;/li>
&lt;li>RPO 不是 base backup frequency — daily base backup + 每分鐘 archive WAL → RPO 1 分鐘&lt;/li>
&lt;/ul>
&lt;p>RTO（recovery time objective）跟 &lt;em>base backup size + WAL replay 量&lt;/em> 相關：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 backup / recovery 是 OLTP 必備能力、本文聚焦 <em>PITR（Point-In-Time Recovery）的雙軌資料設計 + production 5 個 failure mode</em>。</p></blockquote>
<h2 id="問題情境">問題情境</h2>
<p>Logical bug 在 production 部署、執行 6 小時後才發現 — 某個 batch job 把 50 萬筆 user.email 改成 NULL。此時：</p>
<ul>
<li>還原最新 daily backup（昨晚）→ 丟掉今天所有正常寫入（訂單、註冊）</li>
<li>從 standby promote → standby 已同步 bug、跟 primary 同狀態</li>
<li>從 application log 重建 → 部分操作不可逆（已寄出 email）</li>
</ul>
<p>PITR 是這類 <em>logical disaster</em> 的標準解 — 不還原到 backup 時間點、而是 <em>還原到 bug 發生前一刻</em>（例：1 分鐘前）。需要 <em>base backup + WAL archive</em> 雙軌資料：base backup 是 snapshot、WAL archive 是 snapshot 之後的所有寫入；recovery 時 replay WAL 到指定 timestamp / LSN / transaction ID。</p>
<h2 id="核心概念base-backup--wal-archive-的雙軌設計">核心概念：base backup + WAL archive 的雙軌設計</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">[Base backup t0]  +  [WAL archive t0 → now]
</span></span><span class="line"><span class="ln">2</span><span class="cl">     ↓                       ↓
</span></span><span class="line"><span class="ln">3</span><span class="cl">  全量 snapshot          incremental log
</span></span><span class="line"><span class="ln">4</span><span class="cl">     ↓                       ↓
</span></span><span class="line"><span class="ln">5</span><span class="cl">     └────── recover to t_target ──→ [restored cluster at t_target]</span></span></code></pre></div><p>兩個軌道各自獨立但必須對齊：</p>
<ol>
<li><strong>Base backup</strong>：某時刻整個 data dir 的 snapshot。<code>pg_basebackup</code> / <code>pgBackRest</code> / <code>WAL-G</code> 都產這個；通常 <em>每天 / 每週</em> 跑一次</li>
<li><strong>WAL archive</strong>：base backup 之後每段 WAL 都 push 到外部 storage（S3 / GCS / NFS）。<code>archive_command</code> 觸發、PostgreSQL 等到 archive 成功才 <em>回收</em> 那段 WAL</li>
</ol>
<p>兩者組合決定 RPO（recovery point objective）：</p>
<ul>
<li>RPO ≈ WAL archive frequency（streaming 即時、<code>archive_timeout</code> 預設 1 分鐘）</li>
<li>RPO 不是 base backup frequency — daily base backup + 每分鐘 archive WAL → RPO 1 分鐘</li>
</ul>
<p>RTO（recovery time objective）跟 <em>base backup size + WAL replay 量</em> 相關：</p>
<ul>
<li>Restore base backup ~ 1-4 小時（TB 級）</li>
<li>WAL replay 時間 ~ archive 累積量 / replay throughput</li>
</ul>
<h2 id="step-by-step-配置">Step-by-step 配置</h2>
<h3 id="primaryarchive_command-設好">Primary：archive_command 設好</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">wal_level</span> <span class="o">=</span> <span class="s">replica                          # 預設 replica、PITR 需要</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">archive_mode</span> <span class="o">=</span> <span class="s">on                            # 啟用 archive</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">archive_command</span> <span class="o">=</span> <span class="s">&#39;wal-g wal-push %p&#39;        # 或 pgBackRest / 自寫 script</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">archive_timeout</span> <span class="o">=</span> <span class="s">60                         # 60s 無 WAL 時強制切 segment</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="na">max_wal_size</span> <span class="o">=</span> <span class="s">4GB</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="na">checkpoint_timeout</span> <span class="o">=</span> <span class="s">15min</span></span></span></code></pre></div><p><code>archive_command</code> 必須 <em>回 exit code 0 才算成功</em>；非 0 PostgreSQL retry、retry 失敗會在 <code>pg_wal</code> 堆積 WAL 直到 disk 滿。<strong>critical：archive_command 不能寫成 silent-fail</strong>。</p>
<h3 id="用-pgbackrest-取代手寫-script">用 pgBackRest 取代手寫 script</h3>
<p>production 強烈不建議自寫 archive script — pgBackRest / WAL-G / Barman 處理過所有 edge case：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># pgbackrest.conf</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="k">[global]</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="na">repo1-type</span><span class="o">=</span><span class="s">s3</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="na">repo1-s3-bucket</span><span class="o">=</span><span class="s">mybucket</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="na">repo1-s3-region</span><span class="o">=</span><span class="s">us-east-1</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="na">repo1-retention-full</span><span class="o">=</span><span class="s">4                       # 留 4 個 full backup</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="na">repo1-retention-diff</span><span class="o">=</span><span class="s">8                       # 留 8 個 differential</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="na">repo1-cipher-type</span><span class="o">=</span><span class="s">aes-256-cbc                # encrypt at rest</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="na">process-max</span><span class="o">=</span><span class="s">8                                # parallel restore</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="k">[main]</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="na">pg1-path</span><span class="o">=</span><span class="s">/var/lib/postgresql/16/main</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 跑 full backup</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pgbackrest --stanza<span class="o">=</span>main backup --type<span class="o">=</span>full
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># archive_command 用 pgbackrest 內建</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="nv">archive_command</span> <span class="o">=</span> <span class="s1">&#39;pgbackrest --stanza=main archive-push %p&#39;</span></span></span></code></pre></div><p>pgBackRest 處理：parallel push、compression、encryption、checksum、archive replay timing、backup catalog、retention 自動清理。</p>
<h3 id="restorerecovery_target_time">Restore：recovery_target_time</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 1. 從 S3 / repo 拉 base backup</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pgbackrest --stanza<span class="o">=</span>main --type<span class="o">=</span><span class="nb">time</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --target<span class="o">=</span><span class="s2">&#34;2026-05-18 14:30:00+00&#34;</span> restore
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 2. PostgreSQL 進 recovery mode、自動 replay WAL 到 target time</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># (pgBackRest 寫好 recovery.signal + postgresql.auto.conf)</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"># 3. 確認到目標 timestamp 後、promote</span>
</span></span><span class="line"><span class="ln">9</span><span class="cl">pg_ctl promote</span></span></code></pre></div><p>Recovery target 三種：</p>
<ul>
<li><strong><code>recovery_target_time</code></strong>：到某 timestamp</li>
<li><strong><code>recovery_target_xid</code></strong>：到某 transaction ID（log 有 xid 才好定位）</li>
<li><strong><code>recovery_target_lsn</code></strong>：到某 WAL LSN（最精確、但需要事先記下 LSN）</li>
</ul>
<p>production 多用 timestamp、application log 有時間戳容易定位。</p>
<h2 id="故障演練--邊界-case">故障演練 / 邊界 case</h2>
<h3 id="case-1archive_command-靜默失敗">Case 1：archive_command 靜默失敗</h3>
<p><strong>徵兆</strong>：DBA 發現某 PITR test 時、最近 3 天的 WAL 在 S3 上沒有；但 PostgreSQL 沒 alert、<code>pg_wal</code> 也沒堆積（早就被回收？）。</p>
<p><strong>根因</strong>：archive_command 寫成 <code>aws s3 cp %p s3://bucket/... 2&gt;/dev/null</code> — 錯誤訊息被吞、exit code 卻是 0（cp 失敗但 redirect 後 shell wrapper 不傳 fail code）；PostgreSQL 以為成功、繼續 advance WAL pointer、舊 WAL 已回收、archive 上實際沒有。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>絕對不要靜默 exit code</strong>：archive_command 必須 <em>fail loud</em>、exit code 非 0</li>
<li><strong>用 pgBackRest / WAL-G</strong>、不自寫 shell 腳本</li>
<li><strong>monitoring</strong>：對 archive lag 寫 alert</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">pg_last_archived_xact_time</span><span class="p">(),</span><span class="w"> </span><span class="n">now</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pg_last_archived_xact_time</span><span class="p">()</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">lag</span><span class="p">;</span></span></span></code></pre></div><p>alert if lag &gt; 5 minutes</p>
<ol start="4">
<li><strong>定期測試 restore</strong>：每月跑一次 PITR drill、實際從 archive restore + 驗證 timestamp</li>
</ol>
<h3 id="case-2wal-archive-lagprimary-disk-壓力">Case 2：WAL archive lag、primary disk 壓力</h3>
<p><strong>徵兆</strong>：<code>pg_wal</code> 目錄持續長大、<code>df -h</code> 90%+；<code>pg_stat_archiver</code> 顯示 <code>failed_count</code> 累積、<code>last_failed_time</code> 是 30 分鐘前；archive_command 寫不出去（S3 throttle / network 慢）。</p>
<p><strong>根因</strong>：archive_command 寫到 S3、但 S3 rate limit / connection timeout、PostgreSQL retry；WAL 一直在 <code>pg_wal</code> 不能回收、disk 持續長。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>預防</strong>：<code>archive_command</code> 內部 retry + parallel push（pgBackRest 自帶 <code>process-max</code>）</li>
<li><strong>alert</strong>：<code>pg_stat_archiver.failed_count</code> 增長 + primary disk usage &gt; 80%</li>
<li><strong>緊急</strong>：暫時改 archive_command 寫 local NFS / 其他 storage、等 S3 恢復再同步；不要直接 disable archive（會丟資料）</li>
<li><strong>架構</strong>：archive storage 至少跨 region 兩份、單一 storage 故障不影響 archive</li>
</ol>
<h3 id="case-3recovery-跑到-wrong-target-time">Case 3：recovery 跑到 wrong target time</h3>
<p><strong>徵兆</strong>：PITR 還原後資料看起來 <em>缺一塊</em>；DBA 後悔 — target time 設早了 30 分鐘、recovery 已 promote、後續 WAL 在新 timeline 上、回不去。</p>
<p><strong>根因</strong>：recovery 過程不可逆 — 一旦 promote 開新 timeline、舊 WAL 在新 timeline 上不會被 replay；想還原到更晚 timestamp 必須 <em>重新 restore base backup + WAL</em>。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong><code>recovery_target_action = pause</code></strong>（PG 13+）：到 target time 後 <em>暫停</em>、不自動 promote；DBA 手動 query 確認資料對才 promote</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">recovery_target_time</span> <span class="o">=</span> <span class="s">&#39;2026-05-18 14:30:00+00&#39;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">recovery_target_action</span> <span class="o">=</span> <span class="s">pause</span></span></span></code></pre></div><ol start="2">
<li><strong>多次 PITR 試錯</strong>：用 <em>獨立 staging cluster</em> restore、驗證 target time 對、再對 production 跑</li>
<li><strong>記錄 target time 來源</strong>：application log / event timestamp 多比對、避免時區錯亂（<code>+00</code> UTC 跟 local time 差）</li>
</ol>
<h3 id="case-4base-backup-過期未清storage-爆">Case 4：base backup 過期未清、storage 爆</h3>
<p><strong>徵兆</strong>：S3 backup bucket size 半年內從 200GB 漲到 5TB；DBA 才發現 retention 沒設、daily base backup 留 180 天。</p>
<p><strong>根因</strong>：archive_command 自寫腳本沒 retention 邏輯、或 pgBackRest 設了 <code>repo1-retention-full=180</code> 漏看；DB 容量本來就成長 + 每日 full backup 累積。</p>
<p><strong>修法</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># pgBackRest retention：4 full + auto-expire archive</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">repo1-retention-full</span><span class="o">=</span><span class="s">4                         # 留 4 個 full backup</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">repo1-retention-diff</span><span class="o">=</span><span class="s">8                         # 留 8 個 differential</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">repo1-retention-archive</span><span class="o">=</span><span class="s">4                      # WAL archive 跟 full 對齊</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">repo1-retention-archive-type</span><span class="o">=</span><span class="s">full</span></span></span></code></pre></div><p>storage budgeting：</p>
<ul>
<li>daily full + diff + WAL archive ≈ 1-2x DB size / day</li>
<li>4-week retention → ~30-60x DB size storage</li>
<li>跨 region replication → 2-3x</li>
</ul>
<h3 id="case-5timeline-分歧後-recovery-模糊">Case 5：timeline 分歧後 recovery 模糊</h3>
<p><strong>徵兆</strong>：production 經歷一次 failover（Patroni promote）+ 之後又 PITR 一次；現在要再 PITR 到 failover 前一刻、archive 上有兩個 timeline、recovery target 搞不清要哪個。</p>
<p><strong>根因</strong>：每次 promote 開新 timeline ID（<code>.history</code> 檔）；archive storage 上同 LSN 可能對應不同 timeline；recovery target time 在分歧點附近、ambiguous。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong><code>recovery_target_timeline</code></strong> 明示要 follow 哪個 timeline</li>
</ol>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">recovery_target_time</span> <span class="o">=</span> <span class="s">&#39;2026-05-15 10:00:00+00&#39;</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">recovery_target_timeline</span> <span class="o">=</span> <span class="s">&#39;3&#39;                 # 要 follow timeline 3</span></span></span></code></pre></div><ol start="2">
<li><strong>熟悉 <code>.history</code> 檔</strong>：<code>/wal_archive/000000XX.history</code> 記錄 timeline 切換點、PITR 前先看</li>
<li><strong>預防</strong>：每次 promote 後 <em>立刻</em> 跑新的 base backup、簡化未來 PITR 流程（不用跨 timeline）</li>
</ol>
<h2 id="容量--cost-規劃">容量 / cost 規劃</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>估算</th>
          <th>警戒</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base backup size</td>
          <td>跟 DB data dir 大小成正比（PostgreSQL 內部 compression 後）</td>
          <td>每 backup ~ 0.5-1x DB size</td>
      </tr>
      <tr>
          <td>WAL archive size</td>
          <td>~5-50GB / day depending on write volume</td>
          <td>1TB DB / write-heavy 可能 100GB+ / day</td>
      </tr>
      <tr>
          <td>Storage retention</td>
          <td>4-12 weeks 典型</td>
          <td>30-60x DB size budget</td>
      </tr>
      <tr>
          <td>Base backup time</td>
          <td>TB 級 1-4 小時</td>
          <td>跑在 maintenance window</td>
      </tr>
      <tr>
          <td>Restore time</td>
          <td>base backup restore + WAL replay</td>
          <td>TB 級 PITR 通常 2-6 小時</td>
      </tr>
      <tr>
          <td>Network bandwidth</td>
          <td>full backup 期間 100-500 Mbps</td>
          <td>跨 region 注意 egress cost</td>
      </tr>
  </tbody>
</table>
<p>實務 default：</p>
<ul>
<li>Daily full backup + 4 weeks retention</li>
<li>WAL archive every 60s（<code>archive_timeout = 60</code>）</li>
<li>跨 region replication（S3 → S3 cross-region）</li>
<li>月度 restore drill 驗證可用</li>
</ul>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-patroni-ha-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> 整合</h3>
<p>Patroni 不管 backup，但 promotion 後 timeline 切換影響 archive：</p>
<ol>
<li>archive_command 用 <code>%t</code>（timeline）+ <code>%f</code>（filename）路徑、避免不同 timeline WAL 覆蓋</li>
<li>Patroni <code>recovery_conf</code> 包含 <code>restore_command</code>、standby clone 從 archive 拉</li>
<li>每次 Patroni failover 後跑 <em>full backup</em>、簡化未來 PITR</li>
</ol>
<h3 id="跟-logical-replication-對位">跟 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">logical replication</a> 對位</h3>
<p>PITR 跟 logical replication 服務不同 use case：</p>
<ul>
<li>PITR 是 <em>災難恢復</em>（logical bug / corruption）— 全量還原到某時刻</li>
<li>Logical replication 是 <em>連續 sync</em> — Kafka / 跨 DB 即時複製</li>
</ul>
<p>兩者 <em>都依賴 WAL</em>、但目標不同；同 PostgreSQL 可同時跑、互不衝突。</p>
<h3 id="跟-monitoring--alert">跟 monitoring + alert</h3>
<p>關鍵 metric：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- archive 健康度
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_stat_archiver</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- archived_count, failed_count, last_archived_wal, last_archived_time
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- WAL 在 pg_wal 等待 archive 量
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_ls_waldir</span><span class="p">()</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s1">&#39;^[0-9A-F]{24}$&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="c1">-- base backup 上次跑時間
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1">-- (pgBackRest API 或 backup catalog)</span></span></span></code></pre></div><p>Prometheus alert 三條：archive failed_count 增、archive lag &gt; 5min、base backup &gt; 25h 沒跑。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Incremental backup（PG 17+）</strong>：base backup 不全量、只 base + incremental</li>
<li><strong>Block-level differential</strong>：pgBackRest 已支援</li>
<li><strong>Cloud-native 替代</strong>：RDS / Aurora 用 storage-layer snapshot、不走 PITR 鏈</li>
<li><strong><code>pg_dump</code> vs PITR</strong>：pg_dump 是 logical backup（resume to different schema OK）、PITR 是 physical（必須同 version + same arch）</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>上游 chapter：<a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">Database Migration Playbook</a> — PITR 是 migration 的失敗回退</li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> / <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a> / <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum tuning</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item></channel></rss>