<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Partition on Tarragon</title><link>https://tarrragon.github.io/blog/tags/partition/</link><description>Recent content in Partition on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Tue, 19 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/partition/index.xml" rel="self" type="application/rss+xml"/><item><title>PostgreSQL Partition Redesign：當 monthly partition 越跑越慢</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/partition-redesign/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/partition-redesign/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。對應 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Type F「Topology re-layout」&lt;/a> 第 2 個 dogfood（第 1 個是 &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis cluster re-sharding&lt;/a>）— 驗證 Type F anatomy 在不同 vendor 上的通用性。&lt;/p>&lt;/blockquote>
&lt;h2 id="為什麼-monthly-partition-越跑越慢">為什麼 monthly partition 越跑越慢&lt;/h2>
&lt;p>上線時 monthly range partition 設計很合理 — 每月一個 partition、12 個月一年、partition_pruning 在 &lt;code>WHERE event_time &amp;gt;= '2026-05-01'&lt;/code> 時跑單 partition、查詢快。但業務跑了 18 個月後：&lt;/p>
&lt;ul>
&lt;li>每月 partition size 從 50GB 漲到 500GB（流量 10x）&lt;/li>
&lt;li>單月查詢 &lt;code>WHERE event_time BETWEEN '2026-05-01' AND '2026-05-15'&lt;/code> 仍掃整月 500GB（partition_pruning 粒度只到 month）&lt;/li>
&lt;li>Vacuum 一個月 partition 需要 6-8 小時、跑不進 maintenance window&lt;/li>
&lt;li>DROP 老 partition 釋放 storage 是 monthly cadence、但 retention policy 要求 daily granularity&lt;/li>
&lt;/ul>
&lt;p>partition 設計需要 &lt;em>redesign&lt;/em>、不是「optimize」 — 從 monthly range partition 改成 daily range partition、partition 數量從 36 個（3 年 retention）變 1095 個。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。對應 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Type F「Topology re-layout」</a> 第 2 個 dogfood（第 1 個是 <a href="/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis cluster re-sharding</a>）— 驗證 Type F anatomy 在不同 vendor 上的通用性。</p></blockquote>
<h2 id="為什麼-monthly-partition-越跑越慢">為什麼 monthly partition 越跑越慢</h2>
<p>上線時 monthly range partition 設計很合理 — 每月一個 partition、12 個月一年、partition_pruning 在 <code>WHERE event_time &gt;= '2026-05-01'</code> 時跑單 partition、查詢快。但業務跑了 18 個月後：</p>
<ul>
<li>每月 partition size 從 50GB 漲到 500GB（流量 10x）</li>
<li>單月查詢 <code>WHERE event_time BETWEEN '2026-05-01' AND '2026-05-15'</code> 仍掃整月 500GB（partition_pruning 粒度只到 month）</li>
<li>Vacuum 一個月 partition 需要 6-8 小時、跑不進 maintenance window</li>
<li>DROP 老 partition 釋放 storage 是 monthly cadence、但 retention policy 要求 daily granularity</li>
</ul>
<p>partition 設計需要 <em>redesign</em>、不是「optimize」 — 從 monthly range partition 改成 daily range partition、partition 數量從 36 個（3 年 retention）變 1095 個。</p>
<p><a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a> 結果：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>同 PostgreSQL、同 table 定義、partition key 不變</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>同 PostgreSQL operational stack</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>同 OLTP RDBMS</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>同 1 個 DB</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>不改（partition_pruning 透明）</td>
          <td>Low</td>
      </tr>
      <tr>
          <td><strong>Data topology</strong></td>
          <td><strong>Partition strategy 從 monthly → daily</strong></td>
          <td><strong>High</strong></td>
      </tr>
  </tbody>
</table>
<p>6 維皆 Low + topology High = <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">Type F「Topology re-layout」</a>。</p>
<h2 id="pre-layout-analysispartition-不平衡偵測">Pre-layout analysis：partition 不平衡偵測</h2>
<p>執行 redesign 前必須先量化當前 topology：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. 每 partition size + row count
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="n">child</span><span class="p">.</span><span class="n">relname</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">partition_name</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="n">pg_size_pretty</span><span class="p">(</span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="p">))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">size</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="n">child</span><span class="p">.</span><span class="n">reltuples</span><span class="p">::</span><span class="nb">bigint</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">estimated_rows</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">  </span><span class="n">pg_stat_get_last_vacuum_time</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">last_vacuum</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">pg_inherits</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">JOIN</span><span class="w"> </span><span class="n">pg_class</span><span class="w"> </span><span class="n">parent</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">pg_inherits</span><span class="p">.</span><span class="n">inhparent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">parent</span><span class="p">.</span><span class="n">oid</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">JOIN</span><span class="w"> </span><span class="n">pg_class</span><span class="w"> </span><span class="n">child</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">pg_inherits</span><span class="p">.</span><span class="n">inhrelid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">parent</span><span class="p">.</span><span class="n">relname</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;events&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">pg_relation_size</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">oid</span><span class="p">)</span><span class="w"> </span><span class="k">DESC</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- 2. partition_pruning 命中率
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"></span><span class="k">EXPLAIN</span><span class="w"> </span><span class="p">(</span><span class="k">ANALYZE</span><span class="p">,</span><span class="w"> </span><span class="n">BUFFERS</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="k">BETWEEN</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="s1">&#39;2026-05-15&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="c1">-- 期望: 只 scan 1 partition (target: daily) 或 1 partition (current: monthly)
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1">-- 觀察: monthly 設計下、即使 query 只跨 15 天、planner 仍 scan 整月 partition (~500GB)
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="c1">-- 3. 找 partition imbalance
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w">  </span><span class="n">to_char</span><span class="p">(</span><span class="n">event_time</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;YYYY-MM&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">month</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">  </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">row_count</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"></span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">1</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="k">DESC</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w"></span><span class="c1">-- 找 hot month / cold month、判斷 redesign 後分佈</span></span></span></code></pre></div><p>Pre-layout 階段的 output：</p>
<ul>
<li><strong>當前 topology 量化</strong>：36 monthly partition、總 size 1.8TB、最大 partition 500GB、最小 50GB</li>
<li><strong>Hot key 分佈</strong>：80% 流量集中最近 3 個月</li>
<li><strong>Redesign 目標</strong>：daily partition、最近 3 個月 hot daily / 3 個月 + 之前 cold weekly / 1 年 + 之前 monthly（sub-partition strategy）</li>
<li><strong>Migration scope</strong>：1095 個 partition 不直接全建、按 retention policy 階段性</li>
</ul>
<h2 id="re-layout-機制attach--detach-線上重劃">Re-layout 機制：ATTACH / DETACH 線上重劃</h2>
<p>PostgreSQL 不支援「直接改 partition strategy」、必須走 <em>新 partition tree + 資料搬遷</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. 建新 daily partition table (parallel to events)
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_daily</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="n">id</span><span class="w"> </span><span class="nb">bigint</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="n">event_time</span><span class="w"> </span><span class="n">timestamptz</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">  </span><span class="n">payload</span><span class="w"> </span><span class="n">jsonb</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">RANGE</span><span class="w"> </span><span class="p">(</span><span class="n">event_time</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="c1">-- 2. 預建未來 90 天 daily partition
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">  </span><span class="n">format</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">    </span><span class="s1">&#39;CREATE TABLE events_daily_%s PARTITION OF events_daily FOR VALUES FROM (%L) TO (%L)&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="n">to_char</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;YYYY_MM_DD&#39;</span><span class="p">),</span><span class="w"> </span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">d</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nb">interval</span><span class="w"> </span><span class="s1">&#39;1 day&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">  </span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">generate_series</span><span class="p">(</span><span class="k">current_date</span><span class="p">,</span><span class="w"> </span><span class="k">current_date</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nb">interval</span><span class="w"> </span><span class="s1">&#39;90 days&#39;</span><span class="p">,</span><span class="w"> </span><span class="nb">interval</span><span class="w"> </span><span class="s1">&#39;1 day&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">d</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="c1">-- 3. dual-write phase: application 同寫 events + events_daily
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1">-- (用 trigger 或 application-side)
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">OR</span><span class="w"> </span><span class="k">REPLACE</span><span class="w"> </span><span class="k">FUNCTION</span><span class="w"> </span><span class="n">dual_write_events</span><span class="p">()</span><span class="w"> </span><span class="k">RETURNS</span><span class="w"> </span><span class="k">TRIGGER</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="err">$$</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"></span><span class="k">BEGIN</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">  </span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">events_daily</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="k">NEW</span><span class="p">.</span><span class="o">*</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w">  </span><span class="k">RETURN</span><span class="w"> </span><span class="k">NEW</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w"></span><span class="k">END</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w"></span><span class="err">$$</span><span class="w"> </span><span class="k">LANGUAGE</span><span class="w"> </span><span class="n">plpgsql</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TRIGGER</span><span class="w"> </span><span class="n">events_dual_write</span><span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w"></span><span class="k">AFTER</span><span class="w"> </span><span class="k">INSERT</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w"></span><span class="k">FOR</span><span class="w"> </span><span class="k">EACH</span><span class="w"> </span><span class="k">ROW</span><span class="w"> </span><span class="k">EXECUTE</span><span class="w"> </span><span class="k">FUNCTION</span><span class="w"> </span><span class="n">dual_write_events</span><span class="p">();</span><span class="w">
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="w"></span><span class="c1">-- 4. backfill historical data per partition
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">events_daily</span><span class="w">
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">event_time</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="s1">&#39;2026-05-02&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="w"></span><span class="c1">-- ... 每天跑一個 day partition、avoid long transaction
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="w"></span><span class="c1">-- 5. cutover: rename swap
</span></span></span><span class="line"><span class="ln">36</span><span class="cl"><span class="c1"></span><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">37</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">RENAME</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">events_old</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">38</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events_daily</span><span class="w"> </span><span class="k">RENAME</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">events</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">39</span><span class="cl"><span class="w"></span><span class="k">DROP</span><span class="w"> </span><span class="k">TRIGGER</span><span class="w"> </span><span class="n">events_dual_write</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events_old</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">40</span><span class="cl"><span class="w"></span><span class="k">COMMIT</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">41</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">42</span><span class="cl"><span class="w"></span><span class="c1">-- 6. 觀察 1-2 週、DROP events_old</span></span></span></code></pre></div><p>關鍵：rename swap 是 <em>single transaction</em>、cutover 瞬間發生；application connection 不需重連、但 prepared statement cache 可能要刷新。</p>
<h2 id="execution-flow-per-step">Execution flow per-step</h2>
<p>5 段、每段含 rollback boundary：</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>動作</th>
          <th>Rollback boundary</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 預建 partition</td>
          <td>建 events_daily + 90 天 partition、不影響 production</td>
          <td>DROP events_daily、無 impact</td>
      </tr>
      <tr>
          <td>2 Dual-write</td>
          <td>加 trigger 同寫兩端、observe diff</td>
          <td>DROP trigger、events_daily 留作 cleanup</td>
      </tr>
      <tr>
          <td>3 Backfill</td>
          <td>逐日 backfill 歷史資料、用 CHECK constraint 確保完整性</td>
          <td>DROP backfilled partition、不影響 source events</td>
      </tr>
      <tr>
          <td>4 Verify</td>
          <td>對 sample query 跑 events vs events_daily、確認 row count 一致</td>
          <td>仍在 dual-write、發現 diff 可暫停 cutover</td>
      </tr>
      <tr>
          <td>5 Cutover</td>
          <td>Rename swap</td>
          <td><strong>不可逆</strong>、回退需 reverse rename + dual-write restart</td>
      </tr>
  </tbody>
</table>
<p>Step 5 是不可逆邊界、應該排在 <em>低流量 maintenance window</em> 跑、且 cutover 前必須有 backup checkpoint。</p>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1backfill-期間-long-transaction-阻塞-vacuum">Case 1：Backfill 期間 long transaction 阻塞 vacuum</h3>
<p><strong>徵兆</strong>：backfill 跑 6 小時的 <code>INSERT INTO events_daily SELECT * FROM events WHERE ...</code>、期間 events 表的 autovacuum 完全不跑、dead tuple 累積、production query 變慢。</p>
<p><strong>根因</strong>：PostgreSQL transaction 期間 <em>xmin horizon 鎖死</em>、vacuum 只能回收「不會被任何 active transaction 看到」的 dead tuple；long backfill = long open transaction、vacuum 失效。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>拆 batch INSERT</strong>：每日 backfill 拆成 small batch（10 萬 row 一個 transaction）、每個 commit 釋放 xmin</li>
<li><strong>用 COPY 不用 INSERT</strong>：<code>COPY events_daily FROM (SELECT * FROM events WHERE ...)</code> 是 PG 對 batch 最快 + 對 vacuum 影響小</li>
<li><strong>Backfill 跑在 standby</strong>：用 logical replication 從 standby 拉資料、不在 primary 跑長 transaction</li>
</ol>
<h3 id="case-2trigger-dual-write-對-application-造成-latency">Case 2：Trigger dual-write 對 application 造成 latency</h3>
<p><strong>徵兆</strong>：加 trigger 後 application 寫入 latency p99 從 5ms 漲到 25-50ms；high-throughput batch job 直接 timeout。</p>
<p><strong>根因</strong>：每筆 INSERT 都觸發 trigger function 跑一次 INSERT 到 events_daily、IO 雙倍、index 也雙倍維護。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>改 application-side dual-write</strong>：application code 顯式寫兩端、用 connection pool batch 攤平 IO</li>
<li><strong>用 logical replication slot</strong>：events → events_daily 用 logical replication 取代 trigger、降 IO 衝擊</li>
<li><strong>dual-write 時間最小化</strong>：trigger 只在 backfill + verify 期間打開、cutover 前關掉</li>
</ol>
<h3 id="case-3partition_pruning-沒命中planner-仍掃所有-partition">Case 3：Partition_pruning 沒命中、planner 仍掃所有 partition</h3>
<p><strong>徵兆</strong>：cutover 完成後、application 端某些 query latency 從 200ms 跳到 5000ms；EXPLAIN 顯示 <code>Append</code> 下面所有 1095 個 partition 都被 scan。</p>
<p><strong>根因</strong>：partition 數量爆到 1000+、planner planning_time 對某些 query 變長（含 prepared statement 沒帶 partition key bound）；或 query 用了 <code>WHERE event_time = some_function(now())</code>、planning-time pruning 不觸發。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong><code>enable_partition_pruning = on</code></strong> 預設、確認沒被 disable</li>
<li><strong>PG 11+ runtime pruning</strong>：prepared statement 用 generic plan、runtime pruning 補位</li>
<li><strong>Sub-partition strategy</strong>：1095 個 daily 太多、改 <em>最近 90 天 daily / 之前 monthly</em> 混合 strategy、減 partition count</li>
<li><strong>Planner statistics</strong>：跑 <code>ANALYZE</code> 重建 statistics、partition 樹太大時 planner 需新 stats</li>
</ol>
<h3 id="case-4constraint-exclusion-失敗跨-partition-unique-不-enforce">Case 4：Constraint exclusion 失敗、跨 partition unique 不 enforce</h3>
<p><strong>徵兆</strong>：cutover 後發現某 user 的 event 在多個 partition 都有、unique constraint <code>(user_id, event_id)</code> 沒 enforce；data audit 抓到 duplicate。</p>
<p><strong>根因</strong>：PostgreSQL partition table 的 <code>UNIQUE</code> constraint <em>必須包含 partition key</em>；本來 monthly partition 下 <code>UNIQUE (user_id, event_id)</code> 加上 <code>event_time</code>（partition key）變 <code>UNIQUE (user_id, event_id, event_time)</code>、實際語意是「同月同 user 同 event_id 唯一」；改 daily 後變「同日同 user 同 event_id 唯一」— unique scope 從月變天、原本月內跨日 dedup 失效。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-redesign</strong>：明示 unique constraint 的 <em>時間 scope</em>、redesign 後 scope 縮小是否可接受</li>
<li><strong>Application-side dedup</strong>：跨 partition 唯一性走 application 層 lookup（用 Redis SETEX 暫存 key）</li>
<li><strong>退到 non-partitioned dedup 表</strong>：建獨立 user_events_dedup 表、application 寫入前先 lookup</li>
</ol>
<h3 id="case-5drop-老-partition-太頻繁shared_buffers-cache-miss-爆">Case 5：DROP 老 partition 太頻繁、shared_buffers cache miss 爆</h3>
<p><strong>徵兆</strong>：daily partition 上線後、每天凌晨 cron DROP <code>events_2025_05_18</code>（90 天前）；DROP 後 shared_buffers 大量 invalidate、application 端 query latency p99 從 10ms 跳到 100-200ms 持續 30 分鐘。</p>
<p><strong>根因</strong>：PostgreSQL shared_buffers cache 對被 DROP 表的 page 全部 invalidate；DROP 大 partition（10GB+）後 cache hit rate 從 99% 掉到 60%、application 等 disk IO。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>DROP 跑在 off-peak</strong>：凌晨 3-4 點 cron、避開業務高峰</li>
<li><strong>預熱 next partition</strong>：DROP 前用 <code>pg_prewarm</code> 主動 load 熱 partition 進 cache</li>
<li><strong>改 DETACH + DROP TABLE delayed</strong>：DETACH 是 fast、DROP TABLE 排到 weekly batch、降頻率</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Monthly partition (current)</th>
          <th>Daily partition (target)</th>
          <th>Trade-off</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Partition count</td>
          <td>36 (3 年 retention)</td>
          <td>1095 (3 年 retention)</td>
          <td>30x partition count、planner cost 略升</td>
      </tr>
      <tr>
          <td>Single partition size</td>
          <td>50-500GB</td>
          <td>1-20GB</td>
          <td>Daily 更易 vacuum</td>
      </tr>
      <tr>
          <td>DROP old data</td>
          <td>Monthly cadence</td>
          <td>Daily cadence</td>
          <td>更細 retention 控制</td>
      </tr>
      <tr>
          <td>Query latency</td>
          <td>跨 partition 多時 50-200ms</td>
          <td>跨 partition 少時 5-50ms</td>
          <td>Daily 多數 query 更快</td>
      </tr>
      <tr>
          <td>Planning time</td>
          <td>5-10ms</td>
          <td>50-100ms (對 generic plan)</td>
          <td>Planning overhead + 1 order</td>
      </tr>
      <tr>
          <td>Maintenance window</td>
          <td>Vacuum 1 partition 6 小時</td>
          <td>Vacuum 1 partition 5-30 分鐘</td>
          <td>維護視窗更小、可日跑</td>
      </tr>
  </tbody>
</table>
<p><strong>判讀</strong>：daily partition 適合 <em>高流量 + 跨日查詢多 + retention 細的場景</em>；超大 partition (TB 級單日) 仍要 sub-partition 拆。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-autovacuum-tuning-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">autovacuum tuning</a> 整合</h3>
<p>Daily partition 後 autovacuum 行為：</p>
<ul>
<li>每 daily partition 獨立 autovacuum、scale_factor + threshold per-partition tuning</li>
<li><code>autovacuum_max_workers</code> 要從 3 拉到 6-10（partition 數爆）</li>
<li>Cold partition (&gt; 30 天) <code>autovacuum_enabled = false</code>、不浪費 CPU</li>
</ul>
<h3 id="跟-patroni-ha-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a> 整合</h3>
<p>Failover 期間 partition migration 不能跑、必須在 stable cluster state 執行；Patroni promote 後重新評估 partition health。</p>
<h3 id="跟-logical-replication--debezium-整合">跟 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a> 整合</h3>
<p><code>publish_via_partition_root = true</code> 讓 publication 從 parent 角度看；CDC consumer 不需要對每個 partition 設 subscription。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>跨 daily partition 的 archive strategy</strong>：archive 到 S3 cold storage、daily granularity 給更細 retention 控制</li>
<li><strong>pg_partman extension</strong>：自動建 daily partition、不用 cron；但要先確認 Aurora / RDS 支援</li>
<li><strong>Sub-partitioning</strong>：未來流量爆時用「daily by time + list by tenant」雙軸 partition</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>上游 vendor 頁：<a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>平行 deep article：<a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">Declarative Partitioning</a>（partition 基礎）/ <a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">Autovacuum Tuning</a></li>
<li>平行 Type F dogfood：<a href="/blog/backend/02-cache-redis/vendors/redis/cluster-resharding/" data-link-title="Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程" data-link-desc="Redis cluster re-sharding 是 5 type migration 漏類實證 — source / target 同 cluster、無 schema / paradigm 差、但 16384 slot 重分配是核心；本文涵蓋 4 種 re-sharding driver、slot migration 機制、redis-cli --cluster rebalance / reshard 工具、5 個 production 踩雷（cluster busy / replica lag / client cache stale / cross-slot transaction / monitor gap）">Redis Cluster Re-sharding</a>（dogfood #1）/ <a href="/blog/backend/01-database/vendors/mongodb/shard-expansion-multi-dc/" data-link-title="MongoDB Shard Expansion &#43; Multi-DC：Type F「不需要 parallel run」的 multi-region 例外" data-link-desc="MongoDB sharded cluster 加 shard &#43; 跨 DC expansion 是 Type F「topology re-layout」第 3 個 dogfood — 同時改 sharding &#43; replication topology &#43; region distribution；驗證 [#128](/report/data-topology-as-audit-dimension/) self-aware limitation 第 3 點「Type F 不需要 parallel run」claim 的例外（multi-region rollout 必須 parallel run &#43; 切流量）；涵蓋 chunk migration / replica set add member / cross-DC routing">MongoDB Shard + Multi-DC</a>（dogfood #3、F-multi-region sub-type）</li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> / <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">#127 Process content 結構由最大差異維度決定</a> / <a href="/blog/report/data-topology-as-audit-dimension/" data-link-title="Data topology 是 process content 的第 6 audit 維度" data-link-desc="Process content 的 diff dimension audit 原本 5 維（schema / operational / paradigm / components / application change）漏了 *data topology* — 資料在 cluster / partition / region 之間的分佈拓樸；topology 不在既有 5 維任一個、但決定 re-sharding / partition redesign / multi-region rollout 的結構；本卡擴 audit 到 6 維、新增 Type F「Topology re-layout」結構">#128 Data topology 是第 6 audit 維度</a></li>
</ul>
]]></content:encoded></item></channel></rss>