<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>MySQL on Tarragon</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/</link><description>Recent content in MySQL on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Wed, 13 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/index.xml" rel="self" type="application/rss+xml"/><item><title>MySQL → PostgreSQL：從 SQL dialect diff 跑出來的 Type A 6-phase migration</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-postgresql/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-postgresql/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a>。本文是 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> Type A 的標準形態實證。&lt;/p>&lt;/blockquote>
&lt;h2 id="三類-sql-dialect-diff-sample先看具體差距">三類 SQL dialect diff sample：先看具體差距&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- 1. Auto increment / sequence
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="c1">-- MySQL
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">INT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">AUTO_INCREMENT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">PRIMARY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- PostgreSQL
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">SERIAL&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">PRIMARY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 或 PG 10+:
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">CREATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nb">INT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">GENERATED&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">ALWAYS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">AS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">IDENTITY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">PRIMARY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 2. String concatenation
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="c1">-- MySQL: CONCAT(a, b) 或 a || b 在 ANSI mode
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">CONCAT&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">first_name&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39; &amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">last_name&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- PostgreSQL: a || b 或 CONCAT(a, b)
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">first_name&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">||&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39; &amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">||&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">last_name&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 注意: PostgreSQL 對 NULL || x = NULL、MySQL CONCAT 對 NULL 處理不同
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 3. UPSERT
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl">&lt;span class="c1">-- MySQL
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">18&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">INSERT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">INTO&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">VALUES&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;Alice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">19&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">DUPLICATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">KEY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">UPDATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">VALUES&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">20&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- PostgreSQL (9.5+)
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">21&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">INSERT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">INTO&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">VALUES&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;Alice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">22&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">CONFLICT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">DO&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">UPDATE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">SET&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">EXCLUDED&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">23&lt;/span>&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">24&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 4. Index hint / FORCE INDEX
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">25&lt;/span>&lt;span class="cl">&lt;span class="c1">-- MySQL
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">26&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FORCE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">INDEX&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">idx_created_at&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">created_at&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;2025-01-01&amp;#39;&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">27&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- PostgreSQL: 沒對應 syntax、依賴 planner + statistics
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">28&lt;/span>&lt;span class="cl">&lt;span class="c1">-- 必要時用 enable_seqscan=off 或 pg_hint_plan extension
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">29&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">30&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- 5. JSON path
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">31&lt;/span>&lt;span class="cl">&lt;span class="c1">-- MySQL 5.7+
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">32&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">data&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="s1">&amp;#39;$.name&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">33&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="c1">-- PostgreSQL
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">34&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">data&lt;/span>&lt;span class="o">-&amp;gt;&lt;/span>&lt;span class="s1">&amp;#39;name&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">35&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">data&lt;/span>&lt;span class="o">-&amp;gt;&amp;gt;&lt;/span>&lt;span class="s1">&amp;#39;name&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">-- 取出 text&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>5 個 sample 看出 MySQL → PostgreSQL 主要工作是 &lt;em>SQL dialect translation&lt;/em>；不是 5-10 個函數差、是 &lt;em>跨整個 application SQL surface 的 audit + 改寫&lt;/em>。對應 &lt;a href="https://tarrragon.github.io/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit&lt;/a> 結果：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> 跟 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a>。本文是 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> Type A 的標準形態實證。</p></blockquote>
<h2 id="三類-sql-dialect-diff-sample先看具體差距">三類 SQL dialect diff sample：先看具體差距</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. Auto increment / sequence
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1">-- MySQL
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="nb">INT</span><span class="w"> </span><span class="n">AUTO_INCREMENT</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- PostgreSQL
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="c1">-- 或 PG 10+:
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="nb">INT</span><span class="w"> </span><span class="k">GENERATED</span><span class="w"> </span><span class="n">ALWAYS</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">IDENTITY</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- 2. String concatenation
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1">-- MySQL: CONCAT(a, b) 或 a || b 在 ANSI mode
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">CONCAT</span><span class="p">(</span><span class="n">first_name</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39; &#39;</span><span class="p">,</span><span class="w"> </span><span class="n">last_name</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="c1">-- PostgreSQL: a || b 或 CONCAT(a, b)
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">first_name</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="s1">&#39; &#39;</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="n">last_name</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="c1">-- 注意: PostgreSQL 對 NULL || x = NULL、MySQL CONCAT 對 NULL 處理不同
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="c1">-- 3. UPSERT
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="c1">-- MySQL
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Alice&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="n">DUPLICATE</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="k">UPDATE</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">VALUES</span><span class="p">(</span><span class="n">name</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w"></span><span class="c1">-- PostgreSQL (9.5+)
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;Alice&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w"></span><span class="k">ON</span><span class="w"> </span><span class="n">CONFLICT</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">)</span><span class="w"> </span><span class="k">DO</span><span class="w"> </span><span class="k">UPDATE</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">EXCLUDED</span><span class="p">.</span><span class="n">name</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w"></span><span class="c1">-- 4. Index hint / FORCE INDEX
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="c1">-- MySQL
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">FORCE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="p">(</span><span class="n">idx_created_at</span><span class="p">)</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2025-01-01&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="w"></span><span class="c1">-- PostgreSQL: 沒對應 syntax、依賴 planner + statistics
</span></span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="c1">-- 必要時用 enable_seqscan=off 或 pg_hint_plan extension
</span></span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="w"></span><span class="c1">-- 5. JSON path
</span></span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="c1">-- MySQL 5.7+
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">data</span><span class="o">-&gt;</span><span class="s1">&#39;$.name&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="w"></span><span class="c1">-- PostgreSQL
</span></span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">data</span><span class="o">-&gt;</span><span class="s1">&#39;name&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">35</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">data</span><span class="o">-&gt;&gt;</span><span class="s1">&#39;name&#39;</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 取出 text</span></span></span></code></pre></div><p>5 個 sample 看出 MySQL → PostgreSQL 主要工作是 <em>SQL dialect translation</em>；不是 5-10 個函數差、是 <em>跨整個 application SQL surface 的 audit + 改寫</em>。對應 <a href="/blog/report/content-structure-by-max-diff-dimension/" data-link-title="Process content 結構由最大差異維度決定、不是 universal phased" data-link-desc="跨 X process content（migration / upgrade / rollout / playbook）的結構由 source / target 之間 *差異維度組合* 決定、不存在 universal phased 模板；6 種 migration / process type 實證（schema 差 / drop-in / operational / multi-tool / paradigm / topology re-layout）跑出 6 種不同結構；寫作前必須做 *6 維 diff dimension audit* 才能決定結構、跳過會套錯模板">diff dimension audit</a> 結果：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評估</th>
          <th>等級</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema / API</td>
          <td>SQL dialect 差大、CREATE TABLE / INDEX / function 都差</td>
          <td><strong>High</strong></td>
      </tr>
      <tr>
          <td>Operational model</td>
          <td>兩者都 OLTP RDBMS、replication 概念對等但語法不同</td>
          <td>Medium</td>
      </tr>
      <tr>
          <td>Abstraction / paradigm</td>
          <td>同 SQL RDBMS</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Number of components</td>
          <td>同 1 個</td>
          <td>Low</td>
      </tr>
      <tr>
          <td>Application change</td>
          <td>ORM 多數能 cover、raw SQL 必改</td>
          <td>Medium</td>
      </tr>
  </tbody>
</table>
<p>主導維度 Schema = High、走 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Type A 6-phase playbook</a> 標準結構。</p>
<h2 id="phase-0rule-audit--sql-surface-盤點">Phase 0：rule audit + SQL surface 盤點</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 1. 列所有 stored procedure
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">routine_schema</span><span class="p">,</span><span class="w"> </span><span class="k">routine_name</span><span class="p">,</span><span class="w"> </span><span class="n">routine_type</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">information_schema</span><span class="p">.</span><span class="n">routines</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="k">routine_schema</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;mysql&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;sys&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;information_schema&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;performance_schema&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="c1">-- 2. 列所有 trigger
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">trigger_name</span><span class="p">,</span><span class="w"> </span><span class="n">event_object_table</span><span class="p">,</span><span class="w"> </span><span class="n">action_statement</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">information_schema</span><span class="p">.</span><span class="n">triggers</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 3. 列所有 view
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">table_name</span><span class="p">,</span><span class="w"> </span><span class="n">view_definition</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">information_schema</span><span class="p">.</span><span class="n">views</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="c1">-- 4. 列所有 index 含 prefix length
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1"></span><span class="k">SHOW</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="c1">-- PostgreSQL 對 prefix index 處理不同、要逐個 audit</span></span></span></code></pre></div><p>Audit 主要產出三類清單：</p>
<ul>
<li><strong>Direct port</strong>：標準 SQL feature、PG 直接接受</li>
<li><strong>Translate</strong>：MySQL-specific syntax、需要改寫（UPSERT / CONCAT NULL 行為 / index hint）</li>
<li><strong>Refactor</strong>：MySQL-specific behavior（auto_increment session-level / SELECT FOUND_ROWS / GROUP BY 寬鬆 / TEXT 隱性 cast）— 不能直接 port、application code 也要改</li>
</ul>
<h2 id="phase-1schema-對位">Phase 1：schema 對位</h2>
<table>
  <thead>
      <tr>
          <th>MySQL</th>
          <th>PostgreSQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>INT AUTO_INCREMENT</code></td>
          <td><code>INT GENERATED ALWAYS AS IDENTITY</code> 或 <code>SERIAL</code></td>
      </tr>
      <tr>
          <td><code>TINYINT(1)</code> (boolean usage)</td>
          <td><code>BOOLEAN</code></td>
      </tr>
      <tr>
          <td><code>DATETIME</code></td>
          <td><code>TIMESTAMP WITHOUT TIME ZONE</code></td>
      </tr>
      <tr>
          <td><code>DATETIME(6)</code> (microsecond)</td>
          <td><code>TIMESTAMP(6)</code></td>
      </tr>
      <tr>
          <td><code>VARCHAR(N)</code> with charset</td>
          <td><code>VARCHAR(N)</code> (UTF-8 always)</td>
      </tr>
      <tr>
          <td><code>TEXT</code></td>
          <td><code>TEXT</code> (no length limit)</td>
      </tr>
      <tr>
          <td><code>LONGTEXT</code></td>
          <td><code>TEXT</code></td>
      </tr>
      <tr>
          <td><code>JSON</code></td>
          <td><code>JSONB</code> (推薦、indexed) 或 <code>JSON</code></td>
      </tr>
      <tr>
          <td><code>ENUM('a','b','c')</code></td>
          <td>自定 <code>TYPE foo AS ENUM('a','b','c')</code> 或 <code>VARCHAR + CHECK</code></td>
      </tr>
      <tr>
          <td><code>SET('a','b')</code></td>
          <td>Array <code>TEXT[]</code> + CHECK</td>
      </tr>
      <tr>
          <td><code>BINARY(N)</code></td>
          <td><code>BYTEA</code></td>
      </tr>
      <tr>
          <td>Index prefix <code>KEY (col(10))</code></td>
          <td>Functional index <code>CREATE INDEX ON t (LEFT(col, 10))</code></td>
      </tr>
      <tr>
          <td><code>FULLTEXT INDEX</code></td>
          <td><code>tsvector</code> + GIN index</td>
      </tr>
      <tr>
          <td>Geographic types</td>
          <td>PostGIS extension（必須先裝）</td>
      </tr>
  </tbody>
</table>
<p>Schema 對位表存版控、application code refactor 時對照。</p>
<h2 id="phase-2translation-pipeline3-tier-跟-splunk--elastic-類似">Phase 2：Translation pipeline（3-tier 跟 Splunk → Elastic 類似）</h2>
<h3 id="tier-1vendor--community-tool">Tier 1：vendor / community tool</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># pgloader：成熟工具、cover ~70-80% schema + data</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pgloader mysql://user:pass@mysql-host/dbname <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>         postgresql://user:pass@pg-host/dbname
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 或 AWS DMS（managed、適合 RDS / Aurora target）</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># DMS task: Full Load + CDC</span></span></span></code></pre></div><h3 id="tier-2自家-sql-refactor">Tier 2：自家 SQL refactor</h3>
<p>對 ORM 不能 cover 的 raw SQL：</p>
<ul>
<li>Manual grep <code>application code</code> 找 <code>auto_increment</code> / <code>ON DUPLICATE KEY</code> / <code>FORCE INDEX</code> / <code>FOUND_ROWS()</code> / <code>CONCAT NULL</code></li>
<li>寫 codemod / lint rule、CI 強制 check（PG-incompatible SQL block PR）</li>
</ul>
<h3 id="tier-3tricky-case-manual">Tier 3：tricky case manual</h3>
<p>例：MySQL <code>SELECT * FROM t1, t2 WHERE t1.id = t2.id GROUP BY t1.id</code>（implicit GROUP BY 寬鬆）— PG 嚴格 GROUP BY 必須 list 所有 non-aggregate column；application code refactor 必要。</p>
<h2 id="phase-3parallel-run">Phase 3：Parallel run</h2>
<p>雙寫 + 雙讀比對 1-2 個月：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Application ──→ MySQL (write + read primary)
</span></span><span class="line"><span class="ln">2</span><span class="cl">            └─→ PostgreSQL (write only + read shadow)
</span></span><span class="line"><span class="ln">3</span><span class="cl">                                    ↓
</span></span><span class="line"><span class="ln">4</span><span class="cl">                            Diff checker (latency / result diff)</span></span></code></pre></div><p><code>pt-table-checksum</code> (MySQL) + 自家 checksum scanner 對 sample table 跑 daily checksum、找 schema 對位錯。</p>
<h2 id="phase-4cutover">Phase 4：Cutover</h2>
<ul>
<li>設 application maintenance window（30 分鐘）</li>
<li>Drain MySQL write、等 last LSN propagated to PG</li>
<li>Application switch connection string → PG</li>
<li>解除 maintenance、monitor 24-48 hours</li>
</ul>
<h2 id="phase-5cleanup">Phase 5：Cleanup</h2>
<ul>
<li>MySQL read-only 1-2 週（fallback window）</li>
<li>之後 stop replication、decommission MySQL</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1auto_increment-vs-serial-跨-transaction-行為差">Case 1：Auto_increment vs SERIAL 跨 transaction 行為差</h3>
<p><strong>徵兆</strong>：cutover 後某 batch job 跑得比 MySQL 慢 5-10x、PG log 顯示 sequence 競爭。</p>
<p><strong>根因</strong>：MySQL <code>AUTO_INCREMENT</code> 取值受 <code>innodb_autoinc_lock_mode</code> 控制（8.0 預設 mode=2 interleaved 可並行、mode=0 才是 table-level lock；詳見 <a href="/blog/backend/01-database/vendors/mysql/lock-contention/" data-link-title="MySQL Lock Contention：在 staging 重現的 deadlock、production 跑 6 個月才出現" data-link-desc="MySQL InnoDB 的 lock 是 row-level、但 *為什麼某些 row 莫名其妙也被 lock* 是 gap lock / next-key lock 設計造成的隱性行為。本文從一個 production case 開場（staging 重現 deadlock / production 6 個月後突然爆）、走 5 種 InnoDB lock 類型（record / gap / next-key / insert intention / auto-inc）、isolation level 對 lock 行為的決定性影響、deadlock detection / SHOW ENGINE INNODB STATUS 解讀、5 production 踩雷（gap lock 阻塞 INSERT / auto-inc lock contention / FK lock cascading / large transaction lock holding / READ COMMITTED 跟 binlog ROW 互動）">Lock contention</a>）、PG <code>SERIAL</code> 是 <em>sequence-level non-transactional</em>；mode=0 場景跟 PG SERIAL 差異最大、mode=2 跟 PG SERIAL 行為較接近（皆可亂號、皆可並行）。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>改 UUID v7 / bigserial</strong>：消除 sequence 競爭</li>
<li><strong>bigserial + cache</strong>：<code>CREATE SEQUENCE ... CACHE 100</code>、batch 預取 100 個 ID 降 contention</li>
<li><strong>批量 insert 改 COPY</strong>：<code>COPY t FROM STDIN</code> 是 PG 對 batch 最快路徑</li>
</ol>
<h3 id="case-2charset--collation-跑出-unicode-異常">Case 2：Charset / collation 跑出 unicode 異常</h3>
<p><strong>徵兆</strong>：cutover 後某些用戶名 / 中文文字 query 對不到結果、<code>SELECT * WHERE name = '張三'</code> 返回空。</p>
<p><strong>根因</strong>：MySQL default <code>utf8mb3</code>（3-byte UTF-8、不能存 emoji / 部分 unicode）、PG default <code>UTF8</code> 全 unicode；資料遷移時 MySQL 端的 utf8mb3 column 帶到 PG 後 <em>bytes 不變</em> 但 <em>collation rule 變</em>；string comparison 結果差。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration audit</strong>：MySQL 強制 <code>utf8mb4</code>、avoid utf8mb3 data</li>
<li><strong>Collation 對位</strong>：MySQL <code>utf8mb4_unicode_ci</code> → PG <code>LC_COLLATE = 'C.utf8'</code> 或 ICU collation</li>
<li><strong>Application encoding contract</strong>：明示 UTF-8 全範圍、不接受 utf8mb3-only client</li>
</ol>
<h3 id="case-3case-sensitivity-反轉">Case 3：Case sensitivity 反轉</h3>
<p><strong>徵兆</strong>：cutover 後 application query <code>SELECT * FROM users</code> 報錯 <code>relation does not exist</code>；但 <code>SELECT * FROM &quot;Users&quot;</code> works。</p>
<p><strong>根因</strong>：MySQL Linux default <em>table name case-sensitive</em>、Windows <em>case-insensitive</em>、配置 <code>lower_case_table_names</code> 影響；PG <em>all identifier folded to lowercase unless quoted</em>。MySQL on macOS 開發環境是 case-insensitive、PG 嚴格 case-sensitive、application code 端可能用 mixed case。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Schema migration 階段強制 lowercase</strong>：所有 table / column name 統一 lowercase</li>
<li><strong>Application code refactor</strong>：grep raw SQL 找 mixed case identifier、改 lowercase</li>
<li><strong>ORM 端設定 <code>naming_strategy</code></strong>：JPA / Hibernate 等明示 lowercase mapping</li>
</ol>
<h3 id="case-4replication-行為差cdc-pipeline-失效">Case 4：Replication 行為差、CDC pipeline 失效</h3>
<p><strong>徵兆</strong>：MySQL 端 binlog-based CDC（Debezium MySQL connector）跑得好好的、cutover 後 PG 端要重建 CDC pipeline、初期 1-2 週 message 模式異常。</p>
<p><strong>根因</strong>：MySQL binlog row format vs PG logical replication slot 完全不同 protocol；Debezium 對兩家連接器是 <em>獨立</em> binary、message schema 部分對等但不直通。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-cutover 建 PG 端 CDC</strong>：Debezium PG connector 提前部署、初期跟 MySQL CDC 並存比對</li>
<li><strong>Schema registry 同步</strong>：Avro schema 從 MySQL 端 export、註冊 PG 端 connector 用同 schema</li>
<li><strong>Consumer 端 idempotent</strong>：cutover 期間 dual-source、consumer 必須 idempotent 避免 duplicate</li>
</ol>
<h3 id="case-5fulltext-index-對應-tsvectorapplication-search-broken">Case 5：FULLTEXT INDEX 對應 tsvector、application search broken</h3>
<p><strong>徵兆</strong>：cutover 後 application 全文搜尋功能失效、<code>MATCH(name) AGAINST('xxx')</code> 不被 PG 認；application 端 raw SQL 對 search 寫死。</p>
<p><strong>根因</strong>：MySQL <code>FULLTEXT INDEX</code> + <code>MATCH ... AGAINST</code> syntax PG 不支援；PG 用 <code>tsvector + ts_rank + to_tsquery</code>、概念對等但 syntax 完全不同。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-migration</strong>：列 application 用到的 fulltext search 場景、改寫成 tsvector pattern</li>
<li><strong>大型 search 改 Elasticsearch / Meilisearch</strong>：fulltext 是專門 search engine 的本職、不該用 RDBMS 解</li>
<li><strong>降級為 LIKE</strong>：簡單 case <code>WHERE name ILIKE '%xxx%'</code>、performance 較差但相容性好</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>MySQL</th>
          <th>PostgreSQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Instance cost</td>
          <td>對等（同 EC2 / RDS spec）</td>
          <td>對等</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>對等</td>
          <td>對等</td>
      </tr>
      <tr>
          <td>Connection pooling</td>
          <td>proxysql / mysql-proxy</td>
          <td>PgBouncer（更成熟）</td>
      </tr>
      <tr>
          <td>Index performance</td>
          <td>對等</td>
          <td>對等</td>
      </tr>
      <tr>
          <td>JSON performance</td>
          <td>Improving</td>
          <td>JSONB 領先</td>
      </tr>
      <tr>
          <td>Replication</td>
          <td>Async binlog</td>
          <td>Async streaming + logical</td>
      </tr>
      <tr>
          <td>Extension ecosystem</td>
          <td>少</td>
          <td>大（PostGIS / TimescaleDB / pgvector）</td>
      </tr>
      <tr>
          <td>Migration cost (one-time)</td>
          <td>-</td>
          <td>2-6 FTE 月 × project length（含 application）</td>
      </tr>
  </tbody>
</table>
<p>Migration 主要 cost 在 <em>application code refactor + dual-write window operational</em>、不是 DB itself。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-postgresql--aurora-migration-串接">跟 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora migration</a> 串接</h3>
<p>部分組織走 <em>MySQL → PostgreSQL → Aurora</em> 兩段：</p>
<ul>
<li>先 MySQL → self-managed PostgreSQL（schema 對位 + application 改）</li>
<li>穩定後 self-managed PostgreSQL → Aurora（operational simplification）</li>
</ul>
<p>不要一次跑 <em>MySQL → Aurora PostgreSQL compat</em>、認知負擔太大、failure mode 互相干擾。</p>
<h3 id="跟-logical-replication--debezium-對位">跟 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a> 對位</h3>
<p>PG 端 CDC pipeline 在 cutover 完成後立刻可用；可作為 <em>downstream CDC 重建</em> 的契機、設計 outbox pattern 更穩。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>MySQL 8 vs PostgreSQL 16 feature gap</strong>：MySQL 8 加了 CTE / window function / generated column；2025+ feature parity 漸高、migration ROI 評估會變</li>
<li><strong>Reverse migration</strong>（PG → MySQL）：少見、通常是 application 端 dependency lock-in（用了 MySQL-specific stored procedure）</li>
<li><strong>MariaDB → PostgreSQL</strong>：跟 MySQL → PG 類似、MariaDB 部分 syntax 略接近 PG（如 <code>RETURNING</code>）</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source / target vendor：<a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> / <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a></li>
<li>後續路線：<a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a></li>
<li>平行 migration playbook：<a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic</a>（同為 Type A 高 schema 差）</li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a>（本文驗證 Type A 標準形態）</li>
</ul>
]]></content:encoded></item><item><title>MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/replication-topology/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/replication-topology/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>replication topology&lt;/em> — 從 single primary 到 multi-replica 部署的 3 個 trade-off 軸跟 5 段配置。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="replication-的-3-個-trade-off-軸--mode-選擇">Replication 的 3 個 trade-off 軸 + mode 選擇&lt;/h2>
&lt;p>Replication mode 選擇看起來是「選 async 還是 semi-sync」、但決策實際是 3 個獨立 trade-off 軸的權衡、async / semi-sync 是這些軸的兩個常見組合 &lt;em>名稱&lt;/em>：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>軸&lt;/th>
 &lt;th>端 A&lt;/th>
 &lt;th>端 B&lt;/th>
 &lt;th>MySQL 旋鈕&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>Durability&lt;/strong>&lt;/td>
 &lt;td>primary 寫完就 commit&lt;/td>
 &lt;td>至少一個 standby 收到才 commit&lt;/td>
 &lt;td>&lt;code>rpl_semi_sync_master_enabled&lt;/code> / sync ack count&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Latency&lt;/strong>&lt;/td>
 &lt;td>client 等 primary 寫完 OK&lt;/td>
 &lt;td>client 等 standby ack（額外 RTT）&lt;/td>
 &lt;td>&lt;code>rpl_semi_sync_master_timeout&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Consistency&lt;/strong>&lt;/td>
 &lt;td>replica 隨時可能 stale&lt;/td>
 &lt;td>replica 跟 primary 保證讀到一致&lt;/td>
 &lt;td>application read routing rule（不是 replication 旋鈕）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>「async vs semi-sync」實際上是 &lt;em>durability + latency 兩軸&lt;/em> 的選擇、不影響 &lt;em>consistency 軸&lt;/em>（consistency 在 read routing 層決定）。Group Replication / MySQL Cluster（synchronous multi-primary）會同時改三軸、是另一個故事、不在本文 scope。&lt;/p>
&lt;p>跟這三軸獨立的、是 &lt;em>replication 機制本身的可維護性&lt;/em>。binlog position-based replication 用 &lt;code>(file, position)&lt;/code> 標 replica 進度、failover 時要對齊 position 容易出錯；&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/gtid/" data-link-title="GTID" data-link-desc="說明全域交易識別碼如何讓複製進度與故障切換不依賴實體 log 位置">&lt;strong>GTID（Global Transaction Identifier）&lt;/strong>&lt;/a>用全域 transaction ID 標進度、failover / re-pointing 不必算 position。GTID 是 &lt;em>跨 mode 的 infrastructure&lt;/em>、不是第三種 mode。&lt;/p>
&lt;h2 id="async-replicationdefault--高-throughput-的代價">Async replication：default + 高 throughput 的代價&lt;/h2>
&lt;p>Async 是 MySQL 預設、行為：&lt;/p>
&lt;ol>
&lt;li>Primary 寫 binlog、立刻 commit、回應 client OK&lt;/li>
&lt;li>Replica 的 IO thread 從 primary pull binlog event 到 local relay log&lt;/li>
&lt;li>Replica 的 SQL thread apply relay log（單 thread 或 multi-thread parallel）&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Trade-off&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 <em>replication topology</em> — 從 single primary 到 multi-replica 部署的 3 個 trade-off 軸跟 5 段配置。</p></blockquote>
<hr>
<h2 id="replication-的-3-個-trade-off-軸--mode-選擇">Replication 的 3 個 trade-off 軸 + mode 選擇</h2>
<p>Replication mode 選擇看起來是「選 async 還是 semi-sync」、但決策實際是 3 個獨立 trade-off 軸的權衡、async / semi-sync 是這些軸的兩個常見組合 <em>名稱</em>：</p>
<table>
  <thead>
      <tr>
          <th>軸</th>
          <th>端 A</th>
          <th>端 B</th>
          <th>MySQL 旋鈕</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Durability</strong></td>
          <td>primary 寫完就 commit</td>
          <td>至少一個 standby 收到才 commit</td>
          <td><code>rpl_semi_sync_master_enabled</code> / sync ack count</td>
      </tr>
      <tr>
          <td><strong>Latency</strong></td>
          <td>client 等 primary 寫完 OK</td>
          <td>client 等 standby ack（額外 RTT）</td>
          <td><code>rpl_semi_sync_master_timeout</code></td>
      </tr>
      <tr>
          <td><strong>Consistency</strong></td>
          <td>replica 隨時可能 stale</td>
          <td>replica 跟 primary 保證讀到一致</td>
          <td>application read routing rule（不是 replication 旋鈕）</td>
      </tr>
  </tbody>
</table>
<p>「async vs semi-sync」實際上是 <em>durability + latency 兩軸</em> 的選擇、不影響 <em>consistency 軸</em>（consistency 在 read routing 層決定）。Group Replication / MySQL Cluster（synchronous multi-primary）會同時改三軸、是另一個故事、不在本文 scope。</p>
<p>跟這三軸獨立的、是 <em>replication 機制本身的可維護性</em>。binlog position-based replication 用 <code>(file, position)</code> 標 replica 進度、failover 時要對齊 position 容易出錯；<a href="/blog/backend/knowledge-cards/gtid/" data-link-title="GTID" data-link-desc="說明全域交易識別碼如何讓複製進度與故障切換不依賴實體 log 位置"><strong>GTID（Global Transaction Identifier）</strong></a>用全域 transaction ID 標進度、failover / re-pointing 不必算 position。GTID 是 <em>跨 mode 的 infrastructure</em>、不是第三種 mode。</p>
<h2 id="async-replicationdefault--高-throughput-的代價">Async replication：default + 高 throughput 的代價</h2>
<p>Async 是 MySQL 預設、行為：</p>
<ol>
<li>Primary 寫 binlog、立刻 commit、回應 client OK</li>
<li>Replica 的 IO thread 從 primary pull binlog event 到 local relay log</li>
<li>Replica 的 SQL thread apply relay log（單 thread 或 multi-thread parallel）</li>
</ol>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>Durability：primary 寫完 commit、replica 還沒 pull = primary 在這瞬間 crash + 永久故障 → <em>data loss</em>（已 commit 的 transaction 在 replica 不存在）</li>
<li>Latency：client 不等 replica、寫入延遲 = primary 自身寫 binlog 的時間（通常 &lt; 1ms with <code>innodb_flush_log_at_trx_commit=1</code>）</li>
<li>Consistency：replica 可能 lag、application 讀 replica 會 stale；用 <code>SHOW SLAVE STATUS</code> 看 <code>Seconds_Behind_Master</code></li>
</ul>
<p><strong>適用</strong>：</p>
<ul>
<li>主流選擇（90% 場景）</li>
<li>Failover loss 在容忍範圍（多數 web 應用容忍 1-2 秒 data loss）</li>
<li>Read scaling 為主要 driver、絕對 durability 非首要</li>
</ul>
<p><strong>不適用</strong>：</p>
<ul>
<li>金融交易 / 訂單系統、不允許 any data loss</li>
<li>Compliance 要求 zero data loss（PCI-DSS / 部分監管場景）</li>
</ul>
<h2 id="semi-sync-replication至少一個-standby-ack-才-commit">Semi-sync replication：至少一個 standby ack 才 commit</h2>
<p>Semi-sync 在 async 基礎上加 <em>primary 等至少 N 個 replica ack 才 commit</em> 的步驟：</p>
<ol>
<li>Primary 寫 binlog</li>
<li>Primary 發送 binlog event 到所有 replica</li>
<li><em>Primary 等至少 N 個 replica 回 ack</em>（N 是 <code>rpl_semi_sync_master_wait_for_slave_count</code>、預設 1）</li>
<li>Primary commit、回應 client</li>
</ol>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>Durability：至少 N 個 replica 收到 binlog（不一定 apply）、primary crash 後 replica 還有 binlog 可 promote、保證 zero data loss（但是 <em>binlog-level</em>、不是 <em>applied-level</em>）</li>
<li>Latency：client 等 primary + 一輪 replica ack RTT；跨 AZ 通常 +1-3ms、跨 region 可能 +50-200ms</li>
<li>Consistency：跟 async 一樣、replica apply 仍 async、application 讀 replica 仍可能 stale</li>
</ul>
<p><strong>MySQL 5.7+ 區分 <em>standard</em> 跟 <em>Loss-Less</em> semi-sync</strong>：</p>
<ul>
<li>Standard semi-sync（5.5-5.6）：primary 先 commit 再等 ack、ack 超時 fallback 成 async — <em>仍可能 lose data</em></li>
<li>Loss-Less semi-sync（5.7+、<code>rpl_semi_sync_master_wait_point=AFTER_SYNC</code>）：primary 寫完 binlog 但 <em>先等 ack 再 commit</em>、ack 超時 fallback async 之前已寫 binlog 仍保證 durable</li>
</ul>
<p>Production 場景必須用 Loss-Less semi-sync、不是 standard。</p>
<p><strong>適用</strong>：</p>
<ul>
<li>金融交易 / 訂單 / payment ledger</li>
<li>不允許 data loss、可接受寫入延遲 +1-3ms</li>
<li>已有 multi-AZ / multi-region 部署、replica 物理上可靠</li>
</ul>
<p><strong>不適用</strong>：</p>
<ul>
<li>跨 region semi-sync（RTT 50-200ms）通常不划算 — 寫吞吐砍半、改用 <em>region-local sync replica + cross-region async chain</em></li>
<li>寫吞吐 &gt; 50K WPS 且容忍 sub-second loss — async 即可</li>
</ul>
<h2 id="gtid-based-replication機制升級跨-mode-都需要">GTID-based replication：機制升級、跨 mode 都需要</h2>
<p>GTID 把每個 transaction 標一個全域 ID：<code>&lt;server_uuid&gt;:&lt;transaction_id&gt;</code>。Replica 紀錄「已 apply 的 GTID set」、不再用 <code>(binlog_file, position)</code>。</p>
<p><strong>為什麼 GTID 比 binlog position 好</strong>：</p>
<ul>
<li><strong>Failover re-pointing 簡單</strong>：promote 新 primary 後、其他 replica 重新 attach 不必算 <code>MASTER_LOG_FILE</code> + <code>MASTER_LOG_POS</code>、用 <code>CHANGE MASTER TO MASTER_AUTO_POSITION=1</code> 即可</li>
<li><strong>Multi-source replication 可行</strong>：一個 replica 從多個 primary 拉、各 primary 的 GTID set 獨立 track</li>
<li><strong>Consistency check 容易</strong>：兩個 server 對 GTID set、就知道誰落後、有無 gap</li>
<li><strong>跟 group replication / MySQL Cluster 必需</strong>：5.7+ 多 primary 場景 GTID 是前提</li>
</ul>
<p><strong>設定流程</strong>（兩階段、不能直接開）：</p>
<ol>
<li>
<p><strong>Phase 1 (預備、所有 server 同 mode)</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">gtid_mode</span> <span class="o">=</span> <span class="s">ON_PERMISSIVE  -- 接受 GTID 跟 non-GTID transaction</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">enforce_gtid_consistency</span> <span class="o">=</span> <span class="s">ON  -- 拒絕無法用 GTID 表達的 statement（CREATE TABLE...SELECT 等）</span></span></span></code></pre></div></li>
<li>
<p><strong>Phase 2 (rolling、全部 server 都 Phase 1 後)</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">gtid_mode</span> <span class="o">=</span> <span class="s">ON  -- 只接受 GTID transaction</span></span></span></code></pre></div></li>
</ol>
<p>跳 phase 直接 <code>gtid_mode=ON</code> 會讓 replication break（既有 non-GTID transaction 無法處理）。Production 啟用 GTID 要排 maintenance window、跑完 phase 1 觀察 1-2 天再進 phase 2。</p>
<h2 id="配置-step-by-steploss-less-semi-sync--gtid-組合">配置 step-by-step（Loss-Less semi-sync + GTID 組合）</h2>
<p>實務最常見組合：Loss-Less semi-sync + GTID。配置順序：</p>
<h3 id="step-1primary--replica-都開-gtid兩-phase-跑完">Step 1：Primary + replica 都開 GTID（兩 phase 跑完）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># my.cnf on primary AND replica</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">gtid_mode</span> <span class="o">=</span> <span class="s">ON</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">enforce_gtid_consistency</span> <span class="o">=</span> <span class="s">ON</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">log_bin</span> <span class="o">=</span> <span class="s">mysql-bin</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">log_slave_updates</span> <span class="o">=</span> <span class="s">1  -- replica 也記 binlog (chained replication 需要)</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="na">binlog_format</span> <span class="o">=</span> <span class="s">ROW    -- ROW 比 STATEMENT 安全</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="na">sync_binlog</span> <span class="o">=</span> <span class="s">1        -- 每次 commit fsync binlog</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="na">innodb_flush_log_at_trx_commit</span> <span class="o">=</span> <span class="s">1  -- 每次 commit fsync InnoDB log</span></span></span></code></pre></div><h3 id="step-2primary-安裝-semi-sync-plugin">Step 2：Primary 安裝 semi-sync plugin</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="n">INSTALL</span><span class="w"> </span><span class="n">PLUGIN</span><span class="w"> </span><span class="n">rpl_semi_sync_master</span><span class="w"> </span><span class="n">SONAME</span><span class="w"> </span><span class="s1">&#39;semisync_master.so&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SET</span><span class="w"> </span><span class="k">GLOBAL</span><span class="w"> </span><span class="n">rpl_semi_sync_master_enabled</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SET</span><span class="w"> </span><span class="k">GLOBAL</span><span class="w"> </span><span class="n">rpl_semi_sync_master_wait_for_slave_count</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 至少 1 個 ack
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="k">SET</span><span class="w"> </span><span class="k">GLOBAL</span><span class="w"> </span><span class="n">rpl_semi_sync_master_wait_point</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">AFTER_SYNC</span><span class="p">;</span><span class="w">   </span><span class="c1">-- Loss-Less
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SET</span><span class="w"> </span><span class="k">GLOBAL</span><span class="w"> </span><span class="n">rpl_semi_sync_master_timeout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">10000</span><span class="p">;</span><span class="w">           </span><span class="c1">-- 10s timeout、超時 fallback async</span></span></span></code></pre></div><h3 id="step-3replica-安裝-semi-sync-plugin">Step 3：Replica 安裝 semi-sync plugin</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="n">INSTALL</span><span class="w"> </span><span class="n">PLUGIN</span><span class="w"> </span><span class="n">rpl_semi_sync_slave</span><span class="w"> </span><span class="n">SONAME</span><span class="w"> </span><span class="s1">&#39;semisync_slave.so&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SET</span><span class="w"> </span><span class="k">GLOBAL</span><span class="w"> </span><span class="n">rpl_semi_sync_slave_enabled</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="n">STOP</span><span class="w"> </span><span class="n">SLAVE</span><span class="w"> </span><span class="n">IO_THREAD</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">START</span><span class="w"> </span><span class="n">SLAVE</span><span class="w"> </span><span class="n">IO_THREAD</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 重啟 IO thread 啟用 semi-sync</span></span></span></code></pre></div><h3 id="step-4replica-attach-primary">Step 4：Replica attach primary</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="n">CHANGE</span><span class="w"> </span><span class="n">MASTER</span><span class="w"> </span><span class="k">TO</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">  </span><span class="n">MASTER_HOST</span><span class="o">=</span><span class="s1">&#39;primary.example.com&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="n">MASTER_PORT</span><span class="o">=</span><span class="mi">3306</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="n">MASTER_USER</span><span class="o">=</span><span class="s1">&#39;repl&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">  </span><span class="n">MASTER_PASSWORD</span><span class="o">=</span><span class="s1">&#39;...&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="n">MASTER_AUTO_POSITION</span><span class="o">=</span><span class="mi">1</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 用 GTID auto-position
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">START</span><span class="w"> </span><span class="n">SLAVE</span><span class="p">;</span></span></span></code></pre></div><h3 id="step-5驗證">Step 5：驗證</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- Primary: 確認 semi-sync 啟用 + 有 active client
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">SHOW</span><span class="w"> </span><span class="n">STATUS</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="s1">&#39;Rpl_semi_sync_master_status&#39;</span><span class="p">;</span><span class="w">      </span><span class="c1">-- ON
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"></span><span class="k">SHOW</span><span class="w"> </span><span class="n">STATUS</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="s1">&#39;Rpl_semi_sync_master_clients&#39;</span><span class="p">;</span><span class="w">     </span><span class="c1">-- ≥ 1
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"></span><span class="k">SHOW</span><span class="w"> </span><span class="n">STATUS</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="s1">&#39;Rpl_semi_sync_master_yes_tx&#39;</span><span class="p">;</span><span class="w">      </span><span class="c1">-- &gt; 0 (有 transaction 走 semi-sync)
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">SHOW</span><span class="w"> </span><span class="n">STATUS</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="s1">&#39;Rpl_semi_sync_master_no_tx&#39;</span><span class="p">;</span><span class="w">       </span><span class="c1">-- 應該 = 0 (沒有 fallback 成 async)
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="c1">-- Replica: 確認 GTID + IO thread 正常
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="k">SHOW</span><span class="w"> </span><span class="n">SLAVE</span><span class="w"> </span><span class="n">STATUS</span><span class="err">\</span><span class="k">G</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- Slave_IO_Running: Yes
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1">-- Slave_SQL_Running: Yes
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1">-- Retrieved_Gtid_Set: 跟 primary Executed_Gtid_Set 接近
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1">-- Seconds_Behind_Master: 觀察 lag</span></span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-replication-lag-暴衝--單-sql-thread-bottleneck">1. Replication lag 暴衝 — 單 SQL thread bottleneck</h3>
<p>預設 replica 的 SQL thread 是 <em>單 thread</em> apply、primary 多 thread 寫入時 replica 跟不上、lag 從 &lt; 100ms 飆到分鐘級。常見觸發：批次 UPDATE / DELETE、大 transaction、index rebuild。</p>
<p>修法：</p>
<ul>
<li>啟用 <em>multi-thread replication</em>：<code>slave_parallel_workers = 8</code>（per database 或 per logical clock parallel）</li>
<li>5.7+ 用 <code>slave_parallel_type = LOGICAL_CLOCK</code>：依 primary 上的 group commit 並行度自動 parallel</li>
<li>8.0+ 的 <em>writeset-based parallel</em>：<code>binlog_transaction_dependency_tracking = WRITESET</code>、更細粒度並行</li>
</ul>
<p>監控：<code>Seconds_Behind_Master</code> 是 <em>表面指標</em>、實際看 <code>Executed_Gtid_Set</code> 跟 primary 對比的 GTID gap 更準。</p>
<h3 id="2-semi-sync-timeout-fallback-成-async沒監控就看不見">2. Semi-sync timeout fallback 成 async（沒監控就看不見）</h3>
<p><code>rpl_semi_sync_master_timeout</code> 預設 10000ms（10 秒）、超時後 <em>自動 fallback async</em>、直到 replica 重連。Application 視角看不到任何 error、但 <em>durability guarantee 已失效</em>。</p>
<p>修法：</p>
<ul>
<li>監控 <code>Rpl_semi_sync_master_status</code> — fallback 後變 OFF</li>
<li>監控 <code>Rpl_semi_sync_master_no_tx</code> — fallback 期間每個 transaction 都計數</li>
<li>Alert 規則：5 分鐘內 <code>no_tx</code> 增加 &gt; 0 即告警</li>
<li>Timeout 設太短（&lt; 5s）容易 false positive、設太長（&gt; 30s）crash 時 data loss 風險增</li>
</ul>
<h3 id="3-gtid-gap--replica-無法-attach">3. GTID gap — replica 無法 attach</h3>
<p>Replica 重新 attach primary 時報 <code>ERROR 1236: ... transactions you need from master are purged</code>、原因是 primary 的 <code>binlog_expire_logs_seconds</code> 過短、需要的 binlog 已被清掉。GTID 模式下這個錯誤更明顯（直接看 GTID gap）、但 binlog position 模式下也一樣。</p>
<p>修法：</p>
<ul>
<li><code>binlog_expire_logs_seconds = 604800</code>（7 天）作為 baseline</li>
<li>大流量 server 確認 disk 容量能撐 7 天 binlog（一個高峰小時 binlog 可能 GB 級）</li>
<li>真的 gap 太大時用 <em>base backup + replay binlog</em> 重建 replica、不要硬 reset GTID</li>
</ul>
<h3 id="4-loss-less-semi-sync-不一定真的-loss-less">4. Loss-Less semi-sync 不一定真的 loss-less</h3>
<p><code>AFTER_SYNC</code> 模式 <em>primary 寫 binlog → 等 ack → commit</em>、看起來 zero loss。但 <em>primary 寫完 binlog 還沒等 ack 時 crash</em> + replica <em>剛好沒收到那個 binlog event</em> + replica promote — 這個 binlog event 在新 primary 不存在、但舊 primary 的 binlog 仍紀錄為 <em>已寫 binlog 未 commit</em>。client 收到 <em>connection lost</em>、不知道 transaction 是否成功。</p>
<p>修法：</p>
<ul>
<li>接受這個 <em>edge case unknown state</em>、application 用 idempotency key + retry 處理</li>
<li>Loss-Less semi-sync 保證的是 <em>已 commit transaction 不會丟</em>、不是 <em>所有寫入都 ack-and-tell</em></li>
<li>真的 zero unknown state 需要 group replication / Galera Cluster / MySQL Cluster（synchronous multi-primary）</li>
</ul>
<h3 id="5-chained-replication-雪崩">5. Chained replication 雪崩</h3>
<p>Topology 是 <code>primary → replica1 → replica2 → ...</code>（hub-and-spoke 之外的選擇、節省 primary 出口頻寬）。Replica1 SQL thread 卡住、replica2 跟 replica3 都被 block、整條 chain 雪崩。</p>
<p>修法：</p>
<ul>
<li>避免超過 2 層 chain（primary → tier1 replica → tier2 replica 是上限）</li>
<li>用 <em>parallel binary log relay</em>（5.7+ <code>slave_pending_jobs_size_max</code> + parallel workers）讓 chain 中段不阻塞</li>
<li>規模真的大、改用 <em>binlog server</em>（如 Maxwell / MaxScale）解耦 chain dependency</li>
<li>跨 region 用 <em>region-local hub + cross-region async</em>、不是長 chain</li>
</ul>
<h2 id="容量--cost-對照">容量 / cost 對照</h2>
<table>
  <thead>
      <tr>
          <th>配置</th>
          <th>寫吞吐影響</th>
          <th>Replica overhead</th>
          <th>適合 workload</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Async + binlog position</td>
          <td>baseline</td>
          <td>低（IO + SQL thread）</td>
          <td>高吞吐、容忍 sub-second loss</td>
      </tr>
      <tr>
          <td>Async + GTID</td>
          <td>baseline</td>
          <td>同上、failover 容易</td>
          <td>大多數 production 預設</td>
      </tr>
      <tr>
          <td>Loss-Less semi-sync + GTID（1 ack）</td>
          <td>-10% ~ -20%</td>
          <td>同上 + ack RTT</td>
          <td>金融、訂單、不容忍 data loss</td>
      </tr>
      <tr>
          <td>Loss-Less semi-sync + GTID（2 ack）</td>
          <td>-15% ~ -30%</td>
          <td>同上、跨 AZ</td>
          <td>強 durability + multi-AZ HA</td>
      </tr>
      <tr>
          <td>Group Replication（synchronous）</td>
          <td>-30% ~ -50%</td>
          <td>高（每 transaction quorum）</td>
          <td>不允許 single-primary、multi-primary 寫入</td>
      </tr>
  </tbody>
</table>
<p>跨 AZ semi-sync 通常加 1-3ms、跨 region 加 50-200ms — 寫密集 workload 跨 region semi-sync 通常不划算、改用 <em>region-local sync + cross-region async chain</em>。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="aurora-mysql">Aurora MySQL</h3>
<p>Aurora MySQL 用 <em>AWS-managed storage layer</em>、storage 自動 replicate 6 份跨 3 AZ、不需要應用層配 semi-sync。從自管 MySQL 遷 Aurora 時、上方所有 semi-sync 配置 <em>消失</em>、改成 Aurora storage quorum（4 of 6 write、3 of 6 read）。</p>
<p>trade-off 軸的 <em>durability</em> 完全交給 Aurora、application 只關心 <em>latency</em> + <em>consistency</em>。詳見 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a>。</p>
<h3 id="vitesssharding-layer">Vitess（sharding layer）</h3>
<p>Vitess shard 內部仍用 MySQL replication（async or semi-sync）、Vitess 不取代 replication topology、是 <em>上層 routing</em>。Vitess <code>vttablet</code> 每個 shard 有自己的 primary + replica、跟本文 topology 設計一致。</p>
<p>Vitess 比較大議題在 <em>cross-shard transaction</em>（VReplication 跨 shard binlog stream）、不是 replication topology — 詳見 MySQL backlog 中 <em>Vitess sharding 設計</em> 篇（待寫）。</p>
<h3 id="proxysqlread-replica-routing">ProxySQL（read replica routing）</h3>
<p>ProxySQL 是 MySQL 生態的 <em>connection pool + query routing</em> 標準、按 query type（SELECT vs DML）跟 replica lag 自動 route。寫入路 primary、讀走 replica、replica lag &gt; N 秒時暫時退路 primary 維持 consistency。</p>
<p>ProxySQL 跟本文 replication topology 是 <em>互補不重疊</em> — replication 設定哪些 server 有什麼資料、ProxySQL 設定 query 怎麼分配。詳見 MySQL backlog 中 <em>ProxySQL 配置</em> 篇（待寫）。</p>
<h3 id="orchestratorha-failover">Orchestrator（HA failover）</h3>
<p>Orchestrator 是 MySQL HA topology 管理 + 自動 failover 工具、用 GTID 偵測 replica 進度、failover 時自動 promote 最新 replica。對比 PostgreSQL 的 Patroni（詳見 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">Patroni HA</a>）— 兩者角色相同、Orchestrator 需要 GTID + 對 MySQL 行為熟、Patroni 需要 DCS（etcd / Consul）+ 對 PG 行為熟。</p>
<p>詳見 MySQL backlog 中 <em>Orchestrator failover 設計</em> 篇（待寫）。</p>
<h3 id="cdcmaxwell--debezium">CDC（Maxwell / Debezium）</h3>
<p>Maxwell（Zendesk 出品、MySQL-only）跟 Debezium（Red Hat、MySQL / PG / MongoDB 都支援）都讀 MySQL binlog 轉成 event stream（Kafka / Kinesis / Pulsar）。Binlog 必須 <code>ROW</code> format、GTID 啟用後 <em>exactly-once</em> delivery 更好維護（不需算 binlog position）。</p>
<p>跟 PG logical replication + Debezium 對比、MySQL 用 binlog（physical / row-level）不是 logical decoding、所以 schema change 時 <em>CDC consumer 要 schema-aware</em> 處理。詳見 MySQL backlog 中 <em>Binary log + Maxwell / Debezium CDC</em> 篇（待寫）。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PostgreSQL Replication Topology</a>（PG sibling、streaming + LSN + slot 機制 vs MySQL binlog 對位）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">PostgreSQL Patroni HA</a>（PG sibling、不同 HA 機制）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">PostgreSQL Logical Replication + Debezium</a>（PG CDC sibling、不同 replication 抽象層）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-slot-management/" data-link-title="PostgreSQL Replication Slot Management：Physical / Logical / Failover Slot 治理" data-link-desc="PG replication slot 是 *primary 端的 standby 進度紀錄*、防 WAL premature deletion。但 orphan slot 會吃 disk、failover 後 logical slot 不會自動跟新 primary、是 PG 操作的 hidden complexity。本文走 physical / logical slot 差異、slot lifecycle、failover slot synchronization（PG 17&#43; 新特性）、orphan slot 治理、5 production 踩雷（orphan slot disk 爆 / logical slot lag / failover 後 slot 丟 / wal_keep_size 跟 slot 衝突 / connection 同時打 slot 數量限制）">PostgreSQL Replication Slot Management</a>（PG slot 治理、MySQL 無對應概念）</li>
<li><a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a>（managed MySQL、replication 交給 storage layer）</li>
<li><a href="/blog/backend/01-database/transaction-boundary/" data-link-title="1.3 Transaction 與一致性邊界" data-link-desc="交易邊界、isolation level、retry 策略、distributed transaction（2PC、Saga）與跨 region 強一致取捨">1.3 Transaction Boundary</a>（transaction 行為跟 replication 互動）</li>
<li><a href="/blog/backend/01-database/kv-document-capacity-planning/" data-link-title="1.10 KV / Document DB 容量規劃" data-link-desc="DynamoDB / Cosmos DB / Bigtable / MongoDB 等 KV / Document DB 的容量設計、partition key 取捨、capacity mode 選擇">1.10 KV / Document DB 容量規劃</a> / <a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a>（替代路徑）</li>
<li><a href="/blog/backend/knowledge-cards/quorum/" data-link-title="Quorum" data-link-desc="分散式系統以多數節點同意作為提交或讀取有效性的門檻">quorum 卡片</a> / <a href="/blog/backend/knowledge-cards/eventual-consistency/" data-link-title="Eventual Consistency" data-link-desc="允許短暫不一致、最終收斂到同一資料狀態的一致性語意">eventual consistency 卡片</a> / <a href="/blog/backend/knowledge-cards/stale-read/" data-link-title="Stale Read" data-link-desc="讀取到落後於最新寫入版本的舊資料">stale read 卡片</a></li>
<li>官方：<a href="https://dev.mysql.com/doc/refman/8.0/en/replication.html">MySQL Replication</a> / <a href="https://dev.mysql.com/doc/refman/8.0/en/replication-semisync.html">Semi-Sync</a> / <a href="https://dev.mysql.com/doc/refman/8.0/en/replication-gtids.html">GTID</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/online-schema-change-tools/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/online-schema-change-tools/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>online schema change&lt;/em> — gh-ost 跟 pt-online-schema-change 兩條工具路徑的機制對比。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>機制&lt;/th>
 &lt;th>pt-online-schema-change（Percona）&lt;/th>
 &lt;th>gh-ost（GitHub）&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>同步機制&lt;/td>
 &lt;td>&lt;strong>MySQL trigger&lt;/strong>（原表 INSERT/UPDATE/DELETE 觸發寫 ghost）&lt;/td>
 &lt;td>&lt;strong>Binlog stream&lt;/strong>（讀 primary binlog 寫 ghost）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Primary 寫入 overhead&lt;/td>
 &lt;td>trigger 觸發成本（同 transaction 內）&lt;/td>
 &lt;td>0（binlog 已存在）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Replica lag 影響&lt;/td>
 &lt;td>trigger 在 primary 跑、replica 自然 lag&lt;/td>
 &lt;td>從 replica 讀 binlog、可主動 throttle&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Foreign key&lt;/td>
 &lt;td>部分支援（drop/recreate strategy）&lt;/td>
 &lt;td>不支援（必須先 drop FK）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Roll back（過程中）&lt;/td>
 &lt;td>困難（trigger 已建、要清乾淨）&lt;/td>
 &lt;td>容易（drop ghost table 即可）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>暫停 / resume&lt;/td>
 &lt;td>不支援&lt;/td>
 &lt;td>支援（gh-ost interactive command）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>切換時 lock 持續&lt;/td>
 &lt;td>rename 期間 metadata lock（毫秒級）&lt;/td>
 &lt;td>rename 期間 metadata lock（毫秒級）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>工具 binary&lt;/td>
 &lt;td>Perl 腳本（Percona Toolkit）&lt;/td>
 &lt;td>Go binary（單一可執行檔）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>推出年份&lt;/td>
 &lt;td>2011&lt;/td>
 &lt;td>2016&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>兩工具最終結果一樣（ghost table 取代原表）、但 &lt;em>過程中對 production 的影響非常不同&lt;/em>。選哪個取決於：trigger overhead 可不可接受、是否有 foreign key、是否需要 resume/throttle 能力、團隊熟悉哪條工具鏈。&lt;/p>
&lt;h2 id="為什麼-alter-table-需要-online-path">為什麼 ALTER TABLE 需要 online path&lt;/h2>
&lt;p>MySQL 8.0 之前的 &lt;code>ALTER TABLE&lt;/code> 多數情況下 &lt;em>rebuild 整張表&lt;/em> — 過程中 &lt;em>primary key 之外的 read/write 都 block&lt;/em>。100 GB 表 ALTER 跑 hours、production write 全部失敗。&lt;/p>
&lt;p>MySQL 8.0 加 &lt;em>Instant DDL&lt;/em>（部分 ALTER 不 rebuild、只改 metadata、毫秒級完成）、但 &lt;em>能用 instant 的 ALTER 是 subset&lt;/em>：&lt;/p>
&lt;ul>
&lt;li>支援：ADD COLUMN（末尾）、DROP COLUMN（部分情境）、RENAME COLUMN&lt;/li>
&lt;li>不支援：ADD INDEX、CHANGE COLUMN type、ADD/DROP PRIMARY KEY、ADD FOREIGN KEY&lt;/li>
&lt;/ul>
&lt;p>不支援 instant 的場景仍要走 ghost table。Percona 跟 GitHub 各自從 production 痛點出發、產出 pt-osc（2011）跟 gh-ost（2016）。&lt;/p>
&lt;h2 id="pt-online-schema-change用-trigger-同步寫入">pt-online-schema-change：用 trigger 同步寫入&lt;/h2>
&lt;p>pt-osc 流程：&lt;/p>
&lt;ol>
&lt;li>CREATE ghost table（跟原表同 schema + 你要的 ALTER）&lt;/li>
&lt;li>在原表上 &lt;em>建 3 個 trigger&lt;/em>：INSERT / UPDATE / DELETE&lt;/li>
&lt;li>任何寫入原表的 transaction &lt;em>同時觸發 trigger&lt;/em> 寫對應 ghost&lt;/li>
&lt;li>背景 chunk-by-chunk copy 既有 row 到 ghost&lt;/li>
&lt;li>全部 copy 完後 &lt;code>RENAME TABLE&lt;/code>：原表 → archive、ghost → 原表名（atomic、metadata lock 毫秒級）&lt;/li>
&lt;li>Drop trigger、drop archive&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Trade-off&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 <em>online schema change</em> — gh-ost 跟 pt-online-schema-change 兩條工具路徑的機制對比。</p></blockquote>
<hr>
<table>
  <thead>
      <tr>
          <th>機制</th>
          <th>pt-online-schema-change（Percona）</th>
          <th>gh-ost（GitHub）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>同步機制</td>
          <td><strong>MySQL trigger</strong>（原表 INSERT/UPDATE/DELETE 觸發寫 ghost）</td>
          <td><strong>Binlog stream</strong>（讀 primary binlog 寫 ghost）</td>
      </tr>
      <tr>
          <td>Primary 寫入 overhead</td>
          <td>trigger 觸發成本（同 transaction 內）</td>
          <td>0（binlog 已存在）</td>
      </tr>
      <tr>
          <td>Replica lag 影響</td>
          <td>trigger 在 primary 跑、replica 自然 lag</td>
          <td>從 replica 讀 binlog、可主動 throttle</td>
      </tr>
      <tr>
          <td>Foreign key</td>
          <td>部分支援（drop/recreate strategy）</td>
          <td>不支援（必須先 drop FK）</td>
      </tr>
      <tr>
          <td>Roll back（過程中）</td>
          <td>困難（trigger 已建、要清乾淨）</td>
          <td>容易（drop ghost table 即可）</td>
      </tr>
      <tr>
          <td>暫停 / resume</td>
          <td>不支援</td>
          <td>支援（gh-ost interactive command）</td>
      </tr>
      <tr>
          <td>切換時 lock 持續</td>
          <td>rename 期間 metadata lock（毫秒級）</td>
          <td>rename 期間 metadata lock（毫秒級）</td>
      </tr>
      <tr>
          <td>工具 binary</td>
          <td>Perl 腳本（Percona Toolkit）</td>
          <td>Go binary（單一可執行檔）</td>
      </tr>
      <tr>
          <td>推出年份</td>
          <td>2011</td>
          <td>2016</td>
      </tr>
  </tbody>
</table>
<p>兩工具最終結果一樣（ghost table 取代原表）、但 <em>過程中對 production 的影響非常不同</em>。選哪個取決於：trigger overhead 可不可接受、是否有 foreign key、是否需要 resume/throttle 能力、團隊熟悉哪條工具鏈。</p>
<h2 id="為什麼-alter-table-需要-online-path">為什麼 ALTER TABLE 需要 online path</h2>
<p>MySQL 8.0 之前的 <code>ALTER TABLE</code> 多數情況下 <em>rebuild 整張表</em> — 過程中 <em>primary key 之外的 read/write 都 block</em>。100 GB 表 ALTER 跑 hours、production write 全部失敗。</p>
<p>MySQL 8.0 加 <em>Instant DDL</em>（部分 ALTER 不 rebuild、只改 metadata、毫秒級完成）、但 <em>能用 instant 的 ALTER 是 subset</em>：</p>
<ul>
<li>支援：ADD COLUMN（末尾）、DROP COLUMN（部分情境）、RENAME COLUMN</li>
<li>不支援：ADD INDEX、CHANGE COLUMN type、ADD/DROP PRIMARY KEY、ADD FOREIGN KEY</li>
</ul>
<p>不支援 instant 的場景仍要走 ghost table。Percona 跟 GitHub 各自從 production 痛點出發、產出 pt-osc（2011）跟 gh-ost（2016）。</p>
<h2 id="pt-online-schema-change用-trigger-同步寫入">pt-online-schema-change：用 trigger 同步寫入</h2>
<p>pt-osc 流程：</p>
<ol>
<li>CREATE ghost table（跟原表同 schema + 你要的 ALTER）</li>
<li>在原表上 <em>建 3 個 trigger</em>：INSERT / UPDATE / DELETE</li>
<li>任何寫入原表的 transaction <em>同時觸發 trigger</em> 寫對應 ghost</li>
<li>背景 chunk-by-chunk copy 既有 row 到 ghost</li>
<li>全部 copy 完後 <code>RENAME TABLE</code>：原表 → archive、ghost → 原表名（atomic、metadata lock 毫秒級）</li>
<li>Drop trigger、drop archive</li>
</ol>
<p><strong>Trade-off</strong>：</p>
<ul>
<li><em>寫入 overhead</em>：每個 primary 寫入 transaction 都多一次 trigger 執行、寫吞吐降 10-30%</li>
<li><em>Replica lag</em>：trigger 跟原寫入同 transaction、replica 上每個 row 也跑 trigger、replica lag 可能暴增（缺少主動 throttle）</li>
<li><em>Roll back 困難</em>：tool 跑到一半失敗、trigger 已建、要手動清掉才能 retry</li>
<li><em>FK 處理</em>：原表有 FK 指向時、ghost table 要先 drop FK 再 recreate、操作複雜</li>
</ul>
<p><strong>適用</strong>：</p>
<ul>
<li>寫吞吐 &lt; 50% capacity（有 buffer 撐 trigger overhead）</li>
<li>無 FK 或 FK 簡單</li>
<li>沒有 replica lag 敏感的 read（trigger 在 replica 也跑）</li>
</ul>
<p><strong>不適用</strong>：</p>
<ul>
<li>高寫吞吐（&gt; 80% capacity）— trigger overhead 直接 saturate</li>
<li>大量 FK 結構</li>
<li>需要 throttle / pause / resume</li>
</ul>
<h2 id="gh-ost用-binlog-stream-同步寫入">gh-ost：用 binlog stream 同步寫入</h2>
<p>gh-ost 流程：</p>
<ol>
<li>CREATE ghost table</li>
<li><em>從 replica 讀 binlog</em>（不在 primary 加 trigger）</li>
<li>同步 <em>primary 上的寫入</em> 透過 binlog event 寫到 ghost</li>
<li>背景 chunk-by-chunk copy 既有 row 到 ghost</li>
<li>全部 copy 完後 swap：<code>RENAME TABLE</code></li>
<li>Drop archive</li>
</ol>
<p><strong>Trade-off</strong>：</p>
<ul>
<li><em>寫入 overhead</em>：0（binlog 已經寫了、gh-ost 只是 consumer）</li>
<li><em>Replica lag 影響</em>：gh-ost 可監測 replica lag、超過 threshold 自動 throttle copy（不影響 primary 寫入）</li>
<li><em>Roll back 容易</em>：取消時直接 drop ghost table、原表完全沒被改動</li>
<li><em>FK 不支援</em>：gh-ost 設計上不處理 FK、有 FK 必須先 drop / restructure</li>
</ul>
<p><strong>適用</strong>：</p>
<ul>
<li>高寫吞吐 production（trigger overhead 不可接受）</li>
<li>需要 throttle / pause / resume（gh-ost interactive command 可動態調 chunk size、cut-over 時點）</li>
<li>已用 GitHub-flavored MySQL operations workflow</li>
</ul>
<p><strong>不適用</strong>：</p>
<ul>
<li>有複雜 FK 結構、不想動 schema</li>
<li>Replica 跑不了 binlog（極少數場景）</li>
</ul>
<h2 id="配置-step-by-stepgh-ost">配置 step-by-step（gh-ost）</h2>
<p>實務 production 多用 gh-ost（GitHub / Slack / Booking.com 等）。pt-osc 用於有 FK 或舊系統。</p>
<h3 id="gh-ost-一個-alter-命令">gh-ost 一個 ALTER 命令</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">gh-ost <span class="se">\
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="se"></span>  --host<span class="o">=</span>replica.example.com <span class="se">\ </span>          <span class="c1"># 從 replica 讀 binlog</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  --user<span class="o">=</span>ghost <span class="se">\
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="se"></span>  --password<span class="o">=</span>... <span class="se">\
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="se"></span>  --database<span class="o">=</span>production <span class="se">\
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="se"></span>  --table<span class="o">=</span>orders <span class="se">\
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="se"></span>  --alter<span class="o">=</span><span class="s1">&#39;ADD COLUMN status VARCHAR(20) DEFAULT NULL, ADD INDEX idx_status (status)&#39;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="se"></span>  --allow-on-master<span class="o">=</span><span class="nb">false</span> <span class="se">\ </span>             <span class="c1"># 不直接連 primary 讀 binlog</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  --chunk-size<span class="o">=</span><span class="m">1000</span> <span class="se">\ </span>                   <span class="c1"># 每批 copy 1000 row</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">  --max-load<span class="o">=</span><span class="s1">&#39;Threads_running=50&#39;</span> <span class="se">\ </span>     <span class="c1"># primary load 限制</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">  --critical-load<span class="o">=</span><span class="s1">&#39;Threads_running=200&#39;</span> <span class="se">\ </span><span class="c1"># 超過直接 abort</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">  --max-lag-millis<span class="o">=</span><span class="m">1500</span> <span class="se">\ </span>               <span class="c1"># replica lag 限制</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">  --throttle-additional-flag-file<span class="o">=</span>/tmp/throttle <span class="se">\ </span> <span class="c1"># touch 此檔 throttle</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">  --postpone-cut-over-flag-file<span class="o">=</span>/tmp/postpone <span class="se">\ </span>   <span class="c1"># touch 此檔延後 cut-over</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">  --execute                              <span class="c1"># 真的執行（沒這個只 dry-run）</span></span></span></code></pre></div><h3 id="interactive-commandgh-ost-跑起來後">Interactive command（gh-ost 跑起來後）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 連 gh-ost socket（同 directory）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;status&#34;</span> <span class="p">|</span> nc -U /tmp/gh-ost.production.orders.sock
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># 動態調 chunk size</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;chunk-size=500&#34;</span> <span class="p">|</span> nc -U /tmp/gh-ost.production.orders.sock
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 立即觸發 cut-over（不再等）</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;unpostpone&#34;</span> <span class="p">|</span> nc -U /tmp/gh-ost.production.orders.sock
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># Abort 並 drop ghost</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;panic&#34;</span> <span class="p">|</span> nc -U /tmp/gh-ost.production.orders.sock</span></span></code></pre></div><h2 id="配置-step-by-steppt-osc">配置 step-by-step（pt-osc）</h2>
<p>對比 gh-ost 的 binlog reader、pt-osc 命令更短但配置義務同樣多：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">pt-online-schema-change <span class="se">\
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="se"></span>  --host<span class="o">=</span>primary.example.com <span class="se">\
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="se"></span>  --user<span class="o">=</span>ghost <span class="se">\
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="se"></span>  --password<span class="o">=</span>... <span class="se">\
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="se"></span>  --alter<span class="o">=</span><span class="s1">&#39;ADD COLUMN status VARCHAR(20) DEFAULT NULL, ADD INDEX idx_status (status)&#39;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="se"></span>  <span class="nv">D</span><span class="o">=</span>production,t<span class="o">=</span>orders <span class="se">\
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="se"></span>  --chunk-size<span class="o">=</span><span class="m">1000</span> <span class="se">\
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="se"></span>  --max-load<span class="o">=</span><span class="s1">&#39;Threads_running=50&#39;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="se"></span>  --critical-load<span class="o">=</span><span class="s1">&#39;Threads_running=200&#39;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="se"></span>  --max-lag<span class="o">=</span>1.5 <span class="se">\
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="se"></span>  --check-replication-filters <span class="se">\ </span>          <span class="c1"># 防 binlog filter 漏 trigger</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">  --alter-foreign-keys-method<span class="o">=</span>auto <span class="se">\ </span>     <span class="c1"># auto / rebuild_constraints / drop_swap / none</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">  --execute</span></span></code></pre></div><p><code>--alter-foreign-keys-method</code> 是 pt-osc 對 FK 處理的策略選項、四種選擇對 production 影響非常不同（rebuild 重建 FK / drop_swap 用更快但少了 atomic、none 是不處理）。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-pt-osc-trigger-overhead-不可預期">1. pt-osc trigger overhead 不可預期</h3>
<p><code>--max-load='Threads_running=50'</code> 看起來保護了 server、但 trigger 在 transaction 內、production 的 <em>每個寫入</em> 都加 trigger 開銷。<code>Threads_running</code> 是 <em>當下</em> 數字、看不到 trigger 累積 latency。常見場景：高峰時段下 pt-osc、預期 30% overhead、實際 60%、p99 飆 5x。</p>
<p>修法：</p>
<ul>
<li>高峰時段不跑 pt-osc、排 off-peak window</li>
<li>用 <em>staging environment</em> 跑 production-like load 預估 trigger overhead</li>
<li>對寫吞吐 &gt; 50% capacity 的 server 改用 gh-ost</li>
</ul>
<h3 id="2-gh-ost-binlog-lag-跟-primary-寫入率追不上">2. gh-ost binlog lag 跟 primary 寫入率追不上</h3>
<p>gh-ost 從 replica 讀 binlog、binlog event 進來速度有上限。如果 <em>primary 寫入率超過 gh-ost binlog consume 速度</em>（每秒幾千 transaction 對某些 server 已是 ceiling）、gh-ost 永遠追不上、cut-over 會長時間卡住。</p>
<p>修法：</p>
<ul>
<li>gh-ost 預設用 <em>replica binlog</em>、改用 <code>--allow-on-master</code> 直接從 primary 讀（如果 primary 容量夠）</li>
<li>提高 <code>--chunk-size</code> 加快 copy（同時用 <code>--max-load</code> 防過載）</li>
<li>真的追不上、考慮 <em>暫停部分寫入流量</em>（throttle traffic，而非 throttle tool）</li>
</ul>
<h3 id="3-foreign-key-constraint--兩工具都尷尬">3. Foreign key constraint — 兩工具都尷尬</h3>
<p>原表有 FK 指向（其他 table FK references 這張表）、ghost table 切換時 <em>新 ghost 沒有那些 FK 指向</em>。Cut-over 一瞬間、FK 從指向「原表」變成指向「archive 表」、外部 constraint 失效。</p>
<p>修法（pt-osc）：</p>
<ul>
<li>用 <code>--alter-foreign-keys-method=rebuild_constraints</code>：先 ALTER 外部 table FK 指向 ghost、再 cut-over</li>
<li>或 <code>drop_swap</code>：cut-over 前 drop FK、cut-over 後 recreate（更快但 cut-over 期間 FK 失效）</li>
</ul>
<p>修法（gh-ost）：</p>
<ul>
<li>gh-ost 不支援 — 手動 drop FK / 重 setup FK</li>
<li>或維護 schema 改 FK 結構（FK 改在 application 層 enforce）</li>
</ul>
<h3 id="4-pt-osc-trigger-跟-application-既有-trigger-衝突">4. pt-osc trigger 跟 application 既有 trigger 衝突</h3>
<p>原表上已經有 application 自建 trigger、pt-osc 在原表 <em>再加 3 個 trigger</em>、新舊 trigger 執行順序 MySQL 不保證（多 trigger 同事件按 <em>未定義順序</em>）。Application 行為可能 subtly broken。</p>
<p>修法：</p>
<ul>
<li>跑 pt-osc 前 audit 原表 trigger（<code>SHOW TRIGGERS FROM production LIKE 'orders'</code>）</li>
<li>如果有 application trigger、考慮 <em>暫時 disable 再 ALTER</em> 或改 gh-ost</li>
<li>gh-ost 不在原表加 trigger、不會碰到這個問題</li>
</ul>
<h3 id="5-cut-over-瞬間-deadlock--兩工具都有但表現不同">5. Cut-over 瞬間 deadlock — 兩工具都有但表現不同</h3>
<p>Cut-over 用 <code>RENAME TABLE original TO archive, ghost TO original</code>（atomic operation）。但 cut-over 瞬間需要 <em>metadata lock</em>、跟 <em>進行中的 long-running transaction</em> 衝突會 wait。Long-running transaction 持續、cut-over 永遠 wait、最後 timeout 失敗。</p>
<p>修法（gh-ost）：</p>
<ul>
<li><code>--cut-over-lock-timeout-seconds=3</code>、超時 abort、稍後 retry</li>
<li><code>--postpone-cut-over-flag-file</code>：先把 copy 跑完、等流量空檔再觸發 cut-over</li>
</ul>
<p>修法（pt-osc）：</p>
<ul>
<li><code>--set-vars=&quot;lock_wait_timeout=60&quot;</code>、cut-over 等更久（風險：long transaction 撐住更久 server 更多 lock wait）</li>
<li>或排在 long transaction 已知不會跑的時段（nightly backup 後）</li>
</ul>
<h2 id="容量--時間估算">容量 / 時間估算</h2>
<p>對 100 GB 表、ALTER 加 column + 加 index 為例：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>pt-osc</th>
          <th>gh-ost</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>估算總時間</td>
          <td>6-12 小時（依 chunk size + load）</td>
          <td>5-10 小時（同上、可動態調整）</td>
      </tr>
      <tr>
          <td>寫吞吐影響</td>
          <td>-10% ~ -30%（trigger overhead）</td>
          <td>&lt; 5%（binlog 已存在）</td>
      </tr>
      <tr>
          <td>Replica lag</td>
          <td>1-10 秒（trigger 在 replica 跑）</td>
          <td>自動 throttle 在 threshold 內</td>
      </tr>
      <tr>
          <td>Disk 額外需求</td>
          <td>~原表大小 + index（ghost 用）</td>
          <td>同左</td>
      </tr>
      <tr>
          <td>Rollback 成本</td>
          <td>中（清 trigger）</td>
          <td>低（drop ghost）</td>
      </tr>
  </tbody>
</table>
<p>兩工具總時間接近、<em>影響 production 的差異大</em>。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-gtid--replication-topology">跟 GTID / Replication topology</h3>
<p>兩工具都 <em>依賴 replication</em> — pt-osc 透過 trigger 確保 replica 同步、gh-ost 直接從 replica 讀 binlog。Pre-requisite：</p>
<ul>
<li>Binlog <code>ROW</code> format（兩工具都要）</li>
<li>GTID 啟用（gh-ost 更需要、binlog re-pointing 容易）</li>
<li>詳見 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a></li>
</ul>
<h3 id="跟-vitess">跟 Vitess</h3>
<p>Vitess 有自己的 <em>VReplication-based online DDL</em>、不用 gh-ost 或 pt-osc。Vitess online DDL 在 shard 內部用類似 gh-ost 的 binlog stream 機制、但有 Vitess-aware schema management。詳見 <em>Vitess sharding 設計</em> 篇（待寫）。</p>
<h3 id="跟-aurora-mysql">跟 Aurora MySQL</h3>
<p>Aurora MySQL 仍支援 gh-ost / pt-osc、但 <em>Aurora 自己的 fast DDL</em>（部分 ALTER） 比 8.0 Instant DDL 更廣。先檢查 Aurora 文件、能用 native fast DDL 就不用 ghost table tool。詳見 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a>。</p>
<h3 id="跟-planetscale">跟 PlanetScale</h3>
<p>PlanetScale（managed Vitess）走 <em>branch-based schema migration</em> — 建 schema branch、跑 schema change、deploy 時 atomic merge。schema change 由 PlanetScale 內建流程承擔。詳見 <a href="/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/" data-link-title="MySQL → PlanetScale：managed Vitess &#43; branch-based schema workflow 的 hybrid shift" data-link-desc="自管 MySQL → PlanetScale 加上 Vitess sharding 跟 branch-based schema workflow。本文走 6 維 audit（Paradigm &#43; Operational &#43; Schema 多軸）、4-phase migration、5 production 踩雷、何時不要遷。">PlanetScale migration playbook</a>。</p>
<h2 id="production-casegh-ost-operation-workflow">Production case：gh-ost operation workflow</h2>
<p>Online schema change 的 production 責任是把大表 DDL 拆成可暫停、可節流、可切換的資料搬移流程。gh-ost 作為 GitHub 開源工具，把 schema change 轉成 ghost table copy、binlog tailing 與 controlled cutover；這讓 operator 可以在 replica lag、application load 或部署窗口變化時調整速度。</p>
<p>這個案例要回收到三個操作判準。第一，throttle 指標要接 production SLO，例如 replica lag、thread running、application latency 或錯誤率，而非只看 copy rows/sec。第二，pause / resume 是變更治理能力，代表 schema change 可以配合 incident response、deploy freeze 與商業尖峰窗口。第三，cutover 要設 rollback window 與 owner，因為 rename table 的瞬間仍是高風險控制點。</p>
<p>gh-ost workflow 的 sibling 路由是 <a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">PostgreSQL Online Schema Change</a>。PostgreSQL 常靠 fast ALTER、MVCC 與 extension 工具解決同類需求；MySQL 的 ghost table tool 更常成為標準路徑，主因是大表 DDL、metadata lock 與 replication event 的組合壓力不同。</p>
<h2 id="何時用哪一個">何時用哪一個</h2>
<table>
  <thead>
      <tr>
          <th>情境</th>
          <th>選擇</th>
          <th>原因</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>標準 production write &lt; 50% capacity</td>
          <td>gh-ost（預設）</td>
          <td>寫入 overhead 0、控制更細</td>
      </tr>
      <tr>
          <td>高寫吞吐 (&gt; 80% capacity)</td>
          <td>gh-ost（必須）</td>
          <td>pt-osc trigger overhead 直接 OOM</td>
      </tr>
      <tr>
          <td>有 FK constraint 需要保留</td>
          <td>pt-osc</td>
          <td>gh-ost 不處理 FK</td>
      </tr>
      <tr>
          <td>有 application-side trigger 在原表</td>
          <td>gh-ost</td>
          <td>pt-osc trigger 跟既有 trigger 不可預期</td>
      </tr>
      <tr>
          <td>需要 pause / resume 能力</td>
          <td>gh-ost</td>
          <td>pt-osc 不支援</td>
      </tr>
      <tr>
          <td>已用 Percona Toolkit 整套（pt-table-checksum / pt-archiver）</td>
          <td>pt-osc</td>
          <td>工具鏈一致</td>
      </tr>
      <tr>
          <td>已用 Vitess</td>
          <td>Vitess online DDL</td>
          <td>維持 Vitess schema workflow</td>
      </tr>
      <tr>
          <td>已用 PlanetScale</td>
          <td>branch-based</td>
          <td>維持 PlanetScale schema workflow</td>
      </tr>
      <tr>
          <td>已用 Aurora MySQL + native fast DDL OK</td>
          <td>不用 ghost table</td>
          <td>直接 ALTER</td>
      </tr>
  </tbody>
</table>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>（binlog ROW format + GTID 是 pre-requisite）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/online-schema-change/" data-link-title="PostgreSQL Online Schema Change：先用 ALTER 內建特性、不能解才 pg_repack / pg-osc" data-link-desc="PostgreSQL ALTER TABLE 對多數變更已是 *fast catalog-only*（add column nullable / drop column / 改 default），不必走 ghost table tool。本文走 PG 內建 fast DDL 行為、何時必須走 pg_repack / pg-osc、兩工具機制對比（trigger-based vs WAL-shipping）、配置 step-by-step、5 production 踩雷（lock 升級 / VACUUM FULL 誤用 / pg_repack version mismatch / concurrent index 失敗清理 / generated stored column 不能 online）、跟 MySQL gh-ost / pt-osc sibling 對比">PostgreSQL Online Schema Change</a>（PG sibling、為什麼 PG 比 MySQL 少用 ghost table — fast ALTER 覆蓋多數變更）</li>
<li><a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a>（managed MySQL fast DDL）</li>
<li><a href="https://planetscale.com/">PlanetScale</a>（branch-based 不用 ghost table）</li>
<li><a href="/blog/backend/01-database/database-migration-playbook/" data-link-title="1.6 資料庫轉換實作：雙寫、回填、切流與回滾" data-link-desc="同 DB 內 schema 演進與資料變更的可分段驗證流程、跟 1.12 cross-DB migration 分工">1.6 Database Migration Playbook</a>（schema migration 治理）</li>
<li><a href="/blog/backend/knowledge-cards/expand-contract/" data-link-title="Expand / Contract" data-link-desc="說明先擴充相容面、再收斂舊路徑的遷移做法">Expand / Contract 卡片</a>（schema migration 設計原則）</li>
<li>官方：<a href="https://github.com/github/gh-ost">gh-ost</a> / <a href="https://docs.percona.com/percona-toolkit/pt-online-schema-change.html">pt-online-schema-change</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL ProxySQL 配置：connection / query / route / response 四段 lifecycle 跟 query rule 設計</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/proxysql-config/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/proxysql-config/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>ProxySQL 配置&lt;/em> — connection pool + query routing 的 4 段 lifecycle 跟 rule chain 設計。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="proxysql-lifecycle每個-query-走-4-段">ProxySQL Lifecycle：每個 query 走 4 段&lt;/h2>
&lt;p>從 application 連 ProxySQL 到拿到 response、每個 query 都走完整 4 段：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">1. Connection 接入 → application connect 到 ProxySQL（不是 MySQL）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">2. Query parse + rule match → ProxySQL 解析 query、match query rule chain
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">3. Backend route → 決定走哪個 hostgroup（primary / replica）+ 哪個 server
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">4. Response 返回 → 將 result set 回 application、connection 可被 reuse&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>每段都有獨立配置 + failure mode + 觀測 metric。ProxySQL 不是 &lt;em>簡單的 connection pool&lt;/em>、是 &lt;em>query-aware proxy&lt;/em> — 看得到 SQL 內容才能做 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/read-write-split/" data-link-title="Read-Write Split" data-link-desc="說明讀寫流量如何分流到 primary 與 replica，以及它引入的一致性責任">read/write split&lt;/a>、replica lag-aware routing、query mirroring。&lt;/p>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &amp;#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PostgreSQL pgBouncer&lt;/a> 比、pgBouncer 是 &lt;em>transaction-level pool&lt;/em>（只看連線、不看 SQL）、ProxySQL 是 &lt;em>query-level proxy&lt;/em>（看 SQL、做 routing decision）。能力不同、target use case 不同。&lt;/p>
&lt;h2 id="stage-1connection-接入--hostgroup--server--user-三層-schema">Stage 1：Connection 接入 — Hostgroup / Server / User 三層 schema&lt;/h2>
&lt;p>ProxySQL 不直接 expose backend MySQL、用 &lt;em>hostgroup&lt;/em> 作為 routing 抽象。Application 不知道有幾個 backend、只知道 ProxySQL。&lt;/p>
&lt;p>&lt;strong>核心 table（在 &lt;code>main&lt;/code> database）&lt;/strong>：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Table&lt;/th>
 &lt;th>角色&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;code>mysql_servers&lt;/code>&lt;/td>
 &lt;td>列每個 backend MySQL server、屬於哪個 hostgroup&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>mysql_replication_hostgroups&lt;/code>&lt;/td>
 &lt;td>定義 writer hostgroup ↔ reader hostgroup 配對、自動偵測 primary 切換&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>mysql_users&lt;/code>&lt;/td>
 &lt;td>列允許連 ProxySQL 的 application user、預設 hostgroup&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>mysql_query_rules&lt;/code>&lt;/td>
 &lt;td>Query rule chain、決定哪個 query 走哪個 hostgroup&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>典型部署&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 <em>ProxySQL 配置</em> — connection pool + query routing 的 4 段 lifecycle 跟 rule chain 設計。</p></blockquote>
<hr>
<h2 id="proxysql-lifecycle每個-query-走-4-段">ProxySQL Lifecycle：每個 query 走 4 段</h2>
<p>從 application 連 ProxySQL 到拿到 response、每個 query 都走完整 4 段：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. Connection 接入        →  application connect 到 ProxySQL（不是 MySQL）
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. Query parse + rule match  → ProxySQL 解析 query、match query rule chain
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. Backend route          →  決定走哪個 hostgroup（primary / replica）+ 哪個 server
</span></span><span class="line"><span class="ln">4</span><span class="cl">4. Response 返回          →  將 result set 回 application、connection 可被 reuse</span></span></code></pre></div><p>每段都有獨立配置 + failure mode + 觀測 metric。ProxySQL 不是 <em>簡單的 connection pool</em>、是 <em>query-aware proxy</em> — 看得到 SQL 內容才能做 <a href="/blog/backend/knowledge-cards/read-write-split/" data-link-title="Read-Write Split" data-link-desc="說明讀寫流量如何分流到 primary 與 replica，以及它引入的一致性責任">read/write split</a>、replica lag-aware routing、query mirroring。</p>
<p>跟 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PostgreSQL pgBouncer</a> 比、pgBouncer 是 <em>transaction-level pool</em>（只看連線、不看 SQL）、ProxySQL 是 <em>query-level proxy</em>（看 SQL、做 routing decision）。能力不同、target use case 不同。</p>
<h2 id="stage-1connection-接入--hostgroup--server--user-三層-schema">Stage 1：Connection 接入 — Hostgroup / Server / User 三層 schema</h2>
<p>ProxySQL 不直接 expose backend MySQL、用 <em>hostgroup</em> 作為 routing 抽象。Application 不知道有幾個 backend、只知道 ProxySQL。</p>
<p><strong>核心 table（在 <code>main</code> database）</strong>：</p>
<table>
  <thead>
      <tr>
          <th>Table</th>
          <th>角色</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>mysql_servers</code></td>
          <td>列每個 backend MySQL server、屬於哪個 hostgroup</td>
      </tr>
      <tr>
          <td><code>mysql_replication_hostgroups</code></td>
          <td>定義 writer hostgroup ↔ reader hostgroup 配對、自動偵測 primary 切換</td>
      </tr>
      <tr>
          <td><code>mysql_users</code></td>
          <td>列允許連 ProxySQL 的 application user、預設 hostgroup</td>
      </tr>
      <tr>
          <td><code>mysql_query_rules</code></td>
          <td>Query rule chain、決定哪個 query 走哪個 hostgroup</td>
      </tr>
  </tbody>
</table>
<p><strong>典型部署</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 進 ProxySQL admin (6032 port)
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="n">mysql</span><span class="w"> </span><span class="o">-</span><span class="n">uadmin</span><span class="w"> </span><span class="o">-</span><span class="n">padmin</span><span class="w"> </span><span class="o">-</span><span class="n">h127</span><span class="p">.</span><span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">.</span><span class="mi">1</span><span class="w"> </span><span class="o">-</span><span class="n">P6032</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w"></span><span class="c1">-- 設 2 個 hostgroup：10=writer、20=reader
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">mysql_servers</span><span class="p">(</span><span class="n">hostgroup_id</span><span class="p">,</span><span class="w"> </span><span class="n">hostname</span><span class="p">,</span><span class="w"> </span><span class="n">port</span><span class="p">,</span><span class="w"> </span><span class="n">weight</span><span class="p">,</span><span class="w"> </span><span class="n">max_connections</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="k">VALUES</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">  </span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;primary.example.com&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">3306</span><span class="p">,</span><span class="w"> </span><span class="mi">1000</span><span class="p">,</span><span class="w"> </span><span class="mi">200</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">  </span><span class="p">(</span><span class="mi">20</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;replica1.example.com&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">3306</span><span class="p">,</span><span class="w"> </span><span class="mi">1000</span><span class="p">,</span><span class="w"> </span><span class="mi">100</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">  </span><span class="p">(</span><span class="mi">20</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;replica2.example.com&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">3306</span><span class="p">,</span><span class="w"> </span><span class="mi">1000</span><span class="p">,</span><span class="w"> </span><span class="mi">100</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="c1">-- 自動偵測 primary（用 read_only flag）
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">mysql_replication_hostgroups</span><span class="p">(</span><span class="n">writer_hostgroup</span><span class="p">,</span><span class="w"> </span><span class="n">reader_hostgroup</span><span class="p">,</span><span class="w"> </span><span class="k">comment</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="w"> </span><span class="mi">20</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;production cluster&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="c1">-- 設 application user、預設走 reader（保守）
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">mysql_users</span><span class="p">(</span><span class="n">username</span><span class="p">,</span><span class="w"> </span><span class="n">password</span><span class="p">,</span><span class="w"> </span><span class="n">default_hostgroup</span><span class="p">,</span><span class="w"> </span><span class="n">max_connections</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;app&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;app_password&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">20</span><span class="p">,</span><span class="w"> </span><span class="mi">1000</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w"></span><span class="c1">-- 套用設定到 runtime
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="c1"></span><span class="k">LOAD</span><span class="w"> </span><span class="n">MYSQL</span><span class="w"> </span><span class="n">SERVERS</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">RUNTIME</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w"></span><span class="k">LOAD</span><span class="w"> </span><span class="n">MYSQL</span><span class="w"> </span><span class="n">USERS</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">RUNTIME</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w"></span><span class="c1">-- 持久化到 disk（重啟保留）
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="c1"></span><span class="n">SAVE</span><span class="w"> </span><span class="n">MYSQL</span><span class="w"> </span><span class="n">SERVERS</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">DISK</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w"></span><span class="n">SAVE</span><span class="w"> </span><span class="n">MYSQL</span><span class="w"> </span><span class="n">USERS</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">DISK</span><span class="p">;</span></span></span></code></pre></div><p>注意 ProxySQL 的 <em>三層 state</em>：<code>disk</code>（持久化）→ <code>memory</code>（編輯區）→ <code>runtime</code>（實際運作）。每次改完要 <code>LOAD ... TO RUNTIME</code> 才生效、<code>SAVE ... TO DISK</code> 才能 reboot 保留。沒 <code>SAVE</code> 重啟後 config 消失是新手最常踩的雷。</p>
<h2 id="stage-2query-parse--rule-match--query-rule-engine">Stage 2：Query Parse + Rule Match — query rule engine</h2>
<p>ProxySQL 不只 forward connection、看 <em>SQL 內容</em> 決定怎麼 route。Query rule 是 <em>ordered chain</em>、match 第一個符合的 rule。</p>
<p><strong>Query rule 核心欄位</strong>：</p>
<table>
  <thead>
      <tr>
          <th>欄位</th>
          <th>意義</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>rule_id</code></td>
          <td>排序（越小越先 match）</td>
      </tr>
      <tr>
          <td><code>match_pattern</code></td>
          <td>regex 比對 SQL（支援 <code>^SELECT</code> / <code>FOR UPDATE</code> 等）</td>
      </tr>
      <tr>
          <td><code>destination_hostgroup</code></td>
          <td>match 後送哪個 hostgroup</td>
      </tr>
      <tr>
          <td><code>apply</code></td>
          <td>match 後是否停 chain（1=stop、0=繼續看後面 rule）</td>
      </tr>
      <tr>
          <td><code>cache_ttl</code></td>
          <td>result cache TTL（毫秒）— ProxySQL 內建 query cache</td>
      </tr>
      <tr>
          <td><code>mirror_hostgroup</code></td>
          <td>query 鏡像送到第二個 hostgroup（不等 response、用於 shadow test）</td>
      </tr>
  </tbody>
</table>
<p><strong>典型讀寫分離 rule</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- Rule 100: SELECT ... FOR UPDATE 必須走 primary
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">mysql_query_rules</span><span class="p">(</span><span class="n">rule_id</span><span class="p">,</span><span class="w"> </span><span class="n">active</span><span class="p">,</span><span class="w"> </span><span class="n">match_pattern</span><span class="p">,</span><span class="w"> </span><span class="n">destination_hostgroup</span><span class="p">,</span><span class="w"> </span><span class="n">apply</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w"></span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">100</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;^SELECT.*FOR UPDATE$&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">10</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span><span class="c1">-- Rule 200: 一般 SELECT 走 replica（reader）
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">mysql_query_rules</span><span class="p">(</span><span class="n">rule_id</span><span class="p">,</span><span class="w"> </span><span class="n">active</span><span class="p">,</span><span class="w"> </span><span class="n">match_pattern</span><span class="p">,</span><span class="w"> </span><span class="n">destination_hostgroup</span><span class="p">,</span><span class="w"> </span><span class="n">apply</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">200</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;^SELECT&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">20</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="c1">-- Rule 300: BEGIN / START TRANSACTION 走 primary
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">mysql_query_rules</span><span class="p">(</span><span class="n">rule_id</span><span class="p">,</span><span class="w"> </span><span class="n">active</span><span class="p">,</span><span class="w"> </span><span class="n">match_pattern</span><span class="p">,</span><span class="w"> </span><span class="n">destination_hostgroup</span><span class="p">,</span><span class="w"> </span><span class="n">apply</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">300</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;^(BEGIN|START TRANSACTION)&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">10</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="c1">-- 其他（INSERT / UPDATE / DELETE）預設走 default_hostgroup（user 設的）
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1">-- application user default 設 10 (writer)、所以寫入自動走 primary
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span><span class="k">LOAD</span><span class="w"> </span><span class="n">MYSQL</span><span class="w"> </span><span class="n">QUERY</span><span class="w"> </span><span class="n">RULES</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">RUNTIME</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w"></span><span class="n">SAVE</span><span class="w"> </span><span class="n">MYSQL</span><span class="w"> </span><span class="n">QUERY</span><span class="w"> </span><span class="n">RULES</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">DISK</span><span class="p">;</span></span></span></code></pre></div><p><strong>Rule 順序很重要</strong>：<code>rule_id</code> 100 先 match、200 再 match、依此類推。Rule 200 比 100 寬鬆（任何 SELECT）、所以 <code>FOR UPDATE</code> 必須先 match rule 100 才不會誤送 replica。</p>
<h2 id="stage-3backend-route--replica-lag-aware--circuit-breaker">Stage 3：Backend Route — replica lag-aware + circuit breaker</h2>
<p>Rule match 後 ProxySQL 從 hostgroup 內挑一個 server。Backend selection 不是 pure round-robin、考慮：</p>
<ul>
<li><em>Weight</em>：每個 server <code>weight</code> 比例分配（典型用於 replica capacity 不同）</li>
<li><em>Replica lag</em>：若 hostgroup 設 <code>max_replication_lag</code>、lag 超過 threshold 的 replica 自動暫時退出</li>
<li><em>Connection count</em>：避免某個 server connection 滿</li>
<li><em>Server status</em>：<code>mysql_servers.status</code> (ONLINE / SHUNNED / OFFLINE_SOFT / OFFLINE_HARD) 決定是否可用</li>
</ul>
<p><strong>Replica lag-aware routing 配置</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 給整個 reader hostgroup 設 lag threshold
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">UPDATE</span><span class="w"> </span><span class="n">mysql_servers</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SET</span><span class="w"> </span><span class="n">max_replication_lag</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">5</span><span class="w">  </span><span class="c1">-- 秒
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">hostgroup_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">20</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="k">LOAD</span><span class="w"> </span><span class="n">MYSQL</span><span class="w"> </span><span class="n">SERVERS</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">RUNTIME</span><span class="p">;</span></span></span></code></pre></div><p>ProxySQL 內部用 <em>monitor module</em> 定期跑 <code>SHOW SLAVE STATUS</code>、lag 超過 5 秒 → 該 replica 暫時退出 reader hostgroup。讀 query 自動避開 lagging replica。</p>
<p><strong>Circuit breaker（自動 shun）</strong>：server 連續失敗 → ProxySQL 自動 <code>SHUNNED</code>、避免持續打 broken server。但 <em>application 層仍要處理 retry</em>、ProxySQL 不保證 query 100% 成功。</p>
<h2 id="stage-4response-返回--connection-multiplexing">Stage 4：Response 返回 — connection multiplexing</h2>
<p>ProxySQL 對 application connection 跟 backend connection 是 <em>N:M 多工</em>：</p>
<ul>
<li>Application connection 跟 ProxySQL 1:1</li>
<li>ProxySQL 跟 backend MySQL connection 共用 pool（multiplexing）</li>
</ul>
<p><strong>Multiplexing 條件</strong>：</p>
<ul>
<li>Transaction 內：connection 綁定特定 backend（保 transaction atomicity）</li>
<li>跨 transaction：connection 可以換 backend</li>
<li><code>SET</code> statement 改 session variable：connection 黏死 backend（防 session state leak）</li>
<li>User variable（<code>@var</code>）：connection 黏死 backend</li>
</ul>
<p><strong>結果</strong>：application 看到的是「自己有 1000 個 connection」、ProxySQL 後端可能只有 100 connection 到 MySQL。對 connection-bound MySQL（max_connections 限制）是關鍵 cost saving。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-query-rule-順序錯亂--for-update-被-select-route-到-replica">1. Query rule 順序錯亂 — <code>FOR UPDATE</code> 被 SELECT route 到 replica</h3>
<p>Rule 200（<code>^SELECT</code>）寫在 rule 100（<code>^SELECT.*FOR UPDATE$</code>）之前、ProxySQL match 第一個 rule（rule 200）就停、<code>SELECT ... FOR UPDATE</code> 被送 replica、replica 沒 lock、application 假設有 lock 跑 race condition。</p>
<p>修法：</p>
<ul>
<li><code>rule_id</code> 排序：精確 rule（多條件 regex）放小、寬鬆 rule 放大</li>
<li>用 <code>apply=1</code> 強制停 chain、不要讓 query 繼續往下 match</li>
<li>跑 ProxySQL <code>SHOW PROCESSLIST</code> + audit log 確認 routing 正確</li>
</ul>
<h3 id="2-connection-漂移--multiplexing-把-session-variable-弄丟">2. Connection 漂移 — Multiplexing 把 session variable 弄丟</h3>
<p>Application 跑 <code>SET sql_mode=...</code>、ProxySQL 把這 connection 暫時黏死 backend 1。下個 query ProxySQL forget、把 connection unstick、實際 forward 到 backend 2（沒 <code>SET sql_mode</code>）、SQL 解析行為不同、application bug。</p>
<p>修法：</p>
<ul>
<li>用 <code>mysql-multiplexing=false</code> 全 disable（最簡單但浪費 connection pool 效率）</li>
<li>或在 application init 連線後跑的 <code>SET</code> 全列在 <code>mysql_users.connect_init</code>（每個 connection ProxySQL 自動跑、不會漂移）</li>
<li>避免 application 中途改 session variable、改成全部走 ProxySQL connect_init</li>
</ul>
<h3 id="3-write-不小心-route-到-replica--default_hostgroup-設錯">3. Write 不小心 route 到 replica — <code>default_hostgroup</code> 設錯</h3>
<p>Application user <code>default_hostgroup</code> 設 20 (reader)、INSERT / UPDATE / DELETE 沒 match 到任何 rule（沒寫 catch-all write rule）、走 default → 送 replica → replica 是 read-only → error。或更糟：replica 不是 read-only mode、寫入 <em>寫到 replica 上</em>、replication 反向不同步、data corruption。</p>
<p>修法：</p>
<ul>
<li>Application user <code>default_hostgroup</code> 設 10 (writer) — 寫入預設走 primary</li>
<li>Replica MySQL 一定要 <code>read_only=1</code>（防 stale write 寫到 replica）</li>
<li>監控 <code>mysql_query_rules</code> match 率、寫入 query 應該大部分透過 default_hostgroup 路由、不是個別 rule</li>
</ul>
<h3 id="4-runtime--disk-schema-drift--改了-runtime-沒-save重啟-config-消失">4. Runtime / disk schema drift — 改了 runtime 沒 save、重啟 config 消失</h3>
<p><code>LOAD ... TO RUNTIME</code> 跟 <code>SAVE ... TO DISK</code> 是兩個獨立操作。On-call 在事故中改 ProxySQL 配置（add server、調 query rule）、<code>LOAD</code> 套到 runtime 但忘記 <code>SAVE</code>、隔天 ProxySQL 重啟（OS update / crash）、config 回到 disk 版本、半夜 alert。</p>
<p>修法：</p>
<ul>
<li>每次 <code>LOAD ... TO RUNTIME</code> 後立刻 <code>SAVE ... TO DISK</code>（變成 habit）</li>
<li>用 IaC（Terraform / Ansible）管 ProxySQL config、不要手動改 admin</li>
<li>監控：對比 <code>runtime_mysql_servers</code> 跟 <code>mysql_servers</code>（disk）、有 diff 即告警</li>
</ul>
<h3 id="5-mirror-traffic-副作用--insert-鏡像到-staging-寫了兩次">5. Mirror traffic 副作用 — INSERT 鏡像到 staging 寫了兩次</h3>
<p><code>mirror_hostgroup</code> 把 query 鏡像送到第二個 hostgroup（不等 response、用於 shadow test 新 schema）。但 <em>鏡像是真實執行</em>、不是 dry-run。鏡像 INSERT 到 staging hostgroup → staging 真的多了 row。如果 staging hostgroup 接到 production 表（誤接）、production 寫入 doubled。</p>
<p>修法：</p>
<ul>
<li>Mirror 只用於 <em>獨立 staging cluster</em>、不混用 production schema</li>
<li>Mirror 設定要 review（規則 <code>match_pattern</code> 跟 <code>mirror_hostgroup</code> 配對）</li>
<li>開 mirror 前在 staging 跑 dry-run、確認 schema 跟 production isolated</li>
</ul>
<h2 id="容量規劃要點">容量規劃要點</h2>
<p>對 100 application instance × 50 connection / instance = 5000 application connection 場景：</p>
<table>
  <thead>
      <tr>
          <th>配置</th>
          <th>ProxySQL 設定</th>
          <th>MySQL backend 配置</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Application → ProxySQL</td>
          <td><code>mysql-max_connections=10000</code></td>
          <td>不影響</td>
      </tr>
      <tr>
          <td>ProxySQL → MySQL primary</td>
          <td><code>max_connections=200</code>（per server）</td>
          <td>MySQL <code>max_connections=300</code>（多 100 buffer for admin）</td>
      </tr>
      <tr>
          <td>ProxySQL → MySQL replica</td>
          <td><code>max_connections=200</code>（per server）</td>
          <td>同上</td>
      </tr>
      <tr>
          <td>ProxySQL 數量（HA）</td>
          <td>至少 2 instance（HAProxy / VIP）</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Memory per ProxySQL</td>
          <td>2-4 GB（query rule cache + connection pool）</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>ProxySQL 本身需要 HA：放兩個 instance 後面接 VIP（keepalived）或 HAProxy。Application 連 VIP / HAProxy、不直接連 ProxySQL hostname（單點失效）。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p>ProxySQL 透過 <em>monitor module</em> 自動偵測 primary（檢查 <code>read_only</code> flag）+ replica lag（檢查 <code>Seconds_Behind_Master</code>）。這個 monitor 依賴 MySQL replication 已配好（GTID + binlog ROW format）。詳見 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a>。</p>
<h3 id="跟-orchestrator-ha">跟 Orchestrator HA</h3>
<p>Orchestrator 自動 failover 後新 primary 的 <code>read_only</code> flag 變 0、舊 primary 變 1。ProxySQL monitor 偵測到、自動把 hostgroup 10（writer）的 server 切換、application 不必改 connection string。</p>
<p>詳見 <em>Orchestrator failover 設計</em> 篇（待寫）。</p>
<h3 id="跟-osc-toolgh-ost--pt-osc">跟 OSC tool（gh-ost / pt-osc）</h3>
<p>ProxySQL 可以 <em>暫時 throttle</em> application 對某張表的寫入（query rule <code>delay</code> 欄位）、配合 OSC tool cut-over 時段降低 metadata lock 衝突。</p>
<p>詳見 <a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">Online Schema Change Tools</a>。</p>
<h3 id="跟-aurora-mysql--rds-proxy">跟 Aurora MySQL / RDS Proxy</h3>
<p>Aurora MySQL 推 <em>RDS Proxy</em>（AWS managed proxy）取代 ProxySQL — 跟 IAM 整合、failover &lt; 30 秒。但 RDS Proxy <em>沒有 query routing rule engine</em>（只做 connection pool）、不能讀寫分離。Aurora user 仍可能用 ProxySQL 在前面、再用 RDS Proxy 作 backend connection pool。</p>
<p>詳見 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a>。</p>
<h3 id="跟-postgresql-pgbouncer-對比">跟 PostgreSQL pgBouncer 對比</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>ProxySQL（MySQL）</th>
          <th>pgBouncer（PostgreSQL）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>抽象層</td>
          <td>Query-level proxy</td>
          <td>Transaction-level pool</td>
      </tr>
      <tr>
          <td>Query routing</td>
          <td>內建（rule engine）</td>
          <td>無（不看 SQL）</td>
      </tr>
      <tr>
          <td>Connection pool</td>
          <td>內建</td>
          <td>核心功能</td>
      </tr>
      <tr>
          <td>Read/write split</td>
          <td>內建（自動 + rule）</td>
          <td>要 application 層或 HAProxy 配</td>
      </tr>
      <tr>
          <td>Replica lag-aware</td>
          <td>內建</td>
          <td>無</td>
      </tr>
      <tr>
          <td>Query cache</td>
          <td>內建</td>
          <td>無</td>
      </tr>
  </tbody>
</table>
<p>ProxySQL 是 <em>query 層中介</em>、pgBouncer 是 <em>connection 層中介</em>。詳見 <a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">pgBouncer 配置</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>（read replica routing 前提）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">MySQL Online Schema Change Tools</a>（OSC + ProxySQL throttle 整合）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/pgbouncer-config/" data-link-title="PostgreSQL pgBouncer 配置 &#43; 連線池治理" data-link-desc="pgBouncer transaction pooling 配置、跟 application connection pool 的分層、production 故障演練（pool exhaustion / stale connection / DNS failover）跟容量規劃">PostgreSQL pgBouncer</a>（PG sibling、不同抽象層）</li>
<li><a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a>（RDS Proxy + ProxySQL 取捨）</li>
<li><a href="/blog/backend/knowledge-cards/connection-pool/" data-link-title="Connection Pool" data-link-desc="說明連線池如何限制下游資源並影響服務容量">Connection Pool 卡片</a></li>
<li>官方：<a href="https://proxysql.com/documentation/">ProxySQL Documentation</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL Orchestrator Failover：HA 工具自己怎麼 HA？raft cluster + GTID-based promotion 的兩段 paradox</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/orchestrator-failover/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/orchestrator-failover/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>Orchestrator failover&lt;/em> — 自動 HA 的工具雙層架構跟 5 段 decision tree。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;blockquote>
&lt;p>用詞註：Orchestrator 工具命名與 MySQL 5.7- SQL 命令（&lt;code>SHOW SLAVE STATUS&lt;/code> / &lt;code>CHANGE MASTER TO&lt;/code> / &lt;code>STOP SLAVE&lt;/code> 等）沿用 &lt;em>master / slave&lt;/em>。MySQL 8.0+ 改採 &lt;em>primary / replica&lt;/em>、但 SQL syntax 仍保留別名。本文出現 master / slave 處對應 8.0 primary / replica 概念。&lt;/p>&lt;/blockquote>
&lt;p>讀者第一個會問的問題：「Orchestrator 自己會壞嗎？壞了誰 failover Orchestrator？」這個 paradox 是 &lt;em>任何 HA 工具&lt;/em> 的核心議題、PostgreSQL 的 Patroni 用 DCS（etcd / Consul）解決、MySQL 的 Orchestrator 用 &lt;em>內建 raft cluster&lt;/em> 解決：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">被管的 (Layer 1): primary MySQL → replica MySQL → replica MySQL → ...
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">管理者 (Layer 2): orchestrator instance × 3 (or 5) — 用 raft 自己選 leader
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">管理者狀態存放 (Layer 3): 每個 orchestrator instance 自己有 MySQL backend (state)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Orchestrator 3 個 instance 構成 &lt;em>raft cluster&lt;/em>、自己選 leader。Leader 才有 &lt;em>寫入 state&lt;/em> + &lt;em>發起 failover&lt;/em> 權限、其他 instance follower 同步 state。Leader 失聯 → raft 重新選 leader（&amp;lt; 10 秒）、新 leader 繼續 manage MySQL topology。&lt;/p>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &amp;#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">PostgreSQL Patroni&lt;/a> 不同：Patroni 需要 &lt;em>外部 DCS&lt;/em>（etcd / Consul）作為 source of truth、Patroni 本身 stateless；Orchestrator 內建 raft、不需要外部 DCS、但每個 orchestrator instance 需要 &lt;em>自己的 MySQL backend&lt;/em> 存 state。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 <em>Orchestrator failover</em> — 自動 HA 的工具雙層架構跟 5 段 decision tree。</p></blockquote>
<hr>
<blockquote>
<p>用詞註：Orchestrator 工具命名與 MySQL 5.7- SQL 命令（<code>SHOW SLAVE STATUS</code> / <code>CHANGE MASTER TO</code> / <code>STOP SLAVE</code> 等）沿用 <em>master / slave</em>。MySQL 8.0+ 改採 <em>primary / replica</em>、但 SQL syntax 仍保留別名。本文出現 master / slave 處對應 8.0 primary / replica 概念。</p></blockquote>
<p>讀者第一個會問的問題：「Orchestrator 自己會壞嗎？壞了誰 failover Orchestrator？」這個 paradox 是 <em>任何 HA 工具</em> 的核心議題、PostgreSQL 的 Patroni 用 DCS（etcd / Consul）解決、MySQL 的 Orchestrator 用 <em>內建 raft cluster</em> 解決：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">被管的 (Layer 1):       primary MySQL → replica MySQL → replica MySQL → ...
</span></span><span class="line"><span class="ln">2</span><span class="cl">管理者 (Layer 2):       orchestrator instance × 3 (or 5) — 用 raft 自己選 leader
</span></span><span class="line"><span class="ln">3</span><span class="cl">管理者狀態存放 (Layer 3): 每個 orchestrator instance 自己有 MySQL backend (state)</span></span></code></pre></div><p>Orchestrator 3 個 instance 構成 <em>raft cluster</em>、自己選 leader。Leader 才有 <em>寫入 state</em> + <em>發起 failover</em> 權限、其他 instance follower 同步 state。Leader 失聯 → raft 重新選 leader（&lt; 10 秒）、新 leader 繼續 manage MySQL topology。</p>
<p>跟 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">PostgreSQL Patroni</a> 不同：Patroni 需要 <em>外部 DCS</em>（etcd / Consul）作為 source of truth、Patroni 本身 stateless；Orchestrator 內建 raft、不需要外部 DCS、但每個 orchestrator instance 需要 <em>自己的 MySQL backend</em> 存 state。</p>
<h2 id="orchestrator-雙層架構管-mysql-的-layer-2">Orchestrator 雙層架構：管 MySQL 的 Layer 2</h2>
<p>Layer 1 是 <em>被管的</em> MySQL cluster — primary + replica 群。Layer 2 是 <em>管理者</em> — orchestrator instance 群。Layer 2 監視 Layer 1、Layer 2 自己用 raft 自管。</p>
<p><strong>Layer 1 對 Orchestrator 的需求</strong>：</p>
<ul>
<li>所有 MySQL server 啟用 <code>binlog</code> + <code>log_slave_updates</code>（讓 Orchestrator 看得到 binlog event）</li>
<li>啟用 GTID（Orchestrator failover decision 依賴 GTID 比較進度、不用算 binlog position）</li>
<li>每個 server 有 <em>orchestrator user</em>（<code>GRANT SUPER, REPLICATION CLIENT, REPLICATION SLAVE, PROCESS ON *.* TO 'orchestrator'@'%'</code>）</li>
</ul>
<p><strong>Layer 2 配置</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># /etc/orchestrator.conf.json (簡化)</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="na">{</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  <span class="na">&#34;MySQLOrchestratorHost&#34;: &#34;orchestrator-backend.example.com&#34;,</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  <span class="na">&#34;MySQLOrchestratorPort&#34;: 3306,</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">  <span class="na">&#34;MySQLOrchestratorDatabase&#34;: &#34;orchestrator&#34;,</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  <span class="c1"># 用 backend MySQL（每個 orchestrator instance 自己一個）+ raft 同步</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  <span class="na">&#34;RaftEnabled&#34;: true,</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  <span class="na">&#34;RaftDataDir&#34;: &#34;/var/lib/orchestrator&#34;,</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">  <span class="na">&#34;RaftBind&#34;: &#34;10.0.1.10:10008&#34;,</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">  <span class="na">&#34;RaftNodes&#34;: [</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">    <span class="na">&#34;orchestrator1.example.com:10008&#34;,</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">    <span class="na">&#34;orchestrator2.example.com:10008&#34;,</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">    <span class="na">&#34;orchestrator3.example.com:10008&#34;</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">  <span class="na">],</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">
</span></span><span class="line"><span class="ln">17</span><span class="cl">  <span class="c1"># Topology discovery</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">  <span class="na">&#34;DiscoverByShowSlaveHosts&#34;: true,</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">  <span class="na">&#34;InstancePollSeconds&#34;: 5,</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl">  <span class="c1"># Failover detection</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">  <span class="na">&#34;FailureDetectionPeriodBlockMinutes&#34;: 60,</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">  <span class="na">&#34;RecoveryPeriodBlockSeconds&#34;: 3600,</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">
</span></span><span class="line"><span class="ln">25</span><span class="cl">  <span class="c1"># Failover automation</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl">  <span class="na">&#34;RecoverMasterClusterFilters&#34;: [&#34;*&#34;],</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl">  <span class="na">&#34;RecoverIntermediateMasterClusterFilters&#34;: [&#34;*&#34;],</span>
</span></span><span class="line"><span class="ln">28</span><span class="cl">  <span class="na">&#34;PreFailoverProcesses&#34;: [&#34;/usr/local/bin/orchestrator-fence-master.sh&#34;],</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl">  <span class="na">&#34;PostFailoverProcesses&#34;: [&#34;/usr/local/bin/orchestrator-notify-proxysql.sh&#34;]</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="na">}</span></span></span></code></pre></div><h2 id="stage-1topology-discovery--自動發現--manual-seed">Stage 1：Topology Discovery — 自動發現 + manual seed</h2>
<p>Orchestrator 啟動後 <em>seed</em> 一個或多個 MySQL server、自動發現整個 topology：</p>
<ul>
<li>連 seed server → <code>SHOW SLAVE HOSTS</code> → 發現所有 replica</li>
<li>對每個 replica 跑 <code>SHOW MASTER STATUS</code> + <code>SHOW SLAVE STATUS</code> → 建立 <em>父子關係 graph</em></li>
<li>持續 poll（<code>InstancePollSeconds=5</code>）每 5 秒更新 topology state</li>
</ul>
<p><strong>Topology graph 的 node</strong>：</p>
<ul>
<li><em>Master</em>：no slave status、被多個 replica 指</li>
<li><em>Intermediate master</em>：有 slave status 也有下游 replica（chained replication）</li>
<li><em>Co-master</em>：互相 replicate（罕見、active-passive failover 場景）</li>
<li><em>Replica</em>：有 slave status、無下游</li>
</ul>
<p>Topology 可視化：Orchestrator UI（web）顯示 cluster 樹狀圖、操作員可手動 drag-and-drop replica 重新 attach。</p>
<h2 id="stage-2failure-detection--區分真壞跟假壞">Stage 2：Failure Detection — 區分真壞跟假壞</h2>
<p>Orchestrator 不是 <em>單一 ping 失敗就 failover</em>、有 <em>holistic detection</em>：</p>
<table>
  <thead>
      <tr>
          <th>指標</th>
          <th>解讀</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Master <code>connect fail</code></td>
          <td>可能 network blip、不一定真壞</td>
      </tr>
      <tr>
          <td>Master <code>timeout poll</code></td>
          <td>可能 master loaded、不一定真壞</td>
      </tr>
      <tr>
          <td><strong>Replica 全部 <code>IO error</code></strong></td>
          <td>Master 真的對 replica 不可達、強訊號</td>
      </tr>
      <tr>
          <td>Replica 看到 master 還活著</td>
          <td>Master 對 orchestrator 不可達、可能是 <em>orchestrator network</em> 問題、不是 master</td>
      </tr>
      <tr>
          <td>Replica lag 暴增</td>
          <td>Master 可能還活著但 overload、不一定要 failover</td>
      </tr>
  </tbody>
</table>
<p><strong>Detection rule</strong>：Master <em>自己連不上</em> + <em>至少一個 replica 也看 master IO error</em> → 判定 <code>DeadMaster</code>。單一 orchestrator 連不上 master 不觸發 — 防 orchestrator network 隔離造成的 false positive failover。</p>
<h2 id="stage-3failover-decision-tree--選哪個-replica-promote">Stage 3：Failover Decision Tree — 選哪個 replica promote</h2>
<p>判定 <code>DeadMaster</code> 後不是 <em>選最近的 replica</em>、用 decision tree：</p>
<ol>
<li><strong>GTID 最新的 replica</strong>：跟舊 master 同步最完整（用 <code>Executed_Gtid_Set</code> 對比）</li>
<li><strong>同 DC / AZ 的 replica</strong>（如果有 multi-DC 配置）</li>
<li><strong>手動指定的 promotion candidate</strong>（<code>promote_rule=must</code> 或 <code>prefer</code>）</li>
<li><strong>Semi-sync ack 的 replica</strong>（如果 semi-sync 啟用）</li>
</ol>
<p>GTID 最新是基本要求。其他規則是 <em>tie-breaker</em>。</p>
<p><strong>Errant transaction 處理</strong>：選出的 candidate replica 如果有 <em>errant GTID</em>（master 沒有但 replica 有的 transaction）、Orchestrator <em>不會 promote 這個 replica</em>（怕 errant transaction 變成 new master state）。改選次優 candidate。</p>
<h2 id="stage-4promote-action--5-步-atomic理想情況">Stage 4：Promote Action — 5 步 atomic（理想情況）</h2>
<p>選好 candidate 後執行：</p>
<ol>
<li><strong>Fence 舊 master</strong>（pre-failover hook）：把舊 master 對外停掉、防 split-brain</li>
<li><strong>STOP SLAVE on candidate</strong>：candidate 不再從舊 master pull binlog</li>
<li><strong>RESET SLAVE ALL on candidate</strong>：candidate 清掉 slave 配置、變成獨立 master</li>
<li><strong>Re-attach 其他 replica</strong>：用 <code>CHANGE MASTER TO MASTER_HOST=&lt;candidate&gt;, MASTER_AUTO_POSITION=1</code>（GTID auto-position）</li>
<li><strong>Post-failover hook</strong>：通知 ProxySQL / HAProxy / DNS 切流量</li>
</ol>
<p>每步任一失敗、Orchestrator 可能停在中間狀態、需要 <em>人工介入</em>。</p>
<h2 id="stage-5recovery--old-master-怎麼處理">Stage 5：Recovery — Old master 怎麼處理</h2>
<p>Failover 完、舊 master 可能：</p>
<ul>
<li><em>真的死了</em>：物理 server 故障 / region outage → 不必處理、未來修好作為新 replica re-attach</li>
<li><em>Network blip 後復活</em>：舊 master 自己 <em>仍認為自己是 master</em>、再次接受寫入會造成 split-brain</li>
</ul>
<p>修法：</p>
<ul>
<li><em>Fencing</em>（必須）：pre-failover hook 把舊 master 對外 firewall 掉、或 force <code>read_only=1</code>、防舊 master 復活後接受寫入</li>
<li><em>Manual reset</em>：舊 master 復活後人工 confirm 是否變成新 master 的 replica（不要自動、自動容易誤判）</li>
</ul>
<p>Orchestrator UI 在偵測到 errant master 時會標 warning、不會自動處理。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-split-brain--pre-failover-hook-沒-fence-舊-master">1. Split-brain — pre-failover hook 沒 fence 舊 master</h3>
<p>舊 master network blip 後復活、orchestrator 已 promote 新 master、application 部分 instance 連舊 master、部分連新 master、雙寫造成 data divergence。</p>
<p>修法：</p>
<ul>
<li><em>Pre-failover hook 必須 fence</em>（不是可選）：
<ul>
<li>物理 fencing：透過 IPMI 重啟 / 關 server</li>
<li>Network fencing：透過 firewall rule 切斷 server 對外連線</li>
<li>MySQL fencing：<code>SET GLOBAL read_only=1</code> + <code>KILL</code> 所有 active connection</li>
</ul>
</li>
<li>用 <em>VIP / DNS</em> 配合：fence 完才切 VIP / DNS 到新 master、避免 application 連舊 IP</li>
<li>不依賴 application 連線 string 動態變更（DNS TTL 期間仍可能連舊 IP）</li>
</ul>
<h3 id="2-pre-failover-hook-失敗--orchestrator-該停還是該繼續">2. Pre-failover hook 失敗 — Orchestrator 該停還是該繼續</h3>
<p>Pre-failover hook 跑失敗（fence script 因為 SSH 不通、IPMI 沒回應）。Orchestrator 有兩種策略：</p>
<ul>
<li><em>PostponeReplicaRecoveryOnLagMinutes</em>：等 hook 成功才繼續、可能永遠 stuck</li>
<li><em>FailMasterPromotionOnLagMinutes</em>：放棄 promotion、留 cluster degraded（無 master）</li>
</ul>
<p>兩者都不理想。多數 production 選 <em>PostponeReplicaRecoveryOnLagMinutes=10</em>：等 10 分鐘 hook 成功、超時則 alert 人工介入、不繼續 auto-promote（人工 review 才是正確選擇）。</p>
<h3 id="3-anti-flapping-窗口太短--master-抖動-vs-真死">3. Anti-flapping 窗口太短 — Master 抖動 vs 真死</h3>
<p><code>FailureDetectionPeriodBlockMinutes=60</code>：偵測一次 failure 後 60 分鐘內不再 trigger failover（即使再偵測到 failure）。預設 60 分鐘對 <em>第一次 failover 後 master 仍不穩</em> 的場景太長 — 60 分鐘內 master 真的死了第二次、orchestrator 不 failover。預設 60 分鐘對 <em>網路抖動</em> 的場景太短 — 60 分鐘內可能 multiple failover、cluster 一直在 promote。</p>
<p>修法：</p>
<ul>
<li>評估自己 cluster 的 <em>typical recovery time</em>：1-2 小時、設 <code>FailureDetectionPeriodBlockMinutes=120</code></li>
<li>監控 <em>failover 頻率</em>、單週 &gt; 2 次表示底層問題（網路 / hardware）、不是調 anti-flapping window 解決</li>
</ul>
<h3 id="4-gtid-errant-transaction--orchestrator-拒絕-promote-但沒講原因">4. GTID errant transaction — Orchestrator 拒絕 promote 但沒講原因</h3>
<p>Candidate replica 有 <em>errant GTID</em>（從別處 inject 的 transaction）、Orchestrator 拒絕 promote、log 訊息 <code>errant GTID detected</code>、但 <em>沒寫實際是哪個 GTID</em>。On-call 在事故中沒辦法 debug。</p>
<p>修法：</p>
<ul>
<li>平時 <em>監控 errant GTID</em>：定期跑 <code>pt-show-grants</code> + GTID 比對、不要等 failover 才發現</li>
<li>Orchestrator 的 <code>OrchestratorIssuesAGtidPurge</code> 設 true：preview mode 看 errant GTID 的位置</li>
<li>Errant GTID 來源通常是 <em>人為 inject</em>（DBA 直接寫 replica 然後 binlog 出現）、教育 DBA 不要直接連 replica 寫</li>
</ul>
<h3 id="5-vip--proxysql-整合斷層--切流量延遲">5. VIP / ProxySQL 整合斷層 — 切流量延遲</h3>
<p>Post-failover hook 跑完 <em>script 上報</em>「我切完了」、但實際 <em>VIP / DNS / ProxySQL 還沒看到變化</em>。Application 連 stale endpoint 30 秒、寫入失敗。</p>
<p>修法：</p>
<ul>
<li><em>Post-failover hook 不只 trigger 切換、要 wait 切換完成</em>：
<ul>
<li>VIP：等 <code>arping</code> 確認新 IP 已 propagate</li>
<li>ProxySQL：等 <code>mysql_servers</code> runtime table 更新 + 確認 monitor module 看到新 primary</li>
<li>DNS：先把 TTL 降到極短（5 秒）、再切 DNS、等 TTL 過</li>
</ul>
</li>
<li>Orchestrator <code>PostFailoverProcessesFailOnError=true</code>：hook 失敗整個 failover 標記失敗、人工檢查</li>
<li>ProxySQL 用 <code>mysql_replication_hostgroups</code> 自動偵測 read_only flag、可不依賴 hook（推薦）</li>
</ul>
<h2 id="容量規劃要點">容量規劃要點</h2>
<table>
  <thead>
      <tr>
          <th>元件</th>
          <th>配置建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Orchestrator instance 數量</td>
          <td>3（raft cluster 最小、odd number、容忍 1 個故障）</td>
      </tr>
      <tr>
          <td>每個 instance MySQL backend</td>
          <td>1 個獨立 MySQL（不要共用、不要用被管的 cluster）</td>
      </tr>
      <tr>
          <td>Backend MySQL spec</td>
          <td>t3.small 級別、Orchestrator state ~1 GB</td>
      </tr>
      <tr>
          <td>Network latency</td>
          <td>raft 同 region 內、跨 AZ 可接受（&lt; 5ms）、跨 region 不推薦</td>
      </tr>
      <tr>
          <td>InstancePollSeconds</td>
          <td>5 秒（預設）— 越小越敏感、越大越省連線</td>
      </tr>
  </tbody>
</table>
<p>3 instance raft cluster 容忍 1 instance 故障。5 instance 容忍 2 instance 故障但 quorum cost 高、99% 場景 3 個夠用。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p>Orchestrator 100% 依賴 GTID + binlog ROW format（<a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a>）。沒 GTID 用 binlog position、failover 時 re-pointing 容易出錯、Orchestrator 強烈建議 GTID。</p>
<h3 id="跟-proxysql">跟 ProxySQL</h3>
<p><a href="/blog/backend/01-database/vendors/mysql/proxysql-config/" data-link-title="MySQL ProxySQL 配置：connection / query / route / response 四段 lifecycle 跟 query rule 設計" data-link-desc="ProxySQL 是 MySQL 生態的 connection pool &#43; query routing 標準。本文走 connection → query parse → route → response 四段 lifecycle、query rule engine 的 rule chain 設計、Hostgroup / Server / User 三層 schema、配置 step-by-step（讀寫分離 &#43; replica lag-aware routing）、5 production 踩雷（query rule 順序錯亂 / connection 漂移 / write 路由到 replica / runtime / disk schema drift / mirror traffic 副作用）、跟 Replication / Orchestrator / HAProxy 整合">ProxySQL</a> 用 <code>mysql_replication_hostgroups</code> 自動偵測 <code>read_only</code> flag — orchestrator 切完新 master 後、ProxySQL monitor module 自動看到新 master 的 <code>read_only=0</code>、自動更新 routing、application 不用改 connection string。</p>
<p>這個 <em>無需 post-failover hook 通知 ProxySQL</em> 的整合是 ProxySQL + Orchestrator 組合的最大優勢、比手動 hook 通知 VIP / DNS 可靠。</p>
<h3 id="跟-patronipostgresql-對應">跟 Patroni（PostgreSQL 對應）</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Orchestrator</th>
          <th>Patroni</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DCS</td>
          <td>內建 raft（不需外部）</td>
          <td>外部（etcd / Consul / ZooKeeper）</td>
      </tr>
      <tr>
          <td>State storage</td>
          <td>每 instance 一個 MySQL backend</td>
          <td>DCS 本身</td>
      </tr>
      <tr>
          <td>Topology discovery</td>
          <td>自動 + manual seed</td>
          <td>自動（透過 DCS）</td>
      </tr>
      <tr>
          <td>Fencing</td>
          <td>Pre-failover hook（自實作）</td>
          <td>Watchdog（內建）</td>
      </tr>
      <tr>
          <td>5+ year 生產驗證</td>
          <td>GitHub / Booking.com / Shopify</td>
          <td>Zalando / 多個歐美企業</td>
      </tr>
  </tbody>
</table>
<p>兩者角色相同、設計取捨不同。Patroni 對 DCS 高依賴、Orchestrator 對自己 backend MySQL 高依賴。</p>
<h3 id="跟-rds--aurora-mysql">跟 RDS / Aurora MySQL</h3>
<p>AWS RDS / Aurora 內建 multi-AZ failover、<em>不用 Orchestrator</em>。Aurora failover &lt; 30 秒、RDS failover ~60-120 秒。Aurora 把 replication / failover 整套封進 storage layer、application 看到的是 reader endpoint + writer endpoint。</p>
<p>詳見 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a>。</p>
<h3 id="跟-vitess">跟 Vitess</h3>
<p>Vitess shard 內部用 <em>VTOrc</em>（Vitess fork of Orchestrator）— 概念跟 Orchestrator 一致、針對 Vitess topology metadata 適配。</p>
<p>詳見 <em>Vitess sharding 設計</em> 篇（待寫）。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>（GTID 是 Orchestrator pre-requisite）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/proxysql-config/" data-link-title="MySQL ProxySQL 配置：connection / query / route / response 四段 lifecycle 跟 query rule 設計" data-link-desc="ProxySQL 是 MySQL 生態的 connection pool &#43; query routing 標準。本文走 connection → query parse → route → response 四段 lifecycle、query rule engine 的 rule chain 設計、Hostgroup / Server / User 三層 schema、配置 step-by-step（讀寫分離 &#43; replica lag-aware routing）、5 production 踩雷（query rule 順序錯亂 / connection 漂移 / write 路由到 replica / runtime / disk schema drift / mirror traffic 副作用）、跟 Replication / Orchestrator / HAProxy 整合">MySQL ProxySQL 配置</a>（Orchestrator + ProxySQL 自動失效切換組合）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">PostgreSQL Patroni HA</a>（PG sibling、不同 HA 機制）</li>
<li><a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a>（managed MySQL、Orchestrator 不需要）</li>
<li><a href="/blog/backend/knowledge-cards/quorum/" data-link-title="Quorum" data-link-desc="分散式系統以多數節點同意作為提交或讀取有效性的門檻">quorum 卡片</a> / <a href="/blog/backend/knowledge-cards/failover/" data-link-title="Failover" data-link-desc="說明主要服務或節點失效時如何切換到備援能力">failover 卡片</a></li>
<li>官方：<a href="https://github.com/openark/orchestrator">orchestrator GitHub</a> / <a href="https://github.com/openark/orchestrator/tree/master/docs">orchestrator docs</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/innodb-tuning/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/innodb-tuning/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>InnoDB engine tuning&lt;/em> — 4 個影響最大的 knob 跟對應 production 行為。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="開場常見痛點">開場：常見痛點&lt;/h2>
&lt;p>一個 100 GB MySQL DB、64 GB RAM 的 server、p99 query latency 從 5ms 飆到 50ms。第一直覺是 server overload — 但 CPU &amp;lt; 30%、disk IO 50 IOPS。為什麼慢？&lt;/p>
&lt;p>打開 &lt;code>SHOW VARIABLES LIKE 'innodb_buffer_pool_size'&lt;/code>：&lt;code>134217728&lt;/code>（128 MB）。對 64 GB RAM server、buffer pool 只用了 128 MB、剩 99.9% 的 working set 每次 query 都要從 disk 讀。CPU 閒、disk 沒滿、是因為 &lt;em>MySQL 自己不用 RAM&lt;/em> — 用 InnoDB 預設值跑 100 GB DB 等於 disk-only 模式。&lt;/p>
&lt;p>這個案例展示 InnoDB tuning 的核心：MySQL 預設值是 &lt;em>為 16 GB RAM 設計&lt;/em>、production server RAM 越大、預設值離 optimal 越遠。&lt;/p>
&lt;h2 id="4-個-critical-knob">4 個 critical knob&lt;/h2>
&lt;p>對 90% production case、調這 4 個就解決大部分 InnoDB 性能問題：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Knob&lt;/th>
 &lt;th>預設&lt;/th>
 &lt;th>對 production 建議&lt;/th>
 &lt;th>影響&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;code>innodb_buffer_pool_size&lt;/code>&lt;/td>
 &lt;td>128 MB&lt;/td>
 &lt;td>系統 RAM 50-75%（dedicated server 75%）&lt;/td>
 &lt;td>讀效能（資料能否在 RAM）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>innodb_log_file_size&lt;/code>&lt;/td>
 &lt;td>48 MB（×2 file）&lt;/td>
 &lt;td>1-4 GB（依寫吞吐、8.0.30+ 改 &lt;code>innodb_redo_log_capacity&lt;/code>）&lt;/td>
 &lt;td>寫效能（flush 頻率）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>innodb_flush_log_at_trx_commit&lt;/code>&lt;/td>
 &lt;td>1 (full ACID)&lt;/td>
 &lt;td>1（金融 / 訂單）/ 2（高吞吐可容 1 秒 loss）&lt;/td>
 &lt;td>寫吞吐 vs durability&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>innodb_io_capacity&lt;/code> + &lt;code>_max&lt;/code>&lt;/td>
 &lt;td>200 / 2000&lt;/td>
 &lt;td>SSD: 2000 / 20000; NVMe: 10000 / 40000&lt;/td>
 &lt;td>flush 速度（適配儲存）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>其他 knob（&lt;code>innodb_thread_concurrency&lt;/code> / &lt;code>innodb_buffer_pool_instances&lt;/code> / &lt;code>innodb_read_io_threads&lt;/code> 等）也有影響、但對多數 case &lt;em>先把這 4 個調對&lt;/em> 比微調其他 20 個重要。&lt;/p>
&lt;h2 id="knob-1buffer-pool--把-working-set-拉進-ram">Knob 1：Buffer pool — 把 working set 拉進 RAM&lt;/h2>
&lt;p>&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/buffer-pool/" data-link-title="Buffer Pool" data-link-desc="說明資料庫如何用記憶體快取磁碟頁，以降低 I/O 並影響查詢效能">InnoDB buffer pool&lt;/a> 是 &lt;em>page cache&lt;/em> — 從 disk 讀過的 16 KB page 快取在 RAM、下次 query 直接 RAM 讀。Buffer pool 越大、cache hit ratio 越高、disk IO 越少。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 <em>InnoDB engine tuning</em> — 4 個影響最大的 knob 跟對應 production 行為。</p></blockquote>
<hr>
<h2 id="開場常見痛點">開場：常見痛點</h2>
<p>一個 100 GB MySQL DB、64 GB RAM 的 server、p99 query latency 從 5ms 飆到 50ms。第一直覺是 server overload — 但 CPU &lt; 30%、disk IO 50 IOPS。為什麼慢？</p>
<p>打開 <code>SHOW VARIABLES LIKE 'innodb_buffer_pool_size'</code>：<code>134217728</code>（128 MB）。對 64 GB RAM server、buffer pool 只用了 128 MB、剩 99.9% 的 working set 每次 query 都要從 disk 讀。CPU 閒、disk 沒滿、是因為 <em>MySQL 自己不用 RAM</em> — 用 InnoDB 預設值跑 100 GB DB 等於 disk-only 模式。</p>
<p>這個案例展示 InnoDB tuning 的核心：MySQL 預設值是 <em>為 16 GB RAM 設計</em>、production server RAM 越大、預設值離 optimal 越遠。</p>
<h2 id="4-個-critical-knob">4 個 critical knob</h2>
<p>對 90% production case、調這 4 個就解決大部分 InnoDB 性能問題：</p>
<table>
  <thead>
      <tr>
          <th>Knob</th>
          <th>預設</th>
          <th>對 production 建議</th>
          <th>影響</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>innodb_buffer_pool_size</code></td>
          <td>128 MB</td>
          <td>系統 RAM 50-75%（dedicated server 75%）</td>
          <td>讀效能（資料能否在 RAM）</td>
      </tr>
      <tr>
          <td><code>innodb_log_file_size</code></td>
          <td>48 MB（×2 file）</td>
          <td>1-4 GB（依寫吞吐、8.0.30+ 改 <code>innodb_redo_log_capacity</code>）</td>
          <td>寫效能（flush 頻率）</td>
      </tr>
      <tr>
          <td><code>innodb_flush_log_at_trx_commit</code></td>
          <td>1 (full ACID)</td>
          <td>1（金融 / 訂單）/ 2（高吞吐可容 1 秒 loss）</td>
          <td>寫吞吐 vs durability</td>
      </tr>
      <tr>
          <td><code>innodb_io_capacity</code> + <code>_max</code></td>
          <td>200 / 2000</td>
          <td>SSD: 2000 / 20000; NVMe: 10000 / 40000</td>
          <td>flush 速度（適配儲存）</td>
      </tr>
  </tbody>
</table>
<p>其他 knob（<code>innodb_thread_concurrency</code> / <code>innodb_buffer_pool_instances</code> / <code>innodb_read_io_threads</code> 等）也有影響、但對多數 case <em>先把這 4 個調對</em> 比微調其他 20 個重要。</p>
<h2 id="knob-1buffer-pool--把-working-set-拉進-ram">Knob 1：Buffer pool — 把 working set 拉進 RAM</h2>
<p><a href="/blog/backend/knowledge-cards/buffer-pool/" data-link-title="Buffer Pool" data-link-desc="說明資料庫如何用記憶體快取磁碟頁，以降低 I/O 並影響查詢效能">InnoDB buffer pool</a> 是 <em>page cache</em> — 從 disk 讀過的 16 KB page 快取在 RAM、下次 query 直接 RAM 讀。Buffer pool 越大、cache hit ratio 越高、disk IO 越少。</p>
<p><strong>Sizing</strong>：</p>
<ul>
<li><em>Dedicated MySQL server</em>：RAM 70-80%（剩 20-30% 給 OS / MySQL 其他結構 / connection buffer）</li>
<li><em>Shared server</em>：RAM 30-50%（看其他 process 需求）</li>
<li><em>Container / Kubernetes</em>：對 container memory limit 70%（不是 host RAM）</li>
</ul>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 64 GB RAM dedicated server</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">innodb_buffer_pool_size</span> <span class="o">=</span> <span class="s">48G</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">innodb_buffer_pool_instances</span> <span class="o">=</span> <span class="s">8  # 分 8 個 instance 降 mutex contention（每 instance 6 GB）</span></span></span></code></pre></div><p><strong>Buffer pool warm-up</strong>：MySQL 重啟後 buffer pool 是空的、要慢慢從 disk 把熱資料拉回 RAM。預設 5.7+ MySQL 啟動時 <em>dump buffer pool LRU list 到 disk</em>、重啟時 <em>自動 restore</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">innodb_buffer_pool_dump_at_shutdown</span> <span class="o">=</span> <span class="s">1</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">innodb_buffer_pool_load_at_startup</span> <span class="o">=</span> <span class="s">1</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">innodb_buffer_pool_dump_pct</span> <span class="o">=</span> <span class="s">75  # 只 dump 最 hot 的 75% page list</span></span></span></code></pre></div><p>沒這個 warm-up、重啟後第 1 個小時 query latency 都偏高、application 看到 p99 spike。</p>
<h2 id="knob-2redo-log--flush-頻率跟寫吞吐">Knob 2：Redo log — flush 頻率跟寫吞吐</h2>
<p>InnoDB 寫入 <em>先寫 redo log（順序寫）</em>、再非同步寫到 data file（隨機寫）。Redo log 滿了強迫 flush data file、flush 期間寫吞吐降。</p>
<p><code>innodb_log_file_size</code> 控制每個 log file 大小（預設 2 個 file）：</p>
<ul>
<li>5.7：預設 48 MB × 2 = 96 MB total</li>
<li>8.0：預設仍是 48 MB × 2、8.0.30+ 改用動態 <code>innodb_redo_log_capacity</code>（default 100 MB total）</li>
</ul>
<p>對 5K WPS server、預設容量可能 <em>每分鐘 flush 一次</em>、寫吞吐持續 stall。提高到 1-4 GB total、flush 改成每 30 分鐘一次、寫吞吐穩定。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">innodb_log_file_size</span> <span class="o">=</span> <span class="s">2G       # 大寫吞吐 server 設 1-4 GB</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">innodb_log_files_in_group</span> <span class="o">=</span> <span class="s">2   # 預設 2 個就夠</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">innodb_log_buffer_size</span> <span class="o">=</span> <span class="s">64M    # log 寫 disk 前的 RAM buffer</span></span></span></code></pre></div><p><strong>Trade-off</strong>：log file 越大、recovery 時間越長（crash 後 InnoDB 要 replay 全部 log）。1 GB log 通常 &lt; 1 分鐘 recovery、4 GB 可能 5 分鐘以上。SSD / NVMe 這個 trade-off 不嚴重、HDD 要注意。</p>
<p>MySQL 8.0+ 改進：log file 可動態調整（不用重啟）、且 <em>automatic redo log writer threads</em> 降低 mutex contention。</p>
<h2 id="knob-3flush-method--acid-vs-吞吐">Knob 3：Flush method — ACID vs 吞吐</h2>
<p><code>innodb_flush_log_at_trx_commit</code> 控制 <em>每個 transaction commit 時要不要 flush log 到 disk</em>：</p>
<ul>
<li><code>1</code>（預設）：每次 commit fsync log file → <em>zero data loss on crash</em></li>
<li><code>2</code>：每次 commit 寫 log file（但 OS-level cache、不 fsync）→ <em>server crash 不丟、OS crash 丟 1 秒</em></li>
<li><code>0</code>：每秒 fsync 一次 → <em>任何 crash 丟 1 秒</em></li>
</ul>
<p><code>sync_binlog</code> 對應 binlog（不是 InnoDB log）：</p>
<ul>
<li><code>1</code>（建議）：每次 commit fsync binlog</li>
<li><code>0</code>：依賴 OS sync、容易丟 binlog → replication / CDC 風險</li>
</ul>
<p><strong>Production 組合</strong>：</p>
<table>
  <thead>
      <tr>
          <th>用途</th>
          <th><code>innodb_flush_log_at_trx_commit</code></th>
          <th><code>sync_binlog</code></th>
          <th>寫吞吐</th>
          <th>Crash data loss</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>金融 / 訂單 / 支付</td>
          <td>1</td>
          <td>1</td>
          <td>baseline</td>
          <td>0</td>
      </tr>
      <tr>
          <td>一般 web 應用</td>
          <td>1</td>
          <td>1</td>
          <td>baseline</td>
          <td>0</td>
      </tr>
      <tr>
          <td>高寫吞吐 + 容忍 1 sec loss</td>
          <td>2</td>
          <td>1</td>
          <td>+30-50%</td>
          <td>OS crash 丟 1 秒</td>
      </tr>
      <tr>
          <td>Dev / test</td>
          <td>2</td>
          <td>0</td>
          <td>+50-100%</td>
          <td>不重要</td>
      </tr>
      <tr>
          <td>不要這樣設</td>
          <td>0</td>
          <td>0</td>
          <td>+100%</td>
          <td>任意 crash 丟資料</td>
      </tr>
  </tbody>
</table>
<p>多數 production 用 <code>1 + 1</code>、雖然慢但 <em>簡單可預測</em>。改成 <code>2 + 1</code> 之前要明確 <em>能容忍 1 秒 data loss</em>、且通常 review 過 Disaster Recovery Plan。</p>
<h2 id="knob-4io-capacity--適配儲存">Knob 4：IO capacity — 適配儲存</h2>
<p>InnoDB 後台 flush 速度受 <code>innodb_io_capacity</code> 限制：</p>
<ul>
<li><code>innodb_io_capacity</code>（一般）：後台 flush 目標 IOPS</li>
<li><code>innodb_io_capacity_max</code>（突發）：emergency flush 上限</li>
</ul>
<p><strong>對應儲存類型</strong>：</p>
<table>
  <thead>
      <tr>
          <th>儲存</th>
          <th>IOPS 能力</th>
          <th><code>innodb_io_capacity</code></th>
          <th><code>innodb_io_capacity_max</code></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>7200 RPM HDD</td>
          <td>~80 IOPS</td>
          <td>100</td>
          <td>200</td>
      </tr>
      <tr>
          <td>SSD (SATA)</td>
          <td>10K-50K IOPS</td>
          <td>2000</td>
          <td>20000</td>
      </tr>
      <tr>
          <td>NVMe SSD</td>
          <td>100K-500K IOPS</td>
          <td>10000</td>
          <td>40000</td>
      </tr>
      <tr>
          <td>EBS gp3</td>
          <td>3000-16000 IOPS</td>
          <td>5000</td>
          <td>16000</td>
      </tr>
      <tr>
          <td>EBS io2</td>
          <td>50K-256K IOPS</td>
          <td>20000</td>
          <td>60000</td>
      </tr>
  </tbody>
</table>
<p>預設 <code>200 / 2000</code> 是 <em>為 HDD 設計</em>、SSD / NVMe server 用預設值 = InnoDB 自我限速、flush 慢、寫入瓶頸。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># NVMe SSD server</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">innodb_io_capacity</span> <span class="o">=</span> <span class="s">10000</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">innodb_io_capacity_max</span> <span class="o">=</span> <span class="s">40000</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">innodb_flush_neighbors</span> <span class="o">=</span> <span class="s">0  # NVMe 不需要 group flush 相鄰 page</span></span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-buffer-pool-沒-warm-up--重啟後-1-小時-p99-飆">1. Buffer pool 沒 warm-up — 重啟後 1 小時 p99 飆</h3>
<p>MySQL 重啟（OS upgrade / config change / failover）後、buffer pool 是空的、所有 query 第一次都 disk 讀、p99 latency 飆 5-10x、application 看到 timeout。</p>
<p>修法：</p>
<ul>
<li>啟用 <code>innodb_buffer_pool_dump_at_shutdown=1</code> + <code>innodb_buffer_pool_load_at_startup=1</code></li>
<li>對 <em>沒 graceful shutdown</em> 的 crash（OOM / kernel panic）、buffer pool 沒 dump、warm-up 後第一個小時仍辛苦</li>
<li>重要 server 重啟前手動 dump：<code>SET GLOBAL innodb_buffer_pool_dump_now=ON</code></li>
<li>對於不能容忍 cold cache 的場景、failover 前 <em>先 pre-warm new primary</em>（用 query replay 把 hot data 拉到 buffer pool）</li>
</ul>
<h3 id="2-log-file-size-設太小--checkpoint-storm">2. Log file size 設太小 — checkpoint storm</h3>
<p><code>innodb_log_file_size=48M</code> 預設、高寫吞吐 server log 每分鐘 flush 一次、flush 期間 <em>checkpoint storm</em> — 寫吞吐降 50%、p99 暴增。錯誤訊號是 <code>innodb_log_waits</code> 持續 &gt; 0。</p>
<p>修法：</p>
<ul>
<li>監控 <code>SHOW STATUS LIKE 'Innodb_log_waits'</code> — 應該長期接近 0</li>
<li>提高 <code>innodb_log_file_size</code> 到 1-4 GB（依寫吞吐）</li>
<li>8.0+ 可動態調整、5.7 需要 <em>正常 shutdown</em> 後改、開啟前先 dump buffer pool（避免 cold cache）</li>
</ul>
<h3 id="3-sync_binlog0-換速度--replication-永久-broken-風險">3. <code>sync_binlog=0</code> 換速度 — replication 永久 broken 風險</h3>
<p>開發 / staging 改 <code>sync_binlog=0</code>（加快寫入）、後來複製到 production 配置、production 同樣 <code>sync_binlog=0</code>。OS crash 後 binlog 缺最後幾秒 transaction、replica 跟 primary GTID set diverge、replication broken、要 <em>重建 replica from base backup</em>（小時級 recovery）。</p>
<p>修法：</p>
<ul>
<li><em>Production 永遠用 <code>sync_binlog=1</code></em>、不要為了寫吞吐犧牲 binlog durability</li>
<li>開發 / staging 配置跟 production 隔離、不要直接 copy config</li>
<li>Replica 失聯後 <em>用 GTID 自動 re-attach</em>（不是 binlog position）— 仍然需要 binlog 完整、<code>sync_binlog=0</code> 仍是風險</li>
</ul>
<h3 id="4-io-scheduler--不是-innodb-tuning-但影響大">4. IO scheduler — 不是 InnoDB tuning 但影響大</h3>
<p>Linux <code>noop</code> / <code>deadline</code> / <code>cfq</code> IO scheduler 對 SSD / NVMe 影響大：</p>
<ul>
<li><code>cfq</code>（traditional spinning disk default）：對 SSD 嚴重 bottleneck</li>
<li><code>deadline</code>：對 SSD 較好、但有 latency cap</li>
<li><code>noop</code> / <code>none</code>：對 NVMe 最好（讓 device 自己處理 queue）</li>
</ul>
<p><strong>Production check</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">cat /sys/block/sda/queue/scheduler
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 應該顯示： [none] mq-deadline (NVMe)</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># 或：         noop deadline [cfq] (cfq 是錯的)</span></span></span></code></pre></div><p>不是 InnoDB knob、但影響 InnoDB IO behavior &gt; 30%。InnoDB tuning 前先確認 OS-level IO scheduler 對。</p>
<h3 id="5-undo-log-膨脹--purge-跟不上">5. Undo log 膨脹 — purge 跟不上</h3>
<p>Undo log 紀錄 <em>未來可能 rollback 需要的舊版本 row</em>。長 transaction（hours-level）讓 undo log 持續累積、不能 purge、最後 InnoDB tablespace 膨脹幾 GB、disk 滿。</p>
<p>訊號：</p>
<ul>
<li><code>SHOW ENGINE INNODB STATUS</code> 看 <code>History list length</code> 持續成長（正常 &lt; 1000、異常 millions）</li>
<li><code>information_schema.innodb_metrics</code> 的 <code>trx_rseg_history_len</code></li>
</ul>
<p>修法：</p>
<ul>
<li>找 long-running transaction：<code>SELECT * FROM information_schema.innodb_trx WHERE trx_started &lt; NOW() - INTERVAL 1 HOUR</code></li>
<li>KILL 該 transaction（謹慎、可能 application bug）</li>
<li>8.0+ 用 separate undo tablespace（<code>innodb_undo_tablespaces</code>）、不污染 main tablespace、且可以 truncate</li>
</ul>
<h2 id="容量規劃要點">容量規劃要點</h2>
<p>對 64 GB RAM、NVMe SSD、5K WPS、100 GB DB 的 server：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># my.cnf production-ready baseline</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="k">[mysqld]</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># Buffer pool (75% RAM)</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="na">innodb_buffer_pool_size</span> <span class="o">=</span> <span class="s">48G</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="na">innodb_buffer_pool_instances</span> <span class="o">=</span> <span class="s">8</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="na">innodb_buffer_pool_dump_at_shutdown</span> <span class="o">=</span> <span class="s">1</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="na">innodb_buffer_pool_load_at_startup</span> <span class="o">=</span> <span class="s">1</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># Redo log</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="na">innodb_log_file_size</span> <span class="o">=</span> <span class="s">2G</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="na">innodb_log_files_in_group</span> <span class="o">=</span> <span class="s">2</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="na">innodb_log_buffer_size</span> <span class="o">=</span> <span class="s">64M</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"># Flush behavior</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="na">innodb_flush_log_at_trx_commit</span> <span class="o">=</span> <span class="s">1</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="na">sync_binlog</span> <span class="o">=</span> <span class="s">1</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="na">innodb_flush_method</span> <span class="o">=</span> <span class="s">O_DIRECT  # 跳過 OS page cache 避免 double cache</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="c1"># IO capacity (NVMe)</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="na">innodb_io_capacity</span> <span class="o">=</span> <span class="s">10000</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="na">innodb_io_capacity_max</span> <span class="o">=</span> <span class="s">40000</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="na">innodb_flush_neighbors</span> <span class="o">=</span> <span class="s">0</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="na">innodb_lru_scan_depth</span> <span class="o">=</span> <span class="s">1024</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="c1"># Concurrency</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="na">innodb_thread_concurrency</span> <span class="o">=</span> <span class="s">0  # 0 = no limit (8.0+ 推薦)</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl"><span class="na">innodb_read_io_threads</span> <span class="o">=</span> <span class="s">8</span>
</span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="na">innodb_write_io_threads</span> <span class="o">=</span> <span class="s">8</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl">
</span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="c1"># 額外</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl"><span class="na">innodb_file_per_table</span> <span class="o">=</span> <span class="s">1</span>
</span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="na">innodb_strict_mode</span> <span class="o">=</span> <span class="s">1</span></span></span></code></pre></div><p>跨不同 server spec、<code>buffer_pool_size</code> / <code>io_capacity</code> 隨硬體調整、其他 knob 變動小。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p><code>sync_binlog=1</code> + <code>innodb_flush_log_at_trx_commit=1</code> 是 <em>durability baseline</em>、影響 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a> 的 <em>primary durability</em>。Semi-sync 加在這基礎上提供 <em>跨 server durability</em>。</p>
<h3 id="跟-proxysql">跟 ProxySQL</h3>
<p>ProxySQL connection pool 降低 <em>MySQL connection 開銷</em>、但 <em>每個 connection</em> 仍消耗 8-10 MB RAM（thread stack + session buffer）。Buffer pool 設 75% RAM 後、剩 25% 給 connection / temporary buffer / OS。Connection 太多會擠掉 buffer pool。</p>
<p>詳見 <a href="/blog/backend/01-database/vendors/mysql/proxysql-config/" data-link-title="MySQL ProxySQL 配置：connection / query / route / response 四段 lifecycle 跟 query rule 設計" data-link-desc="ProxySQL 是 MySQL 生態的 connection pool &#43; query routing 標準。本文走 connection → query parse → route → response 四段 lifecycle、query rule engine 的 rule chain 設計、Hostgroup / Server / User 三層 schema、配置 step-by-step（讀寫分離 &#43; replica lag-aware routing）、5 production 踩雷（query rule 順序錯亂 / connection 漂移 / write 路由到 replica / runtime / disk schema drift / mirror traffic 副作用）、跟 Replication / Orchestrator / HAProxy 整合">ProxySQL 配置</a>。</p>
<h3 id="跟-aurora-mysql">跟 Aurora MySQL</h3>
<p>Aurora 改寫 InnoDB storage layer、上方 knob 大多 <em>Aurora 自動管理</em>：</p>
<ul>
<li>Buffer pool size：Aurora compute instance 自動配</li>
<li>Redo log：Aurora 自己的 distributed log、不用 <code>innodb_log_file_size</code></li>
<li><code>sync_binlog</code> / <code>innodb_flush_log_at_trx_commit</code>：Aurora storage layer 保證 durability、應用層 knob 影響小</li>
</ul>
<p>Aurora user 仍可 tune <code>innodb_buffer_pool_size</code> 等、但操作面從 InnoDB 內部議題變成 <em>Aurora instance class 選擇</em>。詳見 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a>。</p>
<h3 id="跟-osc-tool">跟 OSC tool</h3>
<p>InnoDB tuning 不直接影響 OSC 工具行為、但 <em>log file size 太小</em> 時 gh-ost / pt-osc 寫 ghost table 容易 trigger checkpoint storm、放慢整個 schema migration。詳見 <a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">Online Schema Change Tools</a>。</p>
<h2 id="觀測-metric">觀測 metric</h2>
<p><code>SHOW STATUS LIKE</code> + Performance Schema 提供：</p>
<ul>
<li><code>Innodb_buffer_pool_read_requests</code> / <code>_reads</code> → cache hit ratio = <code>1 - reads/read_requests</code>、應該 &gt; 99%</li>
<li><code>Innodb_log_waits</code> → checkpoint pressure、應該 = 0</li>
<li><code>Innodb_log_write_requests</code> / <code>_writes</code> → log buffer 效率</li>
<li><code>Innodb_rows_inserted</code> / <code>_updated</code> / <code>_read</code> → workload 形狀</li>
<li><code>Innodb_row_lock_waits</code> / <code>_time</code> → lock contention</li>
</ul>
<p>把這些丟進 <a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog</a> / <a href="/blog/backend/04-observability/vendors/prometheus/" data-link-title="Prometheus" data-link-desc="Pull-based metrics 主流 OSS、PromQL 與 alerting">Prometheus</a> 透過 <a href="https://github.com/prometheus/mysqld_exporter">mysqld_exporter</a> / <a href="https://www.percona.com/software/database-tools/percona-monitoring-and-management">Percona Monitoring</a> 持續 trend。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>（<code>sync_binlog</code> 跟 replication 互動）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/proxysql-config/" data-link-title="MySQL ProxySQL 配置：connection / query / route / response 四段 lifecycle 跟 query rule 設計" data-link-desc="ProxySQL 是 MySQL 生態的 connection pool &#43; query routing 標準。本文走 connection → query parse → route → response 四段 lifecycle、query rule engine 的 rule chain 設計、Hostgroup / Server / User 三層 schema、配置 step-by-step（讀寫分離 &#43; replica lag-aware routing）、5 production 踩雷（query rule 順序錯亂 / connection 漂移 / write 路由到 replica / runtime / disk schema drift / mirror traffic 副作用）、跟 Replication / Orchestrator / HAProxy 整合">MySQL ProxySQL 配置</a>（connection 跟 buffer pool 爭 RAM）</li>
<li><a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a>（managed MySQL、InnoDB tuning 部分轉手）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/autovacuum-tuning/" data-link-title="PostgreSQL autovacuum tuning：為什麼你的 autovacuum 永遠追不上 bloat" data-link-desc="MVCC 怎麼產生 dead tuple、autovacuum cost-based throttle 為什麼預設保守、per-table tuning 怎麼設、5 個 production 踩雷（cost_limit 太低 / 長 transaction blocks vacuum / anti-wraparound 在 peak / partition vacuum 滿 worker / index bloat 沒處理）、跟 partitioning &#43; monitoring 整合">PostgreSQL Autovacuum Tuning</a>（PG sibling、不同 engine 內部 tuning）</li>
<li>官方：<a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-default-se.html">InnoDB Configuration</a> / <a href="https://www.percona.com/blog/mysql-101-tuning-mysql-after-installation/">Percona Tuning Guide</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL Binary Log + CDC：Maxwell / Debezium 是 binlog 第二消費者</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/binlog-cdc/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/binlog-cdc/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>CDC&lt;/em> — Maxwell / Debezium 怎麼讀 binlog 產生 event stream。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;p>MySQL CDC 的核心定位是 &lt;em>binlog consumer&lt;/em>。&lt;/p>
&lt;p>這個誤解來自跟 PostgreSQL CDC（&lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &amp;#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium&lt;/a>）混用名詞。PG 的 logical decoding 是 &lt;em>MySQL 沒有的能力&lt;/em> — PG 有 logical event（INSERT / UPDATE / DELETE 加上欄位 metadata）、輸出格式是 logical（人可讀、schema-aware）。MySQL 的 binlog 是 &lt;em>physical&lt;/em> — 紀錄的是 row 的 binary image、不帶 schema 資訊。&lt;/p>
&lt;p>Maxwell / Debezium 對 MySQL 是 &lt;em>binlog 第二消費者&lt;/em>：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">Primary MySQL → binlog
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> ├→ Replica 1（讀 binlog 同步）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> ├→ Replica 2
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> └→ Maxwell / Debezium（讀 binlog 解析、發 Kafka）&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>跟 replica 同一份 binlog stream，並非 separate logical decoding output。這個結構決定 CDC consumer 的設計：必須 &lt;em>自己處理 schema&lt;/em>（從 information_schema 拉、跟 binlog event 對齊）、必須 &lt;em>自己 track position&lt;/em>（binlog file + position 或 GTID）。&lt;/p>
&lt;h2 id="binlog-formatstatement--row--mixed">Binlog format：STATEMENT / ROW / MIXED&lt;/h2>
&lt;p>MySQL binlog 有 3 種 format、CDC 只能用 ROW：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Format&lt;/th>
 &lt;th>紀錄內容&lt;/th>
 &lt;th>CDC 可用？&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>STATEMENT&lt;/td>
 &lt;td>原始 SQL statement&lt;/td>
 &lt;td>不可用（CDC 看不到實際改的 row）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>ROW&lt;/td>
 &lt;td>每個改變的 row（before + after image）&lt;/td>
 &lt;td>CDC 標準&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>MIXED&lt;/td>
 &lt;td>預設 STATEMENT、特殊情況用 ROW&lt;/td>
 &lt;td>不推薦（CDC 行為不一致）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>ROW 是 CDC 唯一選擇、production 強制：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 <em>CDC</em> — Maxwell / Debezium 怎麼讀 binlog 產生 event stream。</p></blockquote>
<hr>
<p>MySQL CDC 的核心定位是 <em>binlog consumer</em>。</p>
<p>這個誤解來自跟 PostgreSQL CDC（<a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">Logical Replication + Debezium</a>）混用名詞。PG 的 logical decoding 是 <em>MySQL 沒有的能力</em> — PG 有 logical event（INSERT / UPDATE / DELETE 加上欄位 metadata）、輸出格式是 logical（人可讀、schema-aware）。MySQL 的 binlog 是 <em>physical</em> — 紀錄的是 row 的 binary image、不帶 schema 資訊。</p>
<p>Maxwell / Debezium 對 MySQL 是 <em>binlog 第二消費者</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Primary MySQL → binlog
</span></span><span class="line"><span class="ln">2</span><span class="cl">              ├→ Replica 1（讀 binlog 同步）
</span></span><span class="line"><span class="ln">3</span><span class="cl">              ├→ Replica 2
</span></span><span class="line"><span class="ln">4</span><span class="cl">              └→ Maxwell / Debezium（讀 binlog 解析、發 Kafka）</span></span></code></pre></div><p>跟 replica 同一份 binlog stream，並非 separate logical decoding output。這個結構決定 CDC consumer 的設計：必須 <em>自己處理 schema</em>（從 information_schema 拉、跟 binlog event 對齊）、必須 <em>自己 track position</em>（binlog file + position 或 GTID）。</p>
<h2 id="binlog-formatstatement--row--mixed">Binlog format：STATEMENT / ROW / MIXED</h2>
<p>MySQL binlog 有 3 種 format、CDC 只能用 ROW：</p>
<table>
  <thead>
      <tr>
          <th>Format</th>
          <th>紀錄內容</th>
          <th>CDC 可用？</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>STATEMENT</td>
          <td>原始 SQL statement</td>
          <td>不可用（CDC 看不到實際改的 row）</td>
      </tr>
      <tr>
          <td>ROW</td>
          <td>每個改變的 row（before + after image）</td>
          <td>CDC 標準</td>
      </tr>
      <tr>
          <td>MIXED</td>
          <td>預設 STATEMENT、特殊情況用 ROW</td>
          <td>不推薦（CDC 行為不一致）</td>
      </tr>
  </tbody>
</table>
<p>ROW 是 CDC 唯一選擇、production 強制：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="na">binlog_format</span> <span class="o">=</span> <span class="s">ROW</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">binlog_row_image</span> <span class="o">=</span> <span class="s">FULL  # FULL (all columns) / MINIMAL (only changed) / NOBLOB</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">log_bin_use_v1_row_events</span> <span class="o">=</span> <span class="s">0  # 用新版 event format</span></span></span></code></pre></div><p><code>binlog_row_image</code> 取捨：</p>
<ul>
<li><code>FULL</code>：每個 row event 包含所有 column（before + after）、binlog 大、CDC 完整</li>
<li><code>MINIMAL</code>：只包含 changed column + primary key、binlog 省 30-50% 空間、CDC 看不到 <em>未變 column</em></li>
<li><code>NOBLOB</code>：跟 FULL 一樣但 BLOB / TEXT column 只在 changed 時包含、平衡選擇</li>
</ul>
<p>對 <em>CDC 需要 full row payload</em>（例如下游 search index 重建）必須 <code>FULL</code>。對 <em>純 audit log</em> 可以 <code>MINIMAL</code>。</p>
<h2 id="row-format-的-raw-event-結構">ROW format 的 raw event 結構</h2>
<p>Binlog ROW event 的資料形狀是 <em>binary row image</em>，而非 <em>INSERT INTO orders VALUES (1, &lsquo;foo&rsquo;, 100)</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">TABLE_MAP_EVENT     - 對應 table schema metadata (table id + column type)
</span></span><span class="line"><span class="ln">2</span><span class="cl">                      ↓ 接續同一個 transaction 內所有 row event
</span></span><span class="line"><span class="ln">3</span><span class="cl">WRITE_ROWS_EVENT    - INSERT 的新 row image（column values）
</span></span><span class="line"><span class="ln">4</span><span class="cl">UPDATE_ROWS_EVENT   - UPDATE 的 before + after image
</span></span><span class="line"><span class="ln">5</span><span class="cl">DELETE_ROWS_EVENT   - DELETE 的 row image（被刪的 row）
</span></span><span class="line"><span class="ln">6</span><span class="cl">XID_EVENT           - transaction commit marker</span></span></code></pre></div><p>CDC consumer（Maxwell / Debezium）必須：</p>
<ol>
<li>接收 binlog event stream</li>
<li>看到 <code>TABLE_MAP_EVENT</code> 從中拿 table id → 對應 table name（cache 一份）</li>
<li>看到 <code>WRITE/UPDATE/DELETE_ROWS_EVENT</code> 用 table id 反查 schema、把 binary 解析成 column value</li>
<li>包成 JSON / Avro / Protobuf 推到 Kafka</li>
</ol>
<p>關鍵：<em>table schema 不在 binlog 內</em>、CDC consumer 必須 <em>獨立查 information_schema</em>。如果 schema 變了（ALTER TABLE）、CDC 必須 invalidate cache、重新查、否則新 column 的 row event 解析錯亂。</p>
<h2 id="maxwell-vs-debezium">Maxwell vs Debezium</h2>
<p>兩個是 MySQL CDC 主流選擇、不同設計取捨：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Maxwell</th>
          <th>Debezium MySQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>開發者</td>
          <td>Zendesk</td>
          <td>Red Hat</td>
      </tr>
      <tr>
          <td>語言</td>
          <td>Java（單一 binary）</td>
          <td>Java（Kafka Connect plugin）</td>
      </tr>
      <tr>
          <td>部署模式</td>
          <td>Standalone process</td>
          <td>Kafka Connect cluster</td>
      </tr>
      <tr>
          <td>支援 DB</td>
          <td>MySQL only</td>
          <td>MySQL / PostgreSQL / MongoDB / SQL Server / Oracle</td>
      </tr>
      <tr>
          <td>Output format</td>
          <td>JSON（內建）</td>
          <td>JSON / Avro / Protobuf（Kafka Connect）</td>
      </tr>
      <tr>
          <td>Producer</td>
          <td>Kafka / Kinesis / RabbitMQ / Pub/Sub</td>
          <td>Kafka（Kafka Connect 限制）</td>
      </tr>
      <tr>
          <td>Schema registry</td>
          <td>不支援</td>
          <td>支援（Confluent Schema Registry / Apicurio）</td>
      </tr>
      <tr>
          <td>Transformation</td>
          <td>filter / stream-level（內建）</td>
          <td>Single Message Transform (SMT)</td>
      </tr>
      <tr>
          <td>Bootstrapping</td>
          <td>一個 utility 從 <code>SELECT *</code> snapshot</td>
          <td>Built-in snapshot mode</td>
      </tr>
      <tr>
          <td>GTID 支援</td>
          <td>支援</td>
          <td>支援</td>
      </tr>
      <tr>
          <td>簡單性</td>
          <td>高（單一 binary）</td>
          <td>中（Kafka Connect 框架成本）</td>
      </tr>
  </tbody>
</table>
<p>選擇邏輯：</p>
<ul>
<li><em>只用 MySQL + 想要 simple operations</em> → Maxwell</li>
<li><em>已用 Kafka Connect、需要 schema registry、跨多種 DB</em> → Debezium</li>
<li><em>需要 Avro / Protobuf schema 嚴格 governance</em> → Debezium</li>
</ul>
<h2 id="配置-step-by-stepdebezium-mysql-connector">配置 step-by-step（Debezium MySQL connector）</h2>
<p>Debezium 是 Kafka Connect plugin、整套 stack：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># debezium-mysql.json - 部署到 Kafka Connect REST API</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span>{<span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="nt">&#34;name&#34;: </span><span class="s2">&#34;orders-mysql-connector&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="nt">&#34;config&#34;: </span>{<span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="nt">&#34;connector.class&#34;: </span><span class="s2">&#34;io.debezium.connector.mysql.MySqlConnector&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="nt">&#34;database.hostname&#34;: </span><span class="s2">&#34;primary.example.com&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="nt">&#34;database.port&#34;: </span><span class="s2">&#34;3306&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">    </span><span class="nt">&#34;database.user&#34;: </span><span class="s2">&#34;debezium&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="nt">&#34;database.password&#34;: </span><span class="s2">&#34;...&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="nt">&#34;database.server.id&#34;: </span><span class="s2">&#34;184054&#34;</span><span class="p">,</span><span class="w">          </span><span class="c"># 唯一 server ID (跟 MySQL replica 一樣)</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">    </span><span class="nt">&#34;topic.prefix&#34;: </span><span class="s2">&#34;production&#34;</span><span class="p">,</span><span class="w">            </span><span class="c"># Debezium 2.x（舊 1.x 用 database.server.name）</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="nt">&#34;database.include.list&#34;: </span><span class="s2">&#34;orders_db&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">    </span><span class="nt">&#34;table.include.list&#34;: </span><span class="s2">&#34;orders_db.orders,orders_db.payments&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">    </span><span class="nt">&#34;database.history.kafka.bootstrap.servers&#34;: </span><span class="s2">&#34;kafka:9092&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w">    </span><span class="nt">&#34;database.history.kafka.topic&#34;: </span><span class="s2">&#34;dbhistory.orders&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">    </span><span class="nt">&#34;include.schema.changes&#34;: </span><span class="s2">&#34;true&#34;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">    </span><span class="nt">&#34;snapshot.mode&#34;: </span><span class="s2">&#34;initial&#34;</span><span class="p">,</span><span class="w">              </span><span class="c"># 或 schema_only / when_needed / never</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">    </span><span class="nt">&#34;snapshot.locking.mode&#34;: </span><span class="s2">&#34;minimal&#34;</span><span class="p">,</span><span class="w">      </span><span class="c"># 避免 FLUSH TABLES WITH READ LOCK</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w">    </span><span class="nt">&#34;gtid.source.includes&#34;: </span><span class="s2">&#34;...&#34;</span><span class="p">,</span><span class="w">           </span><span class="c"># 可選 GTID filter</span><span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w">    </span><span class="nt">&#34;tombstones.on.delete&#34;: </span><span class="s2">&#34;true&#34;</span><span class="p">,</span><span class="w">          </span><span class="c"># DELETE event 同 partition 跟一個 null tombstone</span><span class="w">
</span></span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="w">    </span><span class="nt">&#34;decimal.handling.mode&#34;: </span><span class="s2">&#34;double&#34;</span><span class="w">        </span><span class="c"># DECIMAL 處理: precise / string / double</span><span class="w">
</span></span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="w">  </span>}<span class="w">
</span></span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="w"></span>}</span></span></code></pre></div><p>deploy：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">curl -X POST -H <span class="s2">&#34;Content-Type: application/json&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --data @debezium-mysql.json <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  http://kafka-connect:8083/connectors</span></span></code></pre></div><p>Output topic：<code>production.orders_db.orders</code> / <code>production.orders_db.payments</code> 等 — 每張 table 一個 topic。</p>
<h2 id="配置-step-by-stepmaxwell">配置 step-by-step（Maxwell）</h2>
<p>Maxwell 簡單很多：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">maxwell <span class="se">\
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="se"></span>  --host<span class="o">=</span>primary.example.com <span class="se">\
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="se"></span>  --user<span class="o">=</span>maxwell <span class="se">\
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="se"></span>  --password<span class="o">=</span>... <span class="se">\
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="se"></span>  --producer<span class="o">=</span>kafka <span class="se">\
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="se"></span>  --kafka.bootstrap.servers<span class="o">=</span>kafka:9092 <span class="se">\
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="se"></span>  --kafka_topic<span class="o">=</span><span class="s2">&#34;maxwell.%{database}.%{table}&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="se"></span>  --filter<span class="o">=</span><span class="s1">&#39;exclude: *.*, include: orders_db.*&#39;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="se"></span>  --gtid_mode<span class="o">=</span><span class="nb">true</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="se"></span>  --output_ddl<span class="o">=</span><span class="nb">true</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="se"></span>  --output_xoffset<span class="o">=</span>true</span></span></code></pre></div><p>Maxwell event format：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  <span class="nt">&#34;database&#34;</span><span class="p">:</span> <span class="s2">&#34;orders_db&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  <span class="nt">&#34;table&#34;</span><span class="p">:</span> <span class="s2">&#34;orders&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;update&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">  <span class="nt">&#34;ts&#34;</span><span class="p">:</span> <span class="mi">1715000000</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">  <span class="nt">&#34;xid&#34;</span><span class="p">:</span> <span class="mi">12345</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  <span class="nt">&#34;commit&#34;</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  <span class="nt">&#34;data&#34;</span><span class="p">:</span> <span class="p">{</span> <span class="nt">&#34;id&#34;</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="nt">&#34;status&#34;</span><span class="p">:</span> <span class="s2">&#34;shipped&#34;</span><span class="p">,</span> <span class="nt">&#34;amount&#34;</span><span class="p">:</span> <span class="mf">100.50</span> <span class="p">},</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  <span class="nt">&#34;old&#34;</span><span class="p">:</span> <span class="p">{</span> <span class="nt">&#34;status&#34;</span><span class="p">:</span> <span class="s2">&#34;pending&#34;</span> <span class="p">}</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p>Debezium 對應的 event 格式更複雜（envelope + before + after + source + ts_ms 各 nested）、但跟 schema registry 整合好。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-binlog-retention-太短--cdc-consumer-落後就-re-bootstrap">1. Binlog retention 太短 — CDC consumer 落後就 re-bootstrap</h3>
<p>CDC consumer 失聯（Kafka Connect cluster down、network issue）超過 binlog retention（預設 <code>binlog_expire_logs_seconds=2592000</code>、30 天、但有些 production 縮短到 1 天）、需要的 binlog event 已被 purge、consumer error。</p>
<p>修法：</p>
<ul>
<li><em>Production binlog retention &gt;= 7 天</em>（避免為了 disk 過度縮短）</li>
<li>監控 <code>Master_Log_File</code> 是否還在（如果 retention 設 7 天、確認當前 file 仍存在）</li>
<li>CDC consumer 失聯 alert 設 <em>早於 retention 期</em>（例如 6 天告警、給 24 小時修）</li>
<li>真的 missed binlog、必須 <em>re-snapshot table</em>（用 Debezium <code>snapshot.new.tables</code>）— 24 小時級工作</li>
</ul>
<h3 id="2-ddl-event-處理--schema-change-跟-row-event-對齊">2. DDL event 處理 — schema change 跟 row event 對齊</h3>
<p><code>ALTER TABLE orders ADD COLUMN status VARCHAR(20)</code> 之後、<code>UPDATE_ROWS_EVENT</code> 多一個 column。CDC consumer 如果還用舊 schema cache、解析 row 時欄位數對不上、event 丟。</p>
<p>修法（Debezium）：</p>
<ul>
<li><code>include.schema.changes=true</code>：DDL 進獨立 topic、consumer 監聽更新自己的 schema cache</li>
<li><code>database.history.kafka.topic</code>：Debezium 自己 track schema 歷史</li>
</ul>
<p>修法（Maxwell）：</p>
<ul>
<li><code>--output_ddl=true</code>：DDL 也進 stream、downstream 看到 DDL event 自己更新</li>
<li>沒有內建 schema history、要 <em>application 層處理</em></li>
</ul>
<p>修法（兩者通用）：</p>
<ul>
<li>用 <a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">Online Schema Change Tools</a> 取代直接 ALTER — 工具操作的 DDL 對 CDC consumer 更可預期</li>
<li>Schema 改動 <em>優先 add column 為 nullable</em>、避免 backfill 期間 CDC consumer 看到 mid-state</li>
</ul>
<h3 id="3-binlog_row_imageminimal-讓下游錯亂">3. <code>binlog_row_image=MINIMAL</code> 讓下游錯亂</h3>
<p><code>MINIMAL</code> 省 binlog 空間、但 row event 只含 changed column。下游 <em>search index 重建</em> 需要 <em>full row payload</em> 的場景下、<code>MINIMAL</code> 看不到未變的 column、index 缺欄位。</p>
<p>修法：</p>
<ul>
<li>CDC 需要 full payload 的場景 <em>必須 <code>FULL</code></em>、這項成本要納入容量規劃</li>
<li>如果空間真緊、考慮 <code>NOBLOB</code>（BLOB / TEXT 只在 changed 時包含、其他 column 仍 FULL）</li>
<li><em>統一設定</em>：production 全部 server 同一 binlog_row_image 設定</li>
</ul>
<h3 id="4-kafka-producer-跟-binlog-reader-速度差--lag-累積">4. Kafka producer 跟 binlog reader 速度差 — lag 累積</h3>
<p>Binlog reader 從 MySQL 讀 1000 event/sec、Kafka producer 寫得只有 800 event/sec、CDC consumer 自身 lag 累積、最終 disk 滿（producer 內部 buffer）。</p>
<p>修法：</p>
<ul>
<li>監控 <em>CDC consumer lag</em>：對 Debezium 看 Kafka Connect 的 <code>source-record-poll-rate</code> vs <code>source-record-write-rate</code></li>
<li>Kafka producer tuning：<code>batch.size</code> / <code>linger.ms</code> / <code>compression.type=snappy</code></li>
<li>Kafka broker capacity：partition 數量 ≥ Debezium task 數量、避免 partition 瓶頸</li>
<li>避免把 <em>過多 table</em> 給單一 Debezium connector — 用 <em>table grouping</em>（按 traffic 拆 connector）</li>
</ul>
<h3 id="5-schema-change-跟-downstream-consumer-不同步">5. Schema change 跟 downstream consumer 不同步</h3>
<p>CDC producer（Debezium）正確處理了 schema change、但 <em>downstream Kafka consumer</em> 用舊 schema deserialize、新 column 看不到 / type 解析錯。</p>
<p>修法：</p>
<ul>
<li>用 <em>Schema Registry</em>（Confluent / Apicurio）+ Avro：consumer 訂閱 schema、自動 evolve</li>
<li>不用 schema registry 時、CDC payload 設計 <em>backward-compatible</em>（新 column 為 optional）</li>
<li><em>Application 層 schema change protocol</em>：<a href="/blog/backend/knowledge-cards/expand-contract/" data-link-title="Expand / Contract" data-link-desc="說明先擴充相容面、再收斂舊路徑的遷移做法">Expand / Contract</a> — 先加 column、deploy consumer 認 column、再 backfill、最後 application 寫新 column</li>
<li>大型 schema change 跨多服務、建議 <em>先 freeze CDC stream、做 schema migration、resume stream</em>（極端但確定）</li>
</ul>
<h2 id="容量規劃要點">容量規劃要點</h2>
<table>
  <thead>
      <tr>
          <th>元件</th>
          <th>容量考量</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MySQL binlog disk</td>
          <td>retention × 寫吞吐 × event size（5K WPS × 1 KB × 7 天 ~= 3 GB / 天 = 21 GB）</td>
      </tr>
      <tr>
          <td>Debezium / Maxwell process</td>
          <td>1 vCPU + 2-4 GB RAM（per connector、視 throughput）</td>
      </tr>
      <tr>
          <td>Kafka topic partition</td>
          <td>每 table 1-10 partition（依寫吞吐）、保 key-based ordering</td>
      </tr>
      <tr>
          <td>Kafka 保留期</td>
          <td>7-30 天（讓 downstream consumer 有 recover window）</td>
      </tr>
      <tr>
          <td>Schema Registry</td>
          <td>&lt; 100 MB storage、replicate 跨 3 broker</td>
      </tr>
  </tbody>
</table>
<p>對 100K WPS server、CDC pipeline cost 大致是 <em>MySQL infra 的 5-10%</em>。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p>CDC 是 <em>binlog 第二消費者</em>、需要 <em>GTID + binlog ROW format</em>（<a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a>）。Debezium / Maxwell 都偏好從 <em>replica</em> 讀 binlog（不增加 primary 負擔）、但要小心 replica lag 加在 CDC lag 上。</p>
<h3 id="跟-osc-tool">跟 OSC tool</h3>
<p><a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">gh-ost / pt-osc</a> 跑 schema change 時、會在 binlog 留下大量 row event（copy 既有 row 到 ghost）。CDC consumer 看到這些 event <em>是 normal-looking INSERT</em>、可能誤觸發 downstream side effect。</p>
<p>修法：</p>
<ul>
<li>CDC consumer 過濾 <em>ghost table prefix</em>（<code>_orders_new</code> / <code>_orders_gho</code>）— 不發 downstream</li>
<li>或暫停 CDC 期間跑 OSC（用 Debezium pause API）</li>
</ul>
<h3 id="跟-postgresql-logical-replication--debezium">跟 PostgreSQL Logical Replication + Debezium</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>MySQL（binlog）</th>
          <th>PostgreSQL（logical decoding）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>抽象層</td>
          <td>Physical（row binary）</td>
          <td>Logical（row + schema-aware）</td>
      </tr>
      <tr>
          <td>Schema metadata</td>
          <td>不在 event 內、要查 information_schema</td>
          <td>在 event 內（plugin output）</td>
      </tr>
      <tr>
          <td>DDL handling</td>
          <td>DDL 本身是 binlog event</td>
          <td>DDL 不在 logical decoding output（要 trigger 自己 capture）</td>
      </tr>
      <tr>
          <td>啟用成本</td>
          <td>binlog ROW + GTID（基本 MySQL replication setup）</td>
          <td>logical replication slot + publication</td>
      </tr>
      <tr>
          <td>Snapshot</td>
          <td><code>SELECT *</code> + binlog catchup</td>
          <td>logical replication initial sync</td>
      </tr>
  </tbody>
</table>
<p>詳見 <a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">PostgreSQL Logical Replication + Debezium</a> — 這是 sibling 對照，用來區分不同 abstraction。</p>
<h3 id="跟-aurora-mysql">跟 Aurora MySQL</h3>
<p>Aurora MySQL 5.7 / 8.0 都支援 binlog + GTID、CDC 可用。但 Aurora 推薦走 <em>Aurora-native database activity streams</em>（不同 abstraction）— 跟 Debezium 共存但有 overlapping。生產上 Debezium 仍是 cross-cloud 跟 vendor-neutral 選項、優先用 Debezium。</p>
<p>詳見 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a>。</p>
<h2 id="production-caseshopify-sharded-mysql-cdc">Production case：Shopify sharded MySQL CDC</h2>
<p>Sharded MySQL CDC 的核心責任是把多個 shard 的 binlog 轉成可消費、可回放、可觀測的事件流。<a href="/blog/backend/03-message-queue/cases/kafka-shopify-debezium-cdc/" data-link-title="3.C13 Shopify：Debezium CDC over sharded MySQL" data-link-desc="Shopify 100&#43; MySQL shard、150 Debezium connector、Black Friday 100K records/sec P99 &lt; 10s。">Shopify Debezium CDC over sharded MySQL</a> 提供的工程訊號是 100+ shard、約 150 個 Debezium connector、BFCM 期間 100K records/sec，以及 snapshot lock 與 oversized payload 對 CDC pipeline 的壓力。</p>
<p>這個案例要回收到三個操作判準。第一，connector 數量應跟 shard 拓撲一起設計，避免單一 connector 變成跨 shard bottleneck。第二，snapshot window 要排進 schema migration 與 event consumer 的變更計畫，避免 initial snapshot 把 production read path 壓滿。第三，oversized payload 要在 schema / outbox / topic 分流階段處理，避免 Kafka partition 與 downstream consumer 同時承受大訊息。</p>
<p>Shopify 案例的下一步路由是把本篇和 <a href="/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">Database Sharding</a> 一起讀。若讀者關心 broker 層的 partition、consumer lag 與 replay 策略，接到 <a href="/blog/backend/03-message-queue/vendors/kafka/" data-link-title="Apache Kafka" data-link-desc="Distributed event streaming platform、log-based 模型">Kafka vendor</a>；若關心資料庫端壓力，回到 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a> 與 <a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">Online Schema Change Tools</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>（binlog ROW + GTID 是 CDC pre-requisite）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">MySQL Online Schema Change Tools</a>（OSC + CDC 整合）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/logical-replication-debezium/" data-link-title="PostgreSQL Logical Replication &#43; Debezium CDC：replication slot × failure × recovery 對照" data-link-desc="PostgreSQL logical replication slot 跟 Debezium CDC 的失效模式對照表：slot lag 撐爆 primary disk / schema change 斷流 / 初始 COPY 鎖表 / zombie slot 不釋放 / replay storm 後 offset reset；publication / subscription / pgoutput 配置、跟 Kafka outbox pattern 整合">PostgreSQL Logical Replication + Debezium</a>（PG sibling、不同 abstraction）</li>
<li><a href="/blog/backend/knowledge-cards/outbox-pattern/" data-link-title="Outbox Pattern" data-link-desc="說明資料庫狀態變更與事件發布如何透過 outbox 維持一致">Outbox pattern 卡片</a>（CDC 跟 outbox 在 application-level event publishing 的關係）</li>
<li><a href="/blog/backend/knowledge-cards/expand-contract/" data-link-title="Expand / Contract" data-link-desc="說明先擴充相容面、再收斂舊路徑的遷移做法">Expand / Contract 卡片</a>（schema migration 跟 CDC consumer）</li>
<li>官方：<a href="https://debezium.io/documentation/reference/stable/connectors/mysql.html">Debezium MySQL Connector</a> / <a href="https://maxwells-daemon.io/">Maxwell</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/vitess-sharding/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/vitess-sharding/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>Vitess sharding&lt;/em> — 4 個 component 協作的完整 sharding 系統。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="問題情境mysql-寫吞吐撞上-single-primary-上限">問題情境：MySQL 寫吞吐撞上 single primary 上限&lt;/h2>
&lt;p>MySQL primary 單機極限大致 50K-100K WPS（依 schema / hardware）。超過這個級別、選項三條：&lt;/p>
&lt;ol>
&lt;li>&lt;em>Application 層 sharding&lt;/em>：每張 table 自己決定怎麼分片、application 寫 routing logic、跨 shard query / migration 都要自己處理&lt;/li>
&lt;li>&lt;em>Vitess&lt;/em>：proxy layer 自動 routing、cross-shard query 可選自動 split、resharding 自動化&lt;/li>
&lt;li>&lt;em>Distributed SQL&lt;/em>（CockroachDB / Spanner / Aurora DSQL）：跟 MySQL 不同 engine、application 改 driver&lt;/li>
&lt;/ol>
&lt;p>選 Vitess 的核心 driver：&lt;em>保留 MySQL wire protocol + 應用層幾乎不必改 + 透明分片&lt;/em>。代價是 4 個 component 的 operational complexity — Vitess 的責任範圍是完整分散式系統，而非單純 proxy。&lt;/p>
&lt;p>閱讀本文前可先對齊 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">Database Sharding&lt;/a> 的 shard key、routing、resharding 與 cross-shard query 語意；容量失衡時再接 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition&lt;/a>。&lt;/p>
&lt;h2 id="vitess-四件套每個-component-的責任">Vitess 四件套：每個 component 的責任&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl"> ┌─────────────────┐
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl"> Application ────→ │ VTGate │ ← 對外 MySQL wire protocol
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl"> │ (proxy + parse + route + aggregate) │
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl"> └────┬─────┬──────┘
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl"> │ │
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl"> ┌────────────┘ └──────────────┐
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl"> ▼ ▼
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl"> ┌──────────────┐ ┌──────────────┐
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl"> │ VTTablet │ │ VTTablet │
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl"> │ (per-MySQL │ │ (per-MySQL │
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl"> │ sidecar) │ │ sidecar) │
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl"> └─────┬────────┘ └─────┬────────┘
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl"> │ │
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl"> ▼ ▼
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl"> ┌──────────────┐ ┌──────────────┐
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl"> │ MySQL │ │ MySQL │
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl"> │ (Shard -80) │ │ (Shard 80-) │
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">18&lt;/span>&lt;span class="cl"> └──────────────┘ └──────────────┘
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">19&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">20&lt;/span>&lt;span class="cl"> Topology Service (etcd / Consul / ZooKeeper)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">21&lt;/span>&lt;span class="cl"> ↑↓ 所有 component 共享 metadata
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">22&lt;/span>&lt;span class="cl"> VSchema：keyspace 結構、shard 範圍、Vindex 定義&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="vtgate--query-routing-layer">VTGate — query routing layer&lt;/h3>
&lt;p>對 application 看起來像 MySQL（同樣 port、同樣 wire protocol、同樣 query 語法）、實際是 stateless proxy。每個 query VTGate：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 <em>Vitess sharding</em> — 4 個 component 協作的完整 sharding 系統。</p></blockquote>
<hr>
<h2 id="問題情境mysql-寫吞吐撞上-single-primary-上限">問題情境：MySQL 寫吞吐撞上 single primary 上限</h2>
<p>MySQL primary 單機極限大致 50K-100K WPS（依 schema / hardware）。超過這個級別、選項三條：</p>
<ol>
<li><em>Application 層 sharding</em>：每張 table 自己決定怎麼分片、application 寫 routing logic、跨 shard query / migration 都要自己處理</li>
<li><em>Vitess</em>：proxy layer 自動 routing、cross-shard query 可選自動 split、resharding 自動化</li>
<li><em>Distributed SQL</em>（CockroachDB / Spanner / Aurora DSQL）：跟 MySQL 不同 engine、application 改 driver</li>
</ol>
<p>選 Vitess 的核心 driver：<em>保留 MySQL wire protocol + 應用層幾乎不必改 + 透明分片</em>。代價是 4 個 component 的 operational complexity — Vitess 的責任範圍是完整分散式系統，而非單純 proxy。</p>
<p>閱讀本文前可先對齊 <a href="/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">Database Sharding</a> 的 shard key、routing、resharding 與 cross-shard query 語意；容量失衡時再接 <a href="/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition</a>。</p>
<h2 id="vitess-四件套每個-component-的責任">Vitess 四件套：每個 component 的責任</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">                        ┌─────────────────┐
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">   Application ────→    │     VTGate      │  ← 對外 MySQL wire protocol
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">                        │  (proxy + parse + route + aggregate)  │
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">                        └────┬─────┬──────┘
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">                             │     │
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">                ┌────────────┘     └──────────────┐
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">                ▼                                 ▼
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">        ┌──────────────┐                  ┌──────────────┐
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">        │   VTTablet   │                  │   VTTablet   │
</span></span><span class="line"><span class="ln">10</span><span class="cl">        │ (per-MySQL   │                  │ (per-MySQL   │
</span></span><span class="line"><span class="ln">11</span><span class="cl">        │  sidecar)    │                  │  sidecar)    │
</span></span><span class="line"><span class="ln">12</span><span class="cl">        └─────┬────────┘                  └─────┬────────┘
</span></span><span class="line"><span class="ln">13</span><span class="cl">              │                                 │
</span></span><span class="line"><span class="ln">14</span><span class="cl">              ▼                                 ▼
</span></span><span class="line"><span class="ln">15</span><span class="cl">        ┌──────────────┐                  ┌──────────────┐
</span></span><span class="line"><span class="ln">16</span><span class="cl">        │    MySQL     │                  │    MySQL     │
</span></span><span class="line"><span class="ln">17</span><span class="cl">        │  (Shard -80) │                  │  (Shard 80-) │
</span></span><span class="line"><span class="ln">18</span><span class="cl">        └──────────────┘                  └──────────────┘
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl">   Topology Service (etcd / Consul / ZooKeeper)
</span></span><span class="line"><span class="ln">21</span><span class="cl">   ↑↓ 所有 component 共享 metadata
</span></span><span class="line"><span class="ln">22</span><span class="cl">   VSchema：keyspace 結構、shard 範圍、Vindex 定義</span></span></code></pre></div><h3 id="vtgate--query-routing-layer">VTGate — query routing layer</h3>
<p>對 application 看起來像 MySQL（同樣 port、同樣 wire protocol、同樣 query 語法）、實際是 stateless proxy。每個 query VTGate：</p>
<ol>
<li>Parse SQL → 找出 routing key（從 WHERE column 拿）</li>
<li>查 VSchema → 計算 routing key 對應的 shard</li>
<li>把 query 送該 shard 的 VTTablet</li>
<li>等 response、aggregate（如果是 cross-shard query）、回 application</li>
</ol>
<p>Stateless 設計 → VTGate 可以隨意 scale、放 N 個前面接 LB。多數 production 部署 3-10 個 VTGate per region。</p>
<h3 id="vttablet--per-mysql-agent">VTTablet — per-MySQL agent</h3>
<p>每個 MySQL instance 旁邊都跑一個 VTTablet。VTTablet 責任：</p>
<ul>
<li>把 MySQL primary 標記、上報給 topology</li>
<li>接 VTGate 的 query、轉發給 local MySQL</li>
<li>跑 <em>connection pool</em>（VTGate 跟 VTTablet 之間少量連線、VTTablet 跟 local MySQL 共享 connection）</li>
<li>跑 <em>query plan cache</em> / <em>transactional consistency check</em></li>
<li>處理 <em>online schema change</em>（Vitess 內建 OSC）</li>
<li>跟 VTOrc（fork of Orchestrator）配合做 failover</li>
</ul>
<p>VTTablet 是 Vitess 跟 MySQL 唯一連接點 — 沒 VTTablet 直接連 MySQL 不在 Vitess 管理下。</p>
<h3 id="vreplication--跨-shard-資料移動">VReplication — 跨 shard 資料移動</h3>
<p>VReplication 是 Vitess <em>跨 shard / 跨 keyspace / 跨 cluster</em> 資料移動引擎、底層用 MySQL binlog。用途：</p>
<ul>
<li><em>Resharding</em>：把 shard -80 拆成 -40 + 40-80、VReplication 自動拆 binlog event 對應 shard</li>
<li><em>Materialized view</em>：cross-shard aggregation 預計算</li>
<li><em>MoveTables</em>：跨 keyspace 移 table（schema-level migration）</li>
<li><em>VStream</em>：CDC、binlog event 對外輸出（可接 Kafka / Debezium）</li>
</ul>
<p>VReplication 的主要使用者是 <em>Vitess operator</em>，它和 application 行為直接相關（resharding 期間有 write split 行為）。</p>
<h3 id="vschema--sharding-metadata">VSchema — sharding metadata</h3>
<p>VSchema 是 keyspace 內 <em>哪張 table 怎麼 shard</em> 的定義、JSON 格式存 topology service。例子：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  <span class="nt">&#34;sharded&#34;</span><span class="p">:</span> <span class="kc">true</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  <span class="nt">&#34;vindexes&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    <span class="nt">&#34;hash&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">      <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="s2">&#34;hash&#34;</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  <span class="p">},</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  <span class="nt">&#34;tables&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="nt">&#34;orders&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">      <span class="nt">&#34;column_vindexes&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">        <span class="p">{</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">          <span class="nt">&#34;column&#34;</span><span class="p">:</span> <span class="s2">&#34;user_id&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">          <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;hash&#34;</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">      <span class="p">]</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    <span class="nt">&#34;users&#34;</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">      <span class="nt">&#34;column_vindexes&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl">        <span class="p">{</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">          <span class="nt">&#34;column&#34;</span><span class="p">:</span> <span class="s2">&#34;user_id&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">          <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;hash&#34;</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl">      <span class="p">]</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p><code>orders.user_id</code> 跟 <code>users.user_id</code> 用同一個 Vindex（hash）+ 同一個 column → 同 user_id 的 orders + users 落在同 shard、可以 JOIN 不跨 shard。</p>
<h2 id="vindexvitess-的-sharding-function">Vindex：Vitess 的 sharding function</h2>
<p>Vindex 是 Vitess 的 <em>shard key 計算函數</em>。內建多種：</p>
<table>
  <thead>
      <tr>
          <th>Vindex 類型</th>
          <th>計算方式</th>
          <th>適用</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>hash</code></td>
          <td>3DES-based null hash（非 MD5）→ 對應 shard range</td>
          <td>預設、均勻分布、適合 primary key</td>
      </tr>
      <tr>
          <td><code>binary_md5</code></td>
          <td>MD5(binary)</td>
          <td>binary key</td>
      </tr>
      <tr>
          <td><code>unicode_loose_xxhash</code></td>
          <td>xxHash on lowercased unicode</td>
          <td>string key</td>
      </tr>
      <tr>
          <td><code>numeric</code></td>
          <td>直接 numeric value</td>
          <td>連續 numeric range（適合 time-based）</td>
      </tr>
      <tr>
          <td><code>numeric_static_map</code></td>
          <td>預定義 map</td>
          <td>國家 code / region 等少 enum</td>
      </tr>
      <tr>
          <td><code>lookup_hash</code></td>
          <td>透過 lookup table 查 shard</td>
          <td>多個 column 都要 shard、需要二級 index</td>
      </tr>
  </tbody>
</table>
<p>最常用：<code>hash</code>（primary key）+ <code>lookup_hash</code>（secondary access pattern）。</p>
<h2 id="keyspace--shard--tablet-階層">Keyspace / Shard / Tablet 階層</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Keyspace (邏輯 database)
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">   └── Shards
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">        ├── -80 (shard range 0-128)
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">        │     ├── Primary tablet (1 MySQL primary)
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">        │     ├── Replica tablet × 2
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">        │     └── RDOnly tablet × 1 (analytics)
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">        └── 80- (shard range 128-256)
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">              ├── Primary tablet
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">              ├── Replica tablet × 2
</span></span><span class="line"><span class="ln">10</span><span class="cl">              └── RDOnly tablet × 1</span></span></code></pre></div><p>Shard range 用 <em>binary hex prefix</em>（<code>-80</code> 表示 0 到 0x80、<code>80-</code> 表示 0x80 到 max）— 給 resharding 留 split 餘地（<code>-80</code> 可切成 <code>-40</code> + <code>40-80</code>）。</p>
<p>Tablet type：</p>
<ul>
<li><em>Primary</em>：寫入入口</li>
<li><em>Replica</em>：read traffic（Vitess query rules 控制）</li>
<li><em>RDOnly</em>：純 analytics / backup / VReplication source、低 SLA、不上 production read traffic</li>
</ul>
<h2 id="配置-step-by-steplocal-cluster">配置 step-by-step（local cluster）</h2>
<p>Production 通常用 Kubernetes operator（vitess-operator）部署、但理解概念用 local cluster 最快：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 用 vtctldclient 操作（替代舊的 vtctlclient）</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># 1. 建 unsharded keyspace</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">vtctldclient CreateKeyspace --durability-policy<span class="o">=</span>semi_sync commerce
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># 2. 從一個 MySQL primary 開始（unsharded）</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">vtctldclient ApplySchema --sql<span class="o">=</span><span class="s2">&#34;CREATE TABLE orders (id INT PRIMARY KEY, user_id INT)&#34;</span> commerce
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 3. 把 keyspace 改成 sharded、定義 VSchema</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">vtctldclient ApplyVSchema --vschema<span class="o">=</span><span class="s1">&#39;{
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="s1">  &#34;sharded&#34;: true,
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="s1">  &#34;vindexes&#34;: {&#34;hash&#34;: {&#34;type&#34;: &#34;hash&#34;}},
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="s1">  &#34;tables&#34;: {
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="s1">    &#34;orders&#34;: {
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="s1">      &#34;column_vindexes&#34;: [{&#34;column&#34;: &#34;user_id&#34;, &#34;name&#34;: &#34;hash&#34;}]
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="s1">    }
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="s1">  }
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="s1">}&#39;</span> commerce
</span></span><span class="line"><span class="ln">19</span><span class="cl">
</span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="c1"># 4. 觸發 resharding：unsharded → 2 shards (-80, 80-)</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">vtctldclient Reshard --workflow<span class="o">=</span>initial-shard create <span class="se">\
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="se"></span>  --source-shards<span class="o">=</span><span class="s2">&#34;commerce/0&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="se"></span>  --target-shards<span class="o">=</span><span class="s2">&#34;commerce/-80,commerce/80-&#34;</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="c1"># 5. 等資料 copy 完（VReplication 跑）</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl">vtctldclient Workflow --keyspace<span class="o">=</span>commerce show initial-shard
</span></span><span class="line"><span class="ln">27</span><span class="cl">
</span></span><span class="line"><span class="ln">28</span><span class="cl"><span class="c1"># 6. SwitchTraffic：先切 RDOnly → 再切 Replica → 最後切 Primary</span>
</span></span><span class="line"><span class="ln">29</span><span class="cl">vtctldclient Reshard --workflow<span class="o">=</span>initial-shard switchtraffic <span class="se">\
</span></span></span><span class="line"><span class="ln">30</span><span class="cl"><span class="se"></span>  --tablet-types<span class="o">=</span><span class="s2">&#34;rdonly,replica&#34;</span>
</span></span><span class="line"><span class="ln">31</span><span class="cl">vtctldclient Reshard --workflow<span class="o">=</span>initial-shard switchtraffic <span class="se">\
</span></span></span><span class="line"><span class="ln">32</span><span class="cl"><span class="se"></span>  --tablet-types<span class="o">=</span><span class="s2">&#34;primary&#34;</span>
</span></span><span class="line"><span class="ln">33</span><span class="cl">
</span></span><span class="line"><span class="ln">34</span><span class="cl"><span class="c1"># 7. 完成、cleanup old shard</span>
</span></span><span class="line"><span class="ln">35</span><span class="cl">vtctldclient Reshard --workflow<span class="o">=</span>initial-shard complete</span></span></code></pre></div><p>實際 production 走 <em>Vitess Kubernetes operator</em>、用 <code>VitessCluster</code> CRD 宣告 desired state、operator 自動操作上面這些 step。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-cross-shard-transaction--vitess-不支援-atomic預設">1. Cross-shard transaction — Vitess 不支援 atomic（預設）</h3>
<p>兩個 user 的 order 在不同 shard、<code>BEGIN; UPDATE orders WHERE user_id=1; UPDATE orders WHERE user_id=2; COMMIT;</code> 跨兩個 shard。Vitess 預設 <em>不保證 atomic</em> — 兩個 shard 各自 commit、可能一個成功一個失敗、application 看到 partial state。</p>
<p>修法：</p>
<ul>
<li><em>避免 cross-shard transaction</em>：schema design 讓 transaction boundary 落在單一 shard 內</li>
<li>啟用 <em>atomic 2-phase commit</em>（Vitess <code>transaction_mode=TWOPC</code>、實驗性、performance penalty 大）</li>
<li>大規模需要 atomic 的場景應該換 distributed SQL（CockroachDB / Spanner），讓資料庫層承擔跨節點一致性</li>
</ul>
<h3 id="2-vstream-lag--resharding-期間-cdc-落後">2. VStream lag — Resharding 期間 CDC 落後</h3>
<p>Resharding 過程 VReplication 大量寫 binlog event、application <em>本來在用</em> 的 VStream（接 Kafka 等）共享同 binlog stream、可能 lag。Downstream consumer 看到 stale data 1-2 小時。</p>
<p>修法：</p>
<ul>
<li>Resharding 期間 <em>暫停非關鍵 VStream</em>（analytics ETL 可暫停、real-time recommendation 需要保留）</li>
<li>確認 binlog disk capacity &gt; resharding 期間預估 binlog 量 × 2（buffer）</li>
<li>Resharding 完成後 <em>手動驗證</em> VStream offset 已 catch up，把驗證結果留成 cutover evidence</li>
</ul>
<h3 id="3-vindex-不均勻--hot-shard">3. Vindex 不均勻 — Hot shard</h3>
<p>Vindex 預設 <code>hash</code> 對 <em>primary key 均勻分布</em>、但對 <em>natural key</em>（country / region / company_id 等）可能不均勻。10 個 country、其中 1 個 country 佔 80% traffic、單一 shard 永遠 hot。</p>
<p>修法：</p>
<ul>
<li><em>Composite Vindex</em>：combine <code>country + user_id</code> 兩 column 作為 shard key、user-level 仍均勻</li>
<li><em>Synthetic shard key</em>：application 層加 <code>sharding_key=hash(actual_key) % N</code>、控制分布</li>
<li>監控 <em>per-shard QPS</em>：<code>vtctldclient ShowVDiff</code> + Prometheus exporter</li>
<li>Hot shard 出現後 Vitess 可以 resharding 解（split hot shard 為 2 個小 shard）、但工作量大</li>
</ul>
<h3 id="4-resharding-切流量瞬間-deadlock">4. Resharding 切流量瞬間 deadlock</h3>
<p>Resharding 最後的 SwitchTraffic 切 primary 階段、舊 shard 仍接 write、Vitess 切 routing、Application 一瞬間連兩個 shard、相同 user_id 寫入可能跑兩邊、deadlock 或 lost update。</p>
<p>修法：</p>
<ul>
<li><em>SwitchTraffic 用 ReverseTraffic 預備</em>：先 switch、確認問題後可 reverse 回去</li>
<li>切流量 <em>只在 known quiet period</em>（夜間 / 週末早上）</li>
<li>VTGate <code>--retry-count=2</code> + <code>--track-vtgate-deadlock-events</code>：deadlock 自動 retry、不暴露給 application</li>
<li>真的失敗用 <code>Reshard cancel</code> 回 old state，讓 workflow 回到可驗證狀態</li>
</ul>
<h3 id="5-vreplication-workflow-卡住--cancel-前需要保護狀態">5. VReplication workflow 卡住 — cancel 前需要保護狀態</h3>
<p>VReplication workflow 跑到 50% 但 <em>某個 row 解析錯誤</em>（schema mismatch / blob 大小超過 limit）、workflow stuck、進度條卡住、無 timeout。整個 resharding flow halt。</p>
<p>修法：</p>
<ul>
<li>平時跑 <em>staging 資料 dry-run</em>、發現 schema 跟 blob 邊界問題</li>
<li>Workflow 卡住時 <code>vtctldclient Workflow show</code> 看 last_message / row_state</li>
<li>手動修問題 row（直接 MySQL 改）後 <em>resume workflow</em></li>
<li>大 cluster 建議 <em>VReplication 跑前先 SchemaApply audit</em>、確認 source / target schema 兼容</li>
</ul>
<h2 id="vitess-跟自管-sharding-對照">Vitess 跟自管 sharding 對照</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Vitess</th>
          <th>Application-level sharding</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Application 改動</td>
          <td>幾乎不必（保留 MySQL wire）</td>
          <td>大改（routing logic 寫 application）</td>
      </tr>
      <tr>
          <td>Cross-shard query</td>
          <td>VTGate 自動 split（受限）</td>
          <td>Application 自己處理</td>
      </tr>
      <tr>
          <td>Resharding</td>
          <td>VReplication 自動</td>
          <td>手寫腳本、操作複雜</td>
      </tr>
      <tr>
          <td>Online schema change</td>
          <td>Vitess 內建（VReplication-based）</td>
          <td>用 gh-ost / pt-osc</td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>VTOrc 整合</td>
          <td>自管 Orchestrator</td>
      </tr>
      <tr>
          <td>Operational cost</td>
          <td>高（4 component 要懂）</td>
          <td>中（fewer abstractions、但 application logic 多）</td>
      </tr>
      <tr>
          <td>Cross-keyspace 共用 vindex</td>
          <td>內建（lookup_hash 跨 keyspace）</td>
          <td>自寫</td>
      </tr>
  </tbody>
</table>
<p>Vitess 的 <em>operational complexity</em> 是它的代價。10-20 人 SRE 團隊撐得住、5 人團隊用 <em>managed Vitess（PlanetScale）</em> 更實際。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p>Vitess shard 內部仍用 MySQL replication（<a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a>）— 每個 shard 有 primary + replica + rdonly。Vitess durability-policy 控制 primary 寫入是否等 replica ack（semi-sync）。</p>
<h3 id="跟-osc-tool">跟 OSC tool</h3>
<p>Vitess <em>不用 gh-ost / pt-osc</em>、用 VReplication-based online DDL。Vitess online DDL：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">vtctldclient ApplySchema --strategy<span class="o">=</span>vitess <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --sql<span class="o">=</span><span class="s2">&#34;ALTER TABLE orders ADD COLUMN status VARCHAR(20)&#34;</span> commerce</span></span></code></pre></div><p>詳見 <a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">Online Schema Change Tools</a>。</p>
<h3 id="跟-proxysql">跟 ProxySQL</h3>
<p><em>Vitess 取代 ProxySQL</em>。VTGate 本身做 connection pool + query routing、不再需要 ProxySQL。混用會造成 routing 衝突（VTGate 期待自己決定 shard、ProxySQL 跟 VTGate 競爭）。詳見 <a href="/blog/backend/01-database/vendors/mysql/proxysql-config/" data-link-title="MySQL ProxySQL 配置：connection / query / route / response 四段 lifecycle 跟 query rule 設計" data-link-desc="ProxySQL 是 MySQL 生態的 connection pool &#43; query routing 標準。本文走 connection → query parse → route → response 四段 lifecycle、query rule engine 的 rule chain 設計、Hostgroup / Server / User 三層 schema、配置 step-by-step（讀寫分離 &#43; replica lag-aware routing）、5 production 踩雷（query rule 順序錯亂 / connection 漂移 / write 路由到 replica / runtime / disk schema drift / mirror traffic 副作用）、跟 Replication / Orchestrator / HAProxy 整合">ProxySQL 配置</a>。</p>
<h3 id="跟-orchestrator">跟 Orchestrator</h3>
<p>Vitess 用 <em>VTOrc</em>（fork of Orchestrator）作 failover、跟 Vitess topology metadata 整合。不用獨立 Orchestrator。詳見 <a href="/blog/backend/01-database/vendors/mysql/orchestrator-failover/" data-link-title="MySQL Orchestrator Failover：HA 工具自己怎麼 HA？raft cluster &#43; GTID-based promotion 的兩段 paradox" data-link-desc="Orchestrator 是 MySQL HA 自動 failover 的 de facto standard、但讀者第一個問題往往是「HA 工具自己會壞嗎」。本文走 Orchestrator 的雙層架構（管 MySQL 的 raft cluster &#43; 被 raft 管的 orchestrator instance）→ topology discovery → failure detection → failover decision tree → promote action → 5 production 踩雷（split-brain 跟 fencing / pre-failover hook 失敗 / anti-flapping window / GTID errant transaction / VIP 跟 ProxySQL 整合斷層）→ 跟 ProxySQL / Patroni / RDS 對比">Orchestrator failover 設計</a>。</p>
<h3 id="跟-planetscalemanaged-vitess">跟 PlanetScale（managed Vitess）</h3>
<p>PlanetScale 是 <em>Vitess managed service</em>、隱藏 4 component operational complexity、加 branch-based schema workflow。詳見 <a href="/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/" data-link-title="MySQL → PlanetScale：managed Vitess &#43; branch-based schema workflow 的 hybrid shift" data-link-desc="自管 MySQL → PlanetScale 加上 Vitess sharding 跟 branch-based schema workflow。本文走 6 維 audit（Paradigm &#43; Operational &#43; Schema 多軸）、4-phase migration、5 production 踩雷、何時不要遷。">PlanetScale migration playbook</a>。</p>
<h3 id="跟-aurora-mysql">跟 Aurora MySQL</h3>
<p>Aurora 跟 Vitess 是 <em>不同 scale 路徑</em>：</p>
<ul>
<li>Aurora：single-region scaling（storage / compute 分離、最高 ~128 TB）</li>
<li>Vitess：horizontal sharding（無上限、靠加 shard scaling）</li>
</ul>
<p>兩者承擔的容量與操作責任不同。超過 Aurora single-region 上限的場景才考慮 Vitess。詳見 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a>。</p>
<h2 id="production-caseyoutube--vitess">Production case：YouTube / Vitess</h2>
<p>Vitess 的 production 責任是把 MySQL shard 拓撲變成應用可查詢、可遷移、可操作的資料庫層。YouTube / Vitess 的公開歷史提供的工程訊號是 VTGate、VTTablet、VReplication 與 VSchema 這組元件分工：application query 進 VTGate、tablet 層包住 MySQL、VSchema 描述 routing / sharding 規則、VReplication 支援 resharding 與資料搬移。</p>
<p>這個案例要回收到三個操作判準。第一，Vitess 是一套 database control plane，而非單一 proxy；導入時要把 topology service、tablet lifecycle、backup、failover 與 schema workflow 一起納入 ownership。第二，VSchema 是 application contract，shard key、lookup vindex 與 cross-shard query 都會影響產品功能設計。第三，VReplication 讓 resharding 可操作，但它仍需要 capacity window、backfill 監控與 cutover plan。</p>
<p>Vitess 的 sibling 路由是 <a href="/blog/backend/01-database/vendors/postgresql/citus-distributed/" data-link-title="PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster" data-link-desc="Citus 是 PG extension、把單機 PG 變成 *coordinator &#43; worker* sharded cluster、保留 PG SQL &#43; 加 distributed table &#43; reference table &#43; columnar storage。本文走 Citus 架構（coordinator / worker / distribution column）、3 種 table type（distributed / reference / local）、配置 step-by-step、5 production 踩雷（distribution column 選錯 / cross-shard transaction / reference table 過大 / colocate 不對齊 / worker failover）、跟 MySQL Vitess sharding sibling 對比">PostgreSQL Citus Distributed</a> 與 <a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a>。Citus 保留 PostgreSQL 生態並用 coordinator / worker 拆分資料；CockroachDB / Spanner 則用 distributed SQL 重新定義交易與一致性邊界。選型時要先判斷自己是在延伸 MySQL 投資，還是在重新選 global OLTP model。</p>
<h2 id="何時用-vitess">何時用 Vitess</h2>
<table>
  <thead>
      <tr>
          <th>條件</th>
          <th>評估</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>流量 &gt; 50K WPS、單 primary 撐不住</td>
          <td>是 Vitess scope</td>
      </tr>
      <tr>
          <td>已有大量 MySQL 投資、不想換 distributed SQL</td>
          <td>是</td>
      </tr>
      <tr>
          <td>有 5-10 人 SRE / DBA 團隊</td>
          <td>是</td>
      </tr>
      <tr>
          <td>流量 &lt; 10K WPS</td>
          <td>否（過度設計、用單 MySQL + replica）</td>
      </tr>
      <tr>
          <td>5 人團隊、不想養 DBA</td>
          <td>否（用 PlanetScale managed）</td>
      </tr>
      <tr>
          <td>必須 multi-region 強一致 transaction</td>
          <td>否（CockroachDB / Spanner 才對）</td>
      </tr>
      <tr>
          <td>需要複雜 cross-shard analytics</td>
          <td>否（搭配 BigQuery / Snowflake）</td>
      </tr>
  </tbody>
</table>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>（Vitess shard 內部）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">MySQL Online Schema Change Tools</a>（Vitess 不用 gh-ost / pt-osc）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/proxysql-config/" data-link-title="MySQL ProxySQL 配置：connection / query / route / response 四段 lifecycle 跟 query rule 設計" data-link-desc="ProxySQL 是 MySQL 生態的 connection pool &#43; query routing 標準。本文走 connection → query parse → route → response 四段 lifecycle、query rule engine 的 rule chain 設計、Hostgroup / Server / User 三層 schema、配置 step-by-step（讀寫分離 &#43; replica lag-aware routing）、5 production 踩雷（query rule 順序錯亂 / connection 漂移 / write 路由到 replica / runtime / disk schema drift / mirror traffic 副作用）、跟 Replication / Orchestrator / HAProxy 整合">MySQL ProxySQL 配置</a>（Vitess 取代 ProxySQL）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/orchestrator-failover/" data-link-title="MySQL Orchestrator Failover：HA 工具自己怎麼 HA？raft cluster &#43; GTID-based promotion 的兩段 paradox" data-link-desc="Orchestrator 是 MySQL HA 自動 failover 的 de facto standard、但讀者第一個問題往往是「HA 工具自己會壞嗎」。本文走 Orchestrator 的雙層架構（管 MySQL 的 raft cluster &#43; 被 raft 管的 orchestrator instance）→ topology discovery → failure detection → failover decision tree → promote action → 5 production 踩雷（split-brain 跟 fencing / pre-failover hook 失敗 / anti-flapping window / GTID errant transaction / VIP 跟 ProxySQL 整合斷層）→ 跟 ProxySQL / Patroni / RDS 對比">MySQL Orchestrator failover</a>（VTOrc fork）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/citus-distributed/" data-link-title="PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster" data-link-desc="Citus 是 PG extension、把單機 PG 變成 *coordinator &#43; worker* sharded cluster、保留 PG SQL &#43; 加 distributed table &#43; reference table &#43; columnar storage。本文走 Citus 架構（coordinator / worker / distribution column）、3 種 table type（distributed / reference / local）、配置 step-by-step、5 production 踩雷（distribution column 選錯 / cross-shard transaction / reference table 過大 / colocate 不對齊 / worker failover）、跟 MySQL Vitess sharding sibling 對比">PostgreSQL Citus Distributed</a>（PG sibling、coordinator + worker 模型 vs Vitess VTGate + tablet）</li>
<li><a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a>（Vitess vs CockroachDB vs Spanner）</li>
<li><a href="/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">Database Sharding</a>（shard key、routing、resharding 與 cross-shard query）</li>
<li>官方：<a href="https://vitess.io/docs/">Vitess Documentation</a> / <a href="https://github.com/planetscale/vitess-operator">Vitess Operator</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/modern-sql-features/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/modern-sql-features/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>8.0 modern SQL 特性&lt;/em> — 5 個關鍵能力 + 跟 PostgreSQL 對應特性的對比。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;p>「MySQL 是 SQL 簡單版」是個過時觀念。&lt;/p>
&lt;p>這個觀念的來源很合理：MySQL 5.x 時代沒 CTE、window function 要嗑 hack、recursive query 寫不出來、JSON 處理是字串 substring 拼接、複雜分析 query 只能丟去 PostgreSQL 或 Snowflake。整整 10 年 SQL 進階特性 MySQL 全缺、PostgreSQL 全有。&lt;/p>
&lt;p>MySQL 8.0（2018 推出）改變這件事。CTE / window function / lateral derived table / JSON_TABLE / hash join / atomic DDL / role-based authentication / common table expression 全部進來。&lt;strong>這不是「終於跟上 PG」、是 MySQL 第一次有資格進入 SQL 工程深度討論&lt;/strong>。但有 caveats：每個特性的 &lt;em>行為實現&lt;/em> 跟 PostgreSQL 對應特性都有 &lt;em>微妙差異&lt;/em>、不能假設 PG 經驗直接套用。&lt;/p>
&lt;p>對從 PostgreSQL 過來評估 MySQL 的讀者：本文是 &lt;em>特性對等驗證&lt;/em> — 哪些 8.0 特性真的可以 production 用、哪些是 marketing 但實作有 gap。對既有 MySQL 5.7 user：本文是 &lt;em>upgrade 5.7 → 8.0 的具體 ROI&lt;/em> — 從 SQL feature 角度看升級值不值得。&lt;/p>
&lt;h2 id="5-個關鍵特性--pg-對比">5 個關鍵特性 + PG 對比&lt;/h2>
&lt;h3 id="特性-1ctecommon-table-expression">特性 1：CTE（Common Table Expression）&lt;/h3>
&lt;p>MySQL 8.0 / PG 8.4+ 都支援。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- MySQL 8.0 + PG 都 OK
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">WITH&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">order_summary&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">AS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">user_id&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">SUM&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">amount&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">AS&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">total&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">created_at&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;2026-01-01&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">GROUP&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">user_id&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">u&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">total&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">users&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">u&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">JOIN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">order_summary&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">u&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">user_id&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">total&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">1000&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>行為差異&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 <em>8.0 modern SQL 特性</em> — 5 個關鍵能力 + 跟 PostgreSQL 對應特性的對比。</p></blockquote>
<hr>
<p>「MySQL 是 SQL 簡單版」是個過時觀念。</p>
<p>這個觀念的來源很合理：MySQL 5.x 時代沒 CTE、window function 要嗑 hack、recursive query 寫不出來、JSON 處理是字串 substring 拼接、複雜分析 query 只能丟去 PostgreSQL 或 Snowflake。整整 10 年 SQL 進階特性 MySQL 全缺、PostgreSQL 全有。</p>
<p>MySQL 8.0（2018 推出）改變這件事。CTE / window function / lateral derived table / JSON_TABLE / hash join / atomic DDL / role-based authentication / common table expression 全部進來。<strong>這不是「終於跟上 PG」、是 MySQL 第一次有資格進入 SQL 工程深度討論</strong>。但有 caveats：每個特性的 <em>行為實現</em> 跟 PostgreSQL 對應特性都有 <em>微妙差異</em>、不能假設 PG 經驗直接套用。</p>
<p>對從 PostgreSQL 過來評估 MySQL 的讀者：本文是 <em>特性對等驗證</em> — 哪些 8.0 特性真的可以 production 用、哪些是 marketing 但實作有 gap。對既有 MySQL 5.7 user：本文是 <em>upgrade 5.7 → 8.0 的具體 ROI</em> — 從 SQL feature 角度看升級值不值得。</p>
<h2 id="5-個關鍵特性--pg-對比">5 個關鍵特性 + PG 對比</h2>
<h3 id="特性-1ctecommon-table-expression">特性 1：CTE（Common Table Expression）</h3>
<p>MySQL 8.0 / PG 8.4+ 都支援。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- MySQL 8.0 + PG 都 OK
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">WITH</span><span class="w"> </span><span class="n">order_summary</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="k">SELECT</span><span class="w"> </span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="k">SUM</span><span class="p">(</span><span class="n">amount</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">total</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2026-01-01&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">user_id</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">u</span><span class="p">.</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">os</span><span class="p">.</span><span class="n">total</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="n">u</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">order_summary</span><span class="w"> </span><span class="n">os</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">u</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os</span><span class="p">.</span><span class="n">user_id</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">os</span><span class="p">.</span><span class="n">total</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">1000</span><span class="p">;</span></span></span></code></pre></div><p><strong>行為差異</strong>：</p>
<ul>
<li><strong>MySQL 8.0</strong>：CTE <em>不 materialize 為預設</em>、optimizer 把 CTE 視為 <em>inlined subquery</em>、CTE 引用兩次以上會 <em>重複計算</em></li>
<li><strong>PostgreSQL（&lt; 12）</strong>：CTE <em>fence by default</em>（materialize barrier）、optimizer 不 push predicate 進 CTE</li>
<li><strong>PostgreSQL（12+）</strong>：CTE 行為跟 MySQL 接近、有 <code>MATERIALIZED</code> / <code>NOT MATERIALIZED</code> keyword 明示</li>
</ul>
<p>對 PG 12+ user：可以套 MySQL 經驗。對 PG 11 以下 user：CTE 行為跟 MySQL 不一樣、要重看 query plan。</p>
<p><strong>Recursive CTE</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">WITH</span><span class="w"> </span><span class="k">RECURSIVE</span><span class="w"> </span><span class="n">org_chart</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">manager_id</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">depth</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="k">FROM</span><span class="w"> </span><span class="n">employees</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">manager_id</span><span class="w"> </span><span class="k">IS</span><span class="w"> </span><span class="k">NULL</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="k">UNION</span><span class="w"> </span><span class="k">ALL</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="k">SELECT</span><span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="n">manager_id</span><span class="p">,</span><span class="w"> </span><span class="n">oc</span><span class="p">.</span><span class="n">depth</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">    </span><span class="k">FROM</span><span class="w"> </span><span class="n">employees</span><span class="w"> </span><span class="n">e</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">org_chart</span><span class="w"> </span><span class="n">oc</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">e</span><span class="p">.</span><span class="n">manager_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">oc</span><span class="p">.</span><span class="n">id</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">org_chart</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">depth</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p>兩家都支援、但 MySQL 8.0 有 <em>深度上限</em>（<code>cte_max_recursion_depth=1000</code>、預設 1000、PG 預設 unlimited）。複雜 hierarchical query（深度 &gt; 1000）MySQL 需要顯式提高 limit。</p>
<h3 id="特性-2window-function">特性 2：Window Function</h3>
<p>MySQL 8.0 / PG 8.4+ 都支援、語法同 SQL standard。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">order_id</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">user_id</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">amount</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="k">SUM</span><span class="p">(</span><span class="n">amount</span><span class="p">)</span><span class="w"> </span><span class="n">OVER</span><span class="w"> </span><span class="p">(</span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">created_at</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">running_total</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">    </span><span class="n">RANK</span><span class="p">()</span><span class="w"> </span><span class="n">OVER</span><span class="w"> </span><span class="p">(</span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">amount</span><span class="w"> </span><span class="k">DESC</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">rank_in_user</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span></span></span></code></pre></div><p><strong>行為差異</strong>：</p>
<ul>
<li><strong>執行 plan</strong>：MySQL 8.0 用 <em>window iterator</em>、單 partition 內 sort、外加 in-memory window buffer。PostgreSQL 有更成熟的 <em>WindowAgg node</em>、複雜 frame spec 處理更好</li>
<li><strong>Frame spec 支援度</strong>：兩家都支援 ROWS / RANGE / GROUPS、但 <em>GROUPS frame</em> MySQL 是 8.0.16+ 才補進、PG 11+ 才補</li>
<li><strong>大資料量 spill behavior</strong>：MySQL window function 超過 <code>sort_buffer_size</code>（預設 256K）會 spill 到 disk、Performance 雪崩。PG 用 <code>work_mem</code>（預設 4MB）、寬裕些但也會 spill</li>
</ul>
<p>對長期用 PG window function 寫複雜 reporting query 的 user：MySQL 8.0 可以做、但 <em>效能 tune</em> 工作量大、不是 drop-in。</p>
<h3 id="特性-3json_tablepg-主要賣點對比">特性 3：JSON_TABLE（PG 主要賣點對比）</h3>
<p>這是 user 點到的對比重點。</p>
<p><strong>MySQL 8.0 的 JSON_TABLE</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">t</span><span class="p">.</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">j</span><span class="p">.</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">j</span><span class="p">.</span><span class="n">price</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="n">t</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">     </span><span class="n">JSON_TABLE</span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">         </span><span class="n">t</span><span class="p">.</span><span class="n">metadata</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">         </span><span class="s1">&#39;$.variants[*]&#39;</span><span class="w"> </span><span class="n">COLUMNS</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">             </span><span class="n">name</span><span class="w"> </span><span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span><span class="w"> </span><span class="n">PATH</span><span class="w"> </span><span class="s1">&#39;$.name&#39;</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">             </span><span class="n">price</span><span class="w"> </span><span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="n">PATH</span><span class="w"> </span><span class="s1">&#39;$.price&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">         </span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">     </span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">j</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">t</span><span class="p">.</span><span class="n">category</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;shoes&#39;</span><span class="p">;</span></span></span></code></pre></div><p>JSON_TABLE 把 JSON document 內的 array element 展開成 <em>relational rows</em>、然後可以 JOIN / WHERE / GROUP BY。SQL:2016 standard 規範。</p>
<p><strong>PostgreSQL 對應</strong>：</p>
<p>PG 17+ 有 <code>JSON_TABLE</code>（SQL:2016 standard、跟 MySQL 同語法）、但歷史上 PG user 用兩條不同路線：</p>
<ol>
<li>
<p><strong>JSONB operator</strong>（PG 9.4+）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">metadata</span><span class="o">-&gt;</span><span class="s1">&#39;variants&#39;</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">variants</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">metadata</span><span class="w"> </span><span class="o">@&gt;</span><span class="w"> </span><span class="s1">&#39;{&#34;category&#34;: &#34;shoes&#34;}&#39;</span><span class="p">;</span></span></span></code></pre></div></li>
<li>
<p><strong>jsonb_path_query</strong>（PG 12+）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">t</span><span class="p">.</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">v</span><span class="p">.</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">v</span><span class="p">.</span><span class="n">price</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="n">t</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">     </span><span class="n">jsonb_path_query</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;$.variants[*]&#39;</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">v</span><span class="p">;</span></span></span></code></pre></div></li>
</ol>
<p><strong>核心差異</strong>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>MySQL JSON_TABLE</th>
          <th>PG JSONB operator</th>
          <th>PG jsonb_path_query</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Index</td>
          <td>必須對 JSON column 建 <em>generated column + 一般 index</em>、不能直接 GIN index JSON path</td>
          <td><strong>GIN index 直接 over JSONB</strong>（業界唯一）</td>
          <td>可以走 GIN expression index</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>JSON column = LONGTEXT 包裝</td>
          <td>JSONB = binary、壓縮、index 友善</td>
          <td>同左</td>
      </tr>
      <tr>
          <td>Query 效率（複雜 path）</td>
          <td>中等（需要 generated column 加速）</td>
          <td>高（GIN index 直接）</td>
          <td>高</td>
      </tr>
      <tr>
          <td>SQL standard 對齊</td>
          <td>高（JSON_TABLE 是 standard）</td>
          <td>低（JSONB operator 是 PG 專有）</td>
          <td>中（jsonpath 是 standard）</td>
      </tr>
      <tr>
          <td>大 JSON（&gt; 1 MB）</td>
          <td>LONGTEXT 仍可、但 query 慢</td>
          <td>JSONB 壓縮 + 部分 read</td>
          <td>同左</td>
      </tr>
  </tbody>
</table>
<p><strong>選型結論</strong>：</p>
<ul>
<li><strong>MySQL 是 JSON-storage 角色</strong>（document 順手存進關聯 DB）：JSON_TABLE 夠用、配 generated column + index、production-ready</li>
<li><strong>MySQL 是 document-heavy workload</strong>（大量 JSON-driven query / 複雜 path / 高 selectivity）：PG JSONB GIN index 仍是 <em>clearly winner</em>、或直接用 MongoDB</li>
<li><strong>MySQL 8.0 JSON 不是 PG JSONB 替代</strong>：JSON_TABLE 是 <em>SQL standard 對齊</em>、好 portable、但 <em>index 跟 storage 仍弱</em></li>
</ul>
<p>對「JSON 是 PG 主要賣點」的判斷：JSONB binary storage + GIN index 是 PG 在 JSON workload 的 <em>結構性優勢</em>、MySQL 8.0 補了 SQL_TABLE 但 <em>index 那層沒補</em>。8.0 後 JSON 議題 <em>不是 deal-breaker for MySQL</em>（不像 5.7 時代直接 disqualify）、但仍不是 MySQL 主場。</p>
<h3 id="特性-4lateral-derived-table">特性 4：Lateral Derived Table</h3>
<p>MySQL 8.0.14+ / PG 9.3+ 都支援。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 對每個 user、找他最近 5 個 order
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">u</span><span class="p">.</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">recent</span><span class="p">.</span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="n">u</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="k">LATERAL</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="k">SELECT</span><span class="w"> </span><span class="n">order_id</span><span class="p">,</span><span class="w"> </span><span class="n">amount</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">    </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">    </span><span class="k">WHERE</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">u</span><span class="p">.</span><span class="n">id</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w">    </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="k">DESC</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">5</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">recent</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="k">true</span><span class="p">;</span></span></span></code></pre></div><p>Lateral 讓 subquery 可以 <em>引用外部 reference column</em>（<code>u.id</code>）、不可能用 plain subquery 寫出來。</p>
<p><strong>行為差異</strong>：</p>
<ul>
<li>MySQL 8.0：lateral 後加、optimizer plan 仍在演進、複雜 lateral query 可能 plan 次優</li>
<li>PostgreSQL：lateral 早就成熟、plan 跟 join 直接 fuse、效率高</li>
</ul>
<p>對 PG-experienced 使用 lateral 寫 reporting query 的 user：MySQL 8.0 可以、但有時候要 hint optimizer 達到最佳 plan。</p>
<h3 id="特性-5hash-join">特性 5：Hash Join</h3>
<p>MySQL 8.0.18+ / PG 早已有。</p>
<p><strong>MySQL 8.0 之前</strong>：只有 <em>nested loop join</em>、大表 JOIN 完全失控（n × m row scan）。8.0.18 加 hash join、optimizer 在預估 row count 大時自動切。</p>
<p><strong>注意</strong>：MySQL 8.0 hash join 預設 <em>不對所有 join 開</em>、只在 <code>optimizer_switch='hash_join=on'</code> 且 join condition 是 <em>equality on indexed column</em> 時觸發。常見錯估：複雜 join 條件不觸發 hash join、optimizer fallback nested loop、query 永遠跑不完。</p>
<p><strong>PG 對應</strong>：PG 一直有 hash join、optimizer 預設 cover 廣、且有 <em>parallel hash join</em>（PG 11+）大表 JOIN 並行加速。</p>
<p>MySQL hash join 是 <em>補洞</em>、不是 <em>並肩特性</em>。複雜 OLAP query MySQL 仍弱於 PG。</p>
<h2 id="其他-80-特性一句話帶過">其他 8.0 特性（一句話帶過）</h2>
<ul>
<li><strong>Atomic DDL</strong>：CREATE TABLE / DROP / ALTER 變 transactional、crash recovery 不會留 orphan table（PG 早就 atomic）</li>
<li><strong>Role-based authentication</strong>：role 取代 group-level grant、user 可繼承 role（PG 早就 role 系統）</li>
<li><strong>CHECK constraint enforcement</strong>：5.7 可寫但不執行、8.0 真的 enforce（PG 一直執行）</li>
<li><strong>invisible index</strong>：建 index 但 optimizer 暫不用、適合 staging query plan 測試（PG 沒原生對應）</li>
<li><strong>Resource Group</strong>：query 跑時可分配 CPU thread 給特定 user group（PG 沒原生對應）</li>
<li><strong>Generated column</strong>：MySQL 5.7 已有、8.0 強化、可作為 JSON path 加速的 workaround</li>
</ul>
<h2 id="配置-step-by-step從-57--80-sql-feature-升級">配置 step-by-step（從 5.7 → 8.0 SQL feature 升級）</h2>
<p>如果已經是 8.0、所有特性都可以用、不必額外配置。如果是 5.7 → 8.0、需要：</p>
<ol>
<li><strong><code>character_set_server=utf8mb4</code></strong>：8.0 預設 utf8mb4（5.7 預設 latin1）、character set 不一致導致 query 行為微差</li>
<li><strong><code>default_authentication_plugin=mysql_native_password</code></strong>：8.0 預設 caching_sha2_password、舊 client 連不上、cluster upgrade 期間用 native_password 保兼容</li>
<li><strong><code>optimizer_switch='hash_join=on'</code></strong>：確認 hash join 啟用、預設應該已 ON</li>
<li><strong><code>cte_max_recursion_depth=10000</code></strong>：複雜 recursive CTE 需要時提高</li>
<li><strong>重新 review 所有 ORM-generated SQL</strong>：8.0 keywords 變多（WINDOW、RANK、LATERAL 等變成 reserved word）、5.7 識別碼可能變 syntax error</li>
</ol>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-cte-引用兩次--跑兩次">1. CTE 引用兩次 = 跑兩次</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">WITH</span><span class="w"> </span><span class="n">expensive</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="k">SELECT</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="n">heavy</span><span class="w"> </span><span class="n">aggregation</span><span class="w"> </span><span class="p">...)</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">expensive</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">...</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">UNION</span><span class="w"> </span><span class="k">ALL</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">expensive</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">other_condition</span><span class="p">;</span></span></span></code></pre></div><p>預期 CTE 跑一次、實際 MySQL 跑兩次。Query 時間 doubled。</p>
<p>修法：</p>
<ul>
<li>把 CTE 結果先 INSERT 進 <em>temporary table</em>、SELECT 兩次走 temp table（手動 materialize）</li>
<li>或 PG 用 <code>MATERIALIZED</code> keyword（MySQL 沒對應 hint、要手動 temp table）</li>
</ul>
<h3 id="2-window-function-大-partition-spill-到-disk">2. Window function 大 partition spill 到 disk</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">order_id</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">       </span><span class="k">SUM</span><span class="p">(</span><span class="n">amount</span><span class="p">)</span><span class="w"> </span><span class="n">OVER</span><span class="w"> </span><span class="p">(</span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">created_at</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 1 億 row</span></span></span></code></pre></div><p><code>sort_buffer_size=256K</code> 預設、單 partition &gt; 256K row 開始 spill disk、執行從秒級變分鐘級。</p>
<p>修法：</p>
<ul>
<li>提高 <code>sort_buffer_size</code>（per-connection、不要設太大、connection × buffer 會吃 RAM）</li>
<li>加 INDEX 包含 <code>user_id, created_at</code>、optimizer 可直接用 sorted index、不必額外 sort</li>
</ul>
<h3 id="3-json_table-跟-generated-column-取捨錯誤">3. JSON_TABLE 跟 generated column 取捨錯誤</h3>
<p>直接 JSON_TABLE on every query：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">products</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="n">JSON_TABLE</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;$.variants[*]&#39;</span><span class="w"> </span><span class="n">COLUMNS</span><span class="w"> </span><span class="p">(...));</span></span></span></code></pre></div><p>每次 query 跑 JSON parse、無 index 加速、大表 query 慢。</p>
<p>修法：</p>
<ul>
<li>
<p>對 <em>常 query 的 JSON path</em> 建 generated column：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">products</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">ADD</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">category</span><span class="w"> </span><span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">GENERATED</span><span class="w"> </span><span class="n">ALWAYS</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="n">JSON_UNQUOTE</span><span class="p">(</span><span class="n">metadata</span><span class="o">-&gt;</span><span class="s1">&#39;$.category&#39;</span><span class="p">))</span><span class="w"> </span><span class="n">STORED</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">ADD</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_category</span><span class="w"> </span><span class="p">(</span><span class="n">category</span><span class="p">);</span></span></span></code></pre></div></li>
<li>
<p>JSON_TABLE 用於 <em>ad-hoc query</em>、不要當熱 path</p>
</li>
<li>
<p>跟 PG JSONB GIN 對比：PG 不必預先建 generated column、GIN index 直接 over JSONB</p>
</li>
</ul>
<h3 id="4-hash-join-沒觸發--optimizer-預估錯-row-count">4. Hash join 沒觸發 — Optimizer 預估錯 row count</h3>
<p>JOIN 大表預期 hash join、實際 MySQL 跑 nested loop、query 跑不完。常見原因：</p>
<ul>
<li>Table statistics 過時（沒跑 <code>ANALYZE TABLE</code>）</li>
<li>Join condition 不是 pure equality（<code>a.id = b.id + 1</code> 等）</li>
<li>一邊有 LIMIT、optimizer 估 small set、選 nested loop</li>
</ul>
<p>修法：</p>
<ul>
<li>跑 <code>ANALYZE TABLE</code> 更新 statistics</li>
<li>用 <code>EXPLAIN ANALYZE</code> 看實際 row count vs 估計</li>
<li>用 <code>optimizer_hint</code>（如 <code>/*+ HASH_JOIN(t1 t2) */</code>）強制</li>
</ul>
<h3 id="5-recursive-cte-深度上限--production-query-突然-fail">5. Recursive CTE 深度上限 — Production query 突然 fail</h3>
<p><code>cte_max_recursion_depth=1000</code> 預設、organization hierarchy / tree query 超過 1000 層直接 fail（<code>ER_CTE_MAX_RECURSION_DEPTH_EXCEEDED</code>）。</p>
<p>修法：</p>
<ul>
<li>評估真實 hierarchy 深度、設 <code>cte_max_recursion_depth=10000</code> 或更高</li>
<li>或 query 加 <code>WHERE depth &lt; N</code> 提前停（不依賴 implicit limit）</li>
<li>對極大 hierarchy（社群 follow graph 等）改用 <em>graph DB</em>（Neo4j）— MySQL recursive CTE 不是 graph workload 主場</li>
</ul>
<h2 id="mysql-80-vs-pg-sql-特性-cross-reference">MySQL 8.0 vs PG SQL 特性 cross-reference</h2>
<table>
  <thead>
      <tr>
          <th>特性</th>
          <th>MySQL 8.0</th>
          <th>PostgreSQL</th>
          <th>差異</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CTE</td>
          <td>8.0+</td>
          <td>8.4+</td>
          <td>PG 2009 即支援、MySQL 2018 才支援、約晚 9 年</td>
      </tr>
      <tr>
          <td>Recursive CTE</td>
          <td>8.0+（depth 限）</td>
          <td>8.4+（unlimited）</td>
          <td>PG 無深度上限</td>
      </tr>
      <tr>
          <td>Window function</td>
          <td>8.0+</td>
          <td>8.4+</td>
          <td>Frame spec 兩家略不同（GROUPS frame 推出時點）</td>
      </tr>
      <tr>
          <td>Lateral</td>
          <td>8.0.14+</td>
          <td>9.3+</td>
          <td>PG plan 較成熟</td>
      </tr>
      <tr>
          <td>JSON_TABLE</td>
          <td>8.0+</td>
          <td>17+</td>
          <td>MySQL 早 6 年（SQL:2016 standard）</td>
      </tr>
      <tr>
          <td>JSONB index</td>
          <td>無原生</td>
          <td>GIN index over JSONB</td>
          <td><strong>PG 結構優勢</strong></td>
      </tr>
      <tr>
          <td>Hash join</td>
          <td>8.0.18+</td>
          <td>早</td>
          <td>PG parallel hash join</td>
      </tr>
      <tr>
          <td>Atomic DDL</td>
          <td>8.0+</td>
          <td>早</td>
          <td>PG 一直 atomic</td>
      </tr>
      <tr>
          <td>Common keyword</td>
          <td>補齊</td>
          <td>完整</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Role-based auth</td>
          <td>8.0+</td>
          <td>早</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Materialized view</td>
          <td>無原生</td>
          <td>9.3+</td>
          <td><strong>PG 結構優勢</strong>（MySQL 用 trigger / scheduled refresh 模擬）</td>
      </tr>
      <tr>
          <td>Partial index</td>
          <td>無</td>
          <td>早</td>
          <td><strong>PG 結構優勢</strong></td>
      </tr>
      <tr>
          <td>Expression index</td>
          <td>8.0.13+</td>
          <td>早</td>
          <td>MySQL 後加</td>
      </tr>
      <tr>
          <td>Full-text search</td>
          <td>內建（InnoDB 5.6+）</td>
          <td>內建（tsvector）</td>
          <td>PG full-text 更成熟</td>
      </tr>
      <tr>
          <td>Foreign data wrapper</td>
          <td>無原生</td>
          <td>早（FDW）</td>
          <td><strong>PG 結構優勢</strong></td>
      </tr>
  </tbody>
</table>
<p>8.0 補了 <em>語法層</em> 大部分缺漏、<em>storage / index / extensibility 層</em> 仍是 PG 結構優勢。對「先選 SQL 工程深度」的 org、PG 仍領先；對「先選 ecosystem / replication / sharding」的 org、MySQL 已不是 disqualifier。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-innodb-tuning">跟 InnoDB Tuning</h3>
<p>JSON column 在 InnoDB 是 LONGTEXT 包裝、大 JSON 進 off-page storage（<code>innodb_default_row_format=DYNAMIC</code> 才行、Antelope format 不支援）。Buffer pool 對 LONGTEXT 較不友善、大 JSON workload 可能要更大 buffer pool。詳見 <a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">InnoDB Tuning</a>。</p>
<h3 id="跟-query-optimization">跟 Query Optimization</h3>
<p>8.0 新 hash join + lateral derived 讓 <em>EXPLAIN ANALYZE</em> 結果更複雜。優化複雜 query 需要熟 <em>新 plan node 類型</em>。詳見 <em>Query Optimization deep dive</em> 篇（待寫）。</p>
<h3 id="跟-online-schema-change">跟 Online Schema Change</h3>
<p>JSON column 跟 generated column 的 schema change 走 gh-ost / pt-osc 沒問題、但 JSON 大表 ALTER 速度比一般 column 慢（每 row 重 serialize）。詳見 <a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">Online Schema Change Tools</a>。</p>
<h3 id="跟-replication">跟 Replication</h3>
<p>Window function / CTE / JSON_TABLE 的 query <em>結果</em> replicate（row-level binlog 紀錄結果）、不 replicate <em>query 本身</em>。所以 replica apply 不會重新跑 window function、效率 OK。詳見 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a>。</p>
<h2 id="何時-sql-特性是-mysql-選型-driver">何時 SQL 特性是 MySQL 選型 driver</h2>
<ul>
<li><strong>想要 SQL standard 對齊跨 vendor portable</strong>：MySQL 8.0 JSON_TABLE / window 都對齊 standard、PG 部分能力（JSONB operator）是 PG-only、portability MySQL 略好</li>
<li><strong>JSON workload &lt; 20% query</strong>：MySQL 8.0 + generated column 夠用、不必為 JSON 換 PG</li>
<li><strong>JSON workload &gt; 50% query + 複雜 path / aggregation</strong>：PG JSONB GIN 仍 winner、考慮 PG 或 MongoDB</li>
<li><strong>需要 materialized view / FDW / partial index</strong>：PG 仍領先、不要因為 SQL feature parity 假設 MySQL 全 cover</li>
<li><strong>既有 MySQL 投資 + SQL 工程深度上升</strong>：升 8.0 + 訓練團隊用新特性、不是換 vendor</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">InnoDB Tuning</a>（JSON column 對 buffer pool 影響）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">Online Schema Change Tools</a>（JSON column 大表 ALTER）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a>（ROW-format binlog 對 window function）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/sql-features-baseline/" data-link-title="PostgreSQL SQL Features：PG 早就有的、MySQL 8.0 才補的、PG 仍領先的" data-link-desc="PG 在 SQL features 上長期領先 MySQL — CTE / window function / lateral / partial index / FTS / JSONB / GIN index / materialized view 在 PG 早 5-15 年。MySQL 8.0（2018）補多數但 *index / storage / extension* 層仍是 PG 結構優勢。本文整理 PG 早期就有的特性、MySQL 8.0 補的差異、PG 仍領先的、跟 MySQL modern-sql-features sibling 反向視角">PostgreSQL SQL Features Baseline</a>（PG 反向視角、哪些特性 PG 早 5-15 年、MySQL 8.0 補齊後 PG 仍領先）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/jsonb-deep-dive/" data-link-title="PostgreSQL JSONB Deep Dive：Binary Storage &#43; GIN Index 為什麼是結構性優勢" data-link-desc="PG JSONB（9.4&#43;）是 *binary 儲存的 JSON*、可直接 GIN index、是 PG 在 JSON workload 的結構性優勢、跟 MongoDB / MySQL 8.0 JSON_TABLE 比仍領先。本文走 JSON vs JSONB 差異、GIN index 機制（jsonb_ops vs jsonb_path_ops）、operator &#43; path query、partial JSONB indexing、5 production 踩雷（大 JSONB 跟 TOAST / nested update / index 選錯 op class / jsonb_path_query 跟 jsonb_path_exists 行為差 / partial index 條件搞錯）、何時用 JSONB vs 拆 column">PostgreSQL JSONB Deep Dive</a>（PG sibling、binary storage + GIN index 跟 MySQL JSON_TABLE 對比）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor page</a>（JSON / SQL feature 對比 source）</li>
<li><a href="/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB vendor page</a>（document-heavy workload 替代）</li>
<li>官方：<a href="https://dev.mysql.com/doc/refman/8.0/en/mysql-nutshell.html">MySQL 8.0 What&rsquo;s New</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/group-replication/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/group-replication/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>Group Replication + InnoDB Cluster&lt;/em> — synchronous multi-primary 的 transaction model + 部署模型。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;p>把「Group Replication multi-primary mode」當成「multi-primary 直接線性 scale write」是常見誤解。&lt;/p>
&lt;p>Single-primary 跟 multi-primary 共用同一套 GR 機制（GCE atomic broadcast + certification + applier）— 切換 mode 是 &lt;em>配置變更&lt;/em>。但 &lt;em>性能效果&lt;/em> 經常跟讀者預期不同：在 single-primary cluster 上加開 &lt;code>group_replication_single_primary_mode=OFF&lt;/code>、預期 &lt;em>3 個 instance 都可以接受 write&lt;/em> 帶來吞吐倍增、實際上每個寫入仍要全 cluster GCE broadcast + certification、寫吞吐沒爆增 / latency 飆高 / certification 衝突回退增加。&lt;/p>
&lt;p>這篇 deep article 把 GR 的 &lt;em>certification 流程&lt;/em> 講清楚 — 為什麼「multi-primary」聽起來像「線性 scale」、實際是「保 strong consistency 的 multi-entry」。然後展開 InnoDB Cluster（GR + MySQL Shell + MySQL Router）作為 production deployment 工具。&lt;/p>
&lt;h2 id="group-replication-的-transaction-model">Group Replication 的 transaction model&lt;/h2>
&lt;p>GR 用 &lt;em>Group Communication Engine (GCE)&lt;/em>（Paxos 變種）達成 &lt;em>atomic broadcast&lt;/em> — 任何 write transaction 必須先 broadcast 到所有 member、所有 member 確認 &lt;em>certification pass&lt;/em> 才 commit。&lt;/p>
&lt;p>每個 transaction 的 GR lifecycle：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">1. Client → Member A: BEGIN; UPDATE ...; COMMIT;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">2. Member A: 先 local execute、收集 write_set（被改的 row + PK + transaction GTID）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">3. Member A: write_set + binlog event → GCE broadcast to all members
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">4. GCE: Paxos consensus、所有 member 收到 broadcast、按 *相同順序*
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">5. Each Member: certification phase — 看 write_set 跟 *尚未 apply 的 incoming transactions* 是否有 PK 衝突
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">6. 若無衝突 → apply 該 transaction（local + remote member 都 apply）、回 client COMMIT OK
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">7. 若衝突 → certification fail、Member A 對 client 回 ERR_LOCK_DEADLOCK / GR_CONFLICT、application 必須 retry&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>核心結論&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 <em>Group Replication + InnoDB Cluster</em> — synchronous multi-primary 的 transaction model + 部署模型。</p></blockquote>
<hr>
<p>把「Group Replication multi-primary mode」當成「multi-primary 直接線性 scale write」是常見誤解。</p>
<p>Single-primary 跟 multi-primary 共用同一套 GR 機制（GCE atomic broadcast + certification + applier）— 切換 mode 是 <em>配置變更</em>。但 <em>性能效果</em> 經常跟讀者預期不同：在 single-primary cluster 上加開 <code>group_replication_single_primary_mode=OFF</code>、預期 <em>3 個 instance 都可以接受 write</em> 帶來吞吐倍增、實際上每個寫入仍要全 cluster GCE broadcast + certification、寫吞吐沒爆增 / latency 飆高 / certification 衝突回退增加。</p>
<p>這篇 deep article 把 GR 的 <em>certification 流程</em> 講清楚 — 為什麼「multi-primary」聽起來像「線性 scale」、實際是「保 strong consistency 的 multi-entry」。然後展開 InnoDB Cluster（GR + MySQL Shell + MySQL Router）作為 production deployment 工具。</p>
<h2 id="group-replication-的-transaction-model">Group Replication 的 transaction model</h2>
<p>GR 用 <em>Group Communication Engine (GCE)</em>（Paxos 變種）達成 <em>atomic broadcast</em> — 任何 write transaction 必須先 broadcast 到所有 member、所有 member 確認 <em>certification pass</em> 才 commit。</p>
<p>每個 transaction 的 GR lifecycle：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. Client → Member A: BEGIN; UPDATE ...; COMMIT;
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. Member A: 先 local execute、收集 write_set（被改的 row + PK + transaction GTID）
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. Member A: write_set + binlog event → GCE broadcast to all members
</span></span><span class="line"><span class="ln">4</span><span class="cl">4. GCE: Paxos consensus、所有 member 收到 broadcast、按 *相同順序*
</span></span><span class="line"><span class="ln">5</span><span class="cl">5. Each Member: certification phase — 看 write_set 跟 *尚未 apply 的 incoming transactions* 是否有 PK 衝突
</span></span><span class="line"><span class="ln">6</span><span class="cl">6. 若無衝突 → apply 該 transaction（local + remote member 都 apply）、回 client COMMIT OK
</span></span><span class="line"><span class="ln">7</span><span class="cl">7. 若衝突 → certification fail、Member A 對 client 回 ERR_LOCK_DEADLOCK / GR_CONFLICT、application 必須 retry</span></span></code></pre></div><p><strong>核心結論</strong>：</p>
<ul>
<li><em>Single-primary mode</em>：只有指定 member 接受 write、其他 member 純 apply、certification 仍跑（但衝突極少、因只有一個寫入源）</li>
<li><em>Multi-primary mode</em>：所有 member 都接受 write、certification 衝突常見、application 必須處理 conflict retry</li>
</ul>
<p><strong>「multi-primary 不會線性 scale write」的原因</strong>：</p>
<ul>
<li>每個 write 仍要全 cluster GCE broadcast + certification</li>
<li>寫吞吐 ceiling 受 <em>最慢 member + 網路延遲</em> 限制（不是「N members × M throughput」）</li>
<li>多寫入源增加 certification 衝突機率、衝突 retry 反而拖 throughput</li>
</ul>
<p><strong>「multi-primary 真實價值」</strong>：</p>
<ul>
<li><em>跨 region multi-active deploy</em>（每個 region local member 接受 local write、無 cross-region write latency）— 但需求極少、多數場景 single-primary + Aurora DSQL / Spanner 更實際</li>
<li><em>零停機 maintenance</em>（任一 member 下線、其他繼續接 write、不必 failover）— 但 single-primary mode 也提供同等 HA</li>
</ul>
<p>對 99% production case：<strong>single-primary mode</strong> 才是正確選擇。Multi-primary 是 <em>特殊 use case 工具</em>、不是 <em>預設 mode</em>。</p>
<h2 id="group-communication-enginegce">Group Communication Engine（GCE）</h2>
<p>GR 內建 GCE、基於 <em>XCom</em> protocol（Paxos 變種）。GCE 責任：</p>
<ul>
<li>Atomic broadcast：保證 message 到所有 member、按相同順序</li>
<li>Group membership：偵測 member join / leave / fail、reconfigure consensus</li>
<li>Network partition handling：minority partition 自動 fence（read-only）、majority 繼續服務</li>
</ul>
<p><strong>GCE 跟 Raft 對比</strong>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>GR XCom (Paxos-like)</th>
          <th>Raft</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Leader</td>
          <td>沒固定 leader、每個 message 選一個 sender</td>
          <td>固定 leader、其他 follower</td>
      </tr>
      <tr>
          <td>配置複雜度</td>
          <td>高（cluster member 列表 + IP allowlist）</td>
          <td>中（更易理解）</td>
      </tr>
      <tr>
          <td>Member 數量</td>
          <td>預設 3 (max 9)</td>
          <td>預設 3-5</td>
      </tr>
      <tr>
          <td>Performance</td>
          <td>高吞吐、低延遲（不必每次選 leader）</td>
          <td>Leader bottleneck 偶有</td>
      </tr>
      <tr>
          <td>工程實作</td>
          <td>XCom 在 MySQL 內部、不暴露 API</td>
          <td>etcd / Consul / TiKV 等獨立工具</td>
      </tr>
  </tbody>
</table>
<p>GR 的設計取捨：<em>緊耦合 MySQL</em>（不必外部 DCS）、<em>Paxos-like consensus</em>（不像 Raft 那麼簡單但效率更高）。trade-off 是 <em>對 ops 的 transparency 較低</em> — XCom 內部行為對 DBA 是 black box。</p>
<h2 id="innodb-clustergr--mysql-shell--mysql-router">InnoDB Cluster：GR + MySQL Shell + MySQL Router</h2>
<p>純 GR 是 <em>底層 replication mechanism</em>、要組成 production deployment 需要：</p>
<ul>
<li><em>MySQL Shell</em> (<code>mysqlsh</code>)：CLI 工具、提供 <code>dba.createCluster()</code> / <code>cluster.addInstance()</code> 等 cluster 管理 API</li>
<li><em>MySQL Router</em>：connection routing layer、自動發現 cluster topology、寫入 routing 給 primary、讀取 routing replica</li>
<li><em>MySQL Group Replication plugin</em>：在每個 MySQL instance 啟用</li>
</ul>
<p><strong>InnoDB Cluster = GR + Shell + Router</strong>、是 Oracle 推薦的 production GR deployment 方式。</p>
<h3 id="起始部署3-member-single-primary-cluster">起始部署（3 member single-primary cluster）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Step 1: 在每個 instance 啟 GR plugin + 配 my.cnf</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="o">[</span>mysqld<span class="o">]</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="nv">server_id</span> <span class="o">=</span> <span class="m">1</span>                          <span class="c1"># 各 instance 不同</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="nv">gtid_mode</span> <span class="o">=</span> ON
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="nv">enforce_gtid_consistency</span> <span class="o">=</span> ON
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="nv">log_bin</span> <span class="o">=</span> mysql-bin
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="nv">binlog_format</span> <span class="o">=</span> ROW
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="nv">master_info_repository</span> <span class="o">=</span> TABLE
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="nv">relay_log_info_repository</span> <span class="o">=</span> TABLE
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="nv">transaction_write_set_extraction</span> <span class="o">=</span> XXHASH64
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="nv">plugin_load_add</span> <span class="o">=</span> <span class="s1">&#39;group_replication.so&#39;</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="nv">group_replication_group_name</span> <span class="o">=</span> <span class="s2">&#34;aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee&#34;</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="nv">group_replication_start_on_boot</span> <span class="o">=</span> OFF
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="nv">group_replication_local_address</span> <span class="o">=</span> <span class="s2">&#34;node1.example.com:33061&#34;</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="nv">group_replication_group_seeds</span> <span class="o">=</span> <span class="s2">&#34;node1:33061,node2:33061,node3:33061&#34;</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="nv">group_replication_bootstrap_group</span> <span class="o">=</span> OFF
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="nv">group_replication_single_primary_mode</span> <span class="o">=</span> ON       <span class="c1"># 99% 場景用 ON</span>
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="nv">group_replication_enforce_update_everywhere_checks</span> <span class="o">=</span> OFF
</span></span><span class="line"><span class="ln">20</span><span class="cl">
</span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="c1"># Step 2: 用 MySQL Shell 從第一個 member bootstrap cluster</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">mysqlsh --user<span class="o">=</span>root --host<span class="o">=</span>node1.example.com
</span></span><span class="line"><span class="ln">23</span><span class="cl">&gt; dba.configureInstance<span class="o">(</span><span class="s1">&#39;root@node1:3306&#39;</span><span class="o">)</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">&gt; var <span class="nv">cluster</span> <span class="o">=</span> dba.createCluster<span class="o">(</span><span class="s1">&#39;prodCluster&#39;</span><span class="o">)</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">&gt; cluster.addInstance<span class="o">(</span><span class="s1">&#39;root@node2:3306&#39;</span><span class="o">)</span>
</span></span><span class="line"><span class="ln">26</span><span class="cl">&gt; cluster.addInstance<span class="o">(</span><span class="s1">&#39;root@node3:3306&#39;</span><span class="o">)</span>
</span></span><span class="line"><span class="ln">27</span><span class="cl">&gt; cluster.status<span class="o">()</span>  <span class="c1"># 應該顯示 3 member、1 PRIMARY + 2 SECONDARY</span>
</span></span><span class="line"><span class="ln">28</span><span class="cl">
</span></span><span class="line"><span class="ln">29</span><span class="cl"><span class="c1"># Step 3: 部署 MySQL Router</span>
</span></span><span class="line"><span class="ln">30</span><span class="cl">mysqlrouter --bootstrap root@node1:3306 --directory /etc/mysql-router --user<span class="o">=</span>mysqlrouter
</span></span><span class="line"><span class="ln">31</span><span class="cl">systemctl start mysql-router
</span></span><span class="line"><span class="ln">32</span><span class="cl">
</span></span><span class="line"><span class="ln">33</span><span class="cl"><span class="c1"># 完成 — application 連 mysql-router:6446 (R/W) 或 :6447 (R/O)</span></span></span></code></pre></div><p>Application 連 Router、Router 自動發現 cluster topology + 自動 failover routing。Application 不必知道哪個 instance 是 primary。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-certification-lag--multi-primary-模式-retry-storm">1. Certification lag — Multi-primary 模式 retry storm</h3>
<p>Multi-primary mode 下、3 個 instance 同時收到 <em>相同 row</em> 的 conflicting write、certification 階段必有 N-1 個 transaction 被退回。Application 看到 <code>ER_GR_CONFLICT_TRANSACTION_ABORTED</code>、retry、若不智能 retry（exponential backoff）會 retry storm、整個 cluster 寫吞吐暴降。</p>
<p>修法：</p>
<ul>
<li>99% 場景用 <em>single-primary mode</em>、避開 conflict</li>
<li>真的需要 multi-primary：application 必須 sharding-aware（不同 entry 寫不同 row range）、本質上跟 Vitess sharding 同概念但用 GR 機制</li>
<li>Application retry 用 <em>jitter exponential backoff</em>、不直接 retry</li>
</ul>
<h3 id="2-certification-queue-爆炸--single-primary-mode-仍受-cert-backlog-影響">2. Certification queue 爆炸 — Single-primary mode 仍受 cert backlog 影響</h3>
<p>Single-primary mode 下 primary 接受 write、broadcast 到 secondary。Secondary 跟 primary network latency / 處理速度差時、cert queue 累積。Cert queue 滿 → primary write 也被卡（GR 設計：所有 member 同步前不接受新 write、保 consistency）。</p>
<p>修法：</p>
<ul>
<li>監控 <code>group_replication_member_stats</code> view：<code>COUNT_TRANSACTIONS_IN_QUEUE</code> 持續 &gt; 0 是警訊</li>
<li>提高 <code>group_replication_message_cache_size</code>（預設 1 GB）給 large transaction 緩衝</li>
<li>確認 <em>所有 member 同 instance class</em>、不要混 spec</li>
<li>跨 region GR：完全不推薦（network latency 殺 cert throughput）</li>
</ul>
<h3 id="3-large-transaction--全-cluster-卡住">3. Large transaction — 全 cluster 卡住</h3>
<p>GR 必須把整個 transaction（含所有 write_set）一次 broadcast。10 GB transaction（大批量 UPDATE）必須一次塞滿 GCE buffer、cluster 內所有 member 都暫停接受新 transaction 直到 broadcast / apply 完成。常見場景：批次 archive / 大 backfill / <code>INSERT ... SELECT 1 億 row</code>。</p>
<p>修法：</p>
<ul>
<li><code>group_replication_transaction_size_limit</code>（預設 150 MB）超過直接 reject、不要設 unlimited</li>
<li>大批量寫入拆 chunk（每 chunk &lt; 100 MB）、用 application 層 loop</li>
<li>對 archive / backfill 用 <code>INSERT INTO archive SELECT ... LIMIT 10000</code> chunked、不是一個 transaction</li>
</ul>
<h3 id="4-network-partition--minority-partition-自動-read-only">4. Network partition — Minority partition 自動 read-only</h3>
<p>3 member cluster、network partition 把 1 個 member 隔離。被隔離 member 是 <em>minority</em>、自動進入 <em>read-only mode</em>（不接受 write）、防 split-brain。Application 連到 minority member 寫入會失敗。</p>
<p>修法：</p>
<ul>
<li>MySQL Router 自動發現 cluster topology、自動 route write 到 majority partition primary</li>
<li>Application 必須處理 connection error + retry（甚至 connection string 改成 <em>Router endpoint</em> 而非個別 instance）</li>
<li>監控 <code>group_replication_primary_member</code> UDF、確認哪個是真 primary</li>
</ul>
<h3 id="5-member-加入-catch-up--大量-binlog-阻擋-cluster-service">5. Member 加入 catch-up — 大量 binlog 阻擋 cluster service</h3>
<p>新 member 加入 cluster（new instance / 復原 failed member）必須 <em>catch-up</em> — apply 從 GR cluster start 到當前所有 binlog 才能 join consensus。如果 cluster 已運作 1 個月、binlog 累積 100 GB、catch-up 可能 6-12 小時、catch-up 期間 <em>該 member 不投票、其他 member 仍 service</em>、但 majority 安全邊界縮小（3 → 2 member working）。</p>
<p>修法：</p>
<ul>
<li>
<p>用 <em>MySQL Shell clone plugin</em> 直接 physical-snapshot 一個 existing member、跳過 binlog replay：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">&gt; cluster.addInstance<span class="o">(</span><span class="s1">&#39;root@node4:3306&#39;</span>, <span class="o">{</span>recoveryMethod: <span class="s1">&#39;clone&#39;</span><span class="o">})</span></span></span></code></pre></div></li>
<li>
<p>Clone 期間原 member 暫不接 write traffic（用 Router temporarily 排除）</p>
</li>
<li>
<p>規劃 maintenance window 加 member、不要在 peak load 期間</p>
</li>
</ul>
<h2 id="何時用-gr--innodb-cluster">何時用 GR / InnoDB Cluster</h2>
<table>
  <thead>
      <tr>
          <th>條件</th>
          <th>建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>需要 <em>zero-data-loss HA</em>（不容忍任何 binlog gap）</td>
          <td>GR single-primary</td>
      </tr>
      <tr>
          <td>需要 <em>自動 failover 而不必 Orchestrator + fence script</em></td>
          <td>GR / InnoDB Cluster</td>
      </tr>
      <tr>
          <td>需要 <em>跨 region multi-active</em>（且 conflict 可接受 / sharding-aware）</td>
          <td>GR multi-primary</td>
      </tr>
      <tr>
          <td>流量 &lt; 50K WPS、無嚴格 zero-loss 需求</td>
          <td>傳統 Orchestrator + Semi-sync 更簡單</td>
      </tr>
      <tr>
          <td>已用 Aurora / Cloud SQL 等 managed</td>
          <td>不用 GR、用 managed offering</td>
      </tr>
      <tr>
          <td>需要分散式 SQL（跨 region linearizable）</td>
          <td>Spanner / CockroachDB / Aurora DSQL（GR 不解決這個）</td>
      </tr>
  </tbody>
</table>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p>GR 取代傳統 async / semi-sync replication、不是 <em>加在上面</em>。啟用 GR 後不要再配 <code>master-slave</code> style replication。詳見 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a>。</p>
<h3 id="跟-orchestrator">跟 Orchestrator</h3>
<p>Orchestrator 跟 InnoDB Cluster 不該 <em>同時用</em> — 兩者都會 trigger failover、會打架。GR / InnoDB Cluster 內建 failover、不需要 Orchestrator。詳見 <a href="/blog/backend/01-database/vendors/mysql/orchestrator-failover/" data-link-title="MySQL Orchestrator Failover：HA 工具自己怎麼 HA？raft cluster &#43; GTID-based promotion 的兩段 paradox" data-link-desc="Orchestrator 是 MySQL HA 自動 failover 的 de facto standard、但讀者第一個問題往往是「HA 工具自己會壞嗎」。本文走 Orchestrator 的雙層架構（管 MySQL 的 raft cluster &#43; 被 raft 管的 orchestrator instance）→ topology discovery → failure detection → failover decision tree → promote action → 5 production 踩雷（split-brain 跟 fencing / pre-failover hook 失敗 / anti-flapping window / GTID errant transaction / VIP 跟 ProxySQL 整合斷層）→ 跟 ProxySQL / Patroni / RDS 對比">Orchestrator Failover</a>。</p>
<h3 id="跟-proxysql--mysql-router">跟 ProxySQL / MySQL Router</h3>
<p>ProxySQL 可以連 GR cluster（自動偵測 read_only flag）、但 <em>MySQL Router</em> 是 GR 原生的 routing layer、跟 InnoDB Cluster 緊耦合（透過 MySQL Shell metadata）。</p>
<p>選擇邏輯：</p>
<ul>
<li><em>純 MySQL stack, 想 Oracle-supported 整套</em> → MySQL Router</li>
<li><em>已用 ProxySQL（包含其他非 GR cluster）+ 統一 routing</em> → 仍用 ProxySQL</li>
</ul>
<p>詳見 <a href="/blog/backend/01-database/vendors/mysql/proxysql-config/" data-link-title="MySQL ProxySQL 配置：connection / query / route / response 四段 lifecycle 跟 query rule 設計" data-link-desc="ProxySQL 是 MySQL 生態的 connection pool &#43; query routing 標準。本文走 connection → query parse → route → response 四段 lifecycle、query rule engine 的 rule chain 設計、Hostgroup / Server / User 三層 schema、配置 step-by-step（讀寫分離 &#43; replica lag-aware routing）、5 production 踩雷（query rule 順序錯亂 / connection 漂移 / write 路由到 replica / runtime / disk schema drift / mirror traffic 副作用）、跟 Replication / Orchestrator / HAProxy 整合">ProxySQL 配置</a>。</p>
<h3 id="跟-innodb-tuning">跟 InnoDB Tuning</h3>
<p>GR 對 <code>innodb_flush_log_at_trx_commit</code> / <code>sync_binlog</code> 行為更敏感 — GR 要求 binlog 必須 <em>fsync to disk</em>（<code>sync_binlog=1</code>）保 zero-loss、不能用 <code>sync_binlog=0</code> 換速度。詳見 <a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">InnoDB Tuning</a>。</p>
<h3 id="跟-postgresql-patroni-對比">跟 PostgreSQL Patroni 對比</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>InnoDB Cluster</th>
          <th>Patroni + PostgreSQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Consensus</td>
          <td>GCE (Paxos-like) 內建</td>
          <td>依賴外部 DCS (etcd / Consul)</td>
      </tr>
      <tr>
          <td>Multi-primary</td>
          <td>支援（但少用）</td>
          <td>不支援（PG single-primary）</td>
      </tr>
      <tr>
          <td>HA tooling</td>
          <td>MySQL Shell + Router 整套</td>
          <td>Patroni + HAProxy + pgBouncer</td>
      </tr>
      <tr>
          <td>Setup 複雜度</td>
          <td>中（MySQL Shell 帶很多 abstraction）</td>
          <td>中（Patroni config + DCS）</td>
      </tr>
      <tr>
          <td>5-year production maturity</td>
          <td>Oracle-backed</td>
          <td>community-driven、廣用</td>
      </tr>
  </tbody>
</table>
<p>兩者角色相同、設計取捨不同。詳見 <a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">PostgreSQL Patroni HA</a>。</p>
<h2 id="容量規劃要點">容量規劃要點</h2>
<table>
  <thead>
      <tr>
          <th>元件</th>
          <th>配置建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Member 數量</td>
          <td>3 (預設、容忍 1 failure)、5 (容忍 2 failure)</td>
      </tr>
      <tr>
          <td>Member 間 network latency</td>
          <td>&lt; 5ms（同 region 同 AZ 或跨 AZ）</td>
      </tr>
      <tr>
          <td>Network bandwidth</td>
          <td>至少 1 Gbps、broadcast traffic 重</td>
      </tr>
      <tr>
          <td>Transaction size limit</td>
          <td><code>group_replication_transaction_size_limit=150M</code></td>
      </tr>
      <tr>
          <td>Message cache</td>
          <td><code>group_replication_message_cache_size=1G</code>（預設）+ 看 lag 調</td>
      </tr>
      <tr>
          <td>MySQL Router instance</td>
          <td>至少 2 個（HA）、放 application 同 LB 後</td>
      </tr>
  </tbody>
</table>
<p>Member 跨 region：<em>不推薦</em>。GR 對 latency 敏感、跨 region 50-200ms RTT 嚴重影響 cert throughput。multi-region 需求用 Aurora Global Database / Spanner 等專為跨 region 設計的方案。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>（GR 取代傳統 replication）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/orchestrator-failover/" data-link-title="MySQL Orchestrator Failover：HA 工具自己怎麼 HA？raft cluster &#43; GTID-based promotion 的兩段 paradox" data-link-desc="Orchestrator 是 MySQL HA 自動 failover 的 de facto standard、但讀者第一個問題往往是「HA 工具自己會壞嗎」。本文走 Orchestrator 的雙層架構（管 MySQL 的 raft cluster &#43; 被 raft 管的 orchestrator instance）→ topology discovery → failure detection → failover decision tree → promote action → 5 production 踩雷（split-brain 跟 fencing / pre-failover hook 失敗 / anti-flapping window / GTID errant transaction / VIP 跟 ProxySQL 整合斷層）→ 跟 ProxySQL / Patroni / RDS 對比">MySQL Orchestrator Failover</a>（GR / InnoDB Cluster 不必 Orchestrator）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/proxysql-config/" data-link-title="MySQL ProxySQL 配置：connection / query / route / response 四段 lifecycle 跟 query rule 設計" data-link-desc="ProxySQL 是 MySQL 生態的 connection pool &#43; query routing 標準。本文走 connection → query parse → route → response 四段 lifecycle、query rule engine 的 rule chain 設計、Hostgroup / Server / User 三層 schema、配置 step-by-step（讀寫分離 &#43; replica lag-aware routing）、5 production 踩雷（query rule 順序錯亂 / connection 漂移 / write 路由到 replica / runtime / disk schema drift / mirror traffic 副作用）、跟 Replication / Orchestrator / HAProxy 整合">MySQL ProxySQL 配置</a>（routing layer 對比）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">MySQL InnoDB Tuning</a>（GR durability 需求）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/bdr-multi-master/" data-link-title="PostgreSQL BDR / Multi-Master：active-active 寫入的 3 種路徑跟 conflict 治理" data-link-desc="PG 預設是 single-primary、active-active 多寫入入口需要 *BDR (EDB)* / *pgEdge* / *Bucardo* 等 extension。本文走 3 種 multi-master 方案對比、conflict detection &#43; resolution model、async vs sync 取捨、配置 step-by-step（pgEdge 為主）、5 production 踩雷（last-write-wins data loss / sequence collision / DDL replication / conflict log 治理 / failover 後 timeline 分歧）、跟 MySQL Group Replication sibling 對比">PostgreSQL BDR / Multi-Master</a>（PG sibling、active-active 寫入 3 種路徑跟 conflict 治理）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/patroni-ha/" data-link-title="PostgreSQL Patroni HA：從 leader 失聯到 client 重連的 5 段 failover lifecycle" data-link-desc="Patroni 把 PostgreSQL HA 拆成 detection / election / promotion / reconfiguration / recovery 五段 lifecycle、每段都有獨立配置跟 failure mode；DCS quorum &#43; watchdog 防 split-brain、async/sync replication 取捨、5 個 production 踩雷、跟 PgBouncer / HAProxy / cert-manager 整合">PostgreSQL Patroni HA</a>（PG sibling、不同 consensus）</li>
<li><a href="/blog/backend/knowledge-cards/quorum/" data-link-title="Quorum" data-link-desc="分散式系統以多數節點同意作為提交或讀取有效性的門檻">quorum 卡片</a> / <a href="/blog/backend/knowledge-cards/quorum/" data-link-title="Quorum" data-link-desc="分散式系統以多數節點同意作為提交或讀取有效性的門檻">Paxos / Raft 對比</a></li>
<li>官方：<a href="https://dev.mysql.com/doc/refman/8.0/en/group-replication.html">MySQL Group Replication</a> / <a href="https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-innodb-cluster.html">InnoDB Cluster</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL Query Optimization：從 EXPLAIN 看到實際執行、5 條 query 從 5 秒變 50ms 的 anatomy</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/query-optimization/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/query-optimization/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>query optimization&lt;/em> — EXPLAIN / optimizer trace / hint 三層工具跟 5 個實際 case。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="5-個常見-production-case">5 個常見 production case&lt;/h2>
&lt;p>production 上 query 慢、root cause 幾乎都是 &lt;em>optimizer 選錯 plan&lt;/em>。從以下 5 個 case 進入 query optimization：&lt;/p>
&lt;h3 id="case-15-秒--50ms--join-順序選錯">Case 1：5 秒 → 50ms — JOIN 順序選錯&lt;/h3>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="c1">-- 慢 (5 秒)：optimizer 選 customers 為 outer table、scan 全 1M row
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">amount&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">c&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">orders&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">JOIN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">customers&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">c&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ON&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">customer_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">c&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">o&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">created_at&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">&amp;gt;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;2026-05-01&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">AND&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">c&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">region&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;TW&amp;#39;&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>EXPLAIN 顯示：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">+----+-------------+-------+------+---------------+--------+
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">| id | select_type | table | type | possible_keys | rows |
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">+----+-------------+-------+------+---------------+--------+
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">| 1 | SIMPLE | c | ALL | NULL | 1000000|
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">| 1 | SIMPLE | o | ref | idx_cust_id | 100 |
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl">+----+-------------+-------+------+---------------+--------+&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>c&lt;/code> table type=ALL（full scan）、rows=1M。問題：&lt;code>customers&lt;/code> 沒在 &lt;code>region&lt;/code> 上的 index、optimizer 預估「region=TW filter 沒效率、就 full scan」、但 region=TW 只佔 10% row（100K row）。&lt;/p>
&lt;p>修法：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">ALTER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">customers&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">ADD&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">INDEX&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">idx_region&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">region&lt;/span>&lt;span class="p">);&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">ANALYZE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">TABLE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">customers&lt;/span>&lt;span class="p">;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="c1">-- 更新 statistics&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>加 index 後 optimizer 切 plan：先 scan &lt;code>customers&lt;/code> 用 &lt;code>idx_region&lt;/code> 篩 100K row、再 join &lt;code>orders&lt;/code>。從 5 秒降到 50ms。&lt;/p>
&lt;h3 id="case-230-秒--200ms--range-scan-退化-all">Case 2：30 秒 → 200ms — Range scan 退化 ALL&lt;/h3>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">events&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">created_at&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BETWEEN&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;2026-05-01&amp;#39;&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">AND&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;2026-05-02&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">AND&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">user_id&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="mi">12345&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>events&lt;/code> 有 &lt;code>idx_user_id&lt;/code> 跟 &lt;code>idx_created_at&lt;/code> 兩個 index、optimizer 應該選一個 + 二級 filter、但實際 &lt;code>type=ALL&lt;/code>（full scan）。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 <em>query optimization</em> — EXPLAIN / optimizer trace / hint 三層工具跟 5 個實際 case。</p></blockquote>
<hr>
<h2 id="5-個常見-production-case">5 個常見 production case</h2>
<p>production 上 query 慢、root cause 幾乎都是 <em>optimizer 選錯 plan</em>。從以下 5 個 case 進入 query optimization：</p>
<h3 id="case-15-秒--50ms--join-順序選錯">Case 1：5 秒 → 50ms — JOIN 順序選錯</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 慢 (5 秒)：optimizer 選 customers 為 outer table、scan 全 1M row
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">amount</span><span class="p">,</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">name</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">customers</span><span class="w"> </span><span class="k">c</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">customer_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">id</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">region</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;TW&#39;</span><span class="p">;</span></span></span></code></pre></div><p>EXPLAIN 顯示：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">+----+-------------+-------+------+---------------+--------+
</span></span><span class="line"><span class="ln">2</span><span class="cl">| id | select_type | table | type | possible_keys | rows   |
</span></span><span class="line"><span class="ln">3</span><span class="cl">+----+-------------+-------+------+---------------+--------+
</span></span><span class="line"><span class="ln">4</span><span class="cl">|  1 | SIMPLE      | c     | ALL  | NULL          | 1000000|
</span></span><span class="line"><span class="ln">5</span><span class="cl">|  1 | SIMPLE      | o     | ref  | idx_cust_id   | 100    |
</span></span><span class="line"><span class="ln">6</span><span class="cl">+----+-------------+-------+------+---------------+--------+</span></span></code></pre></div><p><code>c</code> table type=ALL（full scan）、rows=1M。問題：<code>customers</code> 沒在 <code>region</code> 上的 index、optimizer 預估「region=TW filter 沒效率、就 full scan」、但 region=TW 只佔 10% row（100K row）。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">customers</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_region</span><span class="w"> </span><span class="p">(</span><span class="n">region</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">ANALYZE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">customers</span><span class="p">;</span><span class="w">  </span><span class="c1">-- 更新 statistics</span></span></span></code></pre></div><p>加 index 後 optimizer 切 plan：先 scan <code>customers</code> 用 <code>idx_region</code> 篩 100K row、再 join <code>orders</code>。從 5 秒降到 50ms。</p>
<h3 id="case-230-秒--200ms--range-scan-退化-all">Case 2：30 秒 → 200ms — Range scan 退化 ALL</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="k">BETWEEN</span><span class="w"> </span><span class="s1">&#39;2026-05-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="s1">&#39;2026-05-02&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">AND</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">12345</span><span class="p">;</span></span></span></code></pre></div><p><code>events</code> 有 <code>idx_user_id</code> 跟 <code>idx_created_at</code> 兩個 index、optimizer 應該選一個 + 二級 filter、但實際 <code>type=ALL</code>（full scan）。</p>
<p>EXPLAIN ANALYZE 顯示：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">-&gt; Filter: ((events.user_id = 12345) and (events.created_at between ...))  (cost=2M rows=100)
</span></span><span class="line"><span class="ln">2</span><span class="cl">    -&gt; Table scan on events  (cost=2M rows=10000000)  (actual time=0.1..30s ...)</span></span></code></pre></div><p>問題：optimizer estimated rows=100、實際 <em>cardinality estimation</em> 失準（distribution skew）、選了 ALL。</p>
<p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 用 composite index 直接 cover 兩個條件
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_user_created</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">created_at</span><span class="p">);</span></span></span></code></pre></div><p>Composite index 讓 optimizer 看到 <em>單一 index 直接 satisfy 兩個 predicate</em>、走 range scan + index condition pushdown。30 秒降到 200ms。</p>
<h3 id="case-38-秒--30ms--subquery-沒-unnest">Case 3：8 秒 → 30ms — Subquery 沒 unnest</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">customer_id</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="k">SELECT</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">customers</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">region</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;TW&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">vip_level</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="mi">3</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p>5.6 之前 MySQL 把 <code>IN (subquery)</code> 寫成 <em>correlated subquery</em>、外表每 row 都 re-run subquery、極慢。5.6+ 加 subquery unnesting、轉換成 JOIN，但某些情況 unnest 失敗。</p>
<p>EXPLAIN 顯示：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">+----+--------------------+-----------+-------+
</span></span><span class="line"><span class="ln">2</span><span class="cl">| id | select_type        | table     | type  |
</span></span><span class="line"><span class="ln">3</span><span class="cl">+----+--------------------+-----------+-------+
</span></span><span class="line"><span class="ln">4</span><span class="cl">|  1 | PRIMARY            | orders    | ALL   |
</span></span><span class="line"><span class="ln">5</span><span class="cl">|  2 | DEPENDENT SUBQUERY | customers | unique_subquery |
</span></span><span class="line"><span class="ln">6</span><span class="cl">+----+--------------------+-----------+-------+</span></span></code></pre></div><p><code>DEPENDENT SUBQUERY</code> 是危險訊號。修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 手動改寫成 JOIN
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">JOIN</span><span class="w"> </span><span class="n">customers</span><span class="w"> </span><span class="k">c</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">customer_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">id</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">region</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;TW&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">vip_level</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="mi">3</span><span class="p">;</span></span></span></code></pre></div><p>或用 <code>EXISTS</code>（部分 case 比 <code>IN</code> plan 好）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="k">SELECT</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">customers</span><span class="w"> </span><span class="k">c</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="k">WHERE</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">customer_id</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">region</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;TW&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">vip_level</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="mi">3</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p>不同寫法 plan 差異需用 EXPLAIN 驗證、不能假設「JOIN 一定比 IN 快」。</p>
<h3 id="case-42-秒--100ms--derived-table-沒-materialize">Case 4：2 秒 → 100ms — Derived table 沒 materialize</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">JOIN</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="k">SELECT</span><span class="w"> </span><span class="n">customer_id</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">order_count</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">customer_id</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">counts</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">customer_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">counts</span><span class="p">.</span><span class="n">customer_id</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">counts</span><span class="p">.</span><span class="n">order_count</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p>5.6 之前 derived table（FROM subquery）每次 query 都 re-run、慢。5.7+ 有 <em>derived table materialization</em>、但 optimizer 有時不觸發。</p>
<p>EXPLAIN 顯示：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">+----+-------------+-------+------+
</span></span><span class="line"><span class="ln">2</span><span class="cl">| id | select_type | table | type |
</span></span><span class="line"><span class="ln">3</span><span class="cl">+----+-------------+-------+------+
</span></span><span class="line"><span class="ln">4</span><span class="cl">|  1 | PRIMARY     | o     | ALL  |
</span></span><span class="line"><span class="ln">5</span><span class="cl">|  2 | DERIVED     | orders| ALL  |  -- 沒 materialize、每次 join 都跑
</span></span><span class="line"><span class="ln">6</span><span class="cl">+----+-------------+-------+------+</span></span></code></pre></div><p>修法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 顯式用 CTE + 改寫
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">WITH</span><span class="w"> </span><span class="n">counts</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="k">SELECT</span><span class="w"> </span><span class="n">customer_id</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">order_count</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">customer_id</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">JOIN</span><span class="w"> </span><span class="n">counts</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">customer_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">counts</span><span class="p">.</span><span class="n">customer_id</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">counts</span><span class="p">.</span><span class="n">order_count</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span></span></span></code></pre></div><p>但記得 MySQL CTE 也不 materialize 預設、可能要 <em>temporary table</em> 才強制 cache：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TEMPORARY</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">counts</span><span class="w"> </span><span class="k">AS</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">customer_id</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">order_count</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">customer_id</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">counts</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">customer_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">counts</span><span class="p">.</span><span class="n">customer_id</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">counts</span><span class="p">.</span><span class="n">order_count</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">DROP</span><span class="w"> </span><span class="k">TEMPORARY</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">counts</span><span class="p">;</span></span></span></code></pre></div><h3 id="case-510-秒--100ms--optimizer-選-index-不對">Case 5：10 秒 → 100ms — Optimizer 選 index 不對</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">30</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">active</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span></span></span></code></pre></div><p><code>users</code> 有 <code>idx_active</code> (selectivity 高) 跟 <code>idx_age</code> (selectivity 低)。Optimizer 選 <code>idx_age</code>、scan 60% rows、慢。</p>
<p>EXPLAIN：<code>key: idx_age</code> — 但 active=1 filter 後 row 量 &lt; 5%。</p>
<p>修法選一：</p>
<ol>
<li>
<p><strong>Index hint 強制</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="n">USE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="p">(</span><span class="n">idx_active</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">30</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">active</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span></span></span></code></pre></div></li>
<li>
<p><strong>Composite index 取代</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_active_age</span><span class="w"> </span><span class="p">(</span><span class="n">active</span><span class="p">,</span><span class="w"> </span><span class="n">age</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">DROP</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_age</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">users</span><span class="p">;</span></span></span></code></pre></div></li>
<li>
<p><strong>Optimizer hint (8.0+)</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="cm">/*+ INDEX(users idx_active) */</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">users</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">30</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">active</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span></span></span></code></pre></div></li>
</ol>
<p>Composite index 是最持久解（不依賴 hint）。Index hint 是 quick fix、但對 future schema change 脆弱。</p>
<h2 id="explain-三層工具">EXPLAIN 三層工具</h2>
<h3 id="tool-1explain--query-plan-preview">Tool 1：EXPLAIN — query plan preview</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">EXPLAIN</span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="p">...;</span></span></span></code></pre></div><p>輸出每個 step 的 <em>估計</em> cost / row count / key used。<strong>用於 quick check plan 形狀</strong>。</p>
<p>關鍵欄位：</p>
<ul>
<li><code>type</code>：access type（ALL &lt; index &lt; range &lt; ref &lt; eq_ref &lt; const）、ALL / index 是警訊</li>
<li><code>key</code>：實際選的 index、可能跟 <code>possible_keys</code> 不同</li>
<li><code>rows</code>：估計 scan row 數</li>
<li><code>Extra</code>：<code>Using filesort</code> / <code>Using temporary</code> / <code>Using index condition</code> 等行為標記</li>
</ul>
<h3 id="tool-2explain-analyze--實際執行統計">Tool 2：EXPLAIN ANALYZE — 實際執行統計</h3>
<p>8.0+ 加的。差別：實際 run query、回實際 row count / time、跟 estimate 對比。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">EXPLAIN</span><span class="w"> </span><span class="k">ANALYZE</span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="p">...;</span></span></span></code></pre></div><p>輸出格式（tree format）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">-&gt; Nested loop inner join  (cost=2.4e6 rows=100000) (actual time=0.05..3.2 rows=10000 loops=1)
</span></span><span class="line"><span class="ln">2</span><span class="cl">    -&gt; Index range scan on orders using idx_created (cost=2.4e6 rows=10000) (actual time=0.04..3.0 rows=10000 loops=1)
</span></span><span class="line"><span class="ln">3</span><span class="cl">    -&gt; Single-row index lookup on customers using PRIMARY (cost=1 rows=1) (actual time=0.0001..0.0001 rows=1 loops=10000)</span></span></code></pre></div><p>關鍵：對比 <code>cost / rows</code>（estimate） vs <code>actual time / rows</code>。如果 estimate=100K / actual=10M、optimizer 嚴重低估、可能選錯 plan。</p>
<h3 id="tool-3optimizer-trace--看-optimizer-為何選這個-plan">Tool 3：Optimizer Trace — 看 optimizer 為何選這個 plan</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SET</span><span class="w"> </span><span class="n">optimizer_trace</span><span class="o">=</span><span class="s1">&#39;enabled=on&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="p">...;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">information_schema</span><span class="p">.</span><span class="n">optimizer_trace</span><span class="p">;</span></span></span></code></pre></div><p>輸出 JSON、列每個 step optimizer 考慮過的 plan + cost estimate + 為什麼選最終 plan。<strong>用於：optimizer 行為跟你預期不符時、debug 為什麼</strong>。</p>
<p>複雜 query 的 optimizer trace 可能 100+ KB、要熟讀 JSON 結構。production debug tool、不是常規 tool。</p>
<h2 id="optimizer-hint-vs-index-hint">Optimizer hint vs Index hint</h2>
<p>兩種 hint、語法不同、行為不同：</p>
<h3 id="index-hint5x-就有">Index hint（5.x 就有）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">table</span><span class="w"> </span><span class="n">USE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="p">(</span><span class="n">idx_name</span><span class="p">)</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">...;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">table</span><span class="w"> </span><span class="k">FORCE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="p">(</span><span class="n">idx_name</span><span class="p">)</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">...;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">table</span><span class="w"> </span><span class="k">IGNORE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="p">(</span><span class="n">idx_name</span><span class="p">)</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">...;</span></span></span></code></pre></div><ul>
<li><code>USE INDEX</code>：建議 optimizer 用這 index、但 optimizer 仍可拒絕</li>
<li><code>FORCE INDEX</code>：強制用、optimizer 不能拒絕</li>
<li><code>IGNORE INDEX</code>：禁止用</li>
</ul>
<p><strong>問題</strong>：</p>
<ul>
<li>對 table name 寫死、refactor / partition 時容易斷</li>
<li><code>FORCE</code> 太強、可能讓 optimizer 跑得比沒 hint 更慢（forced index 不是最佳 plan）</li>
</ul>
<h3 id="optimizer-hint80">Optimizer hint（8.0+）</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="cm">/*+ INDEX(table_name idx_name) */</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">table</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">...;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="cm">/*+ JOIN_ORDER(t1, t2, t3) */</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">t1</span><span class="p">,</span><span class="w"> </span><span class="n">t2</span><span class="p">,</span><span class="w"> </span><span class="n">t3</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">...;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="cm">/*+ HASH_JOIN(t1 t2) */</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">t1</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">t2</span><span class="w"> </span><span class="p">...;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="cm">/*+ NO_INDEX_MERGE(table) */</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">table</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="p">...;</span></span></span></code></pre></div><ul>
<li>更細粒度（join order / join method / index 選擇分開）</li>
<li>注入 query comment 內、不污染 SQL syntax</li>
<li>比 index hint 安全：optimizer 看 hint 但仍走 plan space search</li>
</ul>
<p><strong>推薦</strong>：</p>
<ul>
<li>8.0+ 用 optimizer hint</li>
<li>5.7 仍用 index hint、但謹慎 — 觀察 hint 加上去後 <em>實際 plan</em> 是否真的好</li>
</ul>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-statistics-過時--optimizer-估錯-row-count">1. Statistics 過時 — optimizer 估錯 row count</h3>
<p><code>information_schema.STATISTICS</code> 紀錄每個 index 的 cardinality。如果 <em>過 1 個月沒 ANALYZE</em>、statistics 跟實際資料 distribution 嚴重偏差、optimizer 估計錯。</p>
<p>修法：</p>
<ul>
<li>定期跑 <code>ANALYZE TABLE</code>（大表改 nightly cron）</li>
<li>8.0+ <code>innodb_stats_auto_recalc=ON</code> 預設、但變更超過 10% row 才觸發</li>
<li>設 <code>innodb_stats_persistent=ON</code>（預設、把 statistics 存 disk）+ <code>innodb_stats_persistent_sample_pages=20</code>（提高 sample 精度）</li>
</ul>
<h3 id="2-forced-index-用錯--hint-比沒-hint-還慢">2. Forced index 用錯 — Hint 比沒 hint 還慢</h3>
<p><code>FORCE INDEX (idx)</code> 強制 optimizer 用、但 <em>idx 不是最佳</em> 時、query 變慢。常見：開發 staging 試出 <code>FORCE INDEX</code> 有效、production 資料 distribution 不同、forced index 反而慢。</p>
<p>修法：</p>
<ul>
<li>用 <code>USE INDEX</code> 而不是 <code>FORCE INDEX</code>（optimizer 仍可換）</li>
<li>不依賴 hint、用 composite index / 重寫 query 達到目的</li>
<li>已用 hint 的 query 進 <em>staging review 機制</em>、確認 plan 仍合理</li>
</ul>
<h3 id="3-hash-join-沒觸發--equality-是-expression">3. Hash join 沒觸發 — Equality 是 expression</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">a</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">b</span><span class="p">.</span><span class="n">parent_id</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span></span></span></code></pre></div><p><code>b.parent_id + 1</code> 是 expression、不是 raw column、optimizer 不選 hash join、用 nested loop。</p>
<p>修法：</p>
<ul>
<li>Schema 改：把 <code>parent_id + 1</code> 變成 <em>generated column</em></li>
<li>Query 改：JOIN 之前 <em>預計算 expression</em> 存 temp table</li>
<li>或 <code>/*+ HASH_JOIN(a b) */</code> 顯式（但 plan 仍可能拒絕）</li>
</ul>
<h3 id="4-range-scan-退化-all--cardinality-估計太低">4. Range scan 退化 ALL — Cardinality 估計太低</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">t</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="p">...,</span><span class="w"> </span><span class="mi">1000</span><span class="p">);</span></span></span></code></pre></div><p><code>IN</code> 1000 value、optimizer 預估「range scan 太多 lookup、不如 ALL」、選 full table scan。對 <em>中型表</em>（1M row）通常 IN 仍快、但 optimizer 估錯。</p>
<p>修法：</p>
<ul>
<li>
<p><code>IN</code> 拆成 <em>temp table JOIN</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TEMPORARY</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">in_values</span><span class="w"> </span><span class="p">(</span><span class="n">val</span><span class="w"> </span><span class="nb">INT</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">in_values</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">),</span><span class="w"> </span><span class="p">(</span><span class="mi">2</span><span class="p">),</span><span class="w"> </span><span class="p">...,</span><span class="w"> </span><span class="p">(</span><span class="mi">1000</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">t</span><span class="p">.</span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">t</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">in_values</span><span class="w"> </span><span class="n">iv</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">t</span><span class="p">.</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">iv</span><span class="p">.</span><span class="n">val</span><span class="p">;</span></span></span></code></pre></div></li>
<li>
<p>或 <code>optimizer_switch='index_merge=on'</code>（multi-value IN 可能走 index merge）</p>
</li>
<li>
<p>或大 <code>IN</code> 改 application 層拆批 query</p>
</li>
</ul>
<h3 id="5-derived-table-materialization-off--重複-scan">5. Derived table materialization off — 重複 scan</h3>
<p><code>optimizer_switch='derived_merge=on'</code>（預設 ON、derived table 自動 inline merge）某些 query 反而慢（merge 後 plan 變複雜）。或 <em>反向問題</em>：derived table <em>沒</em> materialize、每次都 re-run。</p>
<p>修法：</p>
<ul>
<li>看 EXPLAIN 是否有 <code>DERIVED</code> row、確認 materialization 行為</li>
<li>可 <code>optimizer_switch='derived_merge=off'</code> 強制 materialize（影響整個 connection、謹慎用）</li>
<li>大 derived table 改 explicit <em>temporary table</em> 完全控制</li>
</ul>
<h2 id="跟-postgresql-explain-對比">跟 PostgreSQL EXPLAIN 對比</h2>
<table>
  <thead>
      <tr>
          <th>工具</th>
          <th>MySQL</th>
          <th>PostgreSQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Query plan preview</td>
          <td><code>EXPLAIN</code></td>
          <td><code>EXPLAIN</code></td>
      </tr>
      <tr>
          <td>實際執行統計</td>
          <td><code>EXPLAIN ANALYZE</code> (8.0+)</td>
          <td><code>EXPLAIN ANALYZE</code></td>
      </tr>
      <tr>
          <td>Optimizer 內部 trace</td>
          <td>optimizer_trace (JSON)</td>
          <td><code>auto_explain</code> extension</td>
      </tr>
      <tr>
          <td>Format</td>
          <td>TABLE / JSON / TREE</td>
          <td>TEXT / JSON / XML / YAML</td>
      </tr>
      <tr>
          <td>Parallel query plan</td>
          <td>受限（8.0 限 hash join）</td>
          <td>Full（PG 10+ parallel scan / aggregate / join）</td>
      </tr>
      <tr>
          <td>Index merge</td>
          <td>有</td>
          <td>有 (<code>bitmap index scan</code>)</td>
      </tr>
      <tr>
          <td>Genetic Query Optimizer</td>
          <td>無</td>
          <td>PG 有（適合 &gt; 12 table JOIN）</td>
      </tr>
      <tr>
          <td>Cost estimate accuracy</td>
          <td>中（histograms 8.0+）</td>
          <td>高（成熟 statistics）</td>
      </tr>
  </tbody>
</table>
<p>PG optimizer 整體更成熟、複雜 OLAP-style query plan 更穩定。MySQL 8.0 補了不少（histograms、hash join、derived table merge）、簡單 OLTP query 已 OK、複雜 query 仍弱。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-modern-sql-features">跟 Modern SQL Features</h3>
<p>CTE / window function / lateral / hash join 都改變 query plan space、optimizer 跟著要識別新 pattern。8.0 optimizer 對新 SQL feature plan 仍有改進空間。詳見 <a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">Modern SQL Features</a>。</p>
<h3 id="跟-innodb-tuning">跟 InnoDB Tuning</h3>
<p>Query plan 受 <em>buffer pool hit rate</em> 影響 — optimizer 假設 random IO cost、實際資料在 buffer pool 內讀取快。Buffer pool 不夠時 plan estimate 失真。詳見 <a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">InnoDB Tuning</a>。</p>
<h3 id="跟-proxysql">跟 ProxySQL</h3>
<p>ProxySQL query rule 不影響 optimizer plan、但可以 <em>rewrite query</em>（rule engine 的 <code>replace_pattern</code>）— 用於把 application 寫不好的 query 改成 optimizer-friendly 形式、application 不必改。詳見 <a href="/blog/backend/01-database/vendors/mysql/proxysql-config/" data-link-title="MySQL ProxySQL 配置：connection / query / route / response 四段 lifecycle 跟 query rule 設計" data-link-desc="ProxySQL 是 MySQL 生態的 connection pool &#43; query routing 標準。本文走 connection → query parse → route → response 四段 lifecycle、query rule engine 的 rule chain 設計、Hostgroup / Server / User 三層 schema、配置 step-by-step（讀寫分離 &#43; replica lag-aware routing）、5 production 踩雷（query rule 順序錯亂 / connection 漂移 / write 路由到 replica / runtime / disk schema drift / mirror traffic 副作用）、跟 Replication / Orchestrator / HAProxy 整合">ProxySQL 配置</a>。</p>
<h3 id="跟-lock-contention">跟 Lock Contention</h3>
<p>Slow query 持有 lock 久、其他 query wait、整個 cluster lock contention 爆。Query optimization 不只是 latency 問題、也是 <em>lock 影響範圍</em> 問題。詳見 <em>Lock Contention deep dive</em> 篇（待寫）。</p>
<h3 id="跟-partitioning">跟 Partitioning</h3>
<p>Partition pruning 是 optimizer 決定的、<code>EXPLAIN PARTITIONS</code> 看 partition 命中。partition + index 組合可能比 single big table + index 慢（cross-partition query overhead）。詳見 <em>Partitioning</em> 篇（待寫）。</p>
<h2 id="觀測-metric">觀測 metric</h2>
<p>Production 持續 monitor：</p>
<ul>
<li><code>Performance_schema.events_statements_summary_by_digest</code>：每個 query digest 的累計 time / row examined / row sent</li>
<li><code>slow_query_log</code>：slow query 進 log 檔（<code>long_query_time=1</code>）</li>
<li><code>sys.statements_with_full_table_scans</code>：列 query 用 full scan 的歷史</li>
<li><code>sys.schema_unused_indexes</code>：列從未用過的 index、可以 drop 省 write cost</li>
</ul>
<p>把這些丟進 Datadog / Percona Monitoring &amp; Management 做 trend analysis。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features</a>（hash join / window / CTE 的 plan 議題）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">MySQL InnoDB Tuning</a>（buffer pool 對 plan estimate）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/proxysql-config/" data-link-title="MySQL ProxySQL 配置：connection / query / route / response 四段 lifecycle 跟 query rule 設計" data-link-desc="ProxySQL 是 MySQL 生態的 connection pool &#43; query routing 標準。本文走 connection → query parse → route → response 四段 lifecycle、query rule engine 的 rule chain 設計、Hostgroup / Server / User 三層 schema、配置 step-by-step（讀寫分離 &#43; replica lag-aware routing）、5 production 踩雷（query rule 順序錯亂 / connection 漂移 / write 路由到 replica / runtime / disk schema drift / mirror traffic 副作用）、跟 Replication / Orchestrator / HAProxy 整合">MySQL ProxySQL 配置</a>（query rewrite 整合）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">MySQL Online Schema Change Tools</a>（add index 走 OSC）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/query-optimization/" data-link-title="PostgreSQL Query Optimization：EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具跟 4 個 case" data-link-desc="PG query 慢的根因常是 *planner 選錯 plan 或 statistics 過時*。本文從 4 個 production case 開場（seq scan vs index / hash vs nested loop / 多 column 統計缺 / parallel query 沒觸發）、走 EXPLAIN / EXPLAIN ANALYZE / auto_explain 三層工具、pg_hint_plan extension 跟 planner GUC 取捨、5 production 踩雷（ANALYZE 過時 / multi-column statistics / cost-base setting 不對齊硬體 / random_page_cost SSD 沒調 / parallel query 配置）、跟 MySQL query-optimization sibling 對比">PostgreSQL Query Optimization</a>（PG sibling、EXPLAIN ANALYZE / pg_hint_plan / auto_explain 三層工具）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/index-selection/" data-link-title="PostgreSQL Index Selection：B-tree / GIN / GiST / BRIN / Hash 對應 workload 的決策樹" data-link-desc="PG 有 6 種 index method（B-tree / Hash / GIN / GiST / SP-GiST / BRIN）跟 partial / expression / covering 三種變體、不是「都用 B-tree 就好」。每種 index 有自己的 query pattern、儲存代價、write amplification 跟 maintenance 成本。本文走 6 種 index 的適用 workload 對照、決策樹、partial / expression / covering / multi-column 變體、5 production 踩雷（過度 index / partial 條件不對 / B-tree 對 JSON 無效 / BRIN 對非 correlated 資料無效 / multi-column 順序錯）、跟 query-optimization 的 EXPLAIN 互補">PostgreSQL Index Selection</a>（B-tree / GIN / GiST / BRIN 決策樹 vs MySQL B-tree only）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor page</a>（EXPLAIN ANALYZE 對比）</li>
<li>官方：<a href="https://dev.mysql.com/doc/refman/8.0/en/optimization.html">MySQL Optimization</a> / <a href="https://dev.mysql.com/doc/refman/8.0/en/optimizer-hints.html">Optimizer Hints</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL Partitioning：partition lifecycle 五段、跟 Vitess sharding 不同的「同 instance 內水平切割」</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/partitioning/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/partitioning/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>native partitioning&lt;/em> — 5 段 lifecycle + 4 種 type + 跟 Vitess sharding / PG partitioning 對比。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="partition-lifecycle-五段">Partition lifecycle 五段&lt;/h2>
&lt;p>MySQL native partitioning 是 &lt;em>同 instance 內把一個邏輯 table 拆成多個 physical sub-table&lt;/em>、optimizer 可選擇只 scan 相關 partition。整個 partition lifecycle 5 段：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">Design 決定 partition key / type / 數量
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">Create CREATE TABLE ... PARTITION BY ...
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl">Query WHERE clause + partition pruning
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl"> ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">Maintenance ADD / DROP / REORGANIZE / EXCHANGE
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl"> ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">9&lt;/span>&lt;span class="cl">Drop 整個 partition 一次刪（比 DELETE FROM 快 1000x）&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>每段都有獨立工程決策。設計階段選錯 partition key、後續 maintenance + query 全部 broken。&lt;/p>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">Vitess sharding&lt;/a> 對比：&lt;/p>
&lt;ul>
&lt;li>&lt;em>MySQL partitioning&lt;/em>：同 instance、optimizer 自動 pruning、無 cross-instance network cost&lt;/li>
&lt;li>&lt;em>Vitess sharding&lt;/em>：跨 instance、application 透過 VTGate routing、可線性 scale&lt;/li>
&lt;/ul>
&lt;p>兩者不衝突、可組合：Vitess shard 內部 &lt;em>再&lt;/em> 用 MySQL partition（例如：shard 切 16 個、每個 shard 的 table 再按月份 partition）。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 <em>native partitioning</em> — 5 段 lifecycle + 4 種 type + 跟 Vitess sharding / PG partitioning 對比。</p></blockquote>
<hr>
<h2 id="partition-lifecycle-五段">Partition lifecycle 五段</h2>
<p>MySQL native partitioning 是 <em>同 instance 內把一個邏輯 table 拆成多個 physical sub-table</em>、optimizer 可選擇只 scan 相關 partition。整個 partition lifecycle 5 段：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Design       決定 partition key / type / 數量
</span></span><span class="line"><span class="ln">2</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">3</span><span class="cl">Create       CREATE TABLE ... PARTITION BY ...
</span></span><span class="line"><span class="ln">4</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">5</span><span class="cl">Query        WHERE clause + partition pruning
</span></span><span class="line"><span class="ln">6</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">7</span><span class="cl">Maintenance  ADD / DROP / REORGANIZE / EXCHANGE
</span></span><span class="line"><span class="ln">8</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">9</span><span class="cl">Drop         整個 partition 一次刪（比 DELETE FROM 快 1000x）</span></span></code></pre></div><p>每段都有獨立工程決策。設計階段選錯 partition key、後續 maintenance + query 全部 broken。</p>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">Vitess sharding</a> 對比：</p>
<ul>
<li><em>MySQL partitioning</em>：同 instance、optimizer 自動 pruning、無 cross-instance network cost</li>
<li><em>Vitess sharding</em>：跨 instance、application 透過 VTGate routing、可線性 scale</li>
</ul>
<p>兩者不衝突、可組合：Vitess shard 內部 <em>再</em> 用 MySQL partition（例如：shard 切 16 個、每個 shard 的 table 再按月份 partition）。</p>
<h2 id="4-種-partition-type">4 種 partition type</h2>
<h3 id="range-partitioning--連續區間切割">RANGE partitioning — 連續區間切割</h3>
<p>最常見、適合 time-series / 連續數字：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="w"> </span><span class="n">AUTO_INCREMENT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">user_id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">amount</span><span class="w"> </span><span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">created_at</span><span class="w"> </span><span class="n">DATETIME</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">created_at</span><span class="p">)</span><span class="w">              </span><span class="c1">-- PK 必須含 partition key
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">RANGE</span><span class="w"> </span><span class="p">(</span><span class="n">TO_DAYS</span><span class="p">(</span><span class="n">created_at</span><span class="p">))</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p202601</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">LESS</span><span class="w"> </span><span class="k">THAN</span><span class="w"> </span><span class="p">(</span><span class="n">TO_DAYS</span><span class="p">(</span><span class="s1">&#39;2026-02-01&#39;</span><span class="p">)),</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p202602</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">LESS</span><span class="w"> </span><span class="k">THAN</span><span class="w"> </span><span class="p">(</span><span class="n">TO_DAYS</span><span class="p">(</span><span class="s1">&#39;2026-03-01&#39;</span><span class="p">)),</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">    </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p202603</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">LESS</span><span class="w"> </span><span class="k">THAN</span><span class="w"> </span><span class="p">(</span><span class="n">TO_DAYS</span><span class="p">(</span><span class="s1">&#39;2026-04-01&#39;</span><span class="p">)),</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">    </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p_future</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">LESS</span><span class="w"> </span><span class="k">THAN</span><span class="w"> </span><span class="k">MAXVALUE</span><span class="w">  </span><span class="c1">-- 未來資料 fallback
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"></span><span class="p">);</span></span></span></code></pre></div><p>優點：</p>
<ul>
<li>Partition pruning 高效（時間 range query）</li>
<li>整個月 archive 直接 <code>ALTER TABLE orders DROP PARTITION p202601</code>、毫秒級</li>
</ul>
<p>缺點：</p>
<ul>
<li>必須 <em>預先建</em> 未來 partition（或用 <code>p_future</code> fallback、但 fallback partition 變大就失去 pruning 意義）</li>
<li><em>Hot partition</em> — 最新 partition 接收所有 INSERT、其他 partition 純歷史</li>
</ul>
<h3 id="list-partitioning--離散值切割">LIST partitioning — 離散值切割</h3>
<p>適合 enum-like value：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">users</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">name</span><span class="w"> </span><span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">region</span><span class="w"> </span><span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">region</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">LIST</span><span class="w"> </span><span class="n">COLUMNS</span><span class="w"> </span><span class="p">(</span><span class="n">region</span><span class="p">)</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">    </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p_asia</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;TW&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;JP&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;KR&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;CN&#39;</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p_americas</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;US&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;CA&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;BR&#39;</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p_emea</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="s1">&#39;GB&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;DE&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;FR&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;IT&#39;</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p>優點：對 enum-like value 直接命中、pruning 簡單。</p>
<p>缺點：value list 不能變更（不 supported <code>ALTER PARTITION ADD VALUE</code>）、新國家代碼必須 REORGANIZE。</p>
<h3 id="hash-partitioning--均勻分布">HASH partitioning — 均勻分布</h3>
<p>對 numeric / string column 取 hash、均勻分布：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">user_id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">event_type</span><span class="w"> </span><span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">50</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">user_id</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">HASH</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span><span class="w"> </span><span class="n">PARTITIONS</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span></span></span></code></pre></div><p>優點：均勻分布、沒有 hot partition。</p>
<p>缺點：</p>
<ul>
<li><em>Range query 沒效</em> — <code>WHERE user_id BETWEEN 100 AND 200</code> 不能 pruning、scan 全部 partition</li>
<li>Partition 數量改變需要 REORGANIZE 整張表</li>
</ul>
<h3 id="key-partitioning--mysql-內部-hash">KEY partitioning — MySQL 內部 hash</h3>
<p>跟 HASH 類似、但用 MySQL 內部 hash function（不依賴 column 是否 integer）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">sessions</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">session_id</span><span class="w"> </span><span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">64</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">user_id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="k">data</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">session_id</span><span class="p">,</span><span class="w"> </span><span class="n">user_id</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span><span class="w"> </span><span class="n">PARTITIONS</span><span class="w"> </span><span class="mi">16</span><span class="p">;</span></span></span></code></pre></div><p>用於 <em>string column 或 composite column</em> 的均勻分布。一般場景跟 HASH 效果接近。</p>
<h3 id="sub-partitioning--兩層切割">Sub-partitioning — 兩層切割</h3>
<p>RANGE + HASH 組合、深化分隔：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">big_events</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">user_id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">created_at</span><span class="w"> </span><span class="n">DATETIME</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">created_at</span><span class="p">,</span><span class="w"> </span><span class="n">user_id</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">RANGE</span><span class="w"> </span><span class="p">(</span><span class="n">TO_DAYS</span><span class="p">(</span><span class="n">created_at</span><span class="p">))</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="n">SUBPARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">HASH</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span><span class="w"> </span><span class="n">SUBPARTITIONS</span><span class="w"> </span><span class="mi">4</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p202601</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">LESS</span><span class="w"> </span><span class="k">THAN</span><span class="w"> </span><span class="p">(</span><span class="n">TO_DAYS</span><span class="p">(</span><span class="s1">&#39;2026-02-01&#39;</span><span class="p">)),</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p202602</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">LESS</span><span class="w"> </span><span class="k">THAN</span><span class="w"> </span><span class="p">(</span><span class="n">TO_DAYS</span><span class="p">(</span><span class="s1">&#39;2026-03-01&#39;</span><span class="p">))</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p>每個 RANGE partition 又拆 4 個 HASH sub-partition、共 8 個 physical storage location。適合 <em>時間 range + user_id hash</em> 兩維度。</p>
<p>實務罕用、複雜性高、調 query plan 困難。多數 case 用 single-level partition 即可。</p>
<h2 id="partition-pruning--optimizer-怎麼選-partition">Partition Pruning — Optimizer 怎麼選 partition</h2>
<p><code>EXPLAIN PARTITIONS SELECT ...</code> 顯示 query 命中哪些 partition：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">EXPLAIN</span><span class="w"> </span><span class="n">PARTITIONS</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="k">BETWEEN</span><span class="w"> </span><span class="s1">&#39;2026-02-15&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="s1">&#39;2026-02-20&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="o">+</span><span class="c1">----+-------------+--------+------------+-------+
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="o">|</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">select_type</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="k">table</span><span class="w">  </span><span class="o">|</span><span class="w"> </span><span class="n">partitions</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="k">type</span><span class="w">  </span><span class="o">|</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="o">+</span><span class="c1">----+-------------+--------+------------+-------+
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="o">|</span><span class="w">  </span><span class="mi">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="k">SIMPLE</span><span class="w">      </span><span class="o">|</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">p202602</span><span class="w">    </span><span class="o">|</span><span class="w"> </span><span class="n">range</span><span class="w"> </span><span class="o">|</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="o">+</span><span class="c1">----+-------------+--------+------------+-------+</span></span></span></code></pre></div><p>只命中 <code>p202602</code>、其他 partition 不 scan。</p>
<p><strong>Pruning 失效場景</strong>：</p>
<ol>
<li>
<p><strong>Function on partition key</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="k">YEAR</span><span class="p">(</span><span class="n">created_at</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">2026</span><span class="w">  </span><span class="c1">-- 沒 pruning、scan 全部</span></span></span></code></pre></div><p>應該寫成：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="s1">&#39;2026-01-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="s1">&#39;2027-01-01&#39;</span></span></span></code></pre></div></li>
<li>
<p><strong>Implicit conversion</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;2026-02-15&#39;</span><span class="w">  </span><span class="c1">-- 字串 vs DATETIME、可能失效</span></span></span></code></pre></div><p>應該：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">TIMESTAMP</span><span class="w"> </span><span class="s1">&#39;2026-02-15 00:00:00&#39;</span></span></span></code></pre></div></li>
<li>
<p><strong>OR 跨 partition</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;2026-02-15&#39;</span><span class="w"> </span><span class="k">OR</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="w">  </span><span class="c1">-- partition + non-partition column OR、scan 全部</span></span></span></code></pre></div></li>
<li>
<p><strong>JOIN 不直接 filter partition key</strong>：JOIN 條件不含 partition key、optimizer 估計無法 pruning。</p>
</li>
</ol>
<h2 id="partition-maintenance--add--drop--reorganize--exchange">Partition Maintenance — ADD / DROP / REORGANIZE / EXCHANGE</h2>
<h3 id="add-partition">ADD partition</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p202604</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">LESS</span><span class="w"> </span><span class="k">THAN</span><span class="w"> </span><span class="p">(</span><span class="n">TO_DAYS</span><span class="p">(</span><span class="s1">&#39;2026-05-01&#39;</span><span class="p">))</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p>對 RANGE 簡單、但要 <em>排在 MAXVALUE partition 之前</em>（如果有 <code>p_future</code>、要先 REORGANIZE）。</p>
<h3 id="drop-partition">DROP partition</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">DROP</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p202601</span><span class="p">;</span></span></span></code></pre></div><p>直接刪 partition file、毫秒級完成。是 <em>time-series archive 的最大優勢</em> — 對比 <code>DELETE FROM orders WHERE created_at &lt; '...'</code> 跑 hours。</p>
<h3 id="reorganize-partition">REORGANIZE partition</h3>
<p>切分 / 合併 partition：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 切：把 p_future 切成 p202604 + new p_future
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">REORGANIZE</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p_future</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p202604</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">LESS</span><span class="w"> </span><span class="k">THAN</span><span class="w"> </span><span class="p">(</span><span class="n">TO_DAYS</span><span class="p">(</span><span class="s1">&#39;2026-05-01&#39;</span><span class="p">)),</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p_future</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">LESS</span><span class="w"> </span><span class="k">THAN</span><span class="w"> </span><span class="k">MAXVALUE</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">);</span></span></span></code></pre></div><p>REORGANIZE <em>rewrites partition data</em>、跟 OSC 一樣慢、大 partition 走 gh-ost / pt-osc 模擬（用 ghost table）。</p>
<h3 id="exchange-partition">EXCHANGE partition</h3>
<p>把 partition 跟 <em>獨立 table</em> swap（不複製資料）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 建一個 staging table 跟 partition 同 schema
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders_staging</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders_staging</span><span class="w"> </span><span class="n">REMOVE</span><span class="w"> </span><span class="n">PARTITIONING</span><span class="p">;</span><span class="w">  </span><span class="c1">-- staging 必須是 non-partitioned
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 把 archive partition 的資料 atomic swap 給 staging
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">EXCHANGE</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">p202601</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders_staging</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="c1">-- 現在 orders_staging 有 p202601 的資料、orders 的 p202601 變空
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1">-- 可以 dump staging 到 S3、或 INSERT 進 archive DB</span></span></span></code></pre></div><p><code>EXCHANGE PARTITION</code> 是 <em>metadata operation</em>、毫秒級完成、不複製資料。Time-series archive 工作流的核心工具。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-pk-必須含-partition-key--schema-設計受限">1. PK 必須含 partition key — Schema 設計受限</h3>
<p>MySQL partition 規則：<strong>PK 必須包含所有 partition key column</strong>。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 錯：PK 沒包含 partition key
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="w"> </span><span class="n">AUTO_INCREMENT</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">  </span><span class="c1">-- 只有 id
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="w">    </span><span class="n">created_at</span><span class="w"> </span><span class="n">DATETIME</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">RANGE</span><span class="w"> </span><span class="p">(</span><span class="n">TO_DAYS</span><span class="p">(</span><span class="n">created_at</span><span class="p">))</span><span class="w"> </span><span class="p">(...);</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- ERROR 1503: A PRIMARY KEY must include all columns in the table&#39;s partitioning function</span></span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 對：PK 包含 partition key
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="w"> </span><span class="n">AUTO_INCREMENT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">created_at</span><span class="w"> </span><span class="n">DATETIME</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">created_at</span><span class="p">)</span><span class="w">  </span><span class="c1">-- 兩 column 都進 PK
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="p">)</span><span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">RANGE</span><span class="w"> </span><span class="p">(</span><span class="n">TO_DAYS</span><span class="p">(</span><span class="n">created_at</span><span class="p">))</span><span class="w"> </span><span class="p">(...);</span></span></span></code></pre></div><p>修法：</p>
<ul>
<li>接受 PK 是 composite（id + partition_key column）</li>
<li>AUTO_INCREMENT 仍 work、但 INSERT 必須給定 created_at</li>
<li><em>Unique constraint 也受影響</em> — 所有 UNIQUE index 必須含 partition key</li>
</ul>
<p>對 application：原本 <code>WHERE id = X</code> 仍 work、但慢（沒 partition pruning）、必須 <code>WHERE id = X AND created_at &gt;= ...</code> 才高效。</p>
<h3 id="2-global-index-沒原生支援">2. Global index 沒原生支援</h3>
<p>MySQL partitioning <em>沒 global secondary index</em>（PG 有）。每個 partition 各自有自己的 local index、跨 partition 的 unique constraint 必須 <em>包含 partition key</em>。</p>
<p>例：希望 <code>user_id</code> 全表 unique、但 partition by <code>created_at</code>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- MySQL 不允許這樣 — UNIQUE 必須含 created_at
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="w"> </span><span class="n">AUTO_INCREMENT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">user_id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="n">created_at</span><span class="w"> </span><span class="n">DATETIME</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">    </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">created_at</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">    </span><span class="k">UNIQUE</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">created_at</span><span class="p">)</span><span class="w">  </span><span class="c1">-- 必須含 created_at、不是純 user_id
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"></span><span class="p">);</span></span></span></code></pre></div><p>對 application：跨 partition 的 unique 需要 <em>application 層處理</em>（INSERT 前 SELECT 檢查）或改用 Vitess <code>lookup_hash</code> Vindex。</p>
<h3 id="3-exchange-partition--schema-必須完全一致">3. EXCHANGE partition — schema 必須完全一致</h3>
<p>EXCHANGE 失敗常見：staging table 跟 partition 的 <em>index / column 順序差一個</em>、<code>ERROR 1736: Tables have different definitions</code>。</p>
<p>修法：</p>
<ul>
<li>建 staging 用 <code>CREATE TABLE staging LIKE orders</code> 而非手寫</li>
<li><code>REMOVE PARTITIONING</code> 後立即 verify schema</li>
<li>跑 OSC 改 schema 時、partition + staging table 同時改、不能漏一個</li>
</ul>
<h3 id="4-orphan-partition--future-partition-預先建忘記延展">4. Orphan partition — Future partition 預先建忘記延展</h3>
<p>部署 cron 每月建下個月 partition、cron 失敗 / pause、下個月 INSERT 無對應 partition、寫入 <code>p_future</code>。<code>p_future</code> 一年累積後變超大、partition pruning 沒效、查最近資料 scan 全表。</p>
<p>修法：</p>
<ul>
<li>監控 <code>p_future</code> partition size、超過 threshold alert</li>
<li>Cron 失敗 alert（不是 silent fail）</li>
<li>不依賴 cron、改成 <em>application 層在 INSERT 前 ensure partition exists</em>（lazy create）</li>
</ul>
<h3 id="5-cross-partition-query-慢">5. Cross-partition query 慢</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="k">SUM</span><span class="p">(</span><span class="n">amount</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">user_id</span><span class="p">;</span></span></span></code></pre></div><p>沒 partition key filter、optimizer 不能 pruning、scan 全部 partition。比 <em>single big table without partition</em> 還慢（因為跨 partition aggregation overhead）。</p>
<p>修法：</p>
<ul>
<li>接受 partition 不是 <em>讀效能</em> 工具、是 <em>write + archive 效能</em> 工具</li>
<li>跨 partition aggregation 改 <em>materialized aggregation table</em>（trigger / scheduled job 維護）</li>
<li>跨 partition reporting 改丟 OLAP DB（BigQuery / Snowflake / ClickHouse）</li>
</ul>
<h2 id="跟-vitess-sharding-對比">跟 Vitess sharding 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>MySQL partitioning</th>
          <th>Vitess sharding</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>切割範圍</td>
          <td>同 instance 內</td>
          <td>跨 instance（無上限）</td>
      </tr>
      <tr>
          <td>Cross-shard query</td>
          <td>不適用</td>
          <td>VTGate 自動 split + aggregate</td>
      </tr>
      <tr>
          <td>Resharding</td>
          <td>REORGANIZE（rewrite data）</td>
          <td>VReplication 自動</td>
      </tr>
      <tr>
          <td>Operational cost</td>
          <td>低（單 instance 內）</td>
          <td>高（4 component Vitess stack）</td>
      </tr>
      <tr>
          <td>可線性 scale write</td>
          <td>否（單 instance 寫吞吐限）</td>
          <td>是（加 shard）</td>
      </tr>
      <tr>
          <td>Archive 效率</td>
          <td>DROP PARTITION 毫秒級</td>
          <td>不是 archive 工具</td>
      </tr>
  </tbody>
</table>
<p>兩者不衝突、適用不同問題。Partitioning 解決 <em>單 instance archive + write 集中</em>、sharding 解決 <em>跨 instance scale</em>。</p>
<h2 id="跟-postgresql-declarative-partitioning-對比">跟 PostgreSQL declarative-partitioning 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>MySQL partitioning</th>
          <th>PostgreSQL declarative-partitioning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Partition type</td>
          <td>RANGE / LIST / HASH / KEY</td>
          <td>RANGE / LIST / HASH</td>
      </tr>
      <tr>
          <td>Sub-partitioning</td>
          <td>RANGE + HASH</td>
          <td>多層 nested 支援更廣</td>
      </tr>
      <tr>
          <td>Global index</td>
          <td>無</td>
          <td>PG 11+ 有</td>
      </tr>
      <tr>
          <td>Partition wise join</td>
          <td>受限</td>
          <td>PG 11+ 強</td>
      </tr>
      <tr>
          <td>Cross-partition unique</td>
          <td>必須含 partition key</td>
          <td>PG 11+ 同限制、但 PG 17+ 部分解除</td>
      </tr>
      <tr>
          <td>Partition attach</td>
          <td>EXCHANGE PARTITION</td>
          <td>ATTACH PARTITION</td>
      </tr>
      <tr>
          <td>操作工具</td>
          <td>gh-ost / pt-osc 對 partition</td>
          <td>pg_partman（成熟）</td>
      </tr>
      <tr>
          <td>Production maturity</td>
          <td>中（5.x 開始有、8.0 強化）</td>
          <td>高（11+ declarative 後成熟）</td>
      </tr>
  </tbody>
</table>
<p>PG partitioning 對 <em>跨 partition unique</em> 跟 <em>partition-wise join</em> 處理較好、是 reporting workload 的優勢。MySQL partitioning 對 <em>archive workflow</em>（DROP / EXCHANGE）較成熟。詳見 <a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">PostgreSQL Declarative Partitioning</a>。</p>
<h2 id="何時用-native-partitioning">何時用 native partitioning</h2>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Time-series workload + archive needs（log / event / order history）</td>
          <td>用 RANGE</td>
      </tr>
      <tr>
          <td>大表 &gt; 1 TB 且 query 多有 time filter</td>
          <td>用 RANGE 加速 prune</td>
      </tr>
      <tr>
          <td>跨 region / 跨業務切分</td>
          <td>用 LIST</td>
      </tr>
      <tr>
          <td>需要 <em>線性 scale write throughput</em></td>
          <td>不用 partition、用 Vitess sharding</td>
      </tr>
      <tr>
          <td>需要 <em>全表 unique constraint</em></td>
          <td>不用 partition、影響太大</td>
      </tr>
      <tr>
          <td>主要做 ad-hoc analytical query</td>
          <td>不用 partition、OLAP DB（ClickHouse / BigQuery）</td>
      </tr>
      <tr>
          <td>小表 &lt; 100 GB</td>
          <td>不必 partition、index 夠用</td>
      </tr>
  </tbody>
</table>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-online-schema-change">跟 Online Schema Change</h3>
<p>對 partitioned table 的 schema change（ALTER COLUMN）必須 <em>每個 partition 都改</em>。gh-ost / pt-osc 對 partitioned table 仍 work、但複雜性增加。詳見 <a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">Online Schema Change Tools</a>。</p>
<h3 id="跟-vitess">跟 Vitess</h3>
<p>Vitess shard 內部可再 partition、單 shard 對應一個 MySQL instance、partition 是 instance 內優化。Vitess <code>vtctldclient PartitionTablet</code> 命令處理 shard-aware partition 操作。詳見 <a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">Vitess sharding</a>。</p>
<h3 id="跟-innodb-tuning">跟 InnoDB Tuning</h3>
<p>每個 partition 是獨立 InnoDB tablespace（<code>innodb_file_per_table=ON</code> 預設）、buffer pool 內 cache 行為跟 single big table 不同。Partition 多時 buffer pool warm-up 時間更長。詳見 <a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">InnoDB Tuning</a>。</p>
<h3 id="跟-replication">跟 Replication</h3>
<p>Partition operation（ADD / DROP / EXCHANGE）是 DDL、走 binlog、replica apply 時可能 <em>locking issue</em>（特別是 EXCHANGE 跟 replica running query 衝突）。詳見 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a>。</p>
<h3 id="跟-query-optimization">跟 Query Optimization</h3>
<p><code>EXPLAIN PARTITIONS</code> 是 partition-aware query optimization 的關鍵工具、看 query 真的命中哪些 partition。詳見 <a href="/blog/backend/01-database/vendors/mysql/query-optimization/" data-link-title="MySQL Query Optimization：從 EXPLAIN 看到實際執行、5 條 query 從 5 秒變 50ms 的 anatomy" data-link-desc="MySQL query 慢的根因不在「SQL 寫法」、在「optimizer 選錯 plan」。本文從 5 個常見 production case 開場（5 秒 → 50ms / 30 秒 → 200ms / 8 秒 → 30ms 等）、走 EXPLAIN / EXPLAIN ANALYZE / optimizer trace 三層分析工具、index hint vs optimizer hint 取捨、cardinality estimation 失效時的修法、5 production 踩雷（statistics 過時 / forced index 用錯 / hash join 沒觸發 / range scan 退化 ALL / derived table materialization）">Query Optimization</a>。</p>
<h2 id="容量規劃要點">容量規劃要點</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Partition 數量上限</td>
          <td>8.0 預設 8192、實務建議 &lt; 1000（管理成本上升）</td>
      </tr>
      <tr>
          <td>單 partition 大小</td>
          <td>10 GB - 100 GB（太小無 partition value、太大 prune 沒效）</td>
      </tr>
      <tr>
          <td>RANGE 時間 partition</td>
          <td>月 / 週 / 日（依資料量）</td>
      </tr>
      <tr>
          <td>HASH partition 數量</td>
          <td>通常 power of 2（8 / 16 / 32 / 64）</td>
      </tr>
      <tr>
          <td>Future partition pre-create</td>
          <td>至少 6 個月 buffer、cron 每月 add 1 個</td>
      </tr>
  </tbody>
</table>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">MySQL Vitess sharding</a>（跨 instance 切割對比）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">MySQL Online Schema Change</a>（partition table 的 schema change）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/query-optimization/" data-link-title="MySQL Query Optimization：從 EXPLAIN 看到實際執行、5 條 query 從 5 秒變 50ms 的 anatomy" data-link-desc="MySQL query 慢的根因不在「SQL 寫法」、在「optimizer 選錯 plan」。本文從 5 個常見 production case 開場（5 秒 → 50ms / 30 秒 → 200ms / 8 秒 → 30ms 等）、走 EXPLAIN / EXPLAIN ANALYZE / optimizer trace 三層分析工具、index hint vs optimizer hint 取捨、cardinality estimation 失效時的修法、5 production 踩雷（statistics 過時 / forced index 用錯 / hash join 沒觸發 / range scan 退化 ALL / derived table materialization）">MySQL Query Optimization</a>（EXPLAIN PARTITIONS）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">MySQL InnoDB Tuning</a>（partition + buffer pool 互動）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/declarative-partitioning/" data-link-title="PostgreSQL declarative partitioning：partition 不是切表、是讓 planner pruning" data-link-desc="Declarative partitioning 的真實價值是 query planner pruning &#43; maintenance scope 縮小、不是「把大表切小」；RANGE / LIST / HASH 取捨、partition key 選法、5 個 production 踩雷（key 選錯不 prune / unique 不 enforce 跨 partition / ATTACH 鎖太久 / partition 數爆 / DETACH 不 reclaim 空間）、跟 autovacuum &#43; index 設計整合">PostgreSQL Declarative Partitioning</a>（PG sibling 對比）</li>
<li><a href="/blog/backend/knowledge-cards/partition/" data-link-title="Partition" data-link-desc="說明事件流如何切分成多個可並行處理的有序片段">Partition 卡片</a></li>
<li>官方：<a href="https://dev.mysql.com/doc/refman/8.0/en/partitioning.html">MySQL Partitioning</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL PITR + Backup Strategy：備份不是「拷貝資料」、是 N 點任意 restore 的能力</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/pitr-backup/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/pitr-backup/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>backup + PITR&lt;/em> — 不是「拷貝資料」、是「N 點任意 restore 的能力」。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;p>「我們每天 mysqldump 一次、放 S3、沒問題吧」是個常見錯誤。問「能不能 restore 到 5 分鐘前」、答案會是 &lt;em>不能&lt;/em>。Dump-based backup 只能 restore 到 &lt;em>dump 那個瞬間&lt;/em>、5 分鐘前的事故無法 recover、必須等下次 dump。&lt;/p>
&lt;p>&lt;strong>真正的 backup strategy 是 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/point-in-time-recovery/" data-link-title="Point-in-Time Recovery" data-link-desc="說明如何用完整備份加上後續變更日誌，把資料庫還原到任意時間點">PITR（point-in-time recovery）&lt;/a>&lt;/strong>：&lt;/p>
&lt;ul>
&lt;li>&lt;em>能 restore 到任意過去時間點&lt;/em>（RPO 取決於 binlog flush 頻率、可接近 0）&lt;/li>
&lt;li>由 &lt;em>full backup 基線&lt;/em> + &lt;em>binlog 連續流&lt;/em>（從 backup 點到目標時間點的 incremental delta）組成&lt;/li>
&lt;li>Restore 過程：先 restore full backup → 再 apply binlog 到目標 timestamp 或 GTID&lt;/li>
&lt;/ul>
&lt;p>這篇 deep article 把 backup &lt;em>拆解成能力&lt;/em>、然後展開達到此能力需要的工具鏈跟工程紀律。&lt;/p>
&lt;h2 id="backup-三層責任">Backup 三層責任&lt;/h2>
&lt;p>PITR 的 &lt;em>能力&lt;/em> 由三層工程責任達成、任一層失效則 PITR 不成立：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">Layer 1: Full Backup（基線）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> ↓ (mysqldump / XtraBackup / MyDumper / LVM snapshot / EBS snapshot)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">Layer 2: Binlog Stream（incremental）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl"> ↓ (sync_binlog=1 + binlog 持續流到 backup storage)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">6&lt;/span>&lt;span class="cl"> ↓
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">7&lt;/span>&lt;span class="cl">Layer 3: Restore + Replay 流程
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">8&lt;/span>&lt;span class="cl"> (能 restore full + 能 apply binlog 到目標時間點)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>每層的 &lt;em>backup&lt;/em> 不夠 — 必須有 &lt;em>測試 restore 流程&lt;/em> 才算真的有 backup。「dump 在 S3」加「沒有 verified restore」= no backup。&lt;/p>
&lt;h2 id="tool-1mysqldump--邏輯備份最廣容最慢">Tool 1：mysqldump — 邏輯備份、最廣容、最慢&lt;/h2>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">mysqldump --single-transaction --master-data&lt;span class="o">=&lt;/span>&lt;span class="m">2&lt;/span> --gtid-purged&lt;span class="o">=&lt;/span>ON &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> --triggers --routines --events &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="se">&lt;/span> --all-databases &amp;gt; full-backup.sql&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>輸出&lt;/strong>：SQL statement、純文字、可 grep / 編輯。&lt;/p>
&lt;p>&lt;strong>Trade-off&lt;/strong>：&lt;/p>
&lt;ul>
&lt;li>優點：跨 MySQL 版本（5.7 → 8.0 也讀）、跨 cloud / 跨 OS、可選 dump 部分 table&lt;/li>
&lt;li>缺點：&lt;em>極慢&lt;/em>（rebuild 整 DB 從 SQL execute）、大 DB（&amp;gt; 100 GB）不適用、restore 時長 hours+&lt;/li>
&lt;li>&lt;code>--single-transaction&lt;/code>：InnoDB only、用 REPEATABLE READ 拿 consistent snapshot、不 lock 表&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>適合&lt;/strong>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 <em>backup + PITR</em> — 不是「拷貝資料」、是「N 點任意 restore 的能力」。</p></blockquote>
<hr>
<p>「我們每天 mysqldump 一次、放 S3、沒問題吧」是個常見錯誤。問「能不能 restore 到 5 分鐘前」、答案會是 <em>不能</em>。Dump-based backup 只能 restore 到 <em>dump 那個瞬間</em>、5 分鐘前的事故無法 recover、必須等下次 dump。</p>
<p><strong>真正的 backup strategy 是 <a href="/blog/backend/knowledge-cards/point-in-time-recovery/" data-link-title="Point-in-Time Recovery" data-link-desc="說明如何用完整備份加上後續變更日誌，把資料庫還原到任意時間點">PITR（point-in-time recovery）</a></strong>：</p>
<ul>
<li><em>能 restore 到任意過去時間點</em>（RPO 取決於 binlog flush 頻率、可接近 0）</li>
<li>由 <em>full backup 基線</em> + <em>binlog 連續流</em>（從 backup 點到目標時間點的 incremental delta）組成</li>
<li>Restore 過程：先 restore full backup → 再 apply binlog 到目標 timestamp 或 GTID</li>
</ul>
<p>這篇 deep article 把 backup <em>拆解成能力</em>、然後展開達到此能力需要的工具鏈跟工程紀律。</p>
<h2 id="backup-三層責任">Backup 三層責任</h2>
<p>PITR 的 <em>能力</em> 由三層工程責任達成、任一層失效則 PITR 不成立：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Layer 1: Full Backup（基線）
</span></span><span class="line"><span class="ln">2</span><span class="cl">   ↓     (mysqldump / XtraBackup / MyDumper / LVM snapshot / EBS snapshot)
</span></span><span class="line"><span class="ln">3</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">4</span><span class="cl">Layer 2: Binlog Stream（incremental）
</span></span><span class="line"><span class="ln">5</span><span class="cl">   ↓     (sync_binlog=1 + binlog 持續流到 backup storage)
</span></span><span class="line"><span class="ln">6</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">7</span><span class="cl">Layer 3: Restore + Replay 流程
</span></span><span class="line"><span class="ln">8</span><span class="cl">         (能 restore full + 能 apply binlog 到目標時間點)</span></span></code></pre></div><p>每層的 <em>backup</em> 不夠 — 必須有 <em>測試 restore 流程</em> 才算真的有 backup。「dump 在 S3」加「沒有 verified restore」= no backup。</p>
<h2 id="tool-1mysqldump--邏輯備份最廣容最慢">Tool 1：mysqldump — 邏輯備份、最廣容、最慢</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">mysqldump --single-transaction --master-data<span class="o">=</span><span class="m">2</span> --gtid-purged<span class="o">=</span>ON <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --triggers --routines --events <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --all-databases &gt; full-backup.sql</span></span></code></pre></div><p><strong>輸出</strong>：SQL statement、純文字、可 grep / 編輯。</p>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>優點：跨 MySQL 版本（5.7 → 8.0 也讀）、跨 cloud / 跨 OS、可選 dump 部分 table</li>
<li>缺點：<em>極慢</em>（rebuild 整 DB 從 SQL execute）、大 DB（&gt; 100 GB）不適用、restore 時長 hours+</li>
<li><code>--single-transaction</code>：InnoDB only、用 REPEATABLE READ 拿 consistent snapshot、不 lock 表</li>
</ul>
<p><strong>適合</strong>：</p>
<ul>
<li>&lt; 100 GB DB</li>
<li>Schema dump（migration / 給 dev clone DB）</li>
<li>跨版本 migrate</li>
<li>配 binlog 做 PITR baseline</li>
</ul>
<p><strong>不適合</strong>：</p>
<ul>
<li>
<blockquote>
<p>500 GB DB（restore 跑 days）</p></blockquote>
</li>
<li>高吞吐 production（dump 跑時 hold MVCC read view、bloat）</li>
</ul>
<h2 id="tool-2percona-xtrabackup--物理備份快production-標準">Tool 2：Percona XtraBackup — 物理備份、快、production 標準</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">xtrabackup --backup --target-dir<span class="o">=</span>/backup/full-2026-05-19 <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --user<span class="o">=</span>backup --password<span class="o">=</span>... <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --slave-info --safe-slave-backup
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># Prepare（apply 內部 redo log、變成可 restore 狀態）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">xtrabackup --prepare --target-dir<span class="o">=</span>/backup/full-2026-05-19</span></span></code></pre></div><p><strong>輸出</strong>：InnoDB 資料檔案的 binary copy。</p>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>優點：<em>極快</em>（直接 copy file、無 SQL execute）、適合 TB-scale DB、restore 跑時間跟 copy file 同</li>
<li>缺點：MySQL 版本綁定（XtraBackup 8.0 不能 restore 5.7 backup）、有 storage engine 限制（只 InnoDB）</li>
<li><em>Incremental backup</em> 支援：基於 LSN（log sequence number）只 copy 變更 page</li>
</ul>
<p><strong>Incremental flow</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Day 1: Full backup</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">xtrabackup --backup --target-dir<span class="o">=</span>/backup/full-day1
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># Day 2: Incremental（only changes since day 1）</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">xtrabackup --backup --target-dir<span class="o">=</span>/backup/inc-day2 <span class="se">\
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="se"></span>  --incremental-basedir<span class="o">=</span>/backup/full-day1
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"># Restore: Apply incremental on top of full</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">xtrabackup --prepare --apply-log-only --target-dir<span class="o">=</span>/backup/full-day1
</span></span><span class="line"><span class="ln">10</span><span class="cl">xtrabackup --prepare --apply-log-only --target-dir<span class="o">=</span>/backup/full-day1 <span class="se">\
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="se"></span>  --incremental-dir<span class="o">=</span>/backup/inc-day2
</span></span><span class="line"><span class="ln">12</span><span class="cl">xtrabackup --prepare --target-dir<span class="o">=</span>/backup/full-day1</span></span></code></pre></div><p><strong>適合</strong>：</p>
<ul>
<li>
<blockquote>
<p>100 GB production DB</p></blockquote>
</li>
<li>每日 incremental + 週一次 full（典型 enterprise schedule）</li>
<li>從自管 MySQL 遷 cloud（XtraBackup + rsync 到 cloud restore）</li>
</ul>
<p><strong>不適合</strong>：</p>
<ul>
<li>Schema-only dump（用 mysqldump 更簡單）</li>
<li>跨 major version restore</li>
</ul>
<h2 id="tool-3mydumper--並行邏輯備份">Tool 3：MyDumper — 並行邏輯備份</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">mydumper --user<span class="o">=</span>backup --password<span class="o">=</span>... <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>  --threads<span class="o">=</span><span class="m">8</span> --rows<span class="o">=</span><span class="m">100000</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --outputdir<span class="o">=</span>/backup/mydumper-2026-05-19 <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  --less-locking</span></span></code></pre></div><p><strong>輸出</strong>：每張 table 一個 <code>.sql</code> file（schema） + 多個 chunked <code>.dat</code> file（資料）。</p>
<p><strong>Trade-off</strong>：</p>
<ul>
<li>優點：<em>並行 dump</em>（per-table thread）、比 mysqldump 快 5-10x、可恢復斷點（resume）</li>
<li>缺點：tooling 不如 mysqldump 普及、需要單獨裝</li>
<li>對應的 <code>myloader</code> restore：也並行、比 mysqldump restore 快 5-10x</li>
</ul>
<p><strong>適合</strong>：</p>
<ul>
<li>100 GB - 1 TB 範圍</li>
<li>中型 production、想要邏輯備份的可讀性 + 並行加速</li>
</ul>
<h2 id="tool-4lvm--ebs-snapshot--物理-file-system-層">Tool 4：LVM / EBS Snapshot — 物理 file system 層</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 1. Freeze MySQL（讓 write 暫停）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">mysql&gt; FLUSH TABLES WITH READ LOCK<span class="p">;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># 2. Trigger snapshot（EBS / LVM）</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">aws ec2 create-snapshot --volume-id vol-xxx --description <span class="s2">&#34;mysql-2026-05-19&#34;</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 3. Unfreeze</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl">mysql&gt; UNLOCK TABLES<span class="p">;</span></span></span></code></pre></div><p><strong>Trade-off</strong>：</p>
<ul>
<li>優點：超快（file system 層）、適合 <em>VM-based MySQL</em>（EC2 / on-prem）</li>
<li>缺點：必須 <em>暫停 write</em>（短時間 lock）、不能跨 OS / cloud 移植</li>
<li>AWS RDS / Aurora 全部走這條路（自動 snapshot）</li>
</ul>
<p><strong>適合</strong>：</p>
<ul>
<li>AWS RDS / Aurora（自動）</li>
<li>自管 MySQL on EC2 with EBS（EBS snapshot 結合 mysql freeze）</li>
<li>大 DB 想要 fast backup + fast restore</li>
</ul>
<h2 id="binlog-based-pitr">Binlog-based PITR</h2>
<p>Full backup 加上 binlog 才能達到 PITR。Binlog 是 MySQL replication / CDC / PITR 共用的 source。</p>
<p><strong>配置</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">[mysqld]</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="na">log_bin</span> <span class="o">=</span> <span class="s">mysql-bin</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="na">binlog_format</span> <span class="o">=</span> <span class="s">ROW                  # ROW 必須</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="na">binlog_row_image</span> <span class="o">=</span> <span class="s">FULL              # 完整 row image</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="na">sync_binlog</span> <span class="o">=</span> <span class="s">1                      # 每次 commit fsync binlog（zero loss）</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="na">binlog_expire_logs_seconds</span> <span class="o">=</span> <span class="s">1209600 # 14 天 retention（依需求調）</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="na">gtid_mode</span> <span class="o">=</span> <span class="s">ON                       # GTID 必須、PITR 用 GTID 識別 transaction</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="na">enforce_gtid_consistency</span> <span class="o">=</span> <span class="s">ON</span></span></span></code></pre></div><p><strong>Binlog backup</strong>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 持續 stream binlog 到 backup storage</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">mysqlbinlog --read-from-remote-server --raw --stop-never <span class="se">\
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="se"></span>  --user<span class="o">=</span>replication --password<span class="o">=</span>... <span class="se">\
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="se"></span>  --host<span class="o">=</span>primary.example.com <span class="se">\
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="se"></span>  --result-file<span class="o">=</span>/backup/binlog/ mysql-bin.000001 <span class="p">&amp;</span></span></span></code></pre></div><p><code>--read-from-remote-server</code> + <code>--stop-never</code> 持續從 primary tail binlog、不間斷 stream 到 backup directory。每個 binlog file 寫滿後 close + 開新 file。</p>
<h2 id="restore--pitr-流程">Restore + PITR 流程</h2>
<p>完整 PITR 流程（restore 到 2026-05-19 14:30:00）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Step 1: Restore full backup</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">xtrabackup --copy-back --target-dir<span class="o">=</span>/backup/full-2026-05-18  <span class="c1"># 前一天 full</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># Step 2: 啟動 MySQL（會看到 backup 拿那刻的 GTID set）</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">systemctl start mysqld
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># Step 3: 查 full backup 結束時的 GTID</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">mysql&gt; SHOW MASTER STATUS<span class="p">;</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">+------------------+----------+------------------------------------------+
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="p">|</span> File             <span class="p">|</span> Position <span class="p">|</span> Executed_Gtid_Set                        <span class="p">|</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">+------------------+----------+------------------------------------------+
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="p">|</span> mysql-bin.000150 <span class="p">|</span>     <span class="m">1234</span> <span class="p">|</span> server-uuid:1-12345                      <span class="p">|</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">+------------------+----------+------------------------------------------+
</span></span><span class="line"><span class="ln">14</span><span class="cl">
</span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="c1"># Step 4: Apply binlog 從 backup 之後到目標時間</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">mysqlbinlog --start-datetime<span class="o">=</span><span class="s2">&#34;2026-05-18 03:00:00&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="se"></span>            --stop-datetime<span class="o">=</span><span class="s2">&#34;2026-05-19 14:30:00&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="se"></span>            /backup/binlog/mysql-bin.000150 <span class="se">\
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="se"></span>            /backup/binlog/mysql-bin.000151 <span class="se">\
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="se"></span>            ...                                <span class="c1"># 列所有需要的 binlog</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">            <span class="p">|</span> mysql -u root -p
</span></span><span class="line"><span class="ln">22</span><span class="cl">
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="c1"># Step 5: 驗證 GTID set 到目標時間點對應的位置</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl">mysql&gt; SHOW MASTER STATUS<span class="p">;</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl"><span class="c1"># Executed_Gtid_Set 應包含到目標時間點的 transaction</span></span></span></code></pre></div><p>對 <em>精確 GTID-based PITR</em>（停在特定 transaction、不是 timestamp）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">mysqlbinlog --include-gtids<span class="o">=</span><span class="s1">&#39;server-uuid:1-50000&#39;</span> <span class="se">\
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="se"></span>            /backup/binlog/mysql-bin.000150 ... <span class="p">|</span> mysql -u root -p</span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-gtid-處理不一致--restore-後-replication-broken">1. GTID 處理不一致 — Restore 後 replication broken</h3>
<p>XtraBackup restore 時 <code>--slave-info</code> 紀錄 GTID purged set、mysqldump 用 <code>--gtid-purged=ON</code>。如果 restore 後沒正確 set <code>gtid_purged</code>、replica re-attach 時 GTID gap error。</p>
<p>修法：</p>
<ul>
<li>XtraBackup restore：用 <code>xtrabackup_binlog_info</code> 內的 GTID set 設 <code>SET GLOBAL gtid_purged='...';</code></li>
<li>mysqldump：dump file 內已有 <code>SET @@GLOBAL.GTID_PURGED='...';</code>、執行 dump 自動 set</li>
<li>Restore 後 <em>先驗證 <code>Executed_Gtid_Set</code></em> 跟 source 預期對齊、再 START SLAVE</li>
</ul>
<h3 id="2-binlog-gap--中間遺漏-file-直接-restore-fail">2. Binlog gap — 中間遺漏 file 直接 restore fail</h3>
<p>Binlog stream 失聯（network blip / disk full）+ binlog rotate、<code>mysql-bin.000156</code> 不在 backup storage 內。PITR 試圖跨過該 file restore、跳過已 commit transaction、結果 <em>資料不一致</em>（不是錯誤、是 <em>silently incorrect</em>）。</p>
<p>修法：</p>
<ul>
<li><em>Binlog stream 必須持續</em>、失聯 → alert</li>
<li>監控 backup storage 內 binlog 連續性（file name 連號、無 gap）</li>
<li>Restore 前 <em>先驗證 binlog 完整性</em>：<code>mysqlbinlog --verify-binlog-checksum *.bin &gt; /dev/null</code></li>
<li>對 missing binlog <em>中止 PITR</em>、不繼續 partial restore</li>
</ul>
<h3 id="3-backup-沒-verify--真事故時才發現-restore-broken">3. Backup 沒 verify — 真事故時才發現 restore broken</h3>
<p>每天備份成功、storage 用了 5 TB、實際 <em>從未 restore 過</em>。事故發生 restore 才知道 backup file corrupt / GTID 錯 / binlog gap、整套無用。</p>
<p>修法：</p>
<ul>
<li><em>自動化 restore test</em>：每週 / 每月在 staging server 跑完整 restore + PITR、跑完 SELECT 比對 production</li>
<li>驗證 restore 後 row count 跟 production 接近、<code>CHECKSUM TABLE</code> 比對主要 table</li>
<li>真的事故時 RTO 才不會 surprise</li>
</ul>
<h3 id="4-rpo-不到-1-分鐘的代價">4. RPO 不到 1 分鐘的代價</h3>
<p>「我要 RPO &lt; 1 分鐘」聽起來合理、但實現需要：</p>
<ul>
<li><code>sync_binlog=1</code>（每 commit fsync、寫吞吐降 10-30%）</li>
<li>Binlog stream 到 <em>獨立 storage</em>（不只是 primary local disk）、cross-region replication（額外 network cost）</li>
<li>Replica 也用 semi-sync 配合（zero binlog loss）</li>
<li>監控 + alert RPO 違反（&lt; 1 分鐘 stream lag）</li>
</ul>
<p><strong>TCO</strong>：~30% 寫吞吐 penalty + 額外 storage / network cost + 7x24 on-call。考慮 <em>real RPO requirement</em> — 多數 application 5 分鐘 RPO 已足夠、追求 1 分鐘 RPO 不划算。</p>
<p>修法：</p>
<ul>
<li>跟 product / business 確認 <em>真 RPO 要求</em></li>
<li><em>RPO budget = 寫吞吐 trade-off + ops cost</em>、不是 free</li>
<li>用 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora</a> / managed offering 把 RPO 議題 outsource（Aurora &lt; 1 秒 RPO + 自動 cross-AZ）</li>
</ul>
<h3 id="5-encryption-key-沒備份--restore-後解不開資料">5. Encryption key 沒備份 — Restore 後解不開資料</h3>
<p>啟用 <em>encryption at rest</em>（MySQL 8.0+ <code>default_table_encryption=ON</code> + keyring plugin / component；MariaDB 用 <code>innodb_encrypt_tables</code>）後、所有 InnoDB tablespace 都加密。Master key 在 <em>keyring file</em> 或 KMS-backed component。如果 backup 只 backup MySQL data file、沒備 keyring、restore 後資料 <em>encrypted 但無 key、無法讀</em>。</p>
<p>修法：</p>
<ul>
<li><em>Keyring file 跟 data file 分開儲存</em>、但兩者 <em>都要 backup</em></li>
<li>用 <em>KMS-based keyring</em>（AWS KMS / HashiCorp Vault）取代 file-based、key 不在 MySQL server 上</li>
<li>Disaster recovery runbook 紀錄 <em>key recovery 流程</em>、不要假設「重 install MySQL」就能解</li>
</ul>
<h2 id="容量規劃要點">容量規劃要點</h2>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Full backup 頻率</td>
          <td>週一次（XtraBackup）或日一次（小 DB）</td>
      </tr>
      <tr>
          <td>Incremental 頻率</td>
          <td>每日（XtraBackup incremental）</td>
      </tr>
      <tr>
          <td>Binlog retention</td>
          <td>14 天（給 PITR window）</td>
      </tr>
      <tr>
          <td>Backup retention</td>
          <td>Full × 4 週 + 月度 archive × 12 個月</td>
      </tr>
      <tr>
          <td>Storage cost</td>
          <td>約 2-3x DB size（full + incremental + binlog）</td>
      </tr>
      <tr>
          <td>Cross-region copy</td>
          <td>必要（local backup 失效時還有 disaster recovery）</td>
      </tr>
      <tr>
          <td>Restore test 頻率</td>
          <td>每週 staging 上跑、每月 production-like 跑</td>
      </tr>
  </tbody>
</table>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p>Replication replica 不能取代 backup — replica 上的 DROP TABLE 也會被 replicate、replica 上資料同樣消失。Backup 是 <em>獨立保險</em>。詳見 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a>。</p>
<h3 id="跟-innodb-tuning">跟 InnoDB Tuning</h3>
<p><code>innodb_flush_log_at_trx_commit=1</code> + <code>sync_binlog=1</code> 是 backup-friendly 的設定（zero loss）、但寫吞吐降。如果為了寫吞吐放寬 durability、必須接受 <em>PITR window</em> 也 widening。詳見 <a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">InnoDB Tuning</a>。</p>
<h3 id="跟-aurora-mysql">跟 Aurora MySQL</h3>
<p>Aurora 完全 outsource backup — automatic continuous backup + PITR &lt; 1 秒、不必管 mysqldump / XtraBackup / binlog stream。從 Aurora 遷出時、需要重新建 self-managed backup chain。詳見 <a href="/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">migrate-to-aurora</a>。</p>
<h3 id="跟-postgresql-pitr">跟 PostgreSQL PITR</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>MySQL PITR</th>
          <th>PostgreSQL PITR</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Logical backup</td>
          <td>mysqldump / MyDumper</td>
          <td>pg_dump / pg_dumpall</td>
      </tr>
      <tr>
          <td>Physical backup</td>
          <td>XtraBackup</td>
          <td>pg_basebackup / pgBackRest</td>
      </tr>
      <tr>
          <td>Incremental log</td>
          <td>Binary log（binlog）</td>
          <td>WAL (Write-Ahead Log)</td>
      </tr>
      <tr>
          <td>Stream tool</td>
          <td>mysqlbinlog &ndash;read-from-remote-server</td>
          <td>pg_receivewal</td>
      </tr>
      <tr>
          <td>PITR command</td>
          <td>mysqlbinlog &ndash;stop-datetime</td>
          <td>pg_ctl + recovery.conf / standby.signal</td>
      </tr>
      <tr>
          <td>Identifier</td>
          <td>GTID 或 file:position</td>
          <td>LSN（Log Sequence Number）</td>
      </tr>
      <tr>
          <td>Cross-version</td>
          <td>mysqldump（廣容）</td>
          <td>pg_dump（廣容）</td>
      </tr>
  </tbody>
</table>
<p>兩家 PITR 概念類似（full + log replay）、tool name 不同、概念對等。詳見 <a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PostgreSQL PITR + WAL Archiving</a>。</p>
<h2 id="何時-outsource-backup">何時 outsource backup</h2>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AWS 生態 + 不想管 backup ops</td>
          <td>Aurora MySQL（內建 PITR）</td>
      </tr>
      <tr>
          <td>GCP 生態</td>
          <td>Cloud SQL（內建 PITR）</td>
      </tr>
      <tr>
          <td>Azure 生態</td>
          <td>Azure DB for MySQL</td>
      </tr>
      <tr>
          <td>跨雲 + 想自管</td>
          <td>XtraBackup + binlog stream + S3</td>
      </tr>
      <tr>
          <td>規模小、可接受 mysqldump</td>
          <td>mysqldump cron + S3</td>
      </tr>
      <tr>
          <td>規模大、無 cloud</td>
          <td>Percona XtraBackup Enterprise + tape archive</td>
      </tr>
      <tr>
          <td>強合規（HIPAA / PCI-DSS）</td>
          <td>自管 + air-gap backup + audit trail</td>
      </tr>
  </tbody>
</table>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>（binlog 跟 PITR 共用 source）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">MySQL InnoDB Tuning</a>（durability + backup 互動）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">migrate-to-aurora</a>（backup outsource）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/pitr-wal-archiving/" data-link-title="PostgreSQL PITR &#43; WAL archiving：從 base backup 到 point-in-time recovery 的完整鏈" data-link-desc="Base backup &#43; WAL archive 構成 PITR 的雙軌資料、archive_command &#43; restore_command 配置、用 pgBackRest / WAL-G 替代手寫腳本、5 個 production 踩雷（archive 靜默失敗 / archive lag / 錯誤 target time / base backup 過期未清 / timeline 分歧 recovery 模糊）、跟 Patroni &#43; monitoring 整合">PostgreSQL PITR + WAL Archiving</a>（PG sibling）</li>
<li>官方：<a href="https://docs.percona.com/percona-xtrabackup/8.0/">Percona XtraBackup</a> / <a href="https://github.com/mydumper/mydumper">MyDumper</a> / <a href="https://dev.mysql.com/doc/refman/8.0/en/mysqldump.html">mysqldump</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL Lock Contention：在 staging 重現的 deadlock、production 跑 6 個月才出現</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/lock-contention/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/lock-contention/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 &lt;em>lock contention&lt;/em> — 5 種 lock type + isolation level 互動 + production debug。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;h2 id="開場案例">開場案例&lt;/h2>
&lt;p>Application 跑了 6 個月、staging 100% 重現過的 deadlock 從來沒在 production 出現。某天 traffic 上升 30%、production 開始爆 &lt;code>ER_LOCK_DEADLOCK&lt;/code>、application retry 不夠快、order 大量失敗。&lt;/p>
&lt;p>&lt;code>SHOW ENGINE INNODB STATUS\G&lt;/code> 拉出 deadlock：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">*** (1) TRANSACTION:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl">TRANSACTION 12345, ACTIVE 1 sec starting index read
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl">mysql tables in use 1, locked 1
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl">LOCK WAIT 4 lock struct(s), heap size 1136, 3 row lock(s)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl">MySQL thread id 100, query id 5000 update orders
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl">UPDATE orders SET status = &amp;#39;shipped&amp;#39; WHERE id = 500
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl">*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl">RECORD LOCKS space id 50 page no 5 n bits 80 index PRIMARY of table `production`.`orders`
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl">trx id 12345 lock_mode X locks rec but not gap waiting
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl">*** (2) TRANSACTION:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">13&lt;/span>&lt;span class="cl">TRANSACTION 12346, ACTIVE 1 sec starting index read
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">14&lt;/span>&lt;span class="cl">mysql tables in use 1, locked 1
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">15&lt;/span>&lt;span class="cl">4 lock struct(s), heap size 1136, 4 row lock(s)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">16&lt;/span>&lt;span class="cl">MySQL thread id 101, query id 5001 update payments
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">17&lt;/span>&lt;span class="cl">UPDATE payments SET captured = 1 WHERE order_id = 500
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">18&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">19&lt;/span>&lt;span class="cl">*** (2) HOLDS THE LOCK(S):
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">20&lt;/span>&lt;span class="cl">RECORD LOCKS space id 50 page no 5 n bits 80 index PRIMARY of table `production`.`orders`
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">21&lt;/span>&lt;span class="cl">trx id 12346 lock_mode X locks rec but not gap
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">22&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">23&lt;/span>&lt;span class="cl">*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">24&lt;/span>&lt;span class="cl">RECORD LOCKS space id 51 page no 10 n bits 80 index idx_order_id of table `production`.`payments`
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">25&lt;/span>&lt;span class="cl">trx id 12346 lock_mode X waiting
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">26&lt;/span>&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">27&lt;/span>&lt;span class="cl">*** WE ROLL BACK TRANSACTION (1)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>兩個 transaction 各自拿了一邊 lock、互相等對方的、deadlock。為什麼 staging 重現過、production 6 個月才爆？因為 &lt;strong>lock contention 是 &lt;em>可能性&lt;/em> 不是 &lt;em>確定性&lt;/em>&lt;/strong> — staging 重現等於確認「程式邏輯有 deadlock risk」、production 6 個月平安等於「concurrency 還沒撞到」。Traffic 上升把 &lt;em>機率乘以 N&lt;/em>、原本每天 0 次變每分鐘 5 次。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 <em>lock contention</em> — 5 種 lock type + isolation level 互動 + production debug。</p></blockquote>
<hr>
<h2 id="開場案例">開場案例</h2>
<p>Application 跑了 6 個月、staging 100% 重現過的 deadlock 從來沒在 production 出現。某天 traffic 上升 30%、production 開始爆 <code>ER_LOCK_DEADLOCK</code>、application retry 不夠快、order 大量失敗。</p>
<p><code>SHOW ENGINE INNODB STATUS\G</code> 拉出 deadlock：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">*** (1) TRANSACTION:
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">TRANSACTION 12345, ACTIVE 1 sec starting index read
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">mysql tables in use 1, locked 1
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">LOCK WAIT 4 lock struct(s), heap size 1136, 3 row lock(s)
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">MySQL thread id 100, query id 5000 update orders
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">UPDATE orders SET status = &#39;shipped&#39; WHERE id = 500
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">RECORD LOCKS space id 50 page no 5 n bits 80 index PRIMARY of table `production`.`orders`
</span></span><span class="line"><span class="ln">10</span><span class="cl">trx id 12345 lock_mode X locks rec but not gap waiting
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl">*** (2) TRANSACTION:
</span></span><span class="line"><span class="ln">13</span><span class="cl">TRANSACTION 12346, ACTIVE 1 sec starting index read
</span></span><span class="line"><span class="ln">14</span><span class="cl">mysql tables in use 1, locked 1
</span></span><span class="line"><span class="ln">15</span><span class="cl">4 lock struct(s), heap size 1136, 4 row lock(s)
</span></span><span class="line"><span class="ln">16</span><span class="cl">MySQL thread id 101, query id 5001 update payments
</span></span><span class="line"><span class="ln">17</span><span class="cl">UPDATE payments SET captured = 1 WHERE order_id = 500
</span></span><span class="line"><span class="ln">18</span><span class="cl">
</span></span><span class="line"><span class="ln">19</span><span class="cl">*** (2) HOLDS THE LOCK(S):
</span></span><span class="line"><span class="ln">20</span><span class="cl">RECORD LOCKS space id 50 page no 5 n bits 80 index PRIMARY of table `production`.`orders`
</span></span><span class="line"><span class="ln">21</span><span class="cl">trx id 12346 lock_mode X locks rec but not gap
</span></span><span class="line"><span class="ln">22</span><span class="cl">
</span></span><span class="line"><span class="ln">23</span><span class="cl">*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
</span></span><span class="line"><span class="ln">24</span><span class="cl">RECORD LOCKS space id 51 page no 10 n bits 80 index idx_order_id of table `production`.`payments`
</span></span><span class="line"><span class="ln">25</span><span class="cl">trx id 12346 lock_mode X waiting
</span></span><span class="line"><span class="ln">26</span><span class="cl">
</span></span><span class="line"><span class="ln">27</span><span class="cl">*** WE ROLL BACK TRANSACTION (1)</span></span></code></pre></div><p>兩個 transaction 各自拿了一邊 lock、互相等對方的、deadlock。為什麼 staging 重現過、production 6 個月才爆？因為 <strong>lock contention 是 <em>可能性</em> 不是 <em>確定性</em></strong> — staging 重現等於確認「程式邏輯有 deadlock risk」、production 6 個月平安等於「concurrency 還沒撞到」。Traffic 上升把 <em>機率乘以 N</em>、原本每天 0 次變每分鐘 5 次。</p>
<p>這個 case 揭露 MySQL lock 教學的核心：理解 lock 不只是 <em>debug 跑 deadlock 報錯</em> 的能力、是 <em>讀 query 預測 lock pattern</em> 的能力。</p>
<h2 id="innodb-5-種-lock-類型">InnoDB 5 種 Lock 類型</h2>
<p>InnoDB 不是 <em>簡單 row lock</em>、有 5 個獨立 lock concept：</p>
<h3 id="1-record-lock--鎖-row">1. Record Lock — 鎖 row</h3>
<p><code>SELECT ... FOR UPDATE</code> / UPDATE / DELETE 對 <em>被 match 的 row</em> 加 record lock。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Transaction 1
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="k">UPDATE</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 對 id=100 的 row 加 record lock</span></span></span></code></pre></div><p>Transaction 2 試 <code>UPDATE orders WHERE id = 100</code> 必須等。</p>
<h3 id="2-gap-lock--鎖-row-之間的空隙">2. Gap Lock — 鎖 row 之間的「空隙」</h3>
<p>InnoDB 在 <em>REPEATABLE READ</em> (預設) 下、<code>SELECT ... FOR UPDATE WHERE col &gt; 100</code> 不只 lock 符合的 row、<em>也 lock 該 range 內的「空隙」</em>、防其他 transaction INSERT 進這個 range。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 已存在 orders: id=100, 200, 300
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">100</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="mi">300</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="k">UPDATE</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- Lock id=200 + gap lock (100, 200) + gap lock (200, 300)</span></span></span></code></pre></div><p>Transaction 2 試 <code>INSERT INTO orders (id) VALUES (150)</code> 必須等 — 即使 id=150 不存在、gap lock 阻擋 INSERT。</p>
<p><strong>Gap lock 是 deadlock 最常見來源</strong> — application logic 看 row、但 lock 卻 cover row 之外的空隙、難預測。</p>
<h3 id="3-next-key-lock--record--gap-組合">3. Next-Key Lock — Record + Gap 組合</h3>
<p>預設 lock 行為。<code>SELECT ... FOR UPDATE WHERE col = 100</code> 對 id=100 的 record lock + id=100 之前的 gap lock。</p>
<p>Lock 的範圍實際是 <em>半開區間</em> (previous_id, current_id]：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Records: 100, 200, 300
</span></span><span class="line"><span class="ln">2</span><span class="cl">
</span></span><span class="line"><span class="ln">3</span><span class="cl">WHERE id = 100 FOR UPDATE → next-key lock (-inf, 100]
</span></span><span class="line"><span class="ln">4</span><span class="cl">WHERE id = 200 FOR UPDATE → next-key lock (100, 200]
</span></span><span class="line"><span class="ln">5</span><span class="cl">WHERE id = 300 FOR UPDATE → next-key lock (200, 300]
</span></span><span class="line"><span class="ln">6</span><span class="cl">WHERE id BETWEEN 150 AND 250 FOR UPDATE → next-key lock (100, 200] + (200, 300]</span></span></code></pre></div><h3 id="4-insert-intention-lock--insert-之前的-gap-lock">4. Insert Intention Lock — INSERT 之前的 gap lock</h3>
<p><code>INSERT</code> 不直接 lock 整個 gap、而是 <em>insert intention lock</em> — 比 gap lock 弱、允許多個 INSERT 同 gap 並行（不同 id）。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Transaction 1
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">150</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- Transaction 2
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">175</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 同 gap (100, 200)、兩個 INSERT 並行、不阻塞</span></span></span></code></pre></div><p>但如果 Transaction 1 已 hold gap lock（through SELECT FOR UPDATE）、Transaction 2 INSERT 必須等。</p>
<h3 id="5-auto-inc-lock--auto-increment-column-專用">5. Auto-Inc Lock — Auto-Increment column 專用</h3>
<p><code>INSERT INTO orders (id) VALUES (DEFAULT)</code> 取得 auto-increment value 時 lock。Mode：</p>
<ul>
<li><code>innodb_autoinc_lock_mode=0</code>（traditional）：lock 整個 INSERT statement 期間、其他 INSERT 必須等</li>
<li><code>innodb_autoinc_lock_mode=1</code>（consecutive）：lock 短時間（取值期間）、INSERT 1 row 不會阻塞其他</li>
<li><code>innodb_autoinc_lock_mode=2</code>（interleaved、8.0+ 預設（5.7 預設仍是 1））：完全並行、auto-inc value 不保證連續但可並行</li>
</ul>
<p>8.0+ 預設 mode=2、性能高、但 <em>binlog format 必須 ROW</em>（STATEMENT 行為錯）。</p>
<h2 id="isolation-level-對-lock-的決定性影響">Isolation Level 對 Lock 的決定性影響</h2>
<p>InnoDB 4 個 isolation level、lock 行為完全不同：</p>
<table>
  <thead>
      <tr>
          <th>Isolation</th>
          <th>Read 行為</th>
          <th>Lock 範圍</th>
          <th>Default?</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>READ UNCOMMITTED</td>
          <td>可讀 dirty data</td>
          <td>純 record lock、無 gap</td>
          <td>否</td>
      </tr>
      <tr>
          <td>READ COMMITTED</td>
          <td>每個 statement 看當下 committed</td>
          <td>純 record lock、無 gap</td>
          <td>否</td>
      </tr>
      <tr>
          <td>REPEATABLE READ</td>
          <td>Transaction 內 snapshot consistent</td>
          <td>Record + gap + next-key</td>
          <td><strong>是</strong></td>
      </tr>
      <tr>
          <td>SERIALIZABLE</td>
          <td>強制 SELECT 變 SELECT &hellip; FOR SHARE</td>
          <td>Record + gap + next-key 加重</td>
          <td>否</td>
      </tr>
  </tbody>
</table>
<p><strong>REPEATABLE READ + Gap lock 是 deadlock 主要來源</strong>：</p>
<ul>
<li>預設 isolation level</li>
<li>為了 <em>保證 repeatable read</em>（同 transaction 內讀同樣資料）、強制 gap lock 防 phantom row</li>
<li>但 gap lock 經常 lock 比預期廣的範圍、deadlock 機率上升</li>
</ul>
<p><strong>改成 READ COMMITTED 的取捨</strong>：</p>
<ul>
<li>優點：無 gap lock、deadlock 大降、寫吞吐上升</li>
<li>缺點：transaction 內讀同 query 結果可能不同（non-repeatable read）</li>
<li>重要：<em>binlog format 必須 ROW</em>（STATEMENT 在 READ COMMITTED 下 replication 行為不一致）</li>
<li>多數 MySQL production 用 READ COMMITTED 跑 OLTP、REPEATABLE READ 留給特殊 case</li>
</ul>
<p><strong>對比 PostgreSQL</strong>：</p>
<ul>
<li>PG 預設 isolation 是 <em>READ COMMITTED</em>（不是 RR）</li>
<li>PG 的 RR 用 <em>snapshot isolation</em>（不靠 gap lock）、deadlock 少</li>
<li>這是 MySQL 跟 PG 在 <em>並行控制 model</em> 的根本差異 — MySQL 用 lock-based、PG 用 MVCC-heavy</li>
</ul>
<h2 id="用-show-engine-innodb-status-讀-lock-狀態">用 SHOW ENGINE INNODB STATUS 讀 lock 狀態</h2>
<p><code>SHOW ENGINE INNODB STATUS\G</code> 是 production debug lock contention 的主要工具：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">------------
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">TRANSACTIONS
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">------------
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">Trx id counter 12350
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">Purge done for trx&#39;s n:o &lt; 12340 undo n:o &lt; 0 state: running but idle
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">History list length 5
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">---TRANSACTION 12345, ACTIVE 30 sec  -- 長 transaction、警訊
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">3 lock struct(s), heap size 1136, 5 row lock(s)
</span></span><span class="line"><span class="ln">10</span><span class="cl">MySQL thread id 100, OS thread handle ..., query id ...
</span></span><span class="line"><span class="ln">11</span><span class="cl">SELECT * FROM orders WHERE id &gt; 100 FOR UPDATE
</span></span><span class="line"><span class="ln">12</span><span class="cl">------- TRX HAS BEEN WAITING 5 SEC FOR THIS LOCK:
</span></span><span class="line"><span class="ln">13</span><span class="cl">RECORD LOCKS space id 50 page no 5 n bits 80 index PRIMARY of table `production`.`orders`
</span></span><span class="line"><span class="ln">14</span><span class="cl">trx id 12345 lock_mode X locks gap before rec  -- gap lock</span></span></code></pre></div><p>關鍵欄位：</p>
<ul>
<li><code>ACTIVE N sec</code>：transaction 跑多久（長 transaction 嫌疑）</li>
<li><code>lock_mode X / S</code>：exclusive / shared lock</li>
<li><code>locks rec but not gap</code> / <code>locks gap before rec</code> / <code>locks rec</code>：是 record / gap / next-key</li>
<li><code>TRX HAS BEEN WAITING N SEC FOR THIS LOCK</code>：等多久、超過幾秒就是 lock contention</li>
</ul>
<p><code>SELECT * FROM information_schema.INNODB_TRX</code> / <code>INNODB_LOCKS</code> (5.7) / <code>performance_schema.data_locks</code> (8.0) 給 <em>structured</em> lock 視圖。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-gap-lock-阻塞-insert--lock-不存在的-row">1. Gap lock 阻塞 INSERT — 「Lock 不存在的 row」</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- Transaction 1
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="k">UPDATE</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 假設 user_id=100 沒任何 order、預期沒 lock 任何 row
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- Transaction 2
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">amount</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">100</span><span class="p">,</span><span class="w"> </span><span class="mi">50</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="c1">-- 等！為什麼？</span></span></span></code></pre></div><p>問題：<code>WHERE user_id = 100</code> <em>沒有 record</em> 時、InnoDB 仍 lock <em>user_id=100 應該在的 gap</em>（防 phantom）、Transaction 2 INSERT 進這個 gap 被阻擋。</p>
<p>修法：</p>
<ul>
<li>改 READ COMMITTED isolation</li>
<li>或不用 <code>SELECT ... FOR UPDATE</code> on empty result、改 <em>application 層 check + INSERT</em> pattern</li>
<li>用 <code>INSERT ... ON DUPLICATE KEY UPDATE</code> 或 <code>INSERT IGNORE</code> 避免 SELECT FOR UPDATE</li>
</ul>
<h3 id="2-auto-inc-lock-contention--大量並行-insert">2. Auto-Inc Lock Contention — 大量並行 INSERT</h3>
<p><code>innodb_autoinc_lock_mode=0</code> 或 <code>=1</code> 模式下、大量並行 INSERT 撞 auto-inc lock、寫吞吐 cap。</p>
<p>修法：</p>
<ul>
<li>設 <code>innodb_autoinc_lock_mode=2</code>（interleaved、8.0+ 預設（5.7 預設仍是 1））</li>
<li>確認 <code>binlog_format=ROW</code>（mode=2 必須）</li>
<li>接受 auto-inc value 不連續（id 可能跳號）</li>
</ul>
<h3 id="3-fk-lock-cascading--父子-transaction-互鎖">3. FK Lock Cascading — 父子 transaction 互鎖</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- orders 表有 customer_id FK → customers.id
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1">-- Transaction 1
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span><span class="k">UPDATE</span><span class="w"> </span><span class="n">customers</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;...&#39;</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">100</span><span class="p">;</span><span class="w">  </span><span class="c1">-- lock customers row
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- Transaction 2
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">customer_id</span><span class="p">,</span><span class="w"> </span><span class="n">amount</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">100</span><span class="p">,</span><span class="w"> </span><span class="mi">50</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="c1">-- FK check 需要 lock customers row id=100、等 Transaction 1</span></span></span></code></pre></div><p>FK 強制 <em>每個 INSERT child 都要 shared lock parent</em>、parent 的任何 UPDATE 都會 lock 所有 child INSERT。</p>
<p>修法：</p>
<ul>
<li>評估 FK 是否真的需要（high-write 場景考慮 application-level enforcement）</li>
<li>短 transaction 縮短 lock 時間</li>
<li>FK 設計時讓 <em>parent UPDATE 少</em> / <em>child INSERT 多</em>（parent 是穩定資料）</li>
</ul>
<h3 id="4-large-transaction-lock-holding--1-個-transaction-拖全-cluster">4. Large Transaction Lock Holding — 1 個 transaction 拖全 cluster</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">BEGIN</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="c1">-- 100K row 的 batch UPDATE
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"></span><span class="k">UPDATE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;archived&#39;</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="s1">&#39;2024-01-01&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 跑 5 分鐘、持 100K row 的 lock
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1">-- 其他 transaction 撞到任何被 lock 的 row 都等 5 分鐘
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">COMMIT</span><span class="p">;</span></span></span></code></pre></div><p>長 transaction 是 <em>lock contention 災難</em>。</p>
<p>修法：</p>
<ul>
<li>
<p>把 batch operation <em>拆 chunk</em>（每 chunk 1000 row、commit、繼續）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">DO</span><span class="w"> </span><span class="err">{</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">  </span><span class="k">START</span><span class="w"> </span><span class="k">TRANSACTION</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="k">UPDATE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;archived&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">  </span><span class="k">WHERE</span><span class="w"> </span><span class="n">created_at</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="s1">&#39;2024-01-01&#39;</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s1">&#39;archived&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">  </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">1000</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">  </span><span class="k">COMMIT</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="err">}</span><span class="w"> </span><span class="n">WHILE</span><span class="w"> </span><span class="n">rows_affected</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span></span></span></code></pre></div></li>
<li>
<p>用 <em>pt-archiver</em> tool（Percona）對 batch UPDATE / DELETE 自動 chunked</p>
</li>
<li>
<p>監控 <code>information_schema.innodb_trx</code> 找出 long-running transaction</p>
</li>
</ul>
<h3 id="5-read-committed--binlog-row-interaction">5. READ COMMITTED + Binlog ROW Interaction</h3>
<p>READ COMMITTED isolation 改善 deadlock、但對 <em>binlog format</em> 有要求：</p>
<ul>
<li><code>binlog_format=STATEMENT</code>：READ COMMITTED 下 transaction 看到不同 snapshot、replicate 後 replica 結果可能 <em>不同於 primary</em>（broken replication semantically）</li>
<li><code>binlog_format=ROW</code>：每個 row event 都 explicit、READ COMMITTED 跟 ROW 兼容、replica 結果一致</li>
<li><code>binlog_format=MIXED</code>：部分 case 仍可能 fall back STATEMENT、不推薦</li>
</ul>
<p>修法：</p>
<ul>
<li>用 READ COMMITTED 時、強制 <code>binlog_format=ROW</code></li>
<li>全 cluster server（primary + replica + Group Replication members）統一 binlog_format</li>
<li>Migration 5.7 STATEMENT → 8.0 ROW 時、isolation 跟 binlog format 一起 review</li>
</ul>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication">跟 Replication</h3>
<p><code>binlog_format=ROW</code> 跟 isolation level 互動已述。Replica apply ROW binlog 時、replica 上 <em>也 acquire 同樣 lock</em>、replica 上的 long query 跟 replication lag 互動。詳見 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a>。</p>
<h3 id="跟-group-replication">跟 Group Replication</h3>
<p>GR certification phase 跟 row lock 衝突 — write conflict 檢測在 certification、不是 lock。但 <em>local row lock</em> 仍存在、影響 single-instance write throughput。詳見 <a href="/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &#43; certification* 整個機制不同。本文走 GR 機制（GCE &#43; certification &#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">Group Replication</a>。</p>
<h3 id="跟-online-schema-change">跟 Online Schema Change</h3>
<p>gh-ost / pt-osc 在 cut-over 階段需要 metadata lock、跟 long-running transaction 衝突。Lock contention deep dive 跟 OSC cut-over 議題密切。詳見 <a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">Online Schema Change Tools</a>。</p>
<h3 id="跟-query-optimization">跟 Query Optimization</h3>
<p>Slow query 持 lock 久、放大 contention。<code>EXPLAIN ANALYZE</code> 看實際執行時間、跟 lock holding time 直接相關。詳見 <a href="/blog/backend/01-database/vendors/mysql/query-optimization/" data-link-title="MySQL Query Optimization：從 EXPLAIN 看到實際執行、5 條 query 從 5 秒變 50ms 的 anatomy" data-link-desc="MySQL query 慢的根因不在「SQL 寫法」、在「optimizer 選錯 plan」。本文從 5 個常見 production case 開場（5 秒 → 50ms / 30 秒 → 200ms / 8 秒 → 30ms 等）、走 EXPLAIN / EXPLAIN ANALYZE / optimizer trace 三層分析工具、index hint vs optimizer hint 取捨、cardinality estimation 失效時的修法、5 production 踩雷（statistics 過時 / forced index 用錯 / hash join 沒觸發 / range scan 退化 ALL / derived table materialization）">Query Optimization</a>。</p>
<h3 id="跟-innodb-tuning">跟 InnoDB Tuning</h3>
<p><code>innodb_lock_wait_timeout=50</code>（預設 50 秒）— lock wait 超時 transaction 自動 rollback、避免無限等。production 建議調短（10-20 秒）、快 fail 給 application retry。詳見 <a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">InnoDB Tuning</a>。</p>
<h2 id="跟-postgresql-lock-model-對比">跟 PostgreSQL Lock model 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>MySQL InnoDB</th>
          <th>PostgreSQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Concurrency model</td>
          <td>Lock-based（rec / gap / next-key）</td>
          <td>MVCC-heavy（few explicit lock）</td>
      </tr>
      <tr>
          <td>預設 isolation</td>
          <td>REPEATABLE READ</td>
          <td>READ COMMITTED</td>
      </tr>
      <tr>
          <td>Gap lock</td>
          <td>有</td>
          <td>無對應（PG 用 predicate lock for SERIALIZABLE）</td>
      </tr>
      <tr>
          <td>Deadlock 機率</td>
          <td>中-高</td>
          <td>低</td>
      </tr>
      <tr>
          <td>Auto-inc</td>
          <td>內建 + auto-inc lock</td>
          <td>SEQUENCE（無對應 lock 議題）</td>
      </tr>
      <tr>
          <td>Snapshot isolation</td>
          <td>部分（RR 內）</td>
          <td>完整（MVCC 跑全 stack）</td>
      </tr>
  </tbody>
</table>
<p>PG 用 MVCC 跑大部分並行 control、少數 case 才用 explicit lock、整體 deadlock 機率低。MySQL 用 lock-based + MVCC mixed、production 必須懂 lock pattern。</p>
<h2 id="觀測-metric">觀測 metric</h2>
<p>Production 持續 monitor：</p>
<ul>
<li><code>Innodb_row_lock_waits</code> / <code>_time</code> → lock wait 累計</li>
<li><code>Innodb_deadlocks</code> → deadlock 次數（5.7+ 有、之前要 parse SHOW ENGINE）</li>
<li><code>performance_schema.data_lock_waits</code> → 即時 lock wait 視圖（8.0+）</li>
<li><code>information_schema.innodb_trx</code> → long-running transaction</li>
<li><code>slow_query_log</code> → 看 query 是否花太多 time 在 lock wait</li>
</ul>
<p>對 deadlock：把 <code>innodb_print_all_deadlocks=ON</code>、所有 deadlock 寫 error log、不用 <code>SHOW ENGINE</code> 才看到。</p>
<h2 id="何時改-isolation-level">何時改 isolation level</h2>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>建議 isolation</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>典型 web OLTP、低-中寫吞吐</td>
          <td>REPEATABLE READ（預設）</td>
      </tr>
      <tr>
          <td>高寫吞吐、deadlock 頻繁</td>
          <td>READ COMMITTED</td>
      </tr>
      <tr>
          <td>金融 transaction、需要 strict isolation</td>
          <td>REPEATABLE READ + 仔細 review</td>
      </tr>
      <tr>
          <td>嚴格 serializable（小 case）</td>
          <td>SERIALIZABLE（performance penalty）</td>
      </tr>
      <tr>
          <td>跨 region replication + 強一致</td>
          <td>用 Group Replication / Spanner 而不是 isolation level</td>
      </tr>
  </tbody>
</table>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>（binlog format + isolation 互動）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/query-optimization/" data-link-title="MySQL Query Optimization：從 EXPLAIN 看到實際執行、5 條 query 從 5 秒變 50ms 的 anatomy" data-link-desc="MySQL query 慢的根因不在「SQL 寫法」、在「optimizer 選錯 plan」。本文從 5 個常見 production case 開場（5 秒 → 50ms / 30 秒 → 200ms / 8 秒 → 30ms 等）、走 EXPLAIN / EXPLAIN ANALYZE / optimizer trace 三層分析工具、index hint vs optimizer hint 取捨、cardinality estimation 失效時的修法、5 production 踩雷（statistics 過時 / forced index 用錯 / hash join 沒觸發 / range scan 退化 ALL / derived table materialization）">MySQL Query Optimization</a>（slow query → lock contention）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">MySQL InnoDB Tuning</a>（lock_wait_timeout）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &#43; certification* 整個機制不同。本文走 GR 機制（GCE &#43; certification &#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">MySQL Group Replication</a>（cert vs lock）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/online-schema-change-tools/" data-link-title="MySQL Online Schema Change：gh-ost 跟 pt-online-schema-change 兩條完全不同的 ghost table 路徑" data-link-desc="MySQL ALTER TABLE 可能鎖整張表，production 需要 online schema change 流程。gh-ost（GitHub）跟 pt-online-schema-change（Percona）都用 ghost table 解決、但底層機制完全不同：pt-osc 用 trigger 同步、gh-ost 用 binlog stream 同步。本文走兩工具機制對照表 → trigger vs binlog 各自取捨 → 配置 step-by-step → 5 production 踩雷（trigger overhead / binlog 延遲 / FK constraint / hot trigger lock / 切換瞬間 deadlock）→ 何時用哪一個">MySQL Online Schema Change</a>（metadata lock）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">PostgreSQL MVCC + Lock Model</a>（PG sibling、為什麼 PG 比 MySQL 少 deadlock — pure MVCC vs MVCC + gap lock）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor page</a>（MVCC vs lock model）</li>
<li><a href="/blog/backend/knowledge-cards/isolation-level/" data-link-title="Isolation Level" data-link-desc="說明資料庫交易隔離級別如何影響並發讀寫結果">Isolation Level 卡片</a></li>
<li>官方：<a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-locking.html">InnoDB Locking</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL 5.7 → 8.0 Major Version Upgrade：character set / authentication / atomic DDL 三條 paradigm 同時換軌</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/major-version-upgrade/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/major-version-upgrade/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> 內 version upgrade migration playbook、走 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> Type E paradigm shift 結構。&lt;/p>&lt;/blockquote>
&lt;p>5.7 → 8.0 看起來是 &lt;em>minor bump&lt;/em>（從 5.7.40 升到 8.0.36）、但不是。Oracle 把這個 release boundary 當成 &lt;em>清庫存的機會&lt;/em> — 同時推出 3 個 &lt;em>behavioral paradigm shift&lt;/em>：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Paradigm&lt;/th>
 &lt;th>5.7 default&lt;/th>
 &lt;th>8.0 default&lt;/th>
 &lt;th>影響&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Character set&lt;/td>
 &lt;td>latin1 / utf8（=utf8mb3）&lt;/td>
 &lt;td>utf8mb4&lt;/td>
 &lt;td>string column 儲存 + emoji / 4-byte UTF-8&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Authentication plugin&lt;/td>
 &lt;td>mysql_native_password&lt;/td>
 &lt;td>caching_sha2_password&lt;/td>
 &lt;td>client / library 需要支援新 plugin&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>DDL atomicity&lt;/td>
 &lt;td>Non-atomic（crash 留 orphan）&lt;/td>
 &lt;td>Atomic（crash recovery 乾淨）&lt;/td>
 &lt;td>開發信心、crash recovery 行為&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>對應 &lt;em>任意一個&lt;/em> paradigm 升級失誤、production 都會 down。三條同時換、必須 &lt;em>三條都規劃&lt;/em>。&lt;/p>
&lt;p>這條 upgrade 比 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/major-version-upgrade/" data-link-title="PostgreSQL major version upgrade (14 → 17)：為什麼這篇不套 5 type migration" data-link-desc="PostgreSQL major version upgrade 是 *5 type 漏類* 的實證 — source/target 同 vendor、5 維度都 Low 但 *upgrade-specific audit* 是核心；本文結構接近 deep article methodology 的 6-section &amp;#43; 額外 upgrade audit 段；涵蓋 pg_upgrade / logical replication / blue-green 三方法、extension 相容性、5 production 踩雷">PostgreSQL major-version-upgrade&lt;/a> 工作量大 — PG major upgrade 主要是 &lt;em>pg_upgrade&lt;/em> 工具流程、MySQL 是 &lt;em>behavioral compatibility audit + ecosystem 全 review&lt;/em>。&lt;/p>
&lt;h2 id="為什麼是-type-e不是-minor-upgrade">為什麼是 Type E（不是 minor upgrade）&lt;/h2>
&lt;p>跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/#%e5%af%ab%e5%89%8d%e7%9a%84-diff-dimension-audit" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit&lt;/a>：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> 內 version upgrade migration playbook、走 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> Type E paradigm shift 結構。</p></blockquote>
<p>5.7 → 8.0 看起來是 <em>minor bump</em>（從 5.7.40 升到 8.0.36）、但不是。Oracle 把這個 release boundary 當成 <em>清庫存的機會</em> — 同時推出 3 個 <em>behavioral paradigm shift</em>：</p>
<table>
  <thead>
      <tr>
          <th>Paradigm</th>
          <th>5.7 default</th>
          <th>8.0 default</th>
          <th>影響</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Character set</td>
          <td>latin1 / utf8（=utf8mb3）</td>
          <td>utf8mb4</td>
          <td>string column 儲存 + emoji / 4-byte UTF-8</td>
      </tr>
      <tr>
          <td>Authentication plugin</td>
          <td>mysql_native_password</td>
          <td>caching_sha2_password</td>
          <td>client / library 需要支援新 plugin</td>
      </tr>
      <tr>
          <td>DDL atomicity</td>
          <td>Non-atomic（crash 留 orphan）</td>
          <td>Atomic（crash recovery 乾淨）</td>
          <td>開發信心、crash recovery 行為</td>
      </tr>
  </tbody>
</table>
<p>對應 <em>任意一個</em> paradigm 升級失誤、production 都會 down。三條同時換、必須 <em>三條都規劃</em>。</p>
<p>這條 upgrade 比 <a href="/blog/backend/01-database/vendors/postgresql/major-version-upgrade/" data-link-title="PostgreSQL major version upgrade (14 → 17)：為什麼這篇不套 5 type migration" data-link-desc="PostgreSQL major version upgrade 是 *5 type 漏類* 的實證 — source/target 同 vendor、5 維度都 Low 但 *upgrade-specific audit* 是核心；本文結構接近 deep article methodology 的 6-section &#43; 額外 upgrade audit 段；涵蓋 pg_upgrade / logical replication / blue-green 三方法、extension 相容性、5 production 踩雷">PostgreSQL major-version-upgrade</a> 工作量大 — PG major upgrade 主要是 <em>pg_upgrade</em> 工具流程、MySQL 是 <em>behavioral compatibility audit + ecosystem 全 review</em>。</p>
<h2 id="為什麼是-type-e不是-minor-upgrade">為什麼是 Type E（不是 minor upgrade）</h2>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/#%e5%af%ab%e5%89%8d%e7%9a%84-diff-dimension-audit" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評</th>
          <th>說明</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema</td>
          <td>Medium</td>
          <td>SQL 一致、reserved keyword 新增、collation 預設變</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>Medium-High</td>
          <td>binary upgrade flow 簡單、但 ecosystem 工具兼容性 audit 工作量大</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>High</td>
          <td>3 條 default paradigm shift（charset / auth / atomic DDL）</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Low</td>
          <td>同 MySQL 引擎、不引新 component</td>
      </tr>
      <tr>
          <td>App change</td>
          <td>Medium-High</td>
          <td>client library / driver / connection string 都可能要改</td>
      </tr>
      <tr>
          <td>Topology</td>
          <td>Low</td>
          <td>部署 topology 不變</td>
      </tr>
  </tbody>
</table>
<p>Paradigm = High + App change = Medium-High → <strong>Type E paradigm shift</strong>。</p>
<p>雖然是 <em>同一個 vendor 的 major version</em>、實際的 <em>application 行為差異</em> 跨越多個 paradigm、6 type 框架仍適用、結構走 partial migration 收斂。</p>
<h2 id="4-phase-upgrade">4-phase upgrade</h2>
<h3 id="phase-1pre-check-audit">Phase 1：Pre-check audit</h3>
<p>8.0 升級前用 <em>MySQL Shell upgrade checker</em> + 手動 audit：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">mysqlsh root@5.7-primary.example.com -- util check-for-server-upgrade</span></span></code></pre></div><p>Upgrade checker 報告：</p>
<ul>
<li><em>Reserved keyword</em> 衝突（5.7 不是 keyword 但 8.0 是、例如 <code>WINDOW</code> / <code>RANK</code> / <code>LATERAL</code>）</li>
<li>舊 character set / collation 使用點（latin1 / utf8mb3）</li>
<li>Deprecated feature 使用（GROUP BY 隱含 ORDER BY 等）</li>
<li>Datatype 變動（DATETIME 行為微差）</li>
</ul>
<p>手動 audit：</p>
<ul>
<li>Application driver / library 版本是否支援 caching_sha2_password</li>
<li>Connection string 內 <code>default-authentication-plugin</code> 設定</li>
<li>ORM / framework 是否假設 utf8 而非 utf8mb4</li>
</ul>
<p>完成標準：寫出 <em>blocker list</em>（必須在升級前修） + <em>warning list</em>（可在升級後處理）。</p>
<h3 id="phase-2shadow-upgrade--replica-升-80">Phase 2：Shadow upgrade — Replica 升 8.0</h3>
<p>從 <em>non-critical replica</em> 升起。先升一個 replica、跑 production traffic（read-only）2-4 週：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. Stop replica</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">systemctl stop mysql
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># 2. Backup（XtraBackup）</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">xtrabackup --backup --target-dir<span class="o">=</span>/backup/pre-upgrade
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="c1"># 3. Install MySQL 8.0 binary（apt / yum 升級）</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">apt-get install mysql-server-8.0
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># 4. 啟動 8.0、自動 upgrade data dictionary</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">systemctl start mysql
</span></span><span class="line"><span class="ln">12</span><span class="cl">
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="c1"># 5. 8.0 自動跑 server-upgrade（8.0.16+ 內建、mysql_upgrade utility 已 deprecated）</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="c1"># 若 5.7 升 8.0.16 之前 server、才需要手動跑 mysql_upgrade -u root -p</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="c1"># 6. 重新 attach 為 5.7 primary 的 replica（8.0 replica 可 attach 5.7 primary）</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">CHANGE MASTER TO <span class="nv">MASTER_AUTO_POSITION</span><span class="o">=</span>1<span class="p">;</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl">START SLAVE<span class="p">;</span></span></span></code></pre></div><p>跑 production read traffic 觀察：</p>
<ul>
<li>Query result 是否跟 5.7 一致（特別 character set 相關）</li>
<li>Replication lag 是否在 baseline 範圍</li>
<li>8.0-specific feature 是否需要（hash join / window function 等）</li>
</ul>
<h3 id="phase-3promote-80-為-primary">Phase 3：Promote 8.0 為 primary</h3>
<p>確認 shadow replica 穩定後：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. 升其他 replica 到 8.0</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"># （per-replica 跑 Phase 2 流程）</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="c1"># 2. Application application 改用 8.0-compatible driver</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 把 connection string 加 default-authentication-plugin=caching_sha2_password</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># 或仍用 mysql_native_password（user 端設定）</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"># 3. Failover：promote 8.0 replica 為 primary</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 用 Orchestrator / 自管 failover 流程</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"># 4. 5.7 primary 變成 8.0 replica、升 5.7 → 8.0</span></span></span></code></pre></div><p>完成標準：所有 server 都是 8.0、application 連 8.0 endpoint 無 error。</p>
<h3 id="phase-4decommission-57--套用-80-paradigm">Phase 4：Decommission 5.7 + 套用 8.0 paradigm</h3>
<p>完成 binary upgrade 不是真正完成 — 還要逐步遷移 paradigm：</p>
<ul>
<li>
<p><strong>Character set 升級</strong>：歷史 latin1 / utf8 table 改 utf8mb4</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">CONVERT</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="nb">CHARACTER</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">utf8mb4</span><span class="w"> </span><span class="k">COLLATE</span><span class="w"> </span><span class="n">utf8mb4_0900_ai_ci</span><span class="p">;</span></span></span></code></pre></div><p>每張 table 走 gh-ost / pt-osc（避免 production 阻塞）</p>
</li>
<li>
<p><strong>Authentication 升級</strong>：逐步把 user 從 <code>mysql_native_password</code> 改 <code>caching_sha2_password</code></p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">USER</span><span class="w"> </span><span class="s1">&#39;app&#39;</span><span class="o">@</span><span class="s1">&#39;%&#39;</span><span class="w"> </span><span class="n">IDENTIFIED</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="n">caching_sha2_password</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="s1">&#39;new_password&#39;</span><span class="p">;</span></span></span></code></pre></div><p>需確認 application driver 已支援新 plugin（多數 modern driver OK、legacy 可能要升級）</p>
</li>
<li>
<p><strong>Reserved keyword 處理</strong>：column / table 名稱跟新 reserved word 衝突的、改名</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">RENAME</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">window</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="n">event_window</span><span class="p">;</span></span></span></code></pre></div></li>
</ul>
<p>多數 org 在 Phase 3 停留更久 — paradigm 升級不是一次 big bang、是漸進。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-authentication-plugin--application-突然連不上">1. Authentication plugin — Application 突然連不上</h3>
<p>升 8.0 後 <em>new user</em> 預設用 caching_sha2_password、舊 application driver（&lt; 5 年版本）不支援、connect error: <code>Authentication plugin 'caching_sha2_password' cannot be loaded</code>。</p>
<p>修法：</p>
<ul>
<li><em>先升 driver</em>：每個 application 升級 mysql-connector-* 到支援 caching_sha2 的版本（多數 modern release 已支援）</li>
<li>短期 workaround：用 <code>mysql_native_password</code>（new user 顯式 create with <code>IDENTIFIED WITH mysql_native_password</code>）</li>
<li>設 <code>default_authentication_plugin=mysql_native_password</code>、強制保留舊 default</li>
</ul>
<h3 id="2-character-set-4-byte-utf-8--emoji-進不去">2. Character set 4-byte UTF-8 — Emoji 進不去</h3>
<p>5.7 latin1 / utf8（=utf8mb3）column 升 8.0 後 <em>仍是 utf8mb3</em>、不會自動升 utf8mb4。Application 寫入 emoji（4-byte UTF-8）會被 <em>truncate / 拒絕</em>。</p>
<p>修法：</p>
<ul>
<li><em>逐 table CONVERT</em>：gh-ost / pt-osc 跑 <code>ALTER TABLE ... CONVERT TO CHARACTER SET utf8mb4</code></li>
<li>新建 table 預設用 utf8mb4（<code>character_set_server=utf8mb4</code> 設定）</li>
<li>Application 連線 charset 設定一致（<code>character_set_client / connection / results</code>）</li>
</ul>
<h3 id="3-reserved-keyword--application-query-突然-syntax-error">3. Reserved keyword — Application query 突然 syntax error</h3>
<p>5.7 跑得好的 query：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">window</span><span class="p">,</span><span class="w"> </span><span class="n">rank</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="p">;</span></span></span></code></pre></div><p>8.0 報錯：<code>window</code> 跟 <code>rank</code> 都是 reserved keyword、必須 backtick：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">`</span><span class="n">window</span><span class="o">`</span><span class="p">,</span><span class="w"> </span><span class="o">`</span><span class="n">rank</span><span class="o">`</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="p">;</span></span></span></code></pre></div><p>修法：</p>
<ul>
<li>Phase 1 upgrade checker 已抓出來、Application code review 改 SQL</li>
<li>推薦 <em>predefer table / column 名 backtick</em> policy（一律加 backtick、避免未來 reserved word 衝突）</li>
<li>ORM 多數會自動 backtick、raw SQL 容易踩</li>
</ul>
<h3 id="4-group-replication--新-feature-開了就不能-rollback">4. Group Replication / 新 feature 開了就不能 rollback</h3>
<p>8.0 升級後 <em>誘惑使用 8.0-only feature</em>：</p>
<ul>
<li>Group Replication（5.7 也有但 8.0 更穩）</li>
<li>Resource Group（5.7 沒有）</li>
<li>Histograms（5.7 沒有）</li>
<li>CTE / window function（5.7 沒有）</li>
</ul>
<p>一旦 application 用了這些 feature、不能 rollback 5.7（feature 不存在、query 失敗）。</p>
<p>修法：</p>
<ul>
<li><em>Phase 1-3 期間禁用 8.0-only feature</em>、保留 rollback option</li>
<li><em>Phase 4 完成</em> 且穩定運作 30+ 天後、才開始 evaluate 8.0-only feature</li>
<li>加 8.0-only feature 時 <em>明確記錄不可 rollback</em></li>
</ul>
<h3 id="5-collation-default-變動--sort-order-跟-unique-行為改變">5. Collation default 變動 — Sort order 跟 unique 行為改變</h3>
<p>5.7 utf8mb4 預設 collation = <code>utf8mb4_general_ci</code>、8.0 預設 = <code>utf8mb4_0900_ai_ci</code>。兩者排序行為不一致：</p>
<ul>
<li><code>utf8mb4_general_ci</code>：簡化 collation、不嚴格遵循 Unicode</li>
<li><code>utf8mb4_0900_ai_ci</code>：Unicode 9.0 compliance、accent-insensitive</li>
</ul>
<p>對 <em>已存在的 table</em>、collation 不會被 8.0 升級改變（保留 5.7 設定）。但 <em>新建 table</em> 預設用 0900_ai_ci、UNION / JOIN 跨不同 collation 的 column 可能 error: <code>Illegal mix of collations</code>。</p>
<p>修法：</p>
<ul>
<li>統一 collation：要麼 <em>所有 table 改 0900_ai_ci</em>、要麼 <em>所有 table 保留 general_ci</em></li>
<li>Schema migration 走 OSC 工具</li>
<li>Application 內 sort-dependent logic（leaderboard / search ranking）要驗證新 collation 結果</li>
</ul>
<h2 id="capability-gap57-有但-80-沒有">Capability gap：5.7 有但 8.0 沒有</h2>
<p>少數 8.0 <em>拿走</em> 的能力：</p>
<ul>
<li><strong>Query Cache</strong>：5.7 內建（但已 deprecated）、8.0 <em>完全移除</em>。Query cache 在高並發場景 actually slowing down、移除是好事</li>
<li><strong>InnoDB MEMORY engine</strong>：5.7 部分支援、8.0 限制更多</li>
<li><strong>Some MyISAM optimizations</strong>：8.0 強制 InnoDB-first、MyISAM-specific 工作流 broken</li>
</ul>
<p>對 Query Cache user：升 8.0 前評估是否依賴、考慮改 application-side cache（Redis）。</p>
<h2 id="容量與成本對照">容量與成本對照</h2>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>5.7</th>
          <th>8.0</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cost</td>
          <td>Free (CE) / Enterprise</td>
          <td>Free (CE) / Enterprise</td>
      </tr>
      <tr>
          <td>升級 hosts × 時間</td>
          <td>-</td>
          <td>per-instance ~30 分鐘 binary upgrade</td>
      </tr>
      <tr>
          <td>Application 改動</td>
          <td>-</td>
          <td>driver upgrade + SQL review</td>
      </tr>
      <tr>
          <td>Character set conversion</td>
          <td>-</td>
          <td>per-table OSC、大表小時級</td>
      </tr>
      <tr>
          <td>Ops headcount</td>
          <td>-</td>
          <td>1-2 個 DBA × 2-4 週</td>
      </tr>
      <tr>
          <td>對 production 影響</td>
          <td>-</td>
          <td>Phase 2-3 漸進升級、無大 downtime</td>
      </tr>
  </tbody>
</table>
<p>5.7 → 8.0 upgrade 整體成本是 <em>1-2 個 FTE 月</em> 規模。對中型 deployment（100+ DB）可能更多。</p>
<h2 id="何時不升">何時不升</h2>
<ul>
<li><strong>App 用 Query Cache 重度</strong>：8.0 沒了、要 application 改造</li>
<li><strong>Old driver 不能升</strong>：legacy enterprise application 用 10 年前 driver、driver vendor 已倒、無法升 8.0-compatible</li>
<li><strong>Compliance freeze</strong>：某些金融 / 醫療場景 freeze technology 多年、升級需要重 audit + recertification</li>
<li><strong>5.7 已 EOL（2023-10）後仍堅持不升</strong>：security risk 高、應該 <em>優先升</em></li>
</ul>
<h2 id="跟-postgresql-major-version-upgrade-對比">跟 PostgreSQL Major Version Upgrade 對比</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>MySQL 5.7 → 8.0</th>
          <th>PostgreSQL N → N+1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tool</td>
          <td>binary upgrade + 自動 server-upgrade（8.0.16+；舊版用 mysql_upgrade）</td>
          <td>pg_upgrade（in-place）</td>
      </tr>
      <tr>
          <td>Downtime</td>
          <td>&lt; 5 分鐘 per instance（binary + DD upgrade）</td>
          <td>&lt; 1 分鐘 per instance（pg_upgrade）</td>
      </tr>
      <tr>
          <td>Paradigm shift</td>
          <td>3 條（charset / auth / atomic DDL）</td>
          <td>一般 0-1 條（PG major 多保 compat）</td>
      </tr>
      <tr>
          <td>App 必須改</td>
          <td>多（driver + query）</td>
          <td>少（多數 query 兼容）</td>
      </tr>
      <tr>
          <td>Risk</td>
          <td>高（paradigm 多）</td>
          <td>中-低</td>
      </tr>
      <tr>
          <td>Rollback</td>
          <td>不可（一旦 atomic DDL data 寫入、5.7 不認）</td>
          <td>不可（pg_upgrade 不可逆）</td>
      </tr>
  </tbody>
</table>
<p>PG major upgrade 比 MySQL 簡單。MySQL 5.7 → 8.0 是 <em>特例</em> — Oracle 把多年 deprecated 一次清。8.0 → 8.4 / 9.x 預期更平順。</p>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p>8.0 replica 可 attach 5.7 primary（向下兼容）、但 5.7 replica <em>不能 attach 8.0 primary</em>（向上不兼容）。Upgrade 順序必須 <em>replica 先升、primary 後升</em>。詳見 <a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">Replication Topology</a>。</p>
<h3 id="跟-innodb-tuning">跟 InnoDB Tuning</h3>
<p>8.0 InnoDB 改寫了 redo log（atomic、可動態調整）、<code>innodb_log_file_size</code> 升級後可以 <em>online 改</em>、不必停機。詳見 <a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">InnoDB Tuning</a>。</p>
<h3 id="跟-modern-sql-features">跟 Modern SQL Features</h3>
<p>8.0 補 CTE / window / JSON_TABLE / hash join — 是 <em>為什麼要升 8.0</em> 的 driver。詳見 <a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">Modern SQL Features</a>。</p>
<h3 id="跟-group-replication">跟 Group Replication</h3>
<p>GR 在 5.7 有、但 8.0 才成熟。Group Replication 的 <em>MySQL Shell + Router</em> 整套 stack 主要在 8.0 才完整。詳見 <a href="/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &#43; certification* 整個機制不同。本文走 GR 機制（GCE &#43; certification &#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">Group Replication</a>。</p>
<h3 id="跟-aurora--planetscale-等-managed">跟 Aurora / PlanetScale 等 managed</h3>
<p>從 5.7 升 8.0 是個好時機 <em>同時評估</em> 是否要遷 Aurora / PlanetScale — 既然要做 paradigm shift、不如一次到位。詳見 <a href="/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">migrate-to-aurora</a> / <a href="/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/" data-link-title="MySQL → PlanetScale：managed Vitess &#43; branch-based schema workflow 的 hybrid shift" data-link-desc="自管 MySQL → PlanetScale 加上 Vitess sharding 跟 branch-based schema workflow。本文走 6 維 audit（Paradigm &#43; Operational &#43; Schema 多軸）、4-phase migration、5 production 踩雷、何時不要遷。">migrate-to-planetscale</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/replication-topology/" data-link-title="MySQL Replication Topology：async / semi-sync / GTID 不是三選一、是三個 trade-off 軸的疊加" data-link-desc="MySQL replication 不是「選 async 還是 semi-sync」、是 *durability / latency / consistency* 三個 trade-off 軸的疊加；GTID 是跨 mode 的 infrastructure layer、不是第三種 mode。本文走 3 軸取捨模型 → async / semi-sync 行為對比 → GTID 替代 binlog-position 的好處 → 配置 step-by-step → 5 production 踩雷（lag 暴衝 / semi-sync 退回 async / GTID gap / Loss-Less semi-sync 真的 loss-less / chained replication 雪崩）→ 跟 Aurora MySQL / Vitess / ProxySQL / Orchestrator 整合">MySQL Replication Topology</a>（升級順序 replica-first）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/modern-sql-features/" data-link-title="MySQL 8.0 Modern SQL：CTE / window function / JSON_TABLE 不是「終於跟上 PG」、是進入 SQL 工程深度的入場券" data-link-desc="MySQL 8.0 在 SQL 特性上 *終於補齊* CTE、window function、lateral derived table、JSON_TABLE、hash join 等現代 SQL 特性。本文走 5 個關鍵特性、各自實際 production 場景、跟 PostgreSQL 對應特性的行為差異（特別是 JSON_TABLE vs PG JSONB / jsonb_path_query）、配置 / migration 注意事項、5 production 踩雷（CTE 不 materialize / window function 大量 sort spill / JSON_TABLE 跟 generated column 取捨 / hash join 預設沒開 / recursive CTE 深度上限）">MySQL Modern SQL Features</a>（升 8.0 的主要 driver）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/group-replication/" data-link-title="MySQL Group Replication / InnoDB Cluster：single-primary vs multi-primary mode 對 transaction certification 的影響" data-link-desc="MySQL Group Replication 提供 synchronous multi-primary replication、用 Paxos-like Group Communication Engine（GCE）達成 quorum-based commit。但「multi-primary」不是「single-primary 多開幾個 write 入口」、是 *transaction conflict detection &#43; certification* 整個機制不同。本文走 GR 機制（GCE &#43; certification &#43; applier）、single-primary vs multi-primary mode、InnoDB Cluster 跟 MySQL Shell / Router 整合、5 production 踩雷（cert lag / write conflict / large transaction / network partition / member 加入 catch-up）、何時用 GR 何時用傳統 replication">MySQL Group Replication</a>（8.0 成熟）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/innodb-tuning/" data-link-title="MySQL InnoDB Tuning：為什麼一個 100 GB DB 在 64 GB RAM server 上 query 慢 5 倍" data-link-desc="InnoDB 是 MySQL 預設 storage engine、預設值給 256 MB buffer pool（早期 default）。本文從一個常見痛點開場（DB &gt; RAM 但 server 仍 swap）、走 4 個 critical knob（buffer pool / redo log / flush method / IO capacity）、各自如何影響讀寫吞吐、配置 step-by-step、5 production 踩雷（buffer pool warm-up / log file 大小 / 設 sync_binlog=0 換速度 / IO scheduler / undo log 膨脹）、跟 SSD / NVMe / EBS 的 IO 假設">MySQL InnoDB Tuning</a>（8.0 redo log 改寫）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">migrate-to-aurora</a> / <a href="/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/" data-link-title="MySQL → PlanetScale：managed Vitess &#43; branch-based schema workflow 的 hybrid shift" data-link-desc="自管 MySQL → PlanetScale 加上 Vitess sharding 跟 branch-based schema workflow。本文走 6 維 audit（Paradigm &#43; Operational &#43; Schema 多軸）、4-phase migration、5 production 踩雷、何時不要遷。">migrate-to-planetscale</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/major-version-upgrade/" data-link-title="PostgreSQL major version upgrade (14 → 17)：為什麼這篇不套 5 type migration" data-link-desc="PostgreSQL major version upgrade 是 *5 type 漏類* 的實證 — source/target 同 vendor、5 維度都 Low 但 *upgrade-specific audit* 是核心；本文結構接近 deep article methodology 的 6-section &#43; 額外 upgrade audit 段；涵蓋 pg_upgrade / logical replication / blue-green 三方法、extension 相容性、5 production 踩雷">PostgreSQL Major Version Upgrade</a>（PG sibling）</li>
<li>方法論：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook Methodology</a>（Type E paradigm shift）</li>
<li>官方：<a href="https://dev.mysql.com/doc/refman/8.0/en/upgrading.html">MySQL 8.0 Upgrade Guide</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-aurora/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-aurora/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration&lt;/a> playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&amp;#43;75% 效能改善的 production 證據">Aurora&lt;/a>。走 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> Type C operational hybrid 結構。每階段切換用 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate&lt;/a> 把關。&lt;/p>&lt;/blockquote>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Ops 責任&lt;/th>
 &lt;th>自管 MySQL&lt;/th>
 &lt;th>Aurora MySQL&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Storage&lt;/td>
 &lt;td>EBS / local SSD、自己選 + 監控&lt;/td>
 &lt;td>Aurora distributed storage（自動 6 份跨 3 AZ）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Replication setup&lt;/td>
 &lt;td>binlog + semi-sync 自己配&lt;/td>
 &lt;td>Storage layer 自動、無 binlog replication&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Failover&lt;/td>
 &lt;td>Orchestrator + VIP + fence script&lt;/td>
 &lt;td>Aurora 內建、&amp;lt; 30 秒 RTO&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Backup&lt;/td>
 &lt;td>mysqldump / Percona XtraBackup&lt;/td>
 &lt;td>自動 continuous backup、PITR&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Parameter tuning&lt;/td>
 &lt;td>my.cnf 自己改&lt;/td>
 &lt;td>Parameter group（部分 knob 鎖）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Connection limit&lt;/td>
 &lt;td>max_connections 自己設&lt;/td>
 &lt;td>看 instance class、有上限&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Auto scaling&lt;/td>
 &lt;td>不適用&lt;/td>
 &lt;td>Aurora Serverless v2 + read replica auto-scaling&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Multi-region&lt;/td>
 &lt;td>自己配 chained replication&lt;/td>
 &lt;td>Aurora Global Database&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Per-month cost&lt;/td>
 &lt;td>EC2 + EBS + 自己管 ops&lt;/td>
 &lt;td>Higher per-GB / per-IOPS、但 ops headcount saving&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>從 &lt;em>MySQL 角度&lt;/em> 看 Aurora MySQL：wire protocol 一致、SQL 一致、ORM 不必改、application 連 endpoint 字串以外幾乎不必動。從 &lt;em>Ops 角度&lt;/em> 看 Aurora MySQL：所有 storage / replication / failover knob 都 &lt;em>看不到也改不了&lt;/em>、整個 ops 心智模型重寫。&lt;/p>
&lt;p>這是 Type C operational hybrid 的典型 signature — &lt;em>schema / paradigm 接近、operational 完全不同&lt;/em>。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor <a href="/blog/backend/knowledge-cards/migration/" data-link-title="Migration" data-link-desc="說明系統如何把資料、流量或結構從舊狀態移到新狀態">migration</a> playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> 跟 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora</a>。走 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> Type C operational hybrid 結構。每階段切換用 <a href="/blog/backend/knowledge-cards/migration-gate/" data-link-title="Migration Gate" data-link-desc="說明遷移流程何時可以進入下一階段或正式切換">migration gate</a> 把關。</p></blockquote>
<table>
  <thead>
      <tr>
          <th>Ops 責任</th>
          <th>自管 MySQL</th>
          <th>Aurora MySQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Storage</td>
          <td>EBS / local SSD、自己選 + 監控</td>
          <td>Aurora distributed storage（自動 6 份跨 3 AZ）</td>
      </tr>
      <tr>
          <td>Replication setup</td>
          <td>binlog + semi-sync 自己配</td>
          <td>Storage layer 自動、無 binlog replication</td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>Orchestrator + VIP + fence script</td>
          <td>Aurora 內建、&lt; 30 秒 RTO</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>mysqldump / Percona XtraBackup</td>
          <td>自動 continuous backup、PITR</td>
      </tr>
      <tr>
          <td>Parameter tuning</td>
          <td>my.cnf 自己改</td>
          <td>Parameter group（部分 knob 鎖）</td>
      </tr>
      <tr>
          <td>Connection limit</td>
          <td>max_connections 自己設</td>
          <td>看 instance class、有上限</td>
      </tr>
      <tr>
          <td>Auto scaling</td>
          <td>不適用</td>
          <td>Aurora Serverless v2 + read replica auto-scaling</td>
      </tr>
      <tr>
          <td>Multi-region</td>
          <td>自己配 chained replication</td>
          <td>Aurora Global Database</td>
      </tr>
      <tr>
          <td>Per-month cost</td>
          <td>EC2 + EBS + 自己管 ops</td>
          <td>Higher per-GB / per-IOPS、但 ops headcount saving</td>
      </tr>
  </tbody>
</table>
<p>從 <em>MySQL 角度</em> 看 Aurora MySQL：wire protocol 一致、SQL 一致、ORM 不必改、application 連 endpoint 字串以外幾乎不必動。從 <em>Ops 角度</em> 看 Aurora MySQL：所有 storage / replication / failover knob 都 <em>看不到也改不了</em>、整個 ops 心智模型重寫。</p>
<p>這是 Type C operational hybrid 的典型 signature — <em>schema / paradigm 接近、operational 完全不同</em>。</p>
<h2 id="為什麼是-type-coperational-為主">為什麼是 Type C（operational 為主）</h2>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/#%e5%af%ab%e5%89%8d%e7%9a%84-diff-dimension-audit" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評</th>
          <th>說明</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema</td>
          <td>Low</td>
          <td>MySQL wire protocol + SQL 完全一致</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>High</td>
          <td>storage / replication / failover / backup ops 全部轉到 AWS</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>Low</td>
          <td>同 OLTP relational paradigm</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Medium</td>
          <td>Aurora 加 storage layer / cluster endpoint / reader endpoint</td>
      </tr>
      <tr>
          <td>App change</td>
          <td>Low</td>
          <td>主要 connection string + connection pool 設定</td>
      </tr>
      <tr>
          <td>Topology</td>
          <td>Low-Medium</td>
          <td>single-region scaling、跨 region 走 Global Database</td>
      </tr>
  </tbody>
</table>
<p>Operational = High（其他 Low） → <strong>Type C operational hybrid</strong>。Migration 路徑用 <em>4-phase drop-in cutover</em> + <em>operational re-onboarding</em>。</p>
<h2 id="drivertco--multi-az-ha--aws-integration">Driver：TCO + Multi-AZ HA + AWS integration</h2>
<p>從自管 MySQL 遷到 Aurora MySQL 的核心 driver：</p>
<ul>
<li><strong>TCO</strong>：自管 MySQL 真實 cost = EC2 + EBS + ops headcount（1-3 個 FTE 撐大 MySQL deployment）。Aurora per-GB / per-IOPS 比 EC2+EBS 貴 30-50%、但省 ops headcount、總帳通常 break-even 或更便宜</li>
<li><strong>Multi-AZ HA</strong>：Aurora storage 自動 6 份跨 3 AZ、failover &lt; 30 秒、不需要自管 Orchestrator + VIP + fence script</li>
<li><strong>AWS ecosystem integration</strong>：跟 Lambda / SAM / CloudFormation / IAM / Secrets Manager 整合、給 cloud-native architecture 加分</li>
<li><strong>Read scaling</strong>：Aurora 最多 15 個 read replica、storage layer 共享（不 replicate data、僅 replicate page cache）、read latency &lt; 10ms inter-replica</li>
</ul>
<p>不適合 <em>已用 Percona Server fork</em> 或 <em>需要 cross-cloud portability</em> 的 org — Aurora MySQL 是 AWS-only、且 fork 自 MySQL 5.7/8.0、跟 Percona 特性不完全一致。</p>
<h2 id="4-phase-migration">4-phase migration</h2>
<h3 id="phase-1aurora-cluster-起來作為-read-replica">Phase 1：Aurora cluster 起來作為 read replica</h3>
<p>最低風險入口：建 Aurora cluster、用 MySQL binlog 把 production 資料 stream 進 Aurora。Application 仍寫自管 MySQL primary、Aurora 作為 <em>external read replica</em>。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. 在 AWS 建 Aurora MySQL cluster</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">aws rds create-db-cluster <span class="se">\
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="se"></span>  --db-cluster-identifier prod-aurora <span class="se">\
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="se"></span>  --engine aurora-mysql <span class="se">\
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="se"></span>  --engine-version 8.0.mysql_aurora.3.04.0 <span class="se">\
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="se"></span>  --master-username admin <span class="se">\
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="se"></span>  --master-user-password ... <span class="se">\
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="se"></span>  --database-name production <span class="se">\
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="se"></span>  --vpc-security-group-ids sg-xxx <span class="se">\
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="se"></span>  --db-subnet-group-name prod-subnet
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"># 2. 用 mysqldump 或 Percona XtraBackup 拿 baseline</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">mysqldump --single-transaction --master-data<span class="o">=</span><span class="m">2</span> --triggers --routines --events <span class="se">\
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="se"></span>  --all-databases &gt; baseline.sql
</span></span><span class="line"><span class="ln">15</span><span class="cl">
</span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="c1"># 3. Restore 到 Aurora</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">mysql -h prod-aurora.cluster-xxx.us-east-1.rds.amazonaws.com -u admin -p &lt; baseline.sql
</span></span><span class="line"><span class="ln">18</span><span class="cl">
</span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="c1"># 4. 設定 Aurora 從自管 MySQL 接 binlog</span>
</span></span><span class="line"><span class="ln">20</span><span class="cl">CALL mysql.rds_set_external_master<span class="o">(</span>
</span></span><span class="line"><span class="ln">21</span><span class="cl">  <span class="s1">&#39;self-managed-primary.example.com&#39;</span>, 3306,
</span></span><span class="line"><span class="ln">22</span><span class="cl">  <span class="s1">&#39;replication_user&#39;</span>, <span class="s1">&#39;password&#39;</span>,
</span></span><span class="line"><span class="ln">23</span><span class="cl">  <span class="s1">&#39;mysql-bin.000123&#39;</span>, 12345, <span class="m">0</span>
</span></span><span class="line"><span class="ln">24</span><span class="cl"><span class="o">)</span><span class="p">;</span>
</span></span><span class="line"><span class="ln">25</span><span class="cl">CALL mysql.rds_start_replication<span class="p">;</span></span></span></code></pre></div><p>完成標準：Aurora replica lag &lt; 1 秒、跟 production primary 同步。</p>
<h3 id="phase-2application-read-切到-aurora-reader-endpoint">Phase 2：Application read 切到 Aurora reader endpoint</h3>
<p>Application 仍寫自管 primary、但讀 query 切到 Aurora reader endpoint：</p>
<ul>
<li>Aurora reader endpoint：<code>prod-aurora.cluster-ro-xxx.us-east-1.rds.amazonaws.com</code></li>
<li>自動 round-robin 多個 read replica</li>
<li>ProxySQL 或 application config 改 read connection string</li>
</ul>
<p>跑 1-2 週、確認：</p>
<ul>
<li>Aurora read latency 跟自管 replica latency 接近（通常 Aurora 略好）</li>
<li>Aurora replication lag 穩定 &lt; 1 秒</li>
<li>Aurora query 結果跟自管 primary 一致（spot-check critical query）</li>
</ul>
<p>完成標準：所有 read traffic 都進 Aurora、no application bug。</p>
<h3 id="phase-3cutover--promote-aurora-primary">Phase 3：Cutover — promote Aurora primary</h3>
<p>Cutover window 內：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. 停 application 寫入（feature flag / scheduled maintenance）</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># 2. 等自管 primary 跟 Aurora 同步完成（檢查 Aurora replica lag = 0）</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 3. 把 Aurora 從 external replica 提升為獨立 primary</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">CALL mysql.rds_stop_replication<span class="p">;</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">CALL mysql.rds_reset_external_master<span class="p">;</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 4. Application 寫 connection string 切到 Aurora writer endpoint</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># prod-aurora.cluster-xxx.us-east-1.rds.amazonaws.com</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="c1"># 5. 開始 application traffic</span></span></span></code></pre></div><p>完成標準：寫入流量 100% 進 Aurora、自管 primary 變 idle。Cutover 通常需要 30-60 分鐘 maintenance window。</p>
<h3 id="phase-4decommission-自管-mysql">Phase 4：Decommission 自管 MySQL</h3>
<p>跑 1-2 週確認 Aurora 穩定後 <em>慢慢退役自管</em>：</p>
<ul>
<li>自管 primary 保留作 <em>cold backup</em>（1-3 個月）、不接 traffic、可隨時 rollback</li>
<li>Replica 一個一個關掉</li>
<li>監控 Aurora cost vs 預估、確認 break-even</li>
</ul>
<p>完成標準：自管 EC2 instance terminate、EBS volume snapshot 後 delete、cost 對比驗證符合預期。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-parameter-group-沒對齊--innodb_flush_log_at_trx_commit-等行為差">1. Parameter group 沒對齊 — <code>innodb_flush_log_at_trx_commit</code> 等行為差</h3>
<p>Aurora 的 <em>parameter group</em> 取代 my.cnf。預設 parameter group 不一定跟自管 MySQL 一致：</p>
<ul>
<li><code>innodb_flush_log_at_trx_commit</code>：自管常設 1（zero loss）、Aurora 預設仍 1 但走 <em>Aurora storage durability</em>（行為等價但不同 mechanism）</li>
<li><code>sync_binlog</code>：自管 1、Aurora <em>沒有 binlog 寫 disk</em> 概念（Aurora 不用 binlog 做 replication、binlog 是 <em>optional output</em>）</li>
<li><code>time_zone</code>：Aurora 預設 UTC、自管常設 local time、TIMESTAMP query 行為可能不同</li>
<li><code>character_set_*</code>：自管常設 utf8mb4、Aurora 預設可能是 latin1（看 cluster create 命令）</li>
</ul>
<p>修法：</p>
<ul>
<li>
<p>Phase 1 完成後 <em>逐 row 對比 parameter group</em>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">@@</span><span class="k">global</span><span class="p">.</span><span class="n">variable_name</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="p">...</span></span></span></code></pre></div></li>
<li>
<p>建 <em>custom DB cluster parameter group</em>、匹配自管設定</p>
</li>
<li>
<p>重啟 Aurora primary 套 parameter group 改變（部分 parameter 需要重啟）</p>
</li>
</ul>
<h3 id="2-iam-authentication--application-沒準備">2. IAM authentication — application 沒準備</h3>
<p>Aurora 提供 <em>IAM authentication</em>（不用 password、用 AWS IAM role + temporary token）。Application 用 IAM auth 不必管 password rotation、但程式碼必須 <em>call AWS SDK 取 token、放 connection 設定</em>。</p>
<p>如果 Phase 2-3 期間沒 reverse engineer application connection logic、cutover 後 application 仍試用 password auth、Aurora 拒絕、production down。</p>
<p>修法：</p>
<ul>
<li>評估是否啟用 IAM auth — <em>簡單情況保留 password</em>、整合 AWS Secrets Manager 自動 rotation</li>
<li>啟用 IAM 必須 application code 改：
<ul>
<li>Java：<code>com.amazonaws.services.rds.auth.RdsIamAuthTokenGenerator</code></li>
<li>Python：<code>boto3.client('rds').generate_db_auth_token(...)</code></li>
<li>Go：<code>aws-sdk-go-v2/feature/rds/auth</code></li>
</ul>
</li>
<li>Phase 2 期間 application 對 Aurora 用 IAM token、self-managed 仍 password — 雙 path code</li>
</ul>
<h3 id="3-aurora-only-feature-寫進-applicationrollback-成本升高">3. Aurora-only feature 寫進 application、rollback 成本升高</h3>
<p>Migration 過程開發發現 Aurora 有 <em>Aurora-only feature</em>（Backtrack、Performance Insights、Aurora Global Database）、誘惑使用。一旦 application 用了 Aurora-only feature、要 rollback 自管 MySQL 變不可能（feature 不存在、query 失敗）。</p>
<p>常見 Aurora-only feature：</p>
<ul>
<li><em>Backtrack</em>：72 小時內 in-place rollback 整個 DB（不同於 PITR）</li>
<li><em>Aurora ML</em>：SQL function 內接 SageMaker / Comprehend</li>
<li><em>Aurora Parallel Query</em>：analytical query 跨 storage node 並行</li>
<li><em>Aurora Auto Scaling</em>：read replica 數量按 CPU 自動加減</li>
</ul>
<p>修法：</p>
<ul>
<li><em>Phase 1-3 期間禁用 Aurora-only feature</em>、保留 rollback option</li>
<li><em>Phase 4 完成後</em> 才開始 evaluate Aurora-only feature、加進來時 <em>明確記錄不可 rollback decision</em></li>
<li>把 Aurora-only feature 跟 <em>Aurora 特定 cluster</em> 綁定，避免 application 邏輯依賴 Aurora-only</li>
</ul>
<h3 id="4-read-replica-endpoint-behavior--application-不知道-reader-endpoint-round-robin">4. Read replica endpoint behavior — Application 不知道 reader endpoint round-robin</h3>
<p>Aurora reader endpoint（<code>prod-aurora.cluster-ro-xxx</code>）是 <em>DNS-based load balancer</em>、每次 DNS query 給不同 replica IP。Application connection pool 連續開 10 個 connection、可能全部連同一個 replica（DNS cache）、不均勻。</p>
<p>修法：</p>
<ul>
<li>Application connection pool 強制 <em>DNS re-resolve</em>（避免長時間 cache）</li>
<li>或用 <em>RDS Proxy</em>（managed connection pool）放在前面、不直接連 reader endpoint</li>
<li>或用 <em>Route 53 latency-based routing</em> 配 Aurora reader endpoint per AZ、application 連最近 AZ</li>
</ul>
<h3 id="5-region-failover--aurora-global-database-vs-自管-chained-replication">5. Region failover — Aurora Global Database vs 自管 chained replication</h3>
<p>自管 cross-region replication 是 <em>chained replication</em>（primary → region2 replica → region2 cascading replica）。Aurora Global Database 是 <em>storage-level replication</em>（storage page 直接 ship，而非 binlog）、跨 region &lt; 1 秒 lag、failover &lt; 1 分鐘。</p>
<p>但 Aurora Global Database 是 <em>active-passive</em>（primary region 可寫、secondary region 只讀）。如果原本自管已經 cross-region active-active write（用 multi-master 或應用層 sharding）、Aurora Global Database 的寫入模型會成為限制。</p>
<p>修法：</p>
<ul>
<li>評估 cross-region 是 <em>DR</em> 用途還是 <em>active write</em> 用途</li>
<li>純 DR + read scaling：Aurora Global Database 直接 cover</li>
<li>Active-active write：要 <em>Aurora DSQL</em>（2024 新推出、跟 Aurora 不同 product）或 distributed SQL（CockroachDB / Spanner）</li>
</ul>
<h2 id="capability-gap自管-mysql-有但-aurora-沒有">Capability gap：自管 MySQL 有但 Aurora 沒有</h2>
<table>
  <thead>
      <tr>
          <th>能力</th>
          <th>自管 MySQL</th>
          <th>Aurora MySQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Plugin 自己裝</td>
          <td>任意</td>
          <td>受限（Aurora 只允許官方支援）</td>
      </tr>
      <tr>
          <td>OS-level access</td>
          <td>完整 SSH access</td>
          <td>managed service，無 SSH access</td>
      </tr>
      <tr>
          <td>MySQL 8.0 latest patch</td>
          <td>你決定</td>
          <td>跟 Aurora major version 對應、有滯後</td>
      </tr>
      <tr>
          <td>InnoDB log_file_size</td>
          <td>自己改</td>
          <td>Aurora 內建 storage path</td>
      </tr>
      <tr>
          <td>Custom storage engine</td>
          <td>可（MyRocks / TokuDB）</td>
          <td>只 InnoDB（Aurora optimized）</td>
      </tr>
      <tr>
          <td>Cross-cloud DR</td>
          <td>自配 binlog ship</td>
          <td>Aurora-only (AWS region)</td>
      </tr>
  </tbody>
</table>
<p>評估時必須確認 <em>當前自管功能</em> 沒用到 Aurora 不支援的能力。如果在用 MyRocks 等 storage engine、Aurora migration 不可行。</p>
<h2 id="容量與成本對照">容量與成本對照</h2>
<p>對 100 GB DB、5K WPS、20 個 application instance 的 deployment：</p>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>自管 MySQL（EC2）</th>
          <th>Aurora MySQL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Primary instance</td>
          <td>r5.2xlarge（$0.50/hr）</td>
          <td>db.r6g.2xlarge（$0.83/hr）</td>
      </tr>
      <tr>
          <td>EBS / Aurora storage</td>
          <td>io2 100 GB + 5000 IOPS = ~$70/mo</td>
          <td>Aurora storage 100 GB = ~$10/mo + I/O $0.20/M</td>
      </tr>
      <tr>
          <td>Replica × 3</td>
          <td>3 × r5.2xlarge = $1080/mo</td>
          <td>3 × db.r6g.large = $540/mo</td>
      </tr>
      <tr>
          <td>Backup storage</td>
          <td>S3 + 自己 cron mysqldump ~$50/mo</td>
          <td>Aurora backup 100 GB 免費 + 額外 $0.021/GB</td>
      </tr>
      <tr>
          <td>Ops headcount</td>
          <td>1-2 FTE × $150K = $300-500K/yr</td>
          <td>&lt; 0.5 FTE × $150K = $75K/yr</td>
      </tr>
      <tr>
          <td><strong>Total infra</strong></td>
          <td>~$1500/mo + 大 ops cost</td>
          <td>~$2000-3000/mo + 小 ops cost</td>
      </tr>
  </tbody>
</table>
<p>Pure infra cost Aurora 貴 30-50%、但 <em>ops cost 降幅大過 infra increase</em> — 200 人 eng team 養 1.5 FTE DBA 是 $300K-400K/yr、Aurora 換成 0.3 FTE 是 $60K-100K/yr、差距 $200K+ 抵 infra increase。</p>
<p>小團隊 / 小 deployment Aurora 不一定划算 — 50 人 eng team 沒有 dedicated DBA、自管 MySQL 也只佔某人 20% 時間、Aurora migration 的 ops saving 不存在。</p>
<h2 id="production-casenetflix-aurora-consolidation">Production case：Netflix Aurora consolidation</h2>
<p>MySQL → Aurora migration 的 production 責任是把自管 database operation 轉移成 managed SQL 的契約，而非只搬 schema 與資料。<a href="/blog/backend/09-performance-capacity/cases/netflix-aurora-consolidation/" data-link-title="9.C23 Netflix：把關聯式 DB 統一到 Aurora、效能 &#43;75%、成本 -28%" data-link-desc="Netflix 把多套關聯式 DB 統一到 Aurora、效能提升 75%、成本下降 28%、串流數十億小時">9.C23 Netflix Aurora consolidation</a> 提供的工程訊號是多套 RDBMS 整併到 Aurora 後，效能、成本與操作責任一起改變。</p>
<p>這個案例要回收到三個操作判準。第一，migration driver 應寫成 operation transfer，例如 backup、failover、storage growth、patching 與 observability 由誰承擔。第二，效能與成本要一起看，因為 Aurora 的 storage / compute / I/O 計費會把原本藏在 DBA 操作裡的成本攤開。第三，整併多套 RDBMS 時要先做 feature inventory，確認 plugin、storage engine、charset、replication topology 與 SQL mode 都能落到 Aurora MySQL 支援範圍。</p>
<p>Netflix case 的 sibling 路由是 <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a> 與 <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a>。若 migration 目標從 managed SQL 變成 multi-region active-active write，應改接 <a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a>。</p>
<h2 id="何時維持原路線">何時維持原路線</h2>
<ul>
<li><strong>Cross-cloud portability 是 requirement</strong>：Aurora AWS-only、要 cross-cloud 用 PlanetScale 或 自管</li>
<li><strong>用 Percona Server fork / MyRocks 等非標準 engine</strong>：Aurora 不支援</li>
<li><strong>需要 OS-level customization</strong>：Aurora 完全 managed、無 SSH</li>
<li><strong>規模太小</strong>：&lt; 100 GB / &lt; 1K WPS、自管 MySQL EC2 spot instance 已經夠便宜</li>
<li><strong>規模太大</strong>：&gt; 50 TB single DB / &gt; 100K WPS、Aurora single-instance 仍是 ceiling、考慮 Vitess 或 Aurora DSQL</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>平行 batch：→ PlanetScale migration playbook（同 MySQL backlog、不同 target paradigm）</li>
<li>上游：<a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a> / <a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a></li>
<li>跨章節：<a href="/blog/backend/09-performance-capacity/capacity-planning/" data-link-title="9.6 容量規劃模型" data-link-desc="peak forecast、headroom budget、growth curve、autoscaling sizing">9.6 容量規劃模型</a> — Aurora cost forecast</li>
<li>既有 case：<a href="/blog/backend/09-performance-capacity/cases/netflix-aurora-consolidation/" data-link-title="9.C23 Netflix：把關聯式 DB 統一到 Aurora、效能 &#43;75%、成本 -28%" data-link-desc="Netflix 把多套關聯式 DB 統一到 Aurora、效能提升 75%、成本下降 28%、串流數十億小時">9.C23 Netflix Aurora consolidation</a> — Netflix 從多套 RDBMS 統一到 Aurora 的 migration evidence</li>
<li>方法論：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook Methodology</a>（Type C operational hybrid 結構說明）</li>
<li>官方：<a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Migrating.html">Aurora MySQL Migration Guide</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL → PlanetScale：managed Vitess + branch-based schema workflow 的 hybrid shift</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL&lt;/a> 跟 PlanetScale。走 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> Type E paradigm shift 結構。&lt;/p>&lt;/blockquote>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>自管 MySQL&lt;/th>
 &lt;th>PlanetScale&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Sharding&lt;/td>
 &lt;td>自己配 Vitess 或不 shard&lt;/td>
 &lt;td>Vitess 透明（即使單 keyspace 也走 Vitess）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Schema migration&lt;/td>
 &lt;td>gh-ost / pt-osc 跑 ALTER&lt;/td>
 &lt;td>&lt;strong>Branch + Deploy Request&lt;/strong> workflow&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Failover&lt;/td>
 &lt;td>Orchestrator 自管&lt;/td>
 &lt;td>PlanetScale 自動&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Branching&lt;/td>
 &lt;td>不存在概念&lt;/td>
 &lt;td>&lt;strong>DB branch（git-like）+ revert&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Connection limit&lt;/td>
 &lt;td>max_connections 自己設&lt;/td>
 &lt;td>PlanetScale connection pool / per-plan limit&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Foreign key&lt;/td>
 &lt;td>支援&lt;/td>
 &lt;td>有限支援（Vitess 18+ / 2023 起、需明確啟用）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>SUPER&lt;/code> privilege&lt;/td>
 &lt;td>自己有&lt;/td>
 &lt;td>&lt;strong>無&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Multi-region&lt;/td>
 &lt;td>自己配 binlog ship&lt;/td>
 &lt;td>PlanetScale 內建（Boost feature）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Per-month cost&lt;/td>
 &lt;td>EC2 + EBS + ops&lt;/td>
 &lt;td>per-row-read + per-row-written + storage&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>從 &lt;em>application 連線&lt;/em> 視角：跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">Aurora MySQL migration&lt;/a> 一樣低、connection string 換就完事。從 &lt;em>schema management&lt;/em> 視角：PlanetScale 強推 &lt;em>branch-based workflow&lt;/em> — 改 schema 不再是「跑 gh-ost」、是「開 branch → Deploy Request → review → merge」。整個 schema change 工作流跟 git 同型、跟 application code review 同 workflow。&lt;/p>
&lt;p>這是 &lt;em>workflow + schema-tooling shift&lt;/em> — Aurora 是「同 workflow + managed」、PlanetScale 是「同 protocol + 不同 schema workflow + branch tooling」。Database paradigm（OLTP relational）跟 application change 都 Low、主要 shift 在 DBA / dev 操作介面。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL</a> 跟 PlanetScale。走 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> Type E paradigm shift 結構。</p></blockquote>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>自管 MySQL</th>
          <th>PlanetScale</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sharding</td>
          <td>自己配 Vitess 或不 shard</td>
          <td>Vitess 透明（即使單 keyspace 也走 Vitess）</td>
      </tr>
      <tr>
          <td>Schema migration</td>
          <td>gh-ost / pt-osc 跑 ALTER</td>
          <td><strong>Branch + Deploy Request</strong> workflow</td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>Orchestrator 自管</td>
          <td>PlanetScale 自動</td>
      </tr>
      <tr>
          <td>Branching</td>
          <td>不存在概念</td>
          <td><strong>DB branch（git-like）+ revert</strong></td>
      </tr>
      <tr>
          <td>Connection limit</td>
          <td>max_connections 自己設</td>
          <td>PlanetScale connection pool / per-plan limit</td>
      </tr>
      <tr>
          <td>Foreign key</td>
          <td>支援</td>
          <td>有限支援（Vitess 18+ / 2023 起、需明確啟用）</td>
      </tr>
      <tr>
          <td><code>SUPER</code> privilege</td>
          <td>自己有</td>
          <td><strong>無</strong></td>
      </tr>
      <tr>
          <td>Multi-region</td>
          <td>自己配 binlog ship</td>
          <td>PlanetScale 內建（Boost feature）</td>
      </tr>
      <tr>
          <td>Per-month cost</td>
          <td>EC2 + EBS + ops</td>
          <td>per-row-read + per-row-written + storage</td>
      </tr>
  </tbody>
</table>
<p>從 <em>application 連線</em> 視角：跟 <a href="/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">Aurora MySQL migration</a> 一樣低、connection string 換就完事。從 <em>schema management</em> 視角：PlanetScale 強推 <em>branch-based workflow</em> — 改 schema 不再是「跑 gh-ost」、是「開 branch → Deploy Request → review → merge」。整個 schema change 工作流跟 git 同型、跟 application code review 同 workflow。</p>
<p>這是 <em>workflow + schema-tooling shift</em> — Aurora 是「同 workflow + managed」、PlanetScale 是「同 protocol + 不同 schema workflow + branch tooling」。Database paradigm（OLTP relational）跟 application change 都 Low、主要 shift 在 DBA / dev 操作介面。</p>
<h2 id="為什麼是-type-eparadigm--operational--schema-多軸">為什麼是 Type E（Paradigm + Operational + Schema 多軸）</h2>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/#%e5%af%ab%e5%89%8d%e7%9a%84-diff-dimension-audit" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評</th>
          <th>說明</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema</td>
          <td>Medium-High</td>
          <td>MySQL wire protocol 一致、FK 有限支援（Vitess 18+）、部分 INSTANT DDL 行為差</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>High</td>
          <td>branch lifecycle、Deploy Request workflow、connection pooler 不同</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>High</td>
          <td>branch-based schema management、跟自管 gh-ost / pt-osc 思維完全不同</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Medium</td>
          <td>PlanetScale CLI / Console / API / connection pooler 都進團隊工具</td>
      </tr>
      <tr>
          <td>App change</td>
          <td>Low</td>
          <td>connection string + 移除 FK 約束</td>
      </tr>
      <tr>
          <td>Topology</td>
          <td>Low-Medium</td>
          <td>Vitess 透明 sharding 即使單 keyspace</td>
      </tr>
  </tbody>
</table>
<p>Paradigm + Operational + Schema 三軸 High。按優先序 Schema &gt; Paradigm &gt; Operational、預設選 Type A。但 <em>讀者最關心</em> 的是 schema workflow paradigm 轉變、不是 schema field translation — Type E 結構更貼合「不收斂、部分 adopt」的真實 migration 流程。</p>
<p>→ <strong>Type E paradigm shift</strong>、4-phase partial migration（多數 org 停 Phase 2-3 hybrid）。</p>
<h2 id="driverbranch-based-workflow--vitess-transparent-sharding--zero-dba">Driver：Branch-based workflow + Vitess transparent sharding + zero DBA</h2>
<p>從自管 MySQL 遷 PlanetScale 的核心 driver 有三條：</p>
<p><strong>Branch-based schema workflow</strong>：</p>
<ul>
<li>改 schema 開 branch（<code>pscale branch create</code>）、在 branch 上跑 ALTER、跑 application code 改、merge 進 main 前 Deploy Request review</li>
<li>Deploy Request 顯示 schema diff、跟 GitHub PR 同概念</li>
<li>Merge 後 PlanetScale 自動跑 <em>no-downtime schema migration</em>（內部 VReplication）</li>
<li>出問題可 <em>revert</em>（48 小時內、用 Vitess VReplication 反向 ship 資料）</li>
</ul>
<p>這條 workflow 對 <em>developer ergonomic</em> 拉力大 — schema change 不再是「DBA 工作」、是「dev 自己處理、跟 code review 同流程」。</p>
<p><strong>Vitess transparent sharding</strong>：</p>
<ul>
<li>PlanetScale 強制每個 cluster 走 Vitess（即使單 keyspace 看似 unsharded）</li>
<li>寫吞吐成長到需要 shard 時、加 shard 是 PlanetScale internal 操作、application 看不到</li>
<li>不用養 Vitess SRE 團隊</li>
</ul>
<p><strong>Zero DBA</strong>：</p>
<ul>
<li>PlanetScale 接管所有 ops（failover / backup / parameter / scaling）</li>
<li>跟 Aurora 同等級「managed」、加上 branch workflow</li>
</ul>
<p>FK 處理：早期 Vitess（&lt; 18）不支援 FK、PlanetScale 對應期間建議全 drop FK + 改 application enforcement。Vitess 18（2023 末）後加 FK 支援、PlanetScale 在合適 plan 內可啟用、但 <em>cross-shard FK</em> 仍受限。Phase 1 audit 重點不再是「全 drop FK」、而是「驗證 FK 行為（特別 cascade / cross-shard）跟自管 MySQL 預期一致」。</p>
<h2 id="4-phase-partial-migration不收斂">4-phase partial migration（不收斂）</h2>
<h3 id="phase-1fk-行為驗證--schema-auditplanetscale-shadow-cluster-起來">Phase 1：FK 行為驗證 + schema audit、PlanetScale shadow cluster 起來</h3>
<p>第一步是 <em>FK 行為驗證</em> + schema layout audit。Vitess 18+ / PlanetScale 已支援 FK、但行為跟自管 MySQL 有差異：</p>
<ul>
<li>列所有 FK：<code>SELECT * FROM information_schema.KEY_COLUMN_USAGE WHERE REFERENCED_TABLE_NAME IS NOT NULL</code></li>
<li>對每個 FK 評估：
<ul>
<li><em>Cross-shard FK</em>：PlanetScale 不允許 FK 跨 shard、parent 跟 child 必須同 shard（透過 Vindex 設計）</li>
<li><em>Cascade 行為</em>：cross-shard DELETE cascade 在 PlanetScale 不執行、改 application 層處理</li>
<li><em>Native FK 啟用 vs application enforcement</em>：依 Vitess 18+ 行為決定保留 FK 或改 app-level</li>
</ul>
</li>
<li><em>PlanetScale shadow cluster</em> 起來、跑 application schema、用 Vitess Connector 從自管 binlog ship 資料</li>
</ul>
<p>工作主要塊：</p>
<ul>
<li>FK 行為 audit + 改 cross-shard cascade（依 FK 數量、weeks 工作量）</li>
<li>Schema dump → PlanetScale import（用 <code>pscale shell</code>）</li>
<li>Vitess Connector 設定 binlog stream</li>
</ul>
<p>完成標準：PlanetScale shadow cluster 有完整 production schema、cross-shard FK 已處理、binlog stream lag &lt; 1 秒。</p>
<h3 id="phase-2read-traffic-切-planetscale">Phase 2：Read traffic 切 PlanetScale</h3>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">Aurora migration</a> Phase 2 同概念：read query 切 PlanetScale connection string、寫入仍自管 MySQL。</p>
<p>差異：</p>
<ul>
<li>PlanetScale connection 有 <em>per-plan rate limit</em>（Scaler Plan: 10K connections、Enterprise: 100K）</li>
<li>必須走 <em>PlanetScale connection pool</em>（不是直接連、有 SSL handshake overhead）</li>
<li>監控 <code>pscale_io_read_query_throttled_total</code> 確認沒撞 plan limit</li>
</ul>
<p>跑 2-4 週、確認：</p>
<ul>
<li>PlanetScale read latency 跟自管 replica latency 接近（PlanetScale Boost cache 可能比自管快）</li>
<li>Vitess Connector stream 穩定</li>
<li>Application 對 PlanetScale row read 量符合 cost forecast</li>
</ul>
<h3 id="phase-3schema-workflow-切-planetscale--write-cutover">Phase 3：Schema workflow 切 PlanetScale + write cutover</h3>
<p>關鍵 paradigm shift：<em>停 gh-ost / pt-osc</em>、改用 PlanetScale branch workflow。</p>
<p>訓練步驟：</p>
<ol>
<li><em>第一個 small schema change</em> 用 PlanetScale branch + Deploy Request 跑</li>
<li>開發團隊熟悉 <code>pscale branch create</code> / <code>pscale deploy-request create</code> CLI</li>
<li>CI integration：把 PlanetScale CLI 加進 deploy pipeline</li>
<li>退役 gh-ost / pt-osc CI integration</li>
</ol>
<p>完成 schema workflow 訓練後 write cutover：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 1. PlanetScale 把 shadow cluster promote 為 primary（用 PlanetScale console / API）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"># 透過 PlanetScale Console 啟用 production write 或用 `pscale` CLI 對應 promotion 命令</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="c1"># （CLI 命令名稱隨 pscale 版本變動、以 pscale --help 為準）</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># 2. Application connection string 切 PlanetScale writer</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"># 自管 → mysql://primary.example.com:3306/production</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># PlanetScale → mysql://...@xxx.connect.psdb.cloud/production?sslaccept=strict</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl">
</span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="c1"># 3. Vitess Connector 反向（PlanetScale → 自管）作為 rollback insurance</span></span></span></code></pre></div><p>完成標準：寫入流量 100% 進 PlanetScale、自管 MySQL 接 PlanetScale binlog（rollback buffer）。</p>
<h3 id="phase-4自管-mysql-退役--保留作-rollback-buffer">Phase 4：自管 MySQL 退役 / 保留作 rollback buffer</h3>
<p>跟 Aurora migration Phase 4 同模式：</p>
<ul>
<li>自管保留 30-90 天作 cold buffer</li>
<li>確認 PlanetScale cost forecast 跟 actual 一致（per-row read / write 計費可能超預期）</li>
<li>確認 branch workflow 在 production team 內 adopt（不是「PlanetScale 在用、但團隊還是用 gh-ost on staging」這種 stuck 狀態）</li>
</ul>
<p>多數 org 在 <em>Phase 3</em> 停留更久（半年-一年）— Vitess Connector 反向 binlog ship 是穩定 rollback path、Phase 4 不急。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-cross-shard-fk--planetscale-跟-native-mysql-行為不同">1. Cross-shard FK — PlanetScale 跟 native MySQL 行為不同</h3>
<p>Vitess 18+ / PlanetScale 已支援 FK、但 <em>cross-shard cascade</em> 不執行。同 shard 內 FK 跟 native MySQL 一致；parent 跟 child 跨 shard 時、<code>ON DELETE CASCADE</code> 在 PlanetScale 不會跨 shard 觸發 child delete、結果 application 看到 <em>orphan row</em>。</p>
<p>修法：</p>
<ul>
<li>Phase 1 audit 出哪些 FK 跨 shard（Vindex 設計決定 parent / child 是否同 shard）</li>
<li>同 shard FK：直接保留、行為跟自管 MySQL 一致</li>
<li>Cross-shard cascade：改 application 層 transaction 內 explicit DELETE child、或 <em>background reconciliation job</em>（定期掃 orphan）</li>
<li>把 <em>parent / child 強制同 shard</em>（用相同 Vindex column）是預防 cross-shard FK 議題的根本解</li>
</ul>
<h3 id="2-deploy-request-思維轉換不到位--團隊仍用跑-alter心智模型">2. Deploy Request 思維轉換不到位 — 團隊仍用「跑 ALTER」心智模型</h3>
<p>DBA / SRE 習慣 <em>直接連 PlanetScale 跑 ALTER</em> —但 PlanetScale 在 production branch 上 <em>禁止 DDL</em>（必須走 Deploy Request）。失敗訊息 <em>not actionable</em>（ERROR: not authorized）、DBA 找不到原因、production maintenance 卡住。</p>
<p>修法：</p>
<ul>
<li>Phase 3 <em>訓練步驟</em> 不能跳：找一個 small schema change 在 staging 走完整 branch workflow、團隊每個 DBA / SRE 都 hands-on 過</li>
<li>在 ops runbook 寫明 <em>production schema change must go through Deploy Request</em>、列 CLI 命令模板</li>
<li>緊急 schema change（事故中）也走 branch + Deploy Request、PlanetScale 可加速 Deploy（不能 bypass workflow）</li>
</ul>
<h3 id="3-schema-diff-邊界--planetscale-看不到-application-level-insert-changes">3. Schema diff 邊界 — PlanetScale 看不到 application-level INSERT changes</h3>
<p>Deploy Request 顯示 <em>schema-level diff</em>（CREATE / ALTER / DROP）、不顯示 <em>data diff</em>。如果 branch 上有 INSERT 進去（測試資料 / seed data）、merge 進 main 時 <em>資料不會搬</em>（只搬 schema）、application 預期有資料但 production 沒。</p>
<p>修法：</p>
<ul>
<li>把 <em>seed data INSERT</em> 放 application migration / fixture、不在 PlanetScale branch 內</li>
<li>用 PlanetScale CLI <em>export branch data</em> 跟 <em>import to main</em>（手動操作）作為 escape hatch</li>
<li>教育團隊：PlanetScale branch = <em>schema branch</em>、不是 git-like <em>data branch</em></li>
</ul>
<h3 id="4-branch-lifecycle-ops-cost--100-個-stale-branch">4. Branch lifecycle ops cost — 100 個 stale branch</h3>
<p>每個 PR 都開一個 PlanetScale branch、PR merge 後忘記刪、累積 100 個 stale branch。每個 branch 佔 storage cost、PlanetScale plan limit 也限制 branch 數量。</p>
<p>修法：</p>
<ul>
<li>CI integration：PR close 自動 <code>pscale branch delete &lt;branch-name&gt;</code></li>
<li>設 <em>branch retention policy</em>（30 天無活動自動刪）</li>
<li>監控 <code>pscale branch list | wc -l</code> 數量、超 threshold alert</li>
<li>把 branch lifecycle 寫進 <em>team playbook</em>（不是 PlanetScale 教、是團隊內部規範）</li>
</ul>
<h3 id="5-無-super-privilege--部分操作不可行">5. 無 <code>SUPER</code> privilege — 部分操作不可行</h3>
<p>PlanetScale connection 拿到的 MySQL user 沒有 <code>SUPER</code> privilege。需要 <code>SUPER</code> 的操作直接失敗：</p>
<ul>
<li><code>SET GLOBAL</code>（不能改 runtime variable）</li>
<li><code>KILL</code> 別人的 query（PlanetScale console 提供 alt 介面）</li>
<li><code>SHOW MASTER STATUS</code> / <code>SHOW SLAVE STATUS</code>（PlanetScale 抽象掉、不暴露）</li>
<li><code>INSTALL PLUGIN</code>（managed、不允許）</li>
<li><code>STOP SLAVE</code> / <code>START SLAVE</code>（Vitess 內部）</li>
</ul>
<p>修法：</p>
<ul>
<li>評估 application 跟 ops tool 是否依賴 <code>SUPER</code> privilege</li>
<li>改用 PlanetScale console / API 等價操作</li>
<li>部分監控 query（<code>SHOW SLAVE STATUS</code>）用 <em>PlanetScale 內建 dashboard</em> 代替</li>
</ul>
<h2 id="schema-translation-主要工作量塊">Schema translation 主要工作量塊</h2>
<p>雖然 Type E 結構不以 schema translation 為主、但 schema diff 在 Phase 1 仍佔多數時間：</p>
<table>
  <thead>
      <tr>
          <th>自管 MySQL</th>
          <th>PlanetScale (Vitess)</th>
          <th>翻譯難度</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FOREIGN KEY constraint</td>
          <td>（無）+ application enforcement</td>
          <td>高</td>
      </tr>
      <tr>
          <td>INSTANT DDL</td>
          <td>部分支援、其他走 Vitess online DDL</td>
          <td>低-中</td>
      </tr>
      <tr>
          <td>Stored procedure</td>
          <td>支援</td>
          <td>低</td>
      </tr>
      <tr>
          <td>Trigger</td>
          <td>支援</td>
          <td>低</td>
      </tr>
      <tr>
          <td>User-defined function</td>
          <td>受限</td>
          <td>中</td>
      </tr>
      <tr>
          <td>INSERT 跨表（CTE）</td>
          <td>支援</td>
          <td>低</td>
      </tr>
      <tr>
          <td>Cross-shard JOIN</td>
          <td>必須用 Vindex（user_id 等 shard key 同表）</td>
          <td>中-高</td>
      </tr>
      <tr>
          <td><code>SUPER</code> 行為</td>
          <td>不支援</td>
          <td>中（ops tool 改）</td>
      </tr>
      <tr>
          <td><code>RELOAD</code> privilege</td>
          <td>不支援</td>
          <td>中</td>
      </tr>
  </tbody>
</table>
<h2 id="容量與成本對照">容量與成本對照</h2>
<p>PlanetScale 計費 <em>很不同</em>：</p>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>自管 MySQL（EC2）</th>
          <th>PlanetScale Scaler Pro</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Per-row read</td>
          <td>不計費</td>
          <td>按量計費、$1 per 1B row read</td>
      </tr>
      <tr>
          <td>Per-row written</td>
          <td>不計費</td>
          <td>按量計費、$1.50 per 1M row write</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>EBS、$0.10/GB-month</td>
          <td>$1.50/GB-month + replication overhead</td>
      </tr>
      <tr>
          <td>Connection limit</td>
          <td>max_connections 自己設</td>
          <td>per-plan limit、可加 Connection pooler</td>
      </tr>
      <tr>
          <td>Branch</td>
          <td>不適用</td>
          <td>每 branch 含 storage cost</td>
      </tr>
      <tr>
          <td>Boost cache</td>
          <td>不適用</td>
          <td>additional cost</td>
      </tr>
      <tr>
          <td>Ops headcount</td>
          <td>1-2 FTE</td>
          <td>&lt; 0.2 FTE</td>
      </tr>
  </tbody>
</table>
<p>PlanetScale 適合 <em>小-中規模 + high developer productivity priority</em>：</p>
<ul>
<li>流量 &lt; 10K WPS：cost 接近自管、developer productivity 顯著提升</li>
<li>流量 10-50K WPS：cost 開始貴、但 ops saving 仍大於 cost increase</li>
<li>流量 &gt; 100K WPS：PlanetScale Enterprise 議價、要 commit pricing</li>
</ul>
<p>對 high-traffic 場景 cost forecast 必須跑 <em>真實 workload trace</em> — PlanetScale 提供 <code>pscale analytics</code> 預估 read / write 量、用 production binlog replay 在 staging 跑、估算 row read / write 計費。</p>
<h2 id="何時不要遷">何時不要遷</h2>
<ul>
<li><strong>FK 是 application core constraint</strong>：cascade DELETE / SET NULL 廣泛使用、application 改不動</li>
<li><strong>大量 <code>SUPER</code>-required ops 自動化</strong>：DBA tools / monitoring 寫死 <code>SUPER</code>、改不動</li>
<li><strong>OS-level customization 需求</strong>：跟 Aurora 一樣、PlanetScale 完全 managed</li>
<li><strong>流量極大 + 預算敏感</strong>：&gt; 100K WPS row read 計費可能比 EC2 貴 5x、需要 Enterprise commit pricing</li>
<li><strong>跨雲 portability 是 requirement</strong>：PlanetScale 跑在自家 cloud（背後 AWS / GCP）、不像自管 Vitess 可跨雲</li>
</ul>
<h2 id="跟-aurora-mysql-對比同-batch-的選擇">跟 Aurora MySQL 對比（同 batch 的選擇）</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Aurora MySQL</th>
          <th>PlanetScale</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Type</td>
          <td>C operational hybrid</td>
          <td>E paradigm shift</td>
      </tr>
      <tr>
          <td>工作量主軸</td>
          <td>parameter group + IAM + endpoint</td>
          <td>FK audit + branch workflow</td>
      </tr>
      <tr>
          <td>Sharding</td>
          <td>不 shard、single-region scaling</td>
          <td>Vitess 透明 sharding</td>
      </tr>
      <tr>
          <td>Schema workflow</td>
          <td>仍用 gh-ost / pt-osc</td>
          <td>Branch + Deploy Request</td>
      </tr>
      <tr>
          <td>FK</td>
          <td>支援</td>
          <td>不支援</td>
      </tr>
      <tr>
          <td>Cost model</td>
          <td>per-hour instance + per-GB storage</td>
          <td>per-row read / write + per-GB storage</td>
      </tr>
      <tr>
          <td>適合規模</td>
          <td>100 GB - 50 TB</td>
          <td>100 GB - 1 PB</td>
      </tr>
      <tr>
          <td>跨雲</td>
          <td>AWS-only</td>
          <td>PlanetScale 背後 AWS / GCP</td>
      </tr>
  </tbody>
</table>
<p>選擇邏輯：</p>
<ul>
<li><em>AWS-heavy ecosystem + 不想 schema workflow paradigm shift</em> → Aurora</li>
<li><em>Developer-first culture + 想 branch-based schema workflow + 接受 FK 限制</em> → PlanetScale</li>
</ul>
<p>兩者不互斥、有 org 用 Aurora 給 OLTP core、PlanetScale 給 newer microservices（branch workflow 帶價值）。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>平行 batch：→ Aurora MySQL migration playbook（同 batch、不同 paradigm）</li>
<li>上游：<a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a> / <a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">Vitess sharding 設計</a></li>
<li>跨章節：<a href="/blog/backend/06-reliability/performance-regression-gate/" data-link-title="6.13 Performance Regression Gate" data-link-desc="把效能 baseline 從一次性壓測變成持續對齊的 release gate，涵蓋 baseline 設定、判讀方法、variance 控制與退化定位">6.13 Performance Regression Gate</a> — Deploy Request workflow 對 release gate 的影響</li>
<li>既有 vendor 對照：<a href="/blog/backend/01-database/vendors/aurora/" data-link-title="AWS Aurora" data-link-desc="AWS managed PostgreSQL / MySQL、storage / compute 分離、&#43;75% 效能改善的 production 證據">Aurora vendor page</a> / <a href="https://planetscale.com/">PlanetScale 官方</a></li>
<li>方法論：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook Methodology</a>（Type E paradigm shift 結構說明）</li>
<li>官方：<a href="https://planetscale.com/docs/imports/migrate-from-mysql">PlanetScale Migration Guide</a></li>
</ul>
]]></content:encoded></item><item><title>自管 Vitess → PlanetScale：Vitess component ops outsource、加 schema workflow shift</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-vitess-to-planetscale/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-vitess-to-planetscale/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">Vitess sharding&lt;/a> 跟 PlanetScale。走 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology&lt;/a> Type C operational hybrid 結構。&lt;/p>&lt;/blockquote>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>元件&lt;/th>
 &lt;th>自管 Vitess&lt;/th>
 &lt;th>PlanetScale&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>VTGate&lt;/td>
 &lt;td>自己部署 + LB&lt;/td>
 &lt;td>Managed、隱藏在 PlanetScale endpoint 後&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>VTTablet&lt;/td>
 &lt;td>自己 per-MySQL deploy&lt;/td>
 &lt;td>Managed&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>VReplication&lt;/td>
 &lt;td>自己 trigger workflow&lt;/td>
 &lt;td>Managed、透過 Console / API&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>VSchema&lt;/td>
 &lt;td>自己維護（YAML / API）&lt;/td>
 &lt;td>Managed、Console UI 編輯&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>MySQL backend&lt;/td>
 &lt;td>自己 EC2 / on-prem&lt;/td>
 &lt;td>Managed (Aurora-like underlying)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Schema migration&lt;/td>
 &lt;td>gh-ost / pt-osc 或 Vitess online DDL&lt;/td>
 &lt;td>&lt;strong>Branch + Deploy Request workflow&lt;/strong>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Failover&lt;/td>
 &lt;td>自己用 VTOrc&lt;/td>
 &lt;td>Managed&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Multi-region&lt;/td>
 &lt;td>自己配 VReplication 跨 region&lt;/td>
 &lt;td>Boost / per-region cluster&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cost model&lt;/td>
 &lt;td>EC2 + EBS + ops headcount&lt;/td>
 &lt;td>Per-row read / write + storage&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>這條 migration 跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">→ Aurora MySQL&lt;/a> 相似（self-managed → managed），但 &lt;em>target 是 Vitess-native managed&lt;/em>、保留 sharding 能力。同時加上 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/" data-link-title="MySQL → PlanetScale：managed Vitess &amp;#43; branch-based schema workflow 的 hybrid shift" data-link-desc="自管 MySQL → PlanetScale 加上 Vitess sharding 跟 branch-based schema workflow。本文走 6 維 audit（Paradigm &amp;#43; Operational &amp;#43; Schema 多軸）、4-phase migration、5 production 踩雷、何時不要遷。">→ PlanetScale from self-managed MySQL&lt;/a> 的 branch workflow paradigm。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link 到 <a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">Vitess sharding</a> 跟 PlanetScale。走 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a> Type C operational hybrid 結構。</p></blockquote>
<table>
  <thead>
      <tr>
          <th>元件</th>
          <th>自管 Vitess</th>
          <th>PlanetScale</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>VTGate</td>
          <td>自己部署 + LB</td>
          <td>Managed、隱藏在 PlanetScale endpoint 後</td>
      </tr>
      <tr>
          <td>VTTablet</td>
          <td>自己 per-MySQL deploy</td>
          <td>Managed</td>
      </tr>
      <tr>
          <td>VReplication</td>
          <td>自己 trigger workflow</td>
          <td>Managed、透過 Console / API</td>
      </tr>
      <tr>
          <td>VSchema</td>
          <td>自己維護（YAML / API）</td>
          <td>Managed、Console UI 編輯</td>
      </tr>
      <tr>
          <td>MySQL backend</td>
          <td>自己 EC2 / on-prem</td>
          <td>Managed (Aurora-like underlying)</td>
      </tr>
      <tr>
          <td>Schema migration</td>
          <td>gh-ost / pt-osc 或 Vitess online DDL</td>
          <td><strong>Branch + Deploy Request workflow</strong></td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>自己用 VTOrc</td>
          <td>Managed</td>
      </tr>
      <tr>
          <td>Multi-region</td>
          <td>自己配 VReplication 跨 region</td>
          <td>Boost / per-region cluster</td>
      </tr>
      <tr>
          <td>Cost model</td>
          <td>EC2 + EBS + ops headcount</td>
          <td>Per-row read / write + storage</td>
      </tr>
  </tbody>
</table>
<p>這條 migration 跟 <a href="/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">→ Aurora MySQL</a> 相似（self-managed → managed），但 <em>target 是 Vitess-native managed</em>、保留 sharding 能力。同時加上 <a href="/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/" data-link-title="MySQL → PlanetScale：managed Vitess &#43; branch-based schema workflow 的 hybrid shift" data-link-desc="自管 MySQL → PlanetScale 加上 Vitess sharding 跟 branch-based schema workflow。本文走 6 維 audit（Paradigm &#43; Operational &#43; Schema 多軸）、4-phase migration、5 production 踩雷、何時不要遷。">→ PlanetScale from self-managed MySQL</a> 的 branch workflow paradigm。</p>
<p>對 <em>已花心力建 Vitess team 但 ops cost 太大</em> 的 org 來說、這條 migration 比 <em>Vitess → distributed SQL</em> 風險低、保留 sharding investment。</p>
<h2 id="為什麼是-type-c不是-type-a-或-type-e">為什麼是 Type C（不是 Type A 或 Type E）</h2>
<p>跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/#%e5%af%ab%e5%89%8d%e7%9a%84-diff-dimension-audit" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">6 維 diff dimension audit</a>：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>評</th>
          <th>說明</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Schema</td>
          <td>Low</td>
          <td>Vitess wire protocol + VSchema 概念一致</td>
      </tr>
      <tr>
          <td>Operational</td>
          <td>High</td>
          <td>4 個 component 的 ops 全部 outsource、branch workflow 是新 paradigm</td>
      </tr>
      <tr>
          <td>Paradigm</td>
          <td>Medium</td>
          <td>Vitess paradigm 不變、但加 branch workflow</td>
      </tr>
      <tr>
          <td>Components</td>
          <td>Low</td>
          <td>同 Vitess engine</td>
      </tr>
      <tr>
          <td>App change</td>
          <td>Low</td>
          <td>Connection string 改、無 schema rewrite</td>
      </tr>
      <tr>
          <td>Topology</td>
          <td>Low</td>
          <td>Vitess sharding 結構保留</td>
      </tr>
  </tbody>
</table>
<p>Operational = High（其他 Low / Medium） → <strong>Type C operational hybrid</strong>。Branch workflow 是 <em>Medium paradigm shift</em> 但不是 dominant — 主要工作量在 <em>operational ownership 轉移</em>。</p>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/" data-link-title="MySQL → PlanetScale：managed Vitess &#43; branch-based schema workflow 的 hybrid shift" data-link-desc="自管 MySQL → PlanetScale 加上 Vitess sharding 跟 branch-based schema workflow。本文走 6 維 audit（Paradigm &#43; Operational &#43; Schema 多軸）、4-phase migration、5 production 踩雷、何時不要遷。">自管 MySQL → PlanetScale</a>（Type E paradigm shift）對比：那條 path 是 <em>no-Vitess → Vitess + branch</em>、要學 Vitess 概念 + branch；本條是 <em>已有 Vitess + 加 branch</em>、只學 branch、複雜度低很多。</p>
<h2 id="driverops-headcount--branch-workflow--vitess-feature-加速">Driver：Ops headcount + Branch workflow + Vitess feature 加速</h2>
<p>從自管 Vitess 遷 PlanetScale 的核心 driver：</p>
<p><strong>Ops headcount 削減</strong>：</p>
<ul>
<li>自管 Vitess 通常需要 <em>2-5 個 SRE/DBA 撐 production</em> —VTGate / VTTablet / VReplication / VSchema 各有議題</li>
<li>PlanetScale 把這層全部 outsource、團隊 ops headcount 可降到 &lt; 1 FTE</li>
<li>對 50-200 人 eng team、ops cost saving 是顯著 driver</li>
</ul>
<p><strong>Branch workflow paradigm</strong>：</p>
<ul>
<li>自管 Vitess 仍用 gh-ost / pt-osc 或 Vitess online DDL 跑 schema migration、是 DBA 主導</li>
<li>PlanetScale branch workflow 把 schema migration 變 <em>developer self-service</em>、開 branch / Deploy Request / merge、跟 git workflow 同節奏</li>
<li>對 <em>high-velocity engineering culture</em> 是文化升級</li>
</ul>
<p><strong>Vitess upstream feature</strong>：</p>
<ul>
<li>PlanetScale team 是 Vitess 的主要 contributor、新 feature 通常 PlanetScale 先 ship</li>
<li>自管 Vitess 升級慢、PlanetScale 用戶看到新 feature 早 3-6 個月</li>
</ul>
<p>不適合 <em>跨雲 portability priority high</em> 或 <em>strict on-prem deployment</em> 的 org — PlanetScale 是 cloud-only。</p>
<h2 id="4-phase-migration">4-phase migration</h2>
<h3 id="phase-1topology--vschema-audit">Phase 1：Topology + VSchema audit</h3>
<p>把當前自管 Vitess cluster 完整盤點：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># Vitess cluster topology</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">vtctldclient GetKeyspaces
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">vtctldclient GetShards &lt;keyspace&gt;
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">vtctldclient GetTablets
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># VSchema</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">vtctldclient GetVSchema &lt;keyspace&gt;
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">
</span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="c1"># 跨 keyspace VReplication workflow</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">vtctldclient GetWorkflows</span></span></code></pre></div><p>對每個 keyspace 檢查：</p>
<ul>
<li><em>Shard 數量</em>：PlanetScale plan 對 shard 數量有 limit（Enterprise 才能超大規模）</li>
<li><em>VSchema features</em>：自管可能用 <em>PlanetScale 不支援的 Vindex</em>（custom Vindex）</li>
<li><em>Foreign key</em>：Vitess 18+（2023 末）才支援 FK、自管 Vitess 大多 &lt; 18、cluster 內已 application-enforced；遷 PlanetScale 後可選擇啟用 native FK（同 shard 內）或繼續 application enforcement</li>
<li><em>Stored procedure / trigger</em>：PlanetScale 受限、確認是否 application 依賴</li>
</ul>
<p>完成標準：寫 <em>blocker list</em>（PlanetScale 不支援的功能）+ <em>compatibility list</em>（功能對應）。</p>
<h3 id="phase-2dual-cluster--binlog-stream">Phase 2：Dual cluster + binlog stream</h3>
<p>PlanetScale 內建 <em>Vitess Connector</em>、從外部 MySQL（包括其他 Vitess cluster）binlog stream import：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 1. 用 PlanetScale CLI 建 cluster</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">pscale database create production --region us-east
</span></span><span class="line"><span class="ln">3</span><span class="cl">
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"># 2. Import schema（從自管 Vitess export）</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">pscale shell production main &lt; schema.sql
</span></span><span class="line"><span class="ln">6</span><span class="cl">
</span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"># 3. 設 Vitess Connector 從自管 cluster import 資料</span>
</span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="c1"># （透過 PlanetScale Console）</span></span></span></code></pre></div><p>Vitess Connector 從自管 VTTablet 的 MySQL primary 讀 binlog、寫進 PlanetScale。Lag 通常 &lt; 1 秒。</p>
<p>跑 1-2 週、確認：</p>
<ul>
<li>Schema 完整 migrate</li>
<li>VSchema 對應正確（Vindex 行為一致）</li>
<li>Lag 穩定</li>
</ul>
<h3 id="phase-3application-read-切-planetscale">Phase 3：Application read 切 PlanetScale</h3>
<p>跟 Aurora migration Phase 2 同概念。Application read query 切 PlanetScale endpoint：</p>
<ul>
<li>連 PlanetScale connection string（<code>xxx.connect.psdb.cloud</code>）</li>
<li>仍寫自管 Vitess、Vitess Connector 同步 PlanetScale</li>
</ul>
<p>跑 2-4 週、驗證：</p>
<ul>
<li>Query result 一致</li>
<li>PlanetScale read latency 接近自管（PlanetScale Boost cache 可能加速）</li>
<li>PlanetScale row read 計費跟預估一致</li>
</ul>
<h3 id="phase-4write-cutover--自管-vitess-退役">Phase 4：Write cutover + 自管 Vitess 退役</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># 1. PlanetScale 把 cluster promote 為 primary（透過 Console）</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"># 透過 PlanetScale Console 啟用 production write 或用 `pscale` CLI 對應 promotion 命令</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># （CLI 命令名稱隨 pscale 版本變動、以 pscale --help 為準）</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 2. Application 寫 connection string 切 PlanetScale</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="c1"># 自管 Vitess → PlanetScale</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"># 3. Vitess Connector 反向（PlanetScale → 自管）作為 rollback buffer</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="c1"># 4. 跑 1-2 週確認、開始 decommission 自管 Vitess</span></span></span></code></pre></div><p>Decommission 自管 Vitess 是大工程：</p>
<ul>
<li>VTGate / VTTablet pods 一個個關</li>
<li>VReplication workflow 停掉</li>
<li>MySQL backend 保留作 cold backup 1-3 月、然後 EBS snapshot + terminate</li>
</ul>
<p>完成標準：所有 traffic 在 PlanetScale、自管 Vitess 資源全 release、ops headcount confirm 下降。</p>
<h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-vschema-不完全兼容--custom-vindex-必須改">1. VSchema 不完全兼容 — Custom Vindex 必須改</h3>
<p>自管 Vitess 可能用了 <em>custom Vindex</em>（自寫 Go plugin）、PlanetScale 不支援 custom Vindex（只支援 built-in：hash / lookup_hash / unicode 等）。</p>
<p>修法：</p>
<ul>
<li>Phase 1 audit 出所有 custom Vindex</li>
<li>對每個 custom Vindex 評估能否用 built-in 替代</li>
<li>不能替代的、考慮 <em>application 層 logic 取代 Vindex</em>（application 自己算 shard key）</li>
<li>或 <em>暫不遷該 keyspace</em>、保留自管 Vitess 跑 custom Vindex keyspace、其他遷 PlanetScale</li>
</ul>
<h3 id="2-branch-workflow-訓練不到位--dba-仍用vitess-online-ddl心智模型">2. Branch workflow 訓練不到位 — DBA 仍用「Vitess online DDL」心智模型</h3>
<p>自管 Vitess team 習慣 <code>vtctldclient ApplySchema --strategy=vitess</code> 跑 online DDL、遷 PlanetScale 後仍想直接這樣 — 但 PlanetScale production branch 禁止 schema change、必須走 Deploy Request。</p>
<p>修法：</p>
<ul>
<li>Phase 3 <em>訓練步驟</em>：team 每個 DBA / SRE 都跑過完整 branch + Deploy Request workflow</li>
<li>寫 <em>team runbook</em>：production schema change must 走 branch</li>
<li>緊急 schema change（事故中）也走 branch、PlanetScale 可加速 Deploy</li>
</ul>
<h3 id="3-super-privilege-移除--自管-admin-tool-失效">3. SUPER privilege 移除 — 自管 admin tool 失效</h3>
<p>自管 Vitess 用 <code>SUPER</code> privilege 跑 admin script、PlanetScale 沒給 SUPER。常見失效：</p>
<ul>
<li>自寫 monitor script 跑 <code>SHOW SLAVE STATUS</code>、PlanetScale 抽象掉</li>
<li>自寫 backup script 跑 <code>FLUSH TABLES WITH READ LOCK</code>、PlanetScale 不允許</li>
<li>自寫 cleanup script 跑 <code>KILL QUERY</code>、PlanetScale 受限</li>
</ul>
<p>修法：</p>
<ul>
<li>Phase 1 audit 所有 admin script</li>
<li>改用 <em>PlanetScale Console / CLI / API</em> 等價操作</li>
<li>PlanetScale 提供的 monitoring 介面替代自管監控</li>
</ul>
<h3 id="4-connection-limit--planetscale-plan-比預期緊">4. Connection limit — PlanetScale plan 比預期緊</h3>
<p>PlanetScale Scaler Plan: 10K connection、Enterprise: 100K。自管 Vitess VTGate 通常設 50K-200K connection、遷 PlanetScale 後 hit limit。</p>
<p>修法：</p>
<ul>
<li>Phase 1 <em>connection forecast</em>：peak hour 多少 active connection</li>
<li>升 PlanetScale plan（Scaler Pro / Enterprise）</li>
<li>或在 application 端加 connection pool（HikariCP / pgBouncer 等價）降低 connection count</li>
</ul>
<h3 id="5-cost-model-翻盤--per-row-read-計費超預期">5. Cost model 翻盤 — Per-row read 計費超預期</h3>
<p>PlanetScale 計費是 <em>per row read / written</em>。自管 Vitess cost = EC2 + EBS（線性 with infrastructure scale）。遷 PlanetScale 後計費跟 <em>application access pattern</em> 直接相關。</p>
<p>常見 surprise：</p>
<ul>
<li>Heavy analytics query（COUNT *、aggregation）讀大量 row、計費高</li>
<li>N+1 query pattern（application 跑很多小 SELECT）讀很多 row、計費高</li>
<li>Read-heavy workload 沒 Boost cache、每次 query 都 hit billing</li>
</ul>
<p>修法：</p>
<ul>
<li>Phase 1 <em>cost forecast</em>：用 <code>pscale analytics</code> 預估 row read / write 量、估算月帳</li>
<li>Phase 2 期間實際對 PlanetScale 跑 traffic、看實際 billing</li>
<li>Heavy analytics 改 <em>材料化 view</em> / <em>async aggregation</em>、不是每次 query</li>
<li>高 read frequency 開 Boost cache（額外 cost、但比 row read 便宜）</li>
</ul>
<h2 id="capability-mapping">Capability mapping</h2>
<table>
  <thead>
      <tr>
          <th>自管 Vitess</th>
          <th>PlanetScale 對應</th>
          <th>兼容度</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>VTGate</td>
          <td>PlanetScale endpoint</td>
          <td>100%</td>
      </tr>
      <tr>
          <td>VTTablet</td>
          <td>PlanetScale managed</td>
          <td>100%</td>
      </tr>
      <tr>
          <td>VReplication</td>
          <td>PlanetScale Console + Deploy Request</td>
          <td>90%（內部使用更受限）</td>
      </tr>
      <tr>
          <td>VSchema</td>
          <td>PlanetScale Console / pscale CLI</td>
          <td>95%（custom Vindex 不支援）</td>
      </tr>
      <tr>
          <td>Vitess online DDL</td>
          <td>Deploy Request workflow</td>
          <td>不同 paradigm、功能等價</td>
      </tr>
      <tr>
          <td>Backup</td>
          <td>PlanetScale 自動</td>
          <td>100%（且更好）</td>
      </tr>
      <tr>
          <td>Failover</td>
          <td>PlanetScale 自動</td>
          <td>100%</td>
      </tr>
      <tr>
          <td>Multi-region</td>
          <td>PlanetScale Boost / per-region cluster</td>
          <td>90%</td>
      </tr>
      <tr>
          <td>Custom plugin</td>
          <td>不支援</td>
          <td>0%</td>
      </tr>
      <tr>
          <td>SUPER privilege</td>
          <td>不支援</td>
          <td>0%</td>
      </tr>
  </tbody>
</table>
<h2 id="容量與成本對照">容量與成本對照</h2>
<p>對 200 人 eng team 用自管 Vitess（10 shard、20 TB 資料、50K WPS）：</p>
<table>
  <thead>
      <tr>
          <th>項目</th>
          <th>自管 Vitess（自管 EC2）</th>
          <th>PlanetScale Scaler Pro</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Infrastructure</td>
          <td>~$15K-25K / mo（EC2 + EBS + LB）</td>
          <td>Variable（per row read / write）</td>
      </tr>
      <tr>
          <td>Ops headcount</td>
          <td>2-3 FTE × $150K / yr = $300K-450K / yr</td>
          <td>&lt; 0.5 FTE × $150K = $75K / yr</td>
      </tr>
      <tr>
          <td>Vitess upgrade cost</td>
          <td>每年 1-2 個 SRE × 2 週</td>
          <td>自動</td>
      </tr>
      <tr>
          <td>Per-row read</td>
          <td>不計費</td>
          <td>$1 per 1B row read</td>
      </tr>
      <tr>
          <td>Per-row written</td>
          <td>不計費</td>
          <td>$1.50 per 1M row write</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>EBS $2K-5K / mo</td>
          <td>$1.50 / GB / mo</td>
      </tr>
      <tr>
          <td><strong>總帳</strong></td>
          <td>~$400K-550K / yr</td>
          <td>~$200K-350K / yr（看 traffic）</td>
      </tr>
  </tbody>
</table>
<p>對中型規模、PlanetScale 通常 break-even 或更便宜。對極大規模（&gt; 200K WPS / &gt; 100 TB）PlanetScale Enterprise 需要 commit pricing、不一定划算。</p>
<h2 id="何時不要遷">何時不要遷</h2>
<ul>
<li><strong>跨雲 / on-prem 是 requirement</strong>：PlanetScale cloud-only</li>
<li><strong>Custom Vindex / 特殊 plugin</strong> 大量使用：兼容度低、改造工作量大</li>
<li><strong>規模極大</strong> &gt; 500K WPS / &gt; 200 TB：PlanetScale plan 對應 Enterprise commit、議價辛苦</li>
<li><strong>強合規 / 資料主權限制</strong>：金融 / 政府 / 醫療場景、PlanetScale 不一定能 cover compliance</li>
<li><strong>既有 Vitess team 強 + ops cost 低</strong>：如果 ops 已經精實、不必為 outsource 而 outsource</li>
</ul>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-vitess-sharding">跟 <a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">Vitess sharding</a></h3>
<p>本 migration 保留 Vitess sharding 概念、application code 視角幾乎不變。Phase 1 audit 是 <em>Vitess concept 對應 PlanetScale concept</em>、不是 <em>拆 Vitess 換 distributed SQL</em>。</p>
<h3 id="跟--planetscale-from-self-managed-mysql">跟 <a href="/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/" data-link-title="MySQL → PlanetScale：managed Vitess &#43; branch-based schema workflow 的 hybrid shift" data-link-desc="自管 MySQL → PlanetScale 加上 Vitess sharding 跟 branch-based schema workflow。本文走 6 維 audit（Paradigm &#43; Operational &#43; Schema 多軸）、4-phase migration、5 production 踩雷、何時不要遷。">→ PlanetScale (from self-managed MySQL)</a></h3>
<p>本 migration 是 <em>Vitess → PlanetScale</em>、前者是 <em>MySQL → PlanetScale</em>。差異：</p>
<ul>
<li><em>MySQL → PlanetScale</em> (Type E)：要學 Vitess 概念 + branch workflow + FK 處理</li>
<li><em>Vitess → PlanetScale</em> (Type C)：只學 branch workflow + ops outsource、保留所有 Vitess investment</li>
</ul>
<p>選哪條 path 取決於起點。</p>
<h3 id="跟-major-version-upgrade">跟 <a href="/blog/backend/01-database/vendors/mysql/major-version-upgrade/" data-link-title="MySQL 5.7 → 8.0 Major Version Upgrade：character set / authentication / atomic DDL 三條 paradigm 同時換軌" data-link-desc="MySQL 5.7 → 8.0 三條 default 同時改：charset utf8 → utf8mb4、auth plugin native_password → caching_sha2_password、DDL non-atomic → atomic。本文走 Type E paradigm shift 結構、6 維 audit、4-phase upgrade、5 production 踩雷、何時不要升級。">Major Version Upgrade</a></h3>
<p>從自管 Vitess 上 MySQL 5.7 遷 PlanetScale 也是 <em>同時跨 major version</em>（PlanetScale 跑 8.0+ Vitess）。Application 必須同時處理 5.7 → 8.0 paradigm shift（charset / auth）。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/mysql/" data-link-title="MySQL" data-link-desc="高併發網路服務常用關聯式資料庫、Vitess / PlanetScale 分片生態、GitHub / Shopify / Facebook 規模驗證">MySQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">Vitess Sharding</a>（self-managed source）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/migrate-to-planetscale/" data-link-title="MySQL → PlanetScale：managed Vitess &#43; branch-based schema workflow 的 hybrid shift" data-link-desc="自管 MySQL → PlanetScale 加上 Vitess sharding 跟 branch-based schema workflow。本文走 6 維 audit（Paradigm &#43; Operational &#43; Schema 多軸）、4-phase migration、5 production 踩雷、何時不要遷。">→ PlanetScale from self-managed MySQL</a>（不同起點）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/migrate-to-aurora/" data-link-title="MySQL → Aurora MySQL：storage layer 轉手到 AWS、replication / HA / backup 全部 outsource" data-link-desc="自管 MySQL → Aurora MySQL 是 Type C operational hybrid migration — wire protocol 一致、ops 責任轉到 AWS。本文走 6 維 audit（Operational High）、Aurora storage architecture 衝擊、4-phase migration、5 production 踩雷、何時維持原路線。">→ Aurora MySQL</a>（另一條 self-managed → managed path）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/major-version-upgrade/" data-link-title="MySQL 5.7 → 8.0 Major Version Upgrade：character set / authentication / atomic DDL 三條 paradigm 同時換軌" data-link-desc="MySQL 5.7 → 8.0 三條 default 同時改：charset utf8 → utf8mb4、auth plugin native_password → caching_sha2_password、DDL non-atomic → atomic。本文走 Type E paradigm shift 結構、6 維 audit、4-phase upgrade、5 production 踩雷、何時不要升級。">Major Version Upgrade</a>（5.7 → 8.0 同期考量）</li>
<li>方法論：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration Playbook Methodology</a>（Type C operational hybrid）</li>
<li>官方：<a href="https://planetscale.com/docs/imports">PlanetScale Migration Guide</a> / <a href="https://github.com/planetscale/vitess-operator">Vitess Operator</a></li>
</ul>
]]></content:encoded></item><item><title>MySQL Audit Log + SIEM</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/audit-log-siem/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/audit-log-siem/</guid><description>&lt;p>MySQL audit log + SIEM 的核心責任是把資料庫操作事件轉成可查詢、可保留、可告警的安全證據。Audit log 是可調查的行為紀錄；它要回答誰在何時、從哪裡、對哪個資料物件做了什麼，以及是否符合授權流程。&lt;/p>
&lt;p>本文的判讀錨點是：audit logging 要服務於 investigation 與 compliance。Slow query log、general log、binary log、error log、managed service audit log、plugin audit log 各自承擔不同證據，不應混成同一種 log。&lt;/p>
&lt;h2 id="event-taxonomy">Event Taxonomy&lt;/h2>
&lt;p>Event taxonomy 的核心責任是定義要蒐集哪些資料庫事件。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Event 類型&lt;/th>
 &lt;th>目的&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Login / logout&lt;/td>
 &lt;td>身份與來源追蹤&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Failed access&lt;/td>
 &lt;td>brute force、credential misuse&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>DDL&lt;/td>
 &lt;td>schema 變更與 migration evidence&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>DCL&lt;/td>
 &lt;td>grant / revoke / role 變更&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Sensitive read&lt;/td>
 &lt;td>PII / payment / high-risk table&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Data modification&lt;/td>
 &lt;td>bulk update / delete、admin action&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Replication / backup&lt;/td>
 &lt;td>binlog、backup、restore access&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>事件分類要對應 alert。DDL 可以進 release audit；failed login 可以進 security alert；sensitive read 要連到 support ticket 或 break-glass 流程。&lt;/p>
&lt;h2 id="log-sources">Log Sources&lt;/h2>
&lt;p>Log sources 的核心責任是選出合適來源。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Source&lt;/th>
 &lt;th>適合用途&lt;/th>
 &lt;th>風險&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Error log&lt;/td>
 &lt;td>startup、crash、replication error&lt;/td>
 &lt;td>缺少完整 query context&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Slow log&lt;/td>
 &lt;td>performance investigation&lt;/td>
 &lt;td>安全事件覆蓋不足&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>General log&lt;/td>
 &lt;td>debug / short-term tracing&lt;/td>
 &lt;td>volume 大、PII 風險高&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Binary log&lt;/td>
 &lt;td>data change recovery / CDC&lt;/td>
 &lt;td>需要解析、並非 user audit 完整替代&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Audit plugin / managed audit&lt;/td>
 &lt;td>security evidence&lt;/td>
 &lt;td>provider / edition / config 限制&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>General log 在 production 要謹慎使用。它能提供完整 SQL，但 volume、PII 與成本都高；通常只用短時間 incident window 或測試環境。&lt;/p>
&lt;h2 id="siem-pipeline">SIEM Pipeline&lt;/h2>
&lt;p>SIEM pipeline 的核心責任是把 database event 轉成集中查詢與告警。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Pipeline step&lt;/th>
 &lt;th>內容&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Collect&lt;/td>
 &lt;td>log file、managed log export、agent&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Normalize&lt;/td>
 &lt;td>actor、source IP、database、object、action&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Mask&lt;/td>
 &lt;td>移除 SQL literal / PII&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Retain&lt;/td>
 &lt;td>retention、legal hold、storage class&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Alert&lt;/td>
 &lt;td>rule、severity、owner、runbook&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Review&lt;/td>
 &lt;td>periodic access review&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Normalization 要避免把完整 SQL 直接送進 SIEM。對敏感系統，可保留 query fingerprint、table、operation、row count、actor 與 ticket id，而非 literal value。&lt;/p></description><content:encoded><![CDATA[<p>MySQL audit log + SIEM 的核心責任是把資料庫操作事件轉成可查詢、可保留、可告警的安全證據。Audit log 是可調查的行為紀錄；它要回答誰在何時、從哪裡、對哪個資料物件做了什麼，以及是否符合授權流程。</p>
<p>本文的判讀錨點是：audit logging 要服務於 investigation 與 compliance。Slow query log、general log、binary log、error log、managed service audit log、plugin audit log 各自承擔不同證據，不應混成同一種 log。</p>
<h2 id="event-taxonomy">Event Taxonomy</h2>
<p>Event taxonomy 的核心責任是定義要蒐集哪些資料庫事件。</p>
<table>
  <thead>
      <tr>
          <th>Event 類型</th>
          <th>目的</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Login / logout</td>
          <td>身份與來源追蹤</td>
      </tr>
      <tr>
          <td>Failed access</td>
          <td>brute force、credential misuse</td>
      </tr>
      <tr>
          <td>DDL</td>
          <td>schema 變更與 migration evidence</td>
      </tr>
      <tr>
          <td>DCL</td>
          <td>grant / revoke / role 變更</td>
      </tr>
      <tr>
          <td>Sensitive read</td>
          <td>PII / payment / high-risk table</td>
      </tr>
      <tr>
          <td>Data modification</td>
          <td>bulk update / delete、admin action</td>
      </tr>
      <tr>
          <td>Replication / backup</td>
          <td>binlog、backup、restore access</td>
      </tr>
  </tbody>
</table>
<p>事件分類要對應 alert。DDL 可以進 release audit；failed login 可以進 security alert；sensitive read 要連到 support ticket 或 break-glass 流程。</p>
<h2 id="log-sources">Log Sources</h2>
<p>Log sources 的核心責任是選出合適來源。</p>
<table>
  <thead>
      <tr>
          <th>Source</th>
          <th>適合用途</th>
          <th>風險</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Error log</td>
          <td>startup、crash、replication error</td>
          <td>缺少完整 query context</td>
      </tr>
      <tr>
          <td>Slow log</td>
          <td>performance investigation</td>
          <td>安全事件覆蓋不足</td>
      </tr>
      <tr>
          <td>General log</td>
          <td>debug / short-term tracing</td>
          <td>volume 大、PII 風險高</td>
      </tr>
      <tr>
          <td>Binary log</td>
          <td>data change recovery / CDC</td>
          <td>需要解析、並非 user audit 完整替代</td>
      </tr>
      <tr>
          <td>Audit plugin / managed audit</td>
          <td>security evidence</td>
          <td>provider / edition / config 限制</td>
      </tr>
  </tbody>
</table>
<p>General log 在 production 要謹慎使用。它能提供完整 SQL，但 volume、PII 與成本都高；通常只用短時間 incident window 或測試環境。</p>
<h2 id="siem-pipeline">SIEM Pipeline</h2>
<p>SIEM pipeline 的核心責任是把 database event 轉成集中查詢與告警。</p>
<table>
  <thead>
      <tr>
          <th>Pipeline step</th>
          <th>內容</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Collect</td>
          <td>log file、managed log export、agent</td>
      </tr>
      <tr>
          <td>Normalize</td>
          <td>actor、source IP、database、object、action</td>
      </tr>
      <tr>
          <td>Mask</td>
          <td>移除 SQL literal / PII</td>
      </tr>
      <tr>
          <td>Retain</td>
          <td>retention、legal hold、storage class</td>
      </tr>
      <tr>
          <td>Alert</td>
          <td>rule、severity、owner、runbook</td>
      </tr>
      <tr>
          <td>Review</td>
          <td>periodic access review</td>
      </tr>
  </tbody>
</table>
<p>Normalization 要避免把完整 SQL 直接送進 SIEM。對敏感系統，可保留 query fingerprint、table、operation、row count、actor 與 ticket id，而非 literal value。</p>
<h2 id="alert-rules">Alert Rules</h2>
<p>Alert rules 的核心責任是把高風險事件變成可行動訊號。</p>
<table>
  <thead>
      <tr>
          <th>Rule</th>
          <th>代表風險</th>
          <th>第一反應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Admin login outside window</td>
          <td>credential misuse / emergency access</td>
          <td>確認 ticket、限制 session</td>
      </tr>
      <tr>
          <td>Grant / revoke event</td>
          <td>權限邊界變更</td>
          <td>access review</td>
      </tr>
      <tr>
          <td>Drop / truncate table</td>
          <td>destructive DDL</td>
          <td>freeze release、restore decision</td>
      </tr>
      <tr>
          <td>Bulk update / delete</td>
          <td>application bug / misuse</td>
          <td>查 transaction、binlog、backup</td>
      </tr>
      <tr>
          <td>Sensitive table read</td>
          <td>PII exposure</td>
          <td>ticket match、scope review</td>
      </tr>
  </tbody>
</table>
<p>Alert 要有 owner 與 runbook。只把 log 送進 SIEM，缺少 triage rule，incident 時仍然難以快速定位。</p>
<h2 id="retention-and-privacy">Retention and Privacy</h2>
<p>Retention and privacy 的核心責任是讓 audit log 同時可用與合規。Audit log 可能包含帳號、IP、SQL、table name、literal value 與 PII；保存時間越長，保護責任越重。</p>
<p>Retention policy 要定義：</p>
<ol>
<li>保存天數與 storage class。</li>
<li>哪些欄位可被 masked。</li>
<li>誰能查 audit log。</li>
<li>Legal hold 如何覆蓋一般 retention。</li>
<li>Export 到外部 SIEM 的資料邊界。</li>
</ol>
<p>Audit log 本身也要納入 access control。能查敏感 audit 的人，通常也能推斷敏感資料活動。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Audit log + SIEM 完成後，加密與憑證讀 <a href="../encryption-tls-key-management/">Encryption / TLS / Key Management</a>；備份事故讀 <a href="../pitr-backup/">PITR / Backup</a>；安全治理讀 <a href="/blog/backend/07-security-data-protection/data-protection-and-masking-governance/" data-link-title="7.4 資料保護與遮罩治理" data-link-desc="以問題驅動方式整理資料分級、遮罩、匯出與備份治理">Data Protection</a>。</p>
]]></content:encoded></item><item><title>MySQL Cross-buffer Memory Contention</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/cross-buffer-memory-contention/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/cross-buffer-memory-contention/</guid><description>&lt;p>MySQL cross-buffer memory contention 的核心責任是把 MySQL memory tuning 從單一 buffer pool 參數擴展到整體記憶體競爭。InnoDB buffer pool、redo log buffer、sort buffer、join buffer、tmp table、thread stack、connection memory、OS page cache 與 container limit 會共同決定 latency 與 OOM 風險。&lt;/p>
&lt;p>本文的判讀錨點是：MySQL memory 問題常來自&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/per-connection-memory/" data-link-title="Per-Connection Memory" data-link-desc="說明每條連線或每個操作的記憶體用量如何隨並發數放大">「每連線 / 每操作」記憶體&lt;/a>乘上 concurrency，而非只來自全域 buffer pool。調大單一 buffer 前，要先看 workload 與同時執行的 query。&lt;/p>
&lt;h2 id="memory-surfaces">Memory Surfaces&lt;/h2>
&lt;p>Memory surfaces 的核心責任是列出會互相競爭的記憶體來源。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Surface&lt;/th>
 &lt;th>類型&lt;/th>
 &lt;th>風險&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>InnoDB buffer pool&lt;/td>
 &lt;td>global&lt;/td>
 &lt;td>太小造成 read I/O，太大壓縮 OS 空間&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Redo log buffer&lt;/td>
 &lt;td>global&lt;/td>
 &lt;td>大交易 / burst write 需要審查&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Sort buffer&lt;/td>
 &lt;td>per session / operation&lt;/td>
 &lt;td>concurrent sort 放大 memory&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Join buffer&lt;/td>
 &lt;td>per session / join&lt;/td>
 &lt;td>missing index 時放大&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Temp table&lt;/td>
 &lt;td>memory / disk&lt;/td>
 &lt;td>group / sort / derived table&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Connection overhead&lt;/td>
 &lt;td>per connection&lt;/td>
 &lt;td>connection storm / thread memory&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>OS page cache&lt;/td>
 &lt;td>system&lt;/td>
 &lt;td>file、backup、binlog、tmp&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Per-session buffer 是最容易誤調的項目。把 sort / join buffer 全域調大，會在高 concurrency 下造成 memory spike。&lt;/p>
&lt;h2 id="contention-signals">Contention Signals&lt;/h2>
&lt;p>Contention signals 的核心責任是把 memory pressure 從 symptom 轉成可排查訊號。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Signal&lt;/th>
 &lt;th>意義&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>OOM / container restart&lt;/td>
 &lt;td>total memory 超出限制&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>swap activity&lt;/td>
 &lt;td>memory pressure 已影響 latency&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Created_tmp_disk_tables 增加&lt;/td>
 &lt;td>memory temp table 不足或 query 太大&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Sort_merge_passes 增加&lt;/td>
 &lt;td>sort memory / query shape 問題&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Buffer pool hit rate 下降&lt;/td>
 &lt;td>working set / query pattern 問題&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Threads_connected 高&lt;/td>
 &lt;td>per-connection memory 放大&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Signal 要和 query workload 對照。Temp table 與 sort 問題通常需要 query rewrite、index 或報表隔離，而非只調 memory。&lt;/p>
&lt;h2 id="tuning-order">Tuning Order&lt;/h2>
&lt;p>Tuning order 的核心責任是建立安全調整順序。&lt;/p></description><content:encoded><![CDATA[<p>MySQL cross-buffer memory contention 的核心責任是把 MySQL memory tuning 從單一 buffer pool 參數擴展到整體記憶體競爭。InnoDB buffer pool、redo log buffer、sort buffer、join buffer、tmp table、thread stack、connection memory、OS page cache 與 container limit 會共同決定 latency 與 OOM 風險。</p>
<p>本文的判讀錨點是：MySQL memory 問題常來自<a href="/blog/backend/knowledge-cards/per-connection-memory/" data-link-title="Per-Connection Memory" data-link-desc="說明每條連線或每個操作的記憶體用量如何隨並發數放大">「每連線 / 每操作」記憶體</a>乘上 concurrency，而非只來自全域 buffer pool。調大單一 buffer 前，要先看 workload 與同時執行的 query。</p>
<h2 id="memory-surfaces">Memory Surfaces</h2>
<p>Memory surfaces 的核心責任是列出會互相競爭的記憶體來源。</p>
<table>
  <thead>
      <tr>
          <th>Surface</th>
          <th>類型</th>
          <th>風險</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>InnoDB buffer pool</td>
          <td>global</td>
          <td>太小造成 read I/O，太大壓縮 OS 空間</td>
      </tr>
      <tr>
          <td>Redo log buffer</td>
          <td>global</td>
          <td>大交易 / burst write 需要審查</td>
      </tr>
      <tr>
          <td>Sort buffer</td>
          <td>per session / operation</td>
          <td>concurrent sort 放大 memory</td>
      </tr>
      <tr>
          <td>Join buffer</td>
          <td>per session / join</td>
          <td>missing index 時放大</td>
      </tr>
      <tr>
          <td>Temp table</td>
          <td>memory / disk</td>
          <td>group / sort / derived table</td>
      </tr>
      <tr>
          <td>Connection overhead</td>
          <td>per connection</td>
          <td>connection storm / thread memory</td>
      </tr>
      <tr>
          <td>OS page cache</td>
          <td>system</td>
          <td>file、backup、binlog、tmp</td>
      </tr>
  </tbody>
</table>
<p>Per-session buffer 是最容易誤調的項目。把 sort / join buffer 全域調大，會在高 concurrency 下造成 memory spike。</p>
<h2 id="contention-signals">Contention Signals</h2>
<p>Contention signals 的核心責任是把 memory pressure 從 symptom 轉成可排查訊號。</p>
<table>
  <thead>
      <tr>
          <th>Signal</th>
          <th>意義</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OOM / container restart</td>
          <td>total memory 超出限制</td>
      </tr>
      <tr>
          <td>swap activity</td>
          <td>memory pressure 已影響 latency</td>
      </tr>
      <tr>
          <td>Created_tmp_disk_tables 增加</td>
          <td>memory temp table 不足或 query 太大</td>
      </tr>
      <tr>
          <td>Sort_merge_passes 增加</td>
          <td>sort memory / query shape 問題</td>
      </tr>
      <tr>
          <td>Buffer pool hit rate 下降</td>
          <td>working set / query pattern 問題</td>
      </tr>
      <tr>
          <td>Threads_connected 高</td>
          <td>per-connection memory 放大</td>
      </tr>
  </tbody>
</table>
<p>Signal 要和 query workload 對照。Temp table 與 sort 問題通常需要 query rewrite、index 或報表隔離，而非只調 memory。</p>
<h2 id="tuning-order">Tuning Order</h2>
<p>Tuning order 的核心責任是建立安全調整順序。</p>
<ol>
<li>先確認 host / container memory limit。</li>
<li>設定 InnoDB buffer pool baseline。</li>
<li>控制 max connections 與 application pool。</li>
<li>用 top query 找 sort / join / temp table 來源。</li>
<li>對特定 session / workload 調 buffer，而非全域放大。</li>
<li>將 analytics / reporting 移到 replica 或 OLAP。</li>
</ol>
<p>這個順序讓全域 memory 先穩定，再處理 query 層問題。若反過來先調大 per-session buffer，壓力會在尖峰流量時爆發。</p>
<h2 id="query-patterns">Query Patterns</h2>
<p>Query patterns 的核心責任是找出 memory heavy 查詢。</p>
<table>
  <thead>
      <tr>
          <th>Pattern</th>
          <th>Memory 風險</th>
          <th>修正方向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Large sort</td>
          <td>sort buffer / temp table</td>
          <td>index order、limit、pagination</td>
      </tr>
      <tr>
          <td>Missing join index</td>
          <td>join buffer 放大</td>
          <td>補 index、改 join order</td>
      </tr>
      <tr>
          <td>Big GROUP BY</td>
          <td>tmp table / disk spill</td>
          <td>pre-aggregate、OLAP、covering index</td>
      </tr>
      <tr>
          <td>Large transaction</td>
          <td>undo / lock / memory</td>
          <td>batch、縮短 transaction</td>
      </tr>
      <tr>
          <td>Many idle sessions</td>
          <td>connection memory</td>
          <td>pooler、timeout、max connection</td>
      </tr>
  </tbody>
</table>
<p>Memory tuning 要服務 query design。若 query 本身無界，memory 只會把問題延後到更大資料量。</p>
<h2 id="runbook">Runbook</h2>
<p>Runbook 的核心責任是把 memory incident 分流。</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>操作</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Confirm pressure</td>
          <td>OS memory、swap、OOM、MySQL status</td>
      </tr>
      <tr>
          <td>Identify workload</td>
          <td>processlist、performance schema、top SQL</td>
      </tr>
      <tr>
          <td>Reduce concurrency</td>
          <td>限流、停報表、降 background job</td>
      </tr>
      <tr>
          <td>Protect OLTP</td>
          <td>kill heavy query、切 read replica</td>
      </tr>
      <tr>
          <td>Tune safely</td>
          <td>session-level buffer、index、query</td>
      </tr>
      <tr>
          <td>Retrospective</td>
          <td>pool size、query guard、dashboard</td>
      </tr>
  </tbody>
</table>
<p>OOM 後要保存 evidence：memory limit、MySQL variables、Threads_connected、top queries、tmp table counters、container restart time。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Cross-buffer memory contention 完成後，InnoDB 基礎讀 <a href="../innodb-tuning/">InnoDB Tuning</a>；query 層讀 <a href="../query-optimization/">Query Optimization</a>；lock 與 transaction 壓力讀 <a href="../lock-contention/">Lock Contention</a>。</p>
]]></content:encoded></item><item><title>MySQL Document Store / X Protocol</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/document-store-x-protocol/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/document-store-x-protocol/</guid><description>&lt;p>MySQL Document Store / X Protocol 的核心責任是說明 MySQL 如何在 relational engine 內提供 JSON document workflow。&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/document-store/" data-link-title="Document Store" data-link-desc="說明以 JSON 文件與彈性 schema 提供資料存取的模式，以及它仍需的治理邊界">Document Store&lt;/a> 讓 application 透過 X Protocol 與 CRUD API 操作 collection，但資料仍落在 MySQL 的 storage、transaction、backup 與 permission 模型裡。&lt;/p>
&lt;p>本文的判讀錨點是：Document Store 是 MySQL 內的 document access pattern，而非 MongoDB 等專用 document database 的完整替代。它適合 relational schema 旁邊的 flexible JSON，但不適合把主要資料模型都藏進無治理 JSON。&lt;/p>
&lt;p>官方文件路由的核心責任是固定 X Protocol claim。實作前先查 &lt;a href="https://dev.mysql.com/doc/refman/en/document-store.html">MySQL 8.4 Document Store&lt;/a>；本文最後檢查日是 2026-05-22。&lt;/p>
&lt;h2 id="responsibility-boundary">Responsibility Boundary&lt;/h2>
&lt;p>Responsibility boundary 的核心責任是把 Document Store 和 SQL table 關係說清楚。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>面向&lt;/th>
 &lt;th>Document Store&lt;/th>
 &lt;th>SQL table / JSON column&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Access API&lt;/td>
 &lt;td>X Protocol、CRUD-style API&lt;/td>
 &lt;td>SQL、JSON function&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Storage&lt;/td>
 &lt;td>MySQL InnoDB&lt;/td>
 &lt;td>MySQL InnoDB&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Transaction&lt;/td>
 &lt;td>MySQL transaction&lt;/td>
 &lt;td>MySQL transaction&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Governance&lt;/td>
 &lt;td>仍需 backup、role、audit、migration&lt;/td>
 &lt;td>仍需 schema / index review&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Query power&lt;/td>
 &lt;td>document-friendly access&lt;/td>
 &lt;td>SQL join、index、optimizer&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Document Store 的價值是降低 flexible object 的開發摩擦。它不免除資料合約、index、migration、backup 與 audit 的責任。&lt;/p>
&lt;h2 id="suitable-use-cases">Suitable Use Cases&lt;/h2>
&lt;p>Suitable use cases 的核心責任是找出 document pattern 的合理位置。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>情境&lt;/th>
 &lt;th>適合原因&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Profile / preference&lt;/td>
 &lt;td>欄位變動快、查詢條件少&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Integration payload&lt;/td>
 &lt;td>需要保存外部 JSON 原文&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Feature flag / config&lt;/td>
 &lt;td>讀多寫少、schema 變化頻繁&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Hybrid relational + JSON&lt;/td>
 &lt;td>主體 relational，局部 flexible&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Prototype&lt;/td>
 &lt;td>先探索欄位，再逐步 relationalize&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Document Store 最適合局部 flexible data。若核心 query 需要大量 join、aggregation、transaction invariant，應把穩定欄位拉回 relational schema。&lt;/p>
&lt;h2 id="query-and-index">Query and Index&lt;/h2>
&lt;p>Query and index 的核心責任是避免 JSON 查詢變成不可觀測黑箱。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>問題&lt;/th>
 &lt;th>審查方向&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>常用 filter&lt;/td>
 &lt;td>是否需要 generated column / functional index&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Sort / pagination&lt;/td>
 &lt;td>是否能走 index&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Schema drift&lt;/td>
 &lt;td>document version / validation&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Large document&lt;/td>
 &lt;td>update amplification、network payload&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Analytics&lt;/td>
 &lt;td>是否應 ETL 到 OLAP / warehouse&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>MySQL JSON 查詢可以從 generated column 建 index。正式服務要把常用 JSON path 寫進 query contract，避免每次都掃完整 document。&lt;/p></description><content:encoded><![CDATA[<p>MySQL Document Store / X Protocol 的核心責任是說明 MySQL 如何在 relational engine 內提供 JSON document workflow。<a href="/blog/backend/knowledge-cards/document-store/" data-link-title="Document Store" data-link-desc="說明以 JSON 文件與彈性 schema 提供資料存取的模式，以及它仍需的治理邊界">Document Store</a> 讓 application 透過 X Protocol 與 CRUD API 操作 collection，但資料仍落在 MySQL 的 storage、transaction、backup 與 permission 模型裡。</p>
<p>本文的判讀錨點是：Document Store 是 MySQL 內的 document access pattern，而非 MongoDB 等專用 document database 的完整替代。它適合 relational schema 旁邊的 flexible JSON，但不適合把主要資料模型都藏進無治理 JSON。</p>
<p>官方文件路由的核心責任是固定 X Protocol claim。實作前先查 <a href="https://dev.mysql.com/doc/refman/en/document-store.html">MySQL 8.4 Document Store</a>；本文最後檢查日是 2026-05-22。</p>
<h2 id="responsibility-boundary">Responsibility Boundary</h2>
<p>Responsibility boundary 的核心責任是把 Document Store 和 SQL table 關係說清楚。</p>
<table>
  <thead>
      <tr>
          <th>面向</th>
          <th>Document Store</th>
          <th>SQL table / JSON column</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Access API</td>
          <td>X Protocol、CRUD-style API</td>
          <td>SQL、JSON function</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>MySQL InnoDB</td>
          <td>MySQL InnoDB</td>
      </tr>
      <tr>
          <td>Transaction</td>
          <td>MySQL transaction</td>
          <td>MySQL transaction</td>
      </tr>
      <tr>
          <td>Governance</td>
          <td>仍需 backup、role、audit、migration</td>
          <td>仍需 schema / index review</td>
      </tr>
      <tr>
          <td>Query power</td>
          <td>document-friendly access</td>
          <td>SQL join、index、optimizer</td>
      </tr>
  </tbody>
</table>
<p>Document Store 的價值是降低 flexible object 的開發摩擦。它不免除資料合約、index、migration、backup 與 audit 的責任。</p>
<h2 id="suitable-use-cases">Suitable Use Cases</h2>
<p>Suitable use cases 的核心責任是找出 document pattern 的合理位置。</p>
<table>
  <thead>
      <tr>
          <th>情境</th>
          <th>適合原因</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Profile / preference</td>
          <td>欄位變動快、查詢條件少</td>
      </tr>
      <tr>
          <td>Integration payload</td>
          <td>需要保存外部 JSON 原文</td>
      </tr>
      <tr>
          <td>Feature flag / config</td>
          <td>讀多寫少、schema 變化頻繁</td>
      </tr>
      <tr>
          <td>Hybrid relational + JSON</td>
          <td>主體 relational，局部 flexible</td>
      </tr>
      <tr>
          <td>Prototype</td>
          <td>先探索欄位，再逐步 relationalize</td>
      </tr>
  </tbody>
</table>
<p>Document Store 最適合局部 flexible data。若核心 query 需要大量 join、aggregation、transaction invariant，應把穩定欄位拉回 relational schema。</p>
<h2 id="query-and-index">Query and Index</h2>
<p>Query and index 的核心責任是避免 JSON 查詢變成不可觀測黑箱。</p>
<table>
  <thead>
      <tr>
          <th>問題</th>
          <th>審查方向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>常用 filter</td>
          <td>是否需要 generated column / functional index</td>
      </tr>
      <tr>
          <td>Sort / pagination</td>
          <td>是否能走 index</td>
      </tr>
      <tr>
          <td>Schema drift</td>
          <td>document version / validation</td>
      </tr>
      <tr>
          <td>Large document</td>
          <td>update amplification、network payload</td>
      </tr>
      <tr>
          <td>Analytics</td>
          <td>是否應 ETL 到 OLAP / warehouse</td>
      </tr>
  </tbody>
</table>
<p>MySQL JSON 查詢可以從 generated column 建 index。正式服務要把常用 JSON path 寫進 query contract，避免每次都掃完整 document。</p>
<h2 id="migration-boundary">Migration Boundary</h2>
<p>Migration boundary 的核心責任是讓 document data 可演進。Document 欄位雖然 flexible，但 application 仍會依賴某些 key；這些 key 一旦進入 workflow，就要有版本與 validation。</p>
<p>最小治理：</p>
<ol>
<li>Document version field。</li>
<li>Required key validation at application boundary。</li>
<li>Backfill script for new required key。</li>
<li>Index review for promoted key。</li>
<li>Export / backup restore validation。</li>
</ol>
<p>當 JSON key 變成 join key、permission key 或 reporting key，應評估搬到 relational column。</p>
<h2 id="no-go-conditions">No-Go Conditions</h2>
<p>No-go conditions 的核心責任是指出 Document Store 的邊界。</p>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>建議路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>主要資料都是 nested document</td>
          <td>MongoDB / document database evaluation</td>
      </tr>
      <tr>
          <td>大量 document aggregation</td>
          <td>OLAP / search / document-oriented engine</td>
      </tr>
      <tr>
          <td>JSON path 已成核心 index</td>
          <td>relationalize key 或 generated column</td>
      </tr>
      <tr>
          <td>需要跨 document complex join</td>
          <td>relational schema</td>
      </tr>
      <tr>
          <td>需要 schema governance</td>
          <td>migration + validation</td>
      </tr>
  </tbody>
</table>
<p>Document Store 要服務於 flexible edge，而非取代資料建模。當 flexible area 穩定下來，就把它納入 schema governance。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Document Store / X Protocol 完成後，JSON 與 SQL 能力讀 <a href="../modern-sql-features/">Modern SQL Features</a>；若主要資料模型是 document，讀 <a href="/blog/backend/01-database/vendors/mongodb/" data-link-title="MongoDB" data-link-desc="Document database 代表、Atlas managed、跨雲可用、許多大規模平台從 MongoDB 起家">MongoDB</a>；migration 到 PostgreSQL JSONB 可讀 <a href="../migrate-to-postgresql/">MySQL to PostgreSQL</a>。</p>
]]></content:encoded></item><item><title>MySQL Encryption / TLS / Key Management</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/encryption-tls-key-management/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/encryption-tls-key-management/</guid><description>&lt;p>MySQL encryption / TLS / key management 的核心責任是把資料庫保護拆成儲存加密、傳輸加密、&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/key-management/" data-link-title="Key Management" data-link-desc="說明加密金鑰如何產生、保存、輪替，以及還原時如何依賴金鑰">金鑰生命週期&lt;/a>與連線憑證治理。Encryption 是多層保護設計；它涵蓋 InnoDB tablespace、redo / undo、binary log、backup artifact、client connection 與 keyring。&lt;/p>
&lt;p>本文的判讀錨點是：加密要服務於 threat model。若風險是磁碟遺失，&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/at-rest-encryption/" data-link-title="At-Rest Encryption" data-link-desc="說明資料落到儲存媒介前的加密層，以及它對應的威脅模型">at-rest encryption&lt;/a> 是重點；若風險是網路攔截，TLS 是重點；若風險是內部濫用，還需要 role、audit、masking 與 SIEM。&lt;/p>
&lt;p>官方文件路由的核心責任是固定 MySQL 8.4 security claim。實作前先查 &lt;a href="https://dev.mysql.com/doc/refman/8.2/en/innodb-data-encryption.html">InnoDB data-at-rest encryption&lt;/a>、&lt;a href="https://dev.mysql.com/doc/refman/8.0/en/keyring.html">MySQL keyring&lt;/a> 與 &lt;a href="https://dev.mysql.com/doc/refman/8.4/en/show-binary-log-status.html">SHOW BINARY LOG STATUS&lt;/a>；本文最後檢查日是 2026-05-22。&lt;/p>
&lt;h2 id="protection-layers">Protection Layers&lt;/h2>
&lt;p>Protection layers 的核心責任是把保護面分層。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>層級&lt;/th>
 &lt;th>主要責任&lt;/th>
 &lt;th>Evidence&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>At-rest encryption&lt;/td>
 &lt;td>data file、redo、undo、temp&lt;/td>
 &lt;td>encryption setting、keyring status&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>In-transit TLS&lt;/td>
 &lt;td>client / replica / admin connection&lt;/td>
 &lt;td>TLS mode、certificate、cipher&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Backup encryption&lt;/td>
 &lt;td>dump、snapshot、physical backup&lt;/td>
 &lt;td>encrypted artifact、restore drill&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Key management&lt;/td>
 &lt;td>key generation、rotation、access&lt;/td>
 &lt;td>KMS / keyring log、rotation record&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Credential governance&lt;/td>
 &lt;td>user password、secret、rotation&lt;/td>
 &lt;td>grant review、secret age&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>這些層級要一起設計。資料檔加密後，backup 若以明文落到 object storage，保護鏈仍然破洞；TLS 開啟後，client 若允許 insecure fallback，也會失去網路保護。&lt;/p>
&lt;h2 id="keyring-boundary">Keyring Boundary&lt;/h2>
&lt;p>Keyring boundary 的核心責任是定義 MySQL 如何取得與保護 encryption key。MySQL 支援 keyring component / plugin 與外部 KMS 整合；managed MySQL 可能由 provider 接管 key storage。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>部署型態&lt;/th>
 &lt;th>key 責任&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Self-managed&lt;/td>
 &lt;td>自行部署 keyring / KMS&lt;/td>
 &lt;td>key file permission、backup、rotation&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Managed MySQL&lt;/td>
 &lt;td>provider KMS / customer-managed key&lt;/td>
 &lt;td>region、rotation、audit、restore&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Container lab&lt;/td>
 &lt;td>dev-only keyring&lt;/td>
 &lt;td>避免和 production policy 混用&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Keyring 要進入 backup / restore drill。還原 database 時，只有 data file 而沒有對應 key，restore 會失敗；runbook 要保存 key dependency 與 emergency access。&lt;/p>
&lt;h2 id="tls-policy">TLS Policy&lt;/h2>
&lt;p>TLS policy 的核心責任是讓 client connection、replication connection 與 admin connection 都有明確安全等級。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>連線類型&lt;/th>
 &lt;th>建議檢查&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Application&lt;/td>
 &lt;td>require SSL、verify CA / identity&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Replication&lt;/td>
 &lt;td>source / replica TLS、cert expiry&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Admin&lt;/td>
 &lt;td>bastion / VPN / TLS、least privilege&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Backup tool&lt;/td>
 &lt;td>encrypted transport、secret scope&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>TLS 驗證要包含 certificate rotation。過期憑證造成的 downtime 很常見；runbook 要記錄 CA、server cert、client cert、rotation window 與 reload / restart 條件。&lt;/p></description><content:encoded><![CDATA[<p>MySQL encryption / TLS / key management 的核心責任是把資料庫保護拆成儲存加密、傳輸加密、<a href="/blog/backend/knowledge-cards/key-management/" data-link-title="Key Management" data-link-desc="說明加密金鑰如何產生、保存、輪替，以及還原時如何依賴金鑰">金鑰生命週期</a>與連線憑證治理。Encryption 是多層保護設計；它涵蓋 InnoDB tablespace、redo / undo、binary log、backup artifact、client connection 與 keyring。</p>
<p>本文的判讀錨點是：加密要服務於 threat model。若風險是磁碟遺失，<a href="/blog/backend/knowledge-cards/at-rest-encryption/" data-link-title="At-Rest Encryption" data-link-desc="說明資料落到儲存媒介前的加密層，以及它對應的威脅模型">at-rest encryption</a> 是重點；若風險是網路攔截，TLS 是重點；若風險是內部濫用，還需要 role、audit、masking 與 SIEM。</p>
<p>官方文件路由的核心責任是固定 MySQL 8.4 security claim。實作前先查 <a href="https://dev.mysql.com/doc/refman/8.2/en/innodb-data-encryption.html">InnoDB data-at-rest encryption</a>、<a href="https://dev.mysql.com/doc/refman/8.0/en/keyring.html">MySQL keyring</a> 與 <a href="https://dev.mysql.com/doc/refman/8.4/en/show-binary-log-status.html">SHOW BINARY LOG STATUS</a>；本文最後檢查日是 2026-05-22。</p>
<h2 id="protection-layers">Protection Layers</h2>
<p>Protection layers 的核心責任是把保護面分層。</p>
<table>
  <thead>
      <tr>
          <th>層級</th>
          <th>主要責任</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>At-rest encryption</td>
          <td>data file、redo、undo、temp</td>
          <td>encryption setting、keyring status</td>
      </tr>
      <tr>
          <td>In-transit TLS</td>
          <td>client / replica / admin connection</td>
          <td>TLS mode、certificate、cipher</td>
      </tr>
      <tr>
          <td>Backup encryption</td>
          <td>dump、snapshot、physical backup</td>
          <td>encrypted artifact、restore drill</td>
      </tr>
      <tr>
          <td>Key management</td>
          <td>key generation、rotation、access</td>
          <td>KMS / keyring log、rotation record</td>
      </tr>
      <tr>
          <td>Credential governance</td>
          <td>user password、secret、rotation</td>
          <td>grant review、secret age</td>
      </tr>
  </tbody>
</table>
<p>這些層級要一起設計。資料檔加密後，backup 若以明文落到 object storage，保護鏈仍然破洞；TLS 開啟後，client 若允許 insecure fallback，也會失去網路保護。</p>
<h2 id="keyring-boundary">Keyring Boundary</h2>
<p>Keyring boundary 的核心責任是定義 MySQL 如何取得與保護 encryption key。MySQL 支援 keyring component / plugin 與外部 KMS 整合；managed MySQL 可能由 provider 接管 key storage。</p>
<table>
  <thead>
      <tr>
          <th>部署型態</th>
          <th>key 責任</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Self-managed</td>
          <td>自行部署 keyring / KMS</td>
          <td>key file permission、backup、rotation</td>
      </tr>
      <tr>
          <td>Managed MySQL</td>
          <td>provider KMS / customer-managed key</td>
          <td>region、rotation、audit、restore</td>
      </tr>
      <tr>
          <td>Container lab</td>
          <td>dev-only keyring</td>
          <td>避免和 production policy 混用</td>
      </tr>
  </tbody>
</table>
<p>Keyring 要進入 backup / restore drill。還原 database 時，只有 data file 而沒有對應 key，restore 會失敗；runbook 要保存 key dependency 與 emergency access。</p>
<h2 id="tls-policy">TLS Policy</h2>
<p>TLS policy 的核心責任是讓 client connection、replication connection 與 admin connection 都有明確安全等級。</p>
<table>
  <thead>
      <tr>
          <th>連線類型</th>
          <th>建議檢查</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Application</td>
          <td>require SSL、verify CA / identity</td>
      </tr>
      <tr>
          <td>Replication</td>
          <td>source / replica TLS、cert expiry</td>
      </tr>
      <tr>
          <td>Admin</td>
          <td>bastion / VPN / TLS、least privilege</td>
      </tr>
      <tr>
          <td>Backup tool</td>
          <td>encrypted transport、secret scope</td>
      </tr>
  </tbody>
</table>
<p>TLS 驗證要包含 certificate rotation。過期憑證造成的 downtime 很常見；runbook 要記錄 CA、server cert、client cert、rotation window 與 reload / restart 條件。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SHOW</span><span class="w"> </span><span class="n">VARIABLES</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="s1">&#39;require_secure_transport&#39;</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SHOW</span><span class="w"> </span><span class="n">STATUS</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="s1">&#39;Ssl_cipher&#39;</span><span class="p">;</span></span></span></code></pre></div><p>這些查詢只能提供 connection 層 evidence。正式驗證還要從 client 設定確認 <code>ssl-mode</code> 是否驗證 CA / identity。</p>
<h2 id="backup-and-binlog-encryption">Backup and Binlog Encryption</h2>
<p>Backup and binlog encryption 的核心責任是保護資料離開 primary 後的生命週期。MySQL backup、binlog、logical dump、object storage、replica seed 都可能含敏感資料。</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>保護方式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Logical dump</td>
          <td>client-side encryption、storage policy</td>
      </tr>
      <tr>
          <td>Physical backup</td>
          <td>backup tool encryption、KMS</td>
      </tr>
      <tr>
          <td>Binlog</td>
          <td>encrypted storage、restricted access</td>
      </tr>
      <tr>
          <td>Snapshot</td>
          <td>volume encryption、snapshot policy</td>
      </tr>
      <tr>
          <td>Restore copy</td>
          <td>isolated environment、secret scoping</td>
      </tr>
  </tbody>
</table>
<p>Restore drill 要確認加密 artifact 可被解密並啟動。只有成功產出 encrypted backup，還不足以證明災難時能恢復。</p>
<h2 id="rotation-runbook">Rotation Runbook</h2>
<p>Rotation runbook 的核心責任是讓 key、certificate、password 都可定期更換。</p>
<ol>
<li>Inventory：列出 DB user、TLS cert、KMS key、backup key。</li>
<li>Impact：確認哪些 client / replica / backup job 使用它。</li>
<li>Staging：先在 staging 旋轉並跑 smoke test。</li>
<li>Rollout：使用雙憑證 / 雙 secret window。</li>
<li>Validation：查連線、replication、backup、restore。</li>
<li>Cleanup：移除舊 key / cert / secret。</li>
</ol>
<p>Rotation 要設 calendar 與 owner。安全設定長期無人輪替時，incident 後會難以判斷 exposure window。</p>
<h2 id="failure-modes">Failure Modes</h2>
<p>Failure modes 的核心責任是提前列出加密常見事故。</p>
<table>
  <thead>
      <tr>
          <th>Failure mode</th>
          <th>判讀訊號</th>
          <th>修正方向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>TLS fallback</td>
          <td>client 仍可明文連線</td>
          <td>require secure transport、client verify</td>
      </tr>
      <tr>
          <td>Cert expiry</td>
          <td>application connection failure</td>
          <td>rotation alert、dual cert window</td>
      </tr>
      <tr>
          <td>Missing keyring</td>
          <td>restore / startup failure</td>
          <td>key backup、KMS access drill</td>
      </tr>
      <tr>
          <td>Plain backup</td>
          <td>storage artifact 未加密</td>
          <td>backup pipeline policy</td>
      </tr>
      <tr>
          <td>Overbroad secret</td>
          <td>admin / app 共用 credential</td>
          <td>role split、secret rotation</td>
      </tr>
  </tbody>
</table>
<p>安全 runbook 要和 audit log 串接。Key rotation、failed TLS、privilege change、restore access 都應留下可追溯紀錄。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Encryption / TLS / key management 完成後，操作證據讀 <a href="../audit-log-siem/">Audit Log + SIEM</a>；備份恢復讀 <a href="../pitr-backup/">PITR / Backup</a>；資料保護治理讀 <a href="/blog/backend/07-security-data-protection/data-protection-and-masking-governance/" data-link-title="7.4 資料保護與遮罩治理" data-link-desc="以問題驅動方式整理資料分級、遮罩、匯出與備份治理">Data Protection</a>。</p>
]]></content:encoded></item><item><title>MySQL HeatWave OLAP Add-on</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/heatwave-olap-addon/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/heatwave-olap-addon/</guid><description>&lt;p>MySQL HeatWave OLAP add-on 的核心責任是判斷 OLTP database 內建 analytics 加速何時比拆出 OLAP 系統更划算。HeatWave 這類 add-on 的價值是降低資料搬運與平台數量，但它也把 analytics workload、成本、freshness 與 query governance 帶回 MySQL 生態。&lt;/p>
&lt;p>本文的判讀錨點是：OLAP add-on 做的是把分析查詢從 OLTP 路徑&lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/olap-offload/" data-link-title="OLAP Offload" data-link-desc="說明如何把分析型查詢從 OLTP 主庫卸載，以保護線上交易效能">卸載&lt;/a>到專用引擎，解決特定 analytics workload 的 proximity 問題，而非 data warehouse 的完整替代。選型要看資料量、query pattern、freshness、concurrency、成本與團隊能力。&lt;/p>
&lt;p>官方文件路由的核心責任是固定 HeatWave claim。實作前先查 &lt;a href="https://dev.mysql.com/doc/heatwave/en/index.html">MySQL HeatWave User Guide&lt;/a>；本文最後檢查日是 2026-05-22。&lt;/p>
&lt;h2 id="workload-fit">Workload Fit&lt;/h2>
&lt;p>Workload fit 的核心責任是找出 HeatWave 類 OLAP add-on 的合理位置。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>情境&lt;/th>
 &lt;th>適合原因&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>MySQL 資料為主要分析來源&lt;/td>
 &lt;td>減少 ETL / CDC 複雜度&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Dashboard 需要較新資料&lt;/td>
 &lt;td>freshness 比 warehouse batch 更重要&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>分析 query 可被明確界定&lt;/td>
 &lt;td>可控 workload 便於成本與容量管理&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Team 想降低平台數&lt;/td>
 &lt;td>MySQL 生態內完成 transactional + analytics&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>適合的 workload 通常是「MySQL 內資料、分析需求清楚、資料量可控」。若需要跨多資料源、複雜 semantic layer、長期資料湖與 ML feature store，warehouse / lakehouse 仍然更合適。&lt;/p>
&lt;h2 id="boundary-with-oltp">Boundary with OLTP&lt;/h2>
&lt;p>Boundary with OLTP 的核心責任是避免 analytics 壓力影響交易服務。OLTP query 要穩定、低延遲、可預測；OLAP query 常是大掃描、大聚合、長時間。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>審查面&lt;/th>
 &lt;th>問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Resource&lt;/td>
 &lt;td>OLAP 是否隔離 CPU / memory / storage&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Freshness&lt;/td>
 &lt;td>analytic data 和 source 差多久&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Query control&lt;/td>
 &lt;td>誰能跑 heavy query、如何限流&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Cost&lt;/td>
 &lt;td>add-on node、storage、egress&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Incident&lt;/td>
 &lt;td>OLAP 故障是否影響 OLTP&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>OLAP add-on 要有 query admission policy。任何人都能跑任意分析 SQL，會把成本與穩定性風險放大。&lt;/p>
&lt;h2 id="freshness-and-evidence">Freshness and Evidence&lt;/h2>
&lt;p>Freshness and evidence 的核心責任是定義分析結果多新。Dashboard、營運報表、風控、推薦特徵對 freshness 的要求不同。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Freshness 等級&lt;/th>
 &lt;th>適合情境&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>秒到分鐘&lt;/td>
 &lt;td>operational dashboard、風控&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>小時&lt;/td>
 &lt;td>商業報表、營運分析&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>天&lt;/td>
 &lt;td>財務結算、長期趨勢&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Freshness 要被量測。Runbook 要記錄 last load / sync time、query latency、failed refresh、data gap 與 fallback dashboard。&lt;/p>
&lt;h2 id="cost-model">Cost Model&lt;/h2>
&lt;p>Cost model 的核心責任是比較 add-on 和獨立 OLAP 系統。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>成本項&lt;/th>
 &lt;th>HeatWave 類 add-on&lt;/th>
 &lt;th>獨立 warehouse&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Data movement&lt;/td>
 &lt;td>較少 ETL&lt;/td>
 &lt;td>需要 CDC / batch pipeline&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Compute&lt;/td>
 &lt;td>add-on capacity&lt;/td>
 &lt;td>warehouse compute / auto scaling&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Storage&lt;/td>
 &lt;td>MySQL ecosystem 內&lt;/td>
 &lt;td>separate storage&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Governance&lt;/td>
 &lt;td>MySQL 權限延伸&lt;/td>
 &lt;td>data platform governance&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Lock-in&lt;/td>
 &lt;td>provider-specific&lt;/td>
 &lt;td>warehouse-specific&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>成本比較要包含人力。少一條 ETL pipeline 可能節省大量維運；但 provider-specific query 與管理模型也會增加 exit cost。&lt;/p></description><content:encoded><![CDATA[<p>MySQL HeatWave OLAP add-on 的核心責任是判斷 OLTP database 內建 analytics 加速何時比拆出 OLAP 系統更划算。HeatWave 這類 add-on 的價值是降低資料搬運與平台數量，但它也把 analytics workload、成本、freshness 與 query governance 帶回 MySQL 生態。</p>
<p>本文的判讀錨點是：OLAP add-on 做的是把分析查詢從 OLTP 路徑<a href="/blog/backend/knowledge-cards/olap-offload/" data-link-title="OLAP Offload" data-link-desc="說明如何把分析型查詢從 OLTP 主庫卸載，以保護線上交易效能">卸載</a>到專用引擎，解決特定 analytics workload 的 proximity 問題，而非 data warehouse 的完整替代。選型要看資料量、query pattern、freshness、concurrency、成本與團隊能力。</p>
<p>官方文件路由的核心責任是固定 HeatWave claim。實作前先查 <a href="https://dev.mysql.com/doc/heatwave/en/index.html">MySQL HeatWave User Guide</a>；本文最後檢查日是 2026-05-22。</p>
<h2 id="workload-fit">Workload Fit</h2>
<p>Workload fit 的核心責任是找出 HeatWave 類 OLAP add-on 的合理位置。</p>
<table>
  <thead>
      <tr>
          <th>情境</th>
          <th>適合原因</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MySQL 資料為主要分析來源</td>
          <td>減少 ETL / CDC 複雜度</td>
      </tr>
      <tr>
          <td>Dashboard 需要較新資料</td>
          <td>freshness 比 warehouse batch 更重要</td>
      </tr>
      <tr>
          <td>分析 query 可被明確界定</td>
          <td>可控 workload 便於成本與容量管理</td>
      </tr>
      <tr>
          <td>Team 想降低平台數</td>
          <td>MySQL 生態內完成 transactional + analytics</td>
      </tr>
  </tbody>
</table>
<p>適合的 workload 通常是「MySQL 內資料、分析需求清楚、資料量可控」。若需要跨多資料源、複雜 semantic layer、長期資料湖與 ML feature store，warehouse / lakehouse 仍然更合適。</p>
<h2 id="boundary-with-oltp">Boundary with OLTP</h2>
<p>Boundary with OLTP 的核心責任是避免 analytics 壓力影響交易服務。OLTP query 要穩定、低延遲、可預測；OLAP query 常是大掃描、大聚合、長時間。</p>
<table>
  <thead>
      <tr>
          <th>審查面</th>
          <th>問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Resource</td>
          <td>OLAP 是否隔離 CPU / memory / storage</td>
      </tr>
      <tr>
          <td>Freshness</td>
          <td>analytic data 和 source 差多久</td>
      </tr>
      <tr>
          <td>Query control</td>
          <td>誰能跑 heavy query、如何限流</td>
      </tr>
      <tr>
          <td>Cost</td>
          <td>add-on node、storage、egress</td>
      </tr>
      <tr>
          <td>Incident</td>
          <td>OLAP 故障是否影響 OLTP</td>
      </tr>
  </tbody>
</table>
<p>OLAP add-on 要有 query admission policy。任何人都能跑任意分析 SQL，會把成本與穩定性風險放大。</p>
<h2 id="freshness-and-evidence">Freshness and Evidence</h2>
<p>Freshness and evidence 的核心責任是定義分析結果多新。Dashboard、營運報表、風控、推薦特徵對 freshness 的要求不同。</p>
<table>
  <thead>
      <tr>
          <th>Freshness 等級</th>
          <th>適合情境</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>秒到分鐘</td>
          <td>operational dashboard、風控</td>
      </tr>
      <tr>
          <td>小時</td>
          <td>商業報表、營運分析</td>
      </tr>
      <tr>
          <td>天</td>
          <td>財務結算、長期趨勢</td>
      </tr>
  </tbody>
</table>
<p>Freshness 要被量測。Runbook 要記錄 last load / sync time、query latency、failed refresh、data gap 與 fallback dashboard。</p>
<h2 id="cost-model">Cost Model</h2>
<p>Cost model 的核心責任是比較 add-on 和獨立 OLAP 系統。</p>
<table>
  <thead>
      <tr>
          <th>成本項</th>
          <th>HeatWave 類 add-on</th>
          <th>獨立 warehouse</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Data movement</td>
          <td>較少 ETL</td>
          <td>需要 CDC / batch pipeline</td>
      </tr>
      <tr>
          <td>Compute</td>
          <td>add-on capacity</td>
          <td>warehouse compute / auto scaling</td>
      </tr>
      <tr>
          <td>Storage</td>
          <td>MySQL ecosystem 內</td>
          <td>separate storage</td>
      </tr>
      <tr>
          <td>Governance</td>
          <td>MySQL 權限延伸</td>
          <td>data platform governance</td>
      </tr>
      <tr>
          <td>Lock-in</td>
          <td>provider-specific</td>
          <td>warehouse-specific</td>
      </tr>
  </tbody>
</table>
<p>成本比較要包含人力。少一條 ETL pipeline 可能節省大量維運；但 provider-specific query 與管理模型也會增加 exit cost。</p>
<h2 id="no-go-conditions">No-Go Conditions</h2>
<p>No-go conditions 的核心責任是避免把 OLAP add-on 推到資料平台的位置。</p>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>建議路由</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>分析跨多來源</td>
          <td>warehouse / lakehouse</td>
      </tr>
      <tr>
          <td>查詢需要 semantic layer / BI governance</td>
          <td>dedicated analytics platform</td>
      </tr>
      <tr>
          <td>長期歷史資料遠大於 OLTP</td>
          <td>warehouse / object storage</td>
      </tr>
      <tr>
          <td>ML feature / offline training</td>
          <td>feature store / lakehouse</td>
      </tr>
      <tr>
          <td>成本需要獨立 chargeback</td>
          <td>separate OLAP environment</td>
      </tr>
  </tbody>
</table>
<p>HeatWave 類能力適合 MySQL-centered analytics。當分析需求超出單一 OLTP source，資料平台會比 add-on 更清楚。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>HeatWave OLAP add-on 完成後，MySQL query 基礎讀 <a href="../query-optimization/">Query Optimization</a>；資料平台邊界讀 backend analytics / warehouse 章節；若要保留 MySQL OLTP 並外接 CDC，讀 <a href="../binlog-cdc/">Binlog CDC</a>。</p>
]]></content:encoded></item><item><title>MySQL Metadata Lock Deep Dive</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/metadata-lock-deep-dive/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/metadata-lock-deep-dive/</guid><description>&lt;p>MySQL metadata lock deep dive 的核心責任是說明 DDL、transaction 與 table metadata 之間的阻塞關係。MySQL 在查詢 table 時會取得 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/metadata-lock/" data-link-title="Metadata Lock" data-link-desc="說明 DDL 與既有交易如何在 table metadata 層互相排隊與阻塞">metadata lock&lt;/a>；DDL 需要等待既有 metadata lock 釋放，等待中的 DDL 又會阻塞後續查詢，形成 production 常見雪崩。&lt;/p>
&lt;p>本文的判讀錨點是：MDL 事故通常來自 DDL 排隊在長交易後面，並把後續 query 一起擋住。解法要同時處理 long transaction、DDL window、OSC 工具與 observability。&lt;/p>
&lt;h2 id="lock-lifecycle">Lock Lifecycle&lt;/h2>
&lt;p>Lock lifecycle 的核心責任是建立 MDL 心智模型。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>行為&lt;/th>
 &lt;th>MDL 影響&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;code>SELECT&lt;/code> / DML&lt;/td>
 &lt;td>取得 table metadata lock，交易結束釋放&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Long transaction&lt;/td>
 &lt;td>延長 metadata lock 持有時間&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>ALTER TABLE&lt;/code>&lt;/td>
 &lt;td>等待相容鎖，期間可能阻塞後續 query&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Online schema change&lt;/td>
 &lt;td>仍需 metadata lock 進行切換 / rename&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Idle transaction&lt;/td>
 &lt;td>看似無操作，仍可能持有 metadata lock&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>MDL 的風險在於排隊。當 &lt;code>ALTER TABLE&lt;/code> 等待 long transaction 時，後續新的 query 可能排在 DDL 後面，讓原本小變更變成服務不可用。&lt;/p>
&lt;h2 id="detection">Detection&lt;/h2>
&lt;p>Detection 的核心責任是快速找出誰持鎖、誰等待。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">SELECT&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">FROM&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">performance_schema&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="n">metadata_locks&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">WHERE&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">OBJECT_SCHEMA&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s1">&amp;#39;appdb&amp;#39;&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="k">ORDER&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">BY&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">OBJECT_NAME&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">LOCK_STATUS&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>搭配 processlist：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sql" data-lang="sql">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&lt;span class="k">SHOW&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">FULL&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="n">PROCESSLIST&lt;/span>&lt;span class="p">;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Production dashboard 應監控 running DDL、metadata lock wait、long transaction age、threads running、blocked query count 與 replication lag。&lt;/p>
&lt;h2 id="ddl-risk-review">DDL Risk Review&lt;/h2>
&lt;p>DDL risk review 的核心責任是在變更前預測 MDL 風險。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>DDL 類型&lt;/th>
 &lt;th>風險&lt;/th>
 &lt;th>控制方式&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Add nullable column&lt;/td>
 &lt;td>依版本 / algorithm 可能較低&lt;/td>
 &lt;td>staging dry run、algorithm check&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Add index&lt;/td>
 &lt;td>可能長時間操作與切換 lock&lt;/td>
 &lt;td>online DDL / OSC、低峰窗口&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Change column type&lt;/td>
 &lt;td>table rebuild 風險高&lt;/td>
 &lt;td>ghost table / phased migration&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Rename / swap table&lt;/td>
 &lt;td>短暫但關鍵 MDL&lt;/td>
 &lt;td>kill blocker、短窗口&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Drop column / table&lt;/td>
 &lt;td>destructive 且需鎖&lt;/td>
 &lt;td>backup、approval、blocked query watch&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>DDL review 要列出 algorithm、lock mode、預估時間、rollback、kill blocker policy 與 replication impact。&lt;/p>
&lt;h2 id="incident-runbook">Incident Runbook&lt;/h2>
&lt;p>Incident runbook 的核心責任是把 MDL 事故分流。&lt;/p></description><content:encoded><![CDATA[<p>MySQL metadata lock deep dive 的核心責任是說明 DDL、transaction 與 table metadata 之間的阻塞關係。MySQL 在查詢 table 時會取得 <a href="/blog/backend/knowledge-cards/metadata-lock/" data-link-title="Metadata Lock" data-link-desc="說明 DDL 與既有交易如何在 table metadata 層互相排隊與阻塞">metadata lock</a>；DDL 需要等待既有 metadata lock 釋放，等待中的 DDL 又會阻塞後續查詢，形成 production 常見雪崩。</p>
<p>本文的判讀錨點是：MDL 事故通常來自 DDL 排隊在長交易後面，並把後續 query 一起擋住。解法要同時處理 long transaction、DDL window、OSC 工具與 observability。</p>
<h2 id="lock-lifecycle">Lock Lifecycle</h2>
<p>Lock lifecycle 的核心責任是建立 MDL 心智模型。</p>
<table>
  <thead>
      <tr>
          <th>行為</th>
          <th>MDL 影響</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>SELECT</code> / DML</td>
          <td>取得 table metadata lock，交易結束釋放</td>
      </tr>
      <tr>
          <td>Long transaction</td>
          <td>延長 metadata lock 持有時間</td>
      </tr>
      <tr>
          <td><code>ALTER TABLE</code></td>
          <td>等待相容鎖，期間可能阻塞後續 query</td>
      </tr>
      <tr>
          <td>Online schema change</td>
          <td>仍需 metadata lock 進行切換 / rename</td>
      </tr>
      <tr>
          <td>Idle transaction</td>
          <td>看似無操作，仍可能持有 metadata lock</td>
      </tr>
  </tbody>
</table>
<p>MDL 的風險在於排隊。當 <code>ALTER TABLE</code> 等待 long transaction 時，後續新的 query 可能排在 DDL 後面，讓原本小變更變成服務不可用。</p>
<h2 id="detection">Detection</h2>
<p>Detection 的核心責任是快速找出誰持鎖、誰等待。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">performance_schema</span><span class="p">.</span><span class="n">metadata_locks</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">OBJECT_SCHEMA</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;appdb&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">OBJECT_NAME</span><span class="p">,</span><span class="w"> </span><span class="n">LOCK_STATUS</span><span class="p">;</span></span></span></code></pre></div><p>搭配 processlist：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SHOW</span><span class="w"> </span><span class="k">FULL</span><span class="w"> </span><span class="n">PROCESSLIST</span><span class="p">;</span></span></span></code></pre></div><p>Production dashboard 應監控 running DDL、metadata lock wait、long transaction age、threads running、blocked query count 與 replication lag。</p>
<h2 id="ddl-risk-review">DDL Risk Review</h2>
<p>DDL risk review 的核心責任是在變更前預測 MDL 風險。</p>
<table>
  <thead>
      <tr>
          <th>DDL 類型</th>
          <th>風險</th>
          <th>控制方式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Add nullable column</td>
          <td>依版本 / algorithm 可能較低</td>
          <td>staging dry run、algorithm check</td>
      </tr>
      <tr>
          <td>Add index</td>
          <td>可能長時間操作與切換 lock</td>
          <td>online DDL / OSC、低峰窗口</td>
      </tr>
      <tr>
          <td>Change column type</td>
          <td>table rebuild 風險高</td>
          <td>ghost table / phased migration</td>
      </tr>
      <tr>
          <td>Rename / swap table</td>
          <td>短暫但關鍵 MDL</td>
          <td>kill blocker、短窗口</td>
      </tr>
      <tr>
          <td>Drop column / table</td>
          <td>destructive 且需鎖</td>
          <td>backup、approval、blocked query watch</td>
      </tr>
  </tbody>
</table>
<p>DDL review 要列出 algorithm、lock mode、預估時間、rollback、kill blocker policy 與 replication impact。</p>
<h2 id="incident-runbook">Incident Runbook</h2>
<p>Incident runbook 的核心責任是把 MDL 事故分流。</p>
<table>
  <thead>
      <tr>
          <th>Step</th>
          <th>操作</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Identify blocker</td>
          <td>查 long transaction / metadata_locks</td>
      </tr>
      <tr>
          <td>Stop new DDL</td>
          <td>暫停 migration pipeline</td>
      </tr>
      <tr>
          <td>Decide kill</td>
          <td>依 owner / transaction age / impact</td>
      </tr>
      <tr>
          <td>Protect app</td>
          <td>降低 traffic、停 heavy endpoint</td>
      </tr>
      <tr>
          <td>Validate</td>
          <td>查 query 恢復、replication lag</td>
      </tr>
      <tr>
          <td>Retrospective</td>
          <td>補 DDL gate、long transaction alert</td>
      </tr>
  </tbody>
</table>
<p>Kill session 是高風險操作。決策要記錄 transaction owner、已執行時間、可能 rollback 成本與業務影響。</p>
<h2 id="osc-interaction">OSC Interaction</h2>
<p>OSC interaction 的核心責任是說明 gh-ost / pt-online-schema-change 仍需要 MDL 管理。Ghost table 工具把大部分 copy 與 backfill 移到旁路，但最後 cutover / rename 仍需要短暫 metadata lock。</p>
<table>
  <thead>
      <tr>
          <th>工具階段</th>
          <th>MDL 風險</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Create ghost table</td>
          <td>低</td>
      </tr>
      <tr>
          <td>Copy / backfill</td>
          <td>主要是 load / replication lag</td>
      </tr>
      <tr>
          <td>Trigger / binlog</td>
          <td>依工具模式不同</td>
      </tr>
      <tr>
          <td>Cutover / rename</td>
          <td>關鍵 MDL window</td>
      </tr>
  </tbody>
</table>
<p>OSC runbook 要在 cutover 前檢查 long transaction。若 blocker 存在，先延後 cutover，而非硬切。</p>
<h2 id="prevention">Prevention</h2>
<p>Prevention 的核心責任是讓 MDL 事故在 release 前被擋下。</p>
<ol>
<li>Long transaction alert。</li>
<li>DDL dry run 與 algorithm / lock mode 記錄。</li>
<li>Migration window 與 kill blocker policy。</li>
<li>OSC cutover pre-check。</li>
<li>Application transaction timeout。</li>
<li>Read-only replica 上先測 schema change。</li>
</ol>
<p>MDL 是 MySQL schema governance 的核心議題。每個 production DDL 都要有 metadata lock plan。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Metadata lock deep dive 完成後，schema change 工具讀 <a href="../online-schema-change-tools/">Online Schema Change Tools</a>；lock 行為讀 <a href="../lock-contention/">Lock Contention</a>；操作演練讀 <a href="../hands-on/online-schema-change-lab/">Online Schema Change Lab</a>。</p>
]]></content:encoded></item><item><title>MySQL Multi-source Replication</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/multi-source-replication/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/multi-source-replication/</guid><description>&lt;p>MySQL multi-source replication 的核心責任是讓一個 replica 從多個 source 接收資料。這種拓撲常用於資料整併、分庫匯總、migration staging、報表集中或多個 bounded context 的 read consolidation。&lt;/p>
&lt;p>本文的判讀錨點是：multi-source replication 是 consolidation pattern，而非 multi-primary conflict resolution。每個 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/replication-channel/" data-link-title="Replication Channel" data-link-desc="說明多來源複製中，每個來源對應的獨立複製通道如何成為隔離單位">replication channel&lt;/a> 要有獨立 source、schema scope、lag、error handling 與 ownership。&lt;/p>
&lt;h2 id="use-cases">Use Cases&lt;/h2>
&lt;p>Use cases 的核心責任是確認 multi-source 解決的是整併需求。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>情境&lt;/th>
 &lt;th>適合條件&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Reporting replica&lt;/td>
 &lt;td>多個 source 匯入同一 read-only target&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Migration staging&lt;/td>
 &lt;td>新平台先接多個 source binlog&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Regional fan-in&lt;/td>
 &lt;td>多區 local DB 匯總到中心&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Shard consolidation&lt;/td>
 &lt;td>多 shard 同 schema 匯入 reporting DB&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Audit / CDC sink&lt;/td>
 &lt;td>變更集中供後續 pipeline 使用&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Multi-source target 通常應 read-only。若 target 同時接受 application write，就要設計 conflict 與 ownership，複雜度會大幅提高。&lt;/p>
&lt;h2 id="channel-design">Channel Design&lt;/h2>
&lt;p>Channel design 的核心責任是把每個 source 隔離成可觀測單位。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>設計項&lt;/th>
 &lt;th>審查問題&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Channel name&lt;/td>
 &lt;td>是否能看出 source / owner / purpose&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Schema scope&lt;/td>
 &lt;td>不同 source 是否寫入不同 schema / table&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>GTID&lt;/td>
 &lt;td>GTID domain / collision policy&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Filter&lt;/td>
 &lt;td>replicate-do / ignore 規則是否可審查&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Credential&lt;/td>
 &lt;td>每個 channel 是否獨立 secret&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Lag alert&lt;/td>
 &lt;td>channel-level lag 與 error&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Channel 命名要可讀。Incident 時看到 channel 名稱，就要知道哪個 source、哪個 team、哪個用途與是否可暫停。&lt;/p>
&lt;h2 id="conflict-boundary">Conflict Boundary&lt;/h2>
&lt;p>Conflict boundary 的核心責任是避免多個 source 寫同一份邏輯資料。Multi-source 沒有自動解決業務 conflict 的能力。&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Conflict 類型&lt;/th>
 &lt;th>控制方式&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Primary key collision&lt;/td>
 &lt;td>shard key prefix、schema isolation&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Duplicate natural key&lt;/td>
 &lt;td>source namespace、dedupe layer&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Out-of-order update&lt;/td>
 &lt;td>source ownership、event timestamp&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Delete collision&lt;/td>
 &lt;td>tombstone policy&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>DDL drift&lt;/td>
 &lt;td>migration coordination&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>最安全的 pattern 是每個 source 寫自己的 schema 或帶 source namespace 的 table。若多 source 寫同一 table，必須先設計 key space 與 conflict policy。&lt;/p>
&lt;h2 id="monitoring">Monitoring&lt;/h2>
&lt;p>Monitoring 的核心責任是讓每個 channel 的狀態可見。&lt;/p></description><content:encoded><![CDATA[<p>MySQL multi-source replication 的核心責任是讓一個 replica 從多個 source 接收資料。這種拓撲常用於資料整併、分庫匯總、migration staging、報表集中或多個 bounded context 的 read consolidation。</p>
<p>本文的判讀錨點是：multi-source replication 是 consolidation pattern，而非 multi-primary conflict resolution。每個 <a href="/blog/backend/knowledge-cards/replication-channel/" data-link-title="Replication Channel" data-link-desc="說明多來源複製中，每個來源對應的獨立複製通道如何成為隔離單位">replication channel</a> 要有獨立 source、schema scope、lag、error handling 與 ownership。</p>
<h2 id="use-cases">Use Cases</h2>
<p>Use cases 的核心責任是確認 multi-source 解決的是整併需求。</p>
<table>
  <thead>
      <tr>
          <th>情境</th>
          <th>適合條件</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Reporting replica</td>
          <td>多個 source 匯入同一 read-only target</td>
      </tr>
      <tr>
          <td>Migration staging</td>
          <td>新平台先接多個 source binlog</td>
      </tr>
      <tr>
          <td>Regional fan-in</td>
          <td>多區 local DB 匯總到中心</td>
      </tr>
      <tr>
          <td>Shard consolidation</td>
          <td>多 shard 同 schema 匯入 reporting DB</td>
      </tr>
      <tr>
          <td>Audit / CDC sink</td>
          <td>變更集中供後續 pipeline 使用</td>
      </tr>
  </tbody>
</table>
<p>Multi-source target 通常應 read-only。若 target 同時接受 application write，就要設計 conflict 與 ownership，複雜度會大幅提高。</p>
<h2 id="channel-design">Channel Design</h2>
<p>Channel design 的核心責任是把每個 source 隔離成可觀測單位。</p>
<table>
  <thead>
      <tr>
          <th>設計項</th>
          <th>審查問題</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Channel name</td>
          <td>是否能看出 source / owner / purpose</td>
      </tr>
      <tr>
          <td>Schema scope</td>
          <td>不同 source 是否寫入不同 schema / table</td>
      </tr>
      <tr>
          <td>GTID</td>
          <td>GTID domain / collision policy</td>
      </tr>
      <tr>
          <td>Filter</td>
          <td>replicate-do / ignore 規則是否可審查</td>
      </tr>
      <tr>
          <td>Credential</td>
          <td>每個 channel 是否獨立 secret</td>
      </tr>
      <tr>
          <td>Lag alert</td>
          <td>channel-level lag 與 error</td>
      </tr>
  </tbody>
</table>
<p>Channel 命名要可讀。Incident 時看到 channel 名稱，就要知道哪個 source、哪個 team、哪個用途與是否可暫停。</p>
<h2 id="conflict-boundary">Conflict Boundary</h2>
<p>Conflict boundary 的核心責任是避免多個 source 寫同一份邏輯資料。Multi-source 沒有自動解決業務 conflict 的能力。</p>
<table>
  <thead>
      <tr>
          <th>Conflict 類型</th>
          <th>控制方式</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Primary key collision</td>
          <td>shard key prefix、schema isolation</td>
      </tr>
      <tr>
          <td>Duplicate natural key</td>
          <td>source namespace、dedupe layer</td>
      </tr>
      <tr>
          <td>Out-of-order update</td>
          <td>source ownership、event timestamp</td>
      </tr>
      <tr>
          <td>Delete collision</td>
          <td>tombstone policy</td>
      </tr>
      <tr>
          <td>DDL drift</td>
          <td>migration coordination</td>
      </tr>
  </tbody>
</table>
<p>最安全的 pattern 是每個 source 寫自己的 schema 或帶 source namespace 的 table。若多 source 寫同一 table，必須先設計 key space 與 conflict policy。</p>
<h2 id="monitoring">Monitoring</h2>
<p>Monitoring 的核心責任是讓每個 channel 的狀態可見。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SHOW</span><span class="w"> </span><span class="n">REPLICA</span><span class="w"> </span><span class="n">STATUS</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="n">CHANNEL</span><span class="w"> </span><span class="s1">&#39;source_a&#39;</span><span class="err">\</span><span class="k">G</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SHOW</span><span class="w"> </span><span class="n">REPLICA</span><span class="w"> </span><span class="n">STATUS</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="n">CHANNEL</span><span class="w"> </span><span class="s1">&#39;source_b&#39;</span><span class="err">\</span><span class="k">G</span></span></span></code></pre></div><p>要觀測：</p>
<ol>
<li>IO thread / SQL thread status。</li>
<li>Seconds behind source。</li>
<li>Last IO error / SQL error。</li>
<li>Relay log growth。</li>
<li>GTID executed / retrieved。</li>
<li>Channel credential expiry。</li>
</ol>
<p>Lag 要分 channel 告警。總體 replica 健康不足以定位哪個 source 卡住。</p>
<h2 id="migration-pattern">Migration Pattern</h2>
<p>Migration pattern 的核心責任是把 multi-source 用在可回退的搬遷。</p>
<table>
  <thead>
      <tr>
          <th>Phase</th>
          <th>Evidence</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Source audit</td>
          <td>schema、GTID、binlog format</td>
      </tr>
      <tr>
          <td>Target setup</td>
          <td>channel、filter、credential</td>
      </tr>
      <tr>
          <td>Backfill</td>
          <td>dump / load、checksum</td>
      </tr>
      <tr>
          <td>Catch-up</td>
          <td>channel lag、error</td>
      </tr>
      <tr>
          <td>Read test</td>
          <td>report query、row count</td>
      </tr>
      <tr>
          <td>Cutover</td>
          <td>read endpoint switch</td>
      </tr>
      <tr>
          <td>Cleanup</td>
          <td>stop channel、retention、secret</td>
      </tr>
  </tbody>
</table>
<p>Migration target 若只是 reporting，cutover 風險較低；若要成為 new primary，還要處理 write freeze、conflict、application route 與 rollback。</p>
<h2 id="failure-modes">Failure Modes</h2>
<p>Failure modes 的核心責任是把 multi-source 事故分 channel 處理。</p>
<table>
  <thead>
      <tr>
          <th>Failure mode</th>
          <th>判讀訊號</th>
          <th>修正方向</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Single channel lag</td>
          <td>某 source 延遲</td>
          <td>查 source load、network、SQL error</td>
      </tr>
      <tr>
          <td>DDL drift</td>
          <td>replication SQL error</td>
          <td>migration coordination</td>
      </tr>
      <tr>
          <td>Key collision</td>
          <td>duplicate key error</td>
          <td>namespace / key rewrite</td>
      </tr>
      <tr>
          <td>Relay log growth</td>
          <td>target apply 慢</td>
          <td>調整 parallel apply、拆 workload</td>
      </tr>
      <tr>
          <td>Credential expired</td>
          <td>IO thread stopped</td>
          <td>rotate secret、resume channel</td>
      </tr>
  </tbody>
</table>
<p>Channel failure 要避免全局操作。只停問題 channel，保留其他 channel，能降低 blast radius。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>Multi-source replication 完成後，基本拓撲讀 <a href="../replication-topology/">Replication Topology</a>；failover 讀 <a href="../orchestrator-failover/">Orchestrator Failover</a>；CDC 與 binlog 讀 <a href="../binlog-cdc/">Binlog CDC</a>。</p>
]]></content:encoded></item></channel></rss>