<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Citus on Tarragon</title><link>https://tarrragon.github.io/blog/tags/citus/</link><description>Recent content in Citus on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Tue, 02 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/citus/index.xml" rel="self" type="application/rss+xml"/><item><title>PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/citus-distributed/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/citus-distributed/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL&lt;/a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 &lt;em>Citus distributed extension&lt;/em> — 把 PG 變成 sharded cluster 的方式。&lt;/p>&lt;/blockquote>
&lt;hr>
&lt;p>當 PG single-primary 寫吞吐撞上單機極限（50K-100K WPS）、選項三條：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Application 層 sharding&lt;/strong>：應用層自管 shard routing&lt;/li>
&lt;li>&lt;strong>Citus&lt;/strong>：PG extension、自動 routing + cross-shard query&lt;/li>
&lt;li>&lt;strong>Distributed SQL&lt;/strong>（CockroachDB / Aurora DSQL / Spanner）：不同 engine&lt;/li>
&lt;/ol>
&lt;p>選 Citus 的核心 driver：&lt;em>保留 PG SQL syntax + extension 生態&lt;/em>。但「應用層幾乎不必改」是樂觀說法 — 實際上 application 必須圍繞 distribution column 重設計（query 加 filter / transaction 限定同 shard / reference table 量控制）、跟 Vitess 比 cross-shard query 自動化弱。代價是 &lt;em>coordinator / worker 部署複雜度 + cross-shard query 限制 + application schema 改造工作量&lt;/em>。&lt;/p>
&lt;p>閱讀本文前可先對齊 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">Database Sharding&lt;/a> 的 shard key、routing、resharding 與 cross-shard query 語意；容量失衡時再接 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition&lt;/a>。&lt;/p>
&lt;p>跟 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">MySQL Vitess sharding&lt;/a> 的核心差異：Citus 是 &lt;em>PG extension&lt;/em>（PG 自己跑）、Vitess 是 &lt;em>獨立 proxy + tablet 系統&lt;/em>（包 MySQL）。Citus 用 PG 原生機制（FDW / extension hook）、Vitess 是 &lt;em>外部包裝&lt;/em>。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL</a> overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 <em>Citus distributed extension</em> — 把 PG 變成 sharded cluster 的方式。</p></blockquote>
<hr>
<p>當 PG single-primary 寫吞吐撞上單機極限（50K-100K WPS）、選項三條：</p>
<ol>
<li><strong>Application 層 sharding</strong>：應用層自管 shard routing</li>
<li><strong>Citus</strong>：PG extension、自動 routing + cross-shard query</li>
<li><strong>Distributed SQL</strong>（CockroachDB / Aurora DSQL / Spanner）：不同 engine</li>
</ol>
<p>選 Citus 的核心 driver：<em>保留 PG SQL syntax + extension 生態</em>。但「應用層幾乎不必改」是樂觀說法 — 實際上 application 必須圍繞 distribution column 重設計（query 加 filter / transaction 限定同 shard / reference table 量控制）、跟 Vitess 比 cross-shard query 自動化弱。代價是 <em>coordinator / worker 部署複雜度 + cross-shard query 限制 + application schema 改造工作量</em>。</p>
<p>閱讀本文前可先對齊 <a href="/blog/backend/knowledge-cards/database-sharding/" data-link-title="Database Sharding" data-link-desc="說明資料庫如何依 shard key 分散資料、路由請求與承擔跨 shard 查詢成本">Database Sharding</a> 的 shard key、routing、resharding 與 cross-shard query 語意；容量失衡時再接 <a href="/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition</a>。</p>
<p>跟 <a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">MySQL Vitess sharding</a> 的核心差異：Citus 是 <em>PG extension</em>（PG 自己跑）、Vitess 是 <em>獨立 proxy + tablet 系統</em>（包 MySQL）。Citus 用 PG 原生機制（FDW / extension hook）、Vitess 是 <em>外部包裝</em>。</p>
<h2 id="citus-架構coordinator--worker">Citus 架構：Coordinator + Worker</h2>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">                ┌─────────────────┐
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">   Application  │   Coordinator   │  ← 對外 PG wire protocol、planner、routing
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">                │   (Citus + PG)  │
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">                └────┬─────┬──────┘
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">                     │     │
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">              ┌──────┘     └──────┐
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">              ▼                   ▼
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">        ┌──────────┐         ┌──────────┐
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">        │ Worker 1 │         │ Worker 2 │  ← 各跑 PG + Citus extension
</span></span><span class="line"><span class="ln">10</span><span class="cl">        │  (PG)    │         │  (PG)    │
</span></span><span class="line"><span class="ln">11</span><span class="cl">        │ shard 1,3│         │ shard 2,4│
</span></span><span class="line"><span class="ln">12</span><span class="cl">        └──────────┘         └──────────┘</span></span></code></pre></div><p><strong>Coordinator</strong>：</p>
<ul>
<li>對 application 看起來像 PG（同 port / 同 wire protocol）</li>
<li>接 SQL → Citus planner 把 query 分解 + route 給 worker</li>
<li>不存 data（distributed table 的 shard 在 worker 上）</li>
<li>存 <em>metadata</em>（哪個 shard 在哪個 worker）</li>
</ul>
<p><strong>Worker</strong>：</p>
<ul>
<li>標準 PG instance + Citus extension</li>
<li>各存若干 shard</li>
<li>接 coordinator 來的 query、跑 local execute、回結果</li>
</ul>
<p><strong>Shard</strong>：</p>
<ul>
<li>Distributed table 拆成 N 個 shard（預設 32）</li>
<li>每 shard 是 worker 上的 <em>physical PG table</em>（含 <code>_&lt;shardid&gt;</code> 後綴）</li>
<li>行為跟一般 PG table 一樣、可以直接連 worker 用 PG 工具 access</li>
</ul>
<h2 id="3-種-table-type">3 種 Table Type</h2>
<h3 id="distributed-table--跨-shard-切分">Distributed table — 跨 shard 切分</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 建一般 PG table
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="n">BIGSERIAL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">user_id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">amount</span><span class="w"> </span><span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">created_at</span><span class="w"> </span><span class="k">TIMESTAMP</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">id</span><span class="p">)</span><span class="w">  </span><span class="c1">-- PK 必須含 distribution column
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="c1"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 用 Citus 把它變 distributed
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;orders&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;user_id&#39;</span><span class="p">);</span></span></span></code></pre></div><p><code>user_id</code> 是 <em>distribution column</em> — Citus 用它的 hash 決定 row 屬哪個 shard。<code>PK 必須含 distribution column</code>（跟 MySQL partitioning 同要求）。</p>
<p>跟 Vitess Vindex 對比：</p>
<ul>
<li>Citus：hash distribution column → shard（單一 hash function、不可選 algorithm）</li>
<li>Vitess：Vindex 可選多種（hash / lookup_hash / xxhash / null）</li>
</ul>
<h3 id="reference-table--全-shard-共有">Reference table — 全 shard 共有</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">products</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">name</span><span class="w"> </span><span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">price</span><span class="w"> </span><span class="nb">DECIMAL</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_reference_table</span><span class="p">(</span><span class="s1">&#39;products&#39;</span><span class="p">);</span></span></span></code></pre></div><p><code>products</code> 在 <em>每個 worker 都有完整 copy</em>、寫入 coordinator 廣播給所有 worker。</p>
<p>用途：</p>
<ul>
<li>小 lookup table（country code / product category 等）</li>
<li>跨 distributed table JOIN 時、reference table 在每 worker 上、不必 cross-shard</li>
<li>寫入頻率低（廣播 cost 跟 worker 數 linear）</li>
</ul>
<h3 id="local-table--coordinator-上的-pg-table">Local table — Coordinator 上的 PG table</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">audit_log</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="nb">SERIAL</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">event</span><span class="w"> </span><span class="n">JSONB</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 不調用 Citus function、預設留在 coordinator</span></span></span></code></pre></div><p>行為跟一般 PG table 一樣。用於 <em>不需 distribute</em> 的 table（如 admin metadata）。</p>
<h2 id="colocation跨-distributed-table-同-shard-對齊">Colocation：跨 distributed table 同 shard 對齊</h2>
<p>當兩個 distributed table 都用 <em>同 distribution column</em>（例如 <code>user_id</code>）+ 同 shard count、Citus 自動 colocate：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;orders&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;user_id&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;user_addresses&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;user_id&#39;</span><span class="p">,</span><span class="w"> </span><span class="n">colocate_with</span><span class="w"> </span><span class="o">=&gt;</span><span class="w"> </span><span class="s1">&#39;orders&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Colocate 後：</p>
<ul>
<li><code>user_id = 100</code> 的 orders 跟 user_addresses 在 <em>同一 worker shard</em></li>
<li>JOIN 不跨 worker、效率高</li>
<li>可用 PG 原生 FK constraint（cross-table 但同 shard）</li>
</ul>
<p>Colocate 是 Citus 設計的核心 <em>跨 table 一致性</em> 機制。沒 colocate 的 cross-table query 變 cross-worker、效率大降。</p>
<h2 id="配置-step-by-steplocal-cluster">配置 step-by-step（local cluster）</h2>
<p>Production 用 Citus Cloud（Microsoft 託管）或 Azure Cosmos DB for PostgreSQL（同 engine）。Self-hosted：</p>
<h3 id="step-1coordinator--worker-都裝-pg--citus">Step 1：Coordinator + worker 都裝 PG + Citus</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 在每個 node（coordinator + 2 worker）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">apt install postgresql-14
</span></span><span class="line"><span class="ln">3</span><span class="cl">apt install postgresql-14-citus-12.0
</span></span><span class="line"><span class="ln">4</span><span class="cl">
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"># postgresql.conf</span>
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="nv">shared_preload_libraries</span> <span class="o">=</span> <span class="s1">&#39;citus&#39;</span>
</span></span><span class="line"><span class="ln">7</span><span class="cl">
</span></span><span class="line"><span class="ln">8</span><span class="cl">systemctl restart postgresql</span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 在每個 node 跑
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="n">EXTENSION</span><span class="w"> </span><span class="n">citus</span><span class="p">;</span></span></span></code></pre></div><h3 id="step-2coordinator-註冊-worker">Step 2：Coordinator 註冊 worker</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 在 coordinator 跑
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">citus_add_node</span><span class="p">(</span><span class="s1">&#39;worker1.example.com&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">5432</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">citus_add_node</span><span class="p">(</span><span class="s1">&#39;worker2.example.com&#39;</span><span class="p">,</span><span class="w"> </span><span class="mi">5432</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="c1">-- 確認
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">citus_get_active_worker_nodes</span><span class="p">();</span></span></span></code></pre></div><h3 id="step-3建-distributed-table">Step 3：建 distributed table</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w">    </span><span class="n">id</span><span class="w"> </span><span class="n">BIGSERIAL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">    </span><span class="n">user_id</span><span class="w"> </span><span class="nb">BIGINT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="n">amount</span><span class="w"> </span><span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">    </span><span class="n">created_at</span><span class="w"> </span><span class="k">TIMESTAMP</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">    </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">id</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;orders&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;user_id&#39;</span><span class="p">);</span></span></span></code></pre></div><p>Citus 自動把 <code>orders</code> 拆成 32 個 shard（<code>orders_102008</code> 等）、分配到 worker。</p>
<h3 id="step-4application-連-coordinator">Step 4：Application 連 coordinator</h3>
<p>Application connection string 連 coordinator IP / port（不必知道 worker 存在）。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 從 application 跑 query、Citus 透明 route
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span><span class="w"> </span><span class="n">amount</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">12345</span><span class="p">,</span><span class="w"> </span><span class="mi">50</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="c1">-- → Citus 看 user_id=12345 hash 屬 shard 17、route 給對應 worker
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">12345</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w"></span><span class="c1">-- → Single-shard query、極快
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="c1"></span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="c1">-- → Cross-shard aggregation、Citus 並行跑、合併結果</span></span></span></code></pre></div><h2 id="5-個-production-踩雷">5 個 Production 踩雷</h2>
<h3 id="1-distribution-column-選錯--cross-shard-query-變主流">1. Distribution column 選錯 — Cross-shard query 變主流</h3>
<p>選 <code>created_at</code> 或 <code>id</code>（auto increment）作 distribution column、看起來均勻、實際 <em>application query 多以 user_id 為主</em>、變成 <em>每個 query 都 cross-shard</em>、performance 雪崩。</p>
<p>修法：</p>
<ul>
<li><em>Distribution column 選 application 最常 filter / join 的 column</em>（通常是 <code>tenant_id</code> / <code>user_id</code>）</li>
<li>Audit application top query、確認 distribution column 對齊 query pattern</li>
<li>改 distribution column 要 <em>rewrite 所有 shard</em>、像 resharding、大工程</li>
</ul>
<h3 id="2-cross-shard-transaction-限制">2. Cross-shard transaction 限制</h3>
<p>跨多 shard 的 transaction（如：UPDATE 兩個 user_id 不同的 row）Citus 用 <em>2PC</em>（two-phase commit）但有限制：</p>
<ul>
<li>Multi-statement transaction 跨 shard 需明確開 <code>SET citus.multi_shard_modify_mode = 'sequential'</code></li>
<li>部分 isolation level 不保證 serializable across shards</li>
<li>DDL 跨 shard 是 sequential</li>
</ul>
<p>修法：</p>
<ul>
<li>Schema design 避免 cross-shard transaction（同 colocation group 內 transaction 沒問題）</li>
<li>必要 cross-shard 場景明確設 multi-shard mode</li>
<li>對 <em>strict cross-shard consistency</em>、考慮 distributed SQL（CockroachDB / Aurora DSQL）</li>
</ul>
<h3 id="3-reference-table-過大--寫入廣播-cost-爆">3. Reference table 過大 — 寫入廣播 cost 爆</h3>
<p>Reference table 在每 worker 都有 copy、寫入 <em>廣播給所有 worker</em>。Reference table 100K row + 高頻寫入 → 寫一次寫 N worker、cost N x。</p>
<p>修法：</p>
<ul>
<li>Reference table 限 <em>小 + 寫入頻率低</em> 的 lookup data</li>
<li>超大表不該是 reference table、考慮 distributed</li>
<li>監控 reference table 寫入 rate、超 threshold 重新評估</li>
</ul>
<h3 id="4-colocate-沒對齊--隱性-cross-shard-join">4. Colocate 沒對齊 — 隱性 cross-shard JOIN</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 看似可以、實際 cross-shard 慢
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">orders</span><span class="w"> </span><span class="n">o</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">user_addresses</span><span class="w"> </span><span class="n">ua</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">o</span><span class="p">.</span><span class="n">user_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ua</span><span class="p">.</span><span class="n">user_id</span><span class="p">;</span></span></span></code></pre></div><p>若 <code>user_addresses</code> 沒 <code>colocate_with =&gt; 'orders'</code>、兩表 shard 分配獨立、JOIN 跨 worker。</p>
<p>修法：</p>
<ul>
<li>建相關 table 時 <code>colocate_with</code> 對齊</li>
<li>用 <code>SELECT * FROM citus_tables</code> 看 colocation_id、確認對齊</li>
<li>跨非 colocate table 的 JOIN 用 <em>materialized view</em> 或 application 層拆 query 避開</li>
</ul>
<h3 id="5-worker-failover--coordinator-必須知道">5. Worker failover — Coordinator 必須知道</h3>
<p>Worker 故障、Citus 預設 <em>coordinator 看到 query 失敗、不自動 failover</em>。</p>
<p>修法（Citus 11+）：</p>
<ul>
<li>用 <em>shard replication</em>（<code>citus.shard_replication_factor = 2</code>）— 每 shard 在 2 個 worker 有 copy</li>
<li>配 PG streaming replication 在 worker 層、外加 Patroni 管 failover</li>
<li>Coordinator 失敗 → 整個 cluster 失能、coordinator 也要 HA（Patroni）</li>
</ul>
<p>跟 Vitess 對比 Citus 的 HA story 較弱、production 必須認真規劃。</p>
<h2 id="何時用-citus">何時用 Citus</h2>
<table>
  <thead>
      <tr>
          <th>條件</th>
          <th>建議</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Multi-tenant SaaS、tenant_id 為自然 distribution</td>
          <td>是</td>
      </tr>
      <tr>
          <td>寫吞吐 &gt; 50K WPS、單 PG 撐不住</td>
          <td>是</td>
      </tr>
      <tr>
          <td>需要保留 PG SQL + extension（pgvector / TimescaleDB）</td>
          <td>是</td>
      </tr>
      <tr>
          <td>應用 query pattern 80% 都用同一 distribution column</td>
          <td>是</td>
      </tr>
      <tr>
          <td>應用大量 ad-hoc cross-tenant aggregation</td>
          <td>否（cross-shard 慢）</td>
      </tr>
      <tr>
          <td>強 cross-shard consistency 需求</td>
          <td>否（用 CockroachDB）</td>
      </tr>
      <tr>
          <td>想 zero-ops managed</td>
          <td>Azure Cosmos DB for PostgreSQL（同 engine）</td>
      </tr>
  </tbody>
</table>
<h2 id="容量規劃">容量規劃</h2>
<ul>
<li>Coordinator: 中等 CPU + RAM、metadata 不大、不存 data</li>
<li>Worker: per-worker spec 同 single PG production</li>
<li>Shard count: 預設 32、實務常設 worker count × 4-8</li>
<li>Replication factor: production 至少 2</li>
</ul>
<h2 id="跟其他模組整合">跟其他模組整合</h2>
<h3 id="跟-replication-topology">跟 Replication topology</h3>
<p>Coordinator + worker 各跑 PG streaming replication、Citus 不取代 PG replication。Worker failover 用 Patroni / streaming replication。詳見 <a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">Replication Topology</a>。</p>
<h3 id="跟-pg-extensions">跟 PG Extensions</h3>
<p>Citus 跟其他 PG extension 多數兼容（pgvector / TimescaleDB / pg_stat_statements）— 它維持 <em>extension</em> 形態，保留 PostgreSQL 生態接點。詳見 <em>PG Extension Ecosystem</em> 篇（待寫）。</p>
<h3 id="跟-mysql-vitess">跟 MySQL Vitess</h3>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Citus</th>
          <th>Vitess</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>部署模型</td>
          <td>PG extension</td>
          <td>獨立 proxy + tablet</td>
      </tr>
      <tr>
          <td>主要場景</td>
          <td>Multi-tenant SaaS</td>
          <td>超大規模分片</td>
      </tr>
      <tr>
          <td>Cross-shard JOIN</td>
          <td>colocate 對齊 + reference table</td>
          <td>VTGate 自動 split + aggregate</td>
      </tr>
      <tr>
          <td>FK</td>
          <td>同 colocation 內可用</td>
          <td>Vitess 18+ 支援、cross-shard 限制</td>
      </tr>
      <tr>
          <td>HA</td>
          <td>依賴 Patroni + replication factor</td>
          <td>VTOrc + replication</td>
      </tr>
      <tr>
          <td>學習曲線</td>
          <td>中（PG ops 經驗夠）</td>
          <td>高（4 component）</td>
      </tr>
  </tbody>
</table>
<p>Citus 對 <em>PG-native</em> 場景更平順、Vitess 對 <em>MySQL-native</em> 場景更平順、不直接競爭。詳見 <a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">MySQL Vitess Sharding</a>。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor overview</a></li>
<li><a href="/blog/backend/01-database/vendors/postgresql/replication-topology/" data-link-title="PostgreSQL Replication Topology：async / sync / quorum 三模式跟 LSN &#43; replication slot 的三軸組合" data-link-desc="PostgreSQL streaming replication 不是「sync 或 async」、是 *durability / latency / consistency* 三軸組合 &#43; LSN-based 進度追蹤 &#43; replication slot 治理。本文走 3 軸取捨模型、async / sync / quorum-based sync 行為對比、LSN &#43; replication slot 機制、配置 step-by-step、5 production 踩雷（standby lag 暴衝 / sync standby 退回 async / orphan replication slot / cascading replication 雪崩 / failover 後 timeline 分歧）、跟 Patroni HA &#43; logical replication 整合">PG Replication Topology</a>（per-worker replication）</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/mvcc-lock-model/" data-link-title="PostgreSQL MVCC &#43; Lock Model：為什麼 PG 比 MySQL 少 deadlock、但 vacuum 是別的代價" data-link-desc="PG 用 *MVCC-heavy &#43; 少 explicit lock* 的並行控制、跟 MySQL InnoDB 的 *lock-based*（record / gap / next-key）相反。本文走 MVCC 機制（tuple version &#43; xmin/xmax &#43; visibility）、PG 4 種 lock（row-level / table-level / advisory / predicate）、預測 SERIALIZABLE 行為、5 production 踩雷（idle transaction 卡 vacuum / SELECT FOR UPDATE 跨 transaction / advisory lock 沒釋放 / bloat 不是 vacuum 問題 / predicate lock 在 SSI 下 rollback）、跟 MySQL lock-contention sibling 對比">PG MVCC + Lock Model</a>（cross-shard transaction lock 行為）</li>
<li><a href="/blog/backend/01-database/global-distributed-oltp/" data-link-title="1.11 全球分散式 OLTP" data-link-desc="Spanner / Aurora DSQL / Cosmos DB multi-region write / CockroachDB / TiDB 的全球一致性取捨">1.11 全球分散式 OLTP</a>（Citus vs CockroachDB vs Spanner）</li>
<li><a href="/blog/backend/01-database/vendors/mysql/vitess-sharding/" data-link-title="MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作" data-link-desc="Vitess 不只是 MySQL sharding proxy、是 4 個 component 協作的完整 sharding 系統 — VTGate（query routing layer）、VTTablet（per-MySQL agent）、VReplication（跨 shard 資料移動）、VSchema（sharding metadata）。本文走 4 件套各自責任、keyspace / shard / tablet 架構、shard key 設計（Vindex）、配置 step-by-step、5 production 踩雷（cross-shard transaction / VStream lag / Vindex 不均勻 / resharding 切流 / VReplication 卡住）、跟自管 sharding 跟 PlanetScale 的對比">MySQL Vitess Sharding</a>（sibling、不同實作）</li>
<li><a href="/blog/backend/01-database/vendors/cosmosdb/" data-link-title="Azure Cosmos DB" data-link-desc="全球分散式 multi-model DB、5 個 consistency levels、Microsoft 自家 dogfood 證據">Cosmos DB vendor</a>（Azure Cosmos DB for PostgreSQL = managed Citus）</li>
<li>官方：<a href="https://docs.citusdata.com/">Citus Documentation</a> / <a href="https://github.com/citusdata/citus">Citus on GitHub</a></li>
</ul>
]]></content:encoded></item><item><title>Cosmos DB for PostgreSQL：基於 Citus 的分散式 PostgreSQL、跟核心 Cosmos DB 是不同產品、何時選它而非核心 Cosmos 或一般 PG</title><link>https://tarrragon.github.io/blog/backend/01-database/vendors/cosmosdb/cosmos-for-postgresql/</link><pubDate>Tue, 02 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/01-database/vendors/cosmosdb/cosmos-for-postgresql/</guid><description>&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/cosmosdb/" data-link-title="Azure Cosmos DB" data-link-desc="全球分散式 multi-model DB、5 個 consistency levels、Microsoft 自家 dogfood 證據">Cosmos DB&lt;/a> overview 的 deep article、寫作參照 &lt;a href="https://tarrragon.github.io/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">vendor deep article methodology&lt;/a>。Cosmos DB for PostgreSQL 是 Azure 在 2022 把 Citus（PostgreSQL 的分散式 extension）納入後推出的 &lt;em>分散式 PostgreSQL&lt;/em> 託管服務 — 它跑真正的 PostgreSQL engine、支援標準 SQL / JOIN / ACID 交易、把單表水平分片到多個 worker node。它跟本 vendor 頁主講的核心 Cosmos DB（NoSQL、multi-model、RU/s 計費）是 &lt;em>兩個不同產品&lt;/em>、只是共用品牌名稱。本文的主責任是釐清這個定位混淆、再講它的架構與選型判準：何時選它、何時該回核心 Cosmos DB、何時一般 PostgreSQL 就夠。&lt;/p>
&lt;p>本文沒有專屬 production case anchor：Cosmos DB for PostgreSQL 的公開 case 覆蓋稀薄、機制以 Azure / Citus vendor 規格與分散式 PostgreSQL 通用工程展開、選型判準用「scale-out PG vs NoSQL vs single-node PG」這個具體決策驅動。&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Scope warning&lt;/strong>：本文涉及的服務命名、node 規格上限、Citus 版本、PostgreSQL major version 支援屬時間敏感、Azure 服務命名歷史上有變動、實作前以 &lt;a href="https://learn.microsoft.com/azure/cosmos-db/postgresql/">Cosmos DB for PostgreSQL 官方文件&lt;/a> cross-verify。&lt;/p>&lt;/blockquote>
&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>典型觸發場景：team 在 Azure 上跑 PostgreSQL、單機 primary 撐到上限 — write throughput、資料量、或單表太大導致 index / vacuum / query 變慢。看到「Cosmos DB」以為是要把資料搬進 NoSQL、重寫 application 成 document model；或反過來、看到「Cosmos DB for PostgreSQL」以為它就是核心 Cosmos DB 的一個 PostgreSQL API、結果發現它是完全不同的東西。命名混淆讓選型從一開始就走偏。&lt;/p>
&lt;p>讀者徵兆：&lt;/p>
&lt;ul>
&lt;li>「單機 PostgreSQL 撐不住、但 application 是 SQL / JOIN / 交易重、不想重寫成 NoSQL」&lt;/li>
&lt;li>「Cosmos DB for PostgreSQL 跟核心 Cosmos DB 是同一個東西嗎」&lt;/li>
&lt;li>「它跟一般 Azure Database for PostgreSQL 差在哪、什麼時候才需要它」&lt;/li>
&lt;li>「跟 CockroachDB / Aurora / Spanner 這些 distributed SQL 怎麼選」&lt;/li>
&lt;/ul>
&lt;p>真實壓力：SQL workload 撐到單機上限時、選錯方向的成本是年級的。誤以為要遷 NoSQL 而重寫 application 是浪費；誤以為核心 Cosmos DB 有「PostgreSQL 相容」而選錯產品也是浪費。正確的選型要先把這個服務放回它真正的分類 — &lt;em>分散式 SQL&lt;/em>、見 &lt;a href="https://tarrragon.github.io/blog/backend/knowledge-cards/distributed-sql/" data-link-title="Distributed SQL" data-link-desc="把 SQL 與交易語意延伸到多節點與多區域的資料庫形態">distributed SQL&lt;/a>。&lt;/p>
&lt;h2 id="核心機制citus-based-coordinator-worker-分散式-postgresql">核心機制：Citus-based coordinator-worker 分散式 PostgreSQL&lt;/h2>
&lt;p>Cosmos DB for PostgreSQL 的底層是 Citus、把 PostgreSQL 從單機擴展成 coordinator + worker 的分散式叢集。它的關鍵概念有幾個。&lt;/p></description><content:encoded><![CDATA[<p>本文是 <a href="/blog/backend/01-database/vendors/cosmosdb/" data-link-title="Azure Cosmos DB" data-link-desc="全球分散式 multi-model DB、5 個 consistency levels、Microsoft 自家 dogfood 證據">Cosmos DB</a> overview 的 deep article、寫作參照 <a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">vendor deep article methodology</a>。Cosmos DB for PostgreSQL 是 Azure 在 2022 把 Citus（PostgreSQL 的分散式 extension）納入後推出的 <em>分散式 PostgreSQL</em> 託管服務 — 它跑真正的 PostgreSQL engine、支援標準 SQL / JOIN / ACID 交易、把單表水平分片到多個 worker node。它跟本 vendor 頁主講的核心 Cosmos DB（NoSQL、multi-model、RU/s 計費）是 <em>兩個不同產品</em>、只是共用品牌名稱。本文的主責任是釐清這個定位混淆、再講它的架構與選型判準：何時選它、何時該回核心 Cosmos DB、何時一般 PostgreSQL 就夠。</p>
<p>本文沒有專屬 production case anchor：Cosmos DB for PostgreSQL 的公開 case 覆蓋稀薄、機制以 Azure / Citus vendor 規格與分散式 PostgreSQL 通用工程展開、選型判準用「scale-out PG vs NoSQL vs single-node PG」這個具體決策驅動。</p>
<blockquote>
<p><strong>Scope warning</strong>：本文涉及的服務命名、node 規格上限、Citus 版本、PostgreSQL major version 支援屬時間敏感、Azure 服務命名歷史上有變動、實作前以 <a href="https://learn.microsoft.com/azure/cosmos-db/postgresql/">Cosmos DB for PostgreSQL 官方文件</a> cross-verify。</p></blockquote>
<h2 id="問題情境">問題情境</h2>
<p>典型觸發場景：team 在 Azure 上跑 PostgreSQL、單機 primary 撐到上限 — write throughput、資料量、或單表太大導致 index / vacuum / query 變慢。看到「Cosmos DB」以為是要把資料搬進 NoSQL、重寫 application 成 document model；或反過來、看到「Cosmos DB for PostgreSQL」以為它就是核心 Cosmos DB 的一個 PostgreSQL API、結果發現它是完全不同的東西。命名混淆讓選型從一開始就走偏。</p>
<p>讀者徵兆：</p>
<ul>
<li>「單機 PostgreSQL 撐不住、但 application 是 SQL / JOIN / 交易重、不想重寫成 NoSQL」</li>
<li>「Cosmos DB for PostgreSQL 跟核心 Cosmos DB 是同一個東西嗎」</li>
<li>「它跟一般 Azure Database for PostgreSQL 差在哪、什麼時候才需要它」</li>
<li>「跟 CockroachDB / Aurora / Spanner 這些 distributed SQL 怎麼選」</li>
</ul>
<p>真實壓力：SQL workload 撐到單機上限時、選錯方向的成本是年級的。誤以為要遷 NoSQL 而重寫 application 是浪費；誤以為核心 Cosmos DB 有「PostgreSQL 相容」而選錯產品也是浪費。正確的選型要先把這個服務放回它真正的分類 — <em>分散式 SQL</em>、見 <a href="/blog/backend/knowledge-cards/distributed-sql/" data-link-title="Distributed SQL" data-link-desc="把 SQL 與交易語意延伸到多節點與多區域的資料庫形態">distributed SQL</a>。</p>
<h2 id="核心機制citus-based-coordinator-worker-分散式-postgresql">核心機制：Citus-based coordinator-worker 分散式 PostgreSQL</h2>
<p>Cosmos DB for PostgreSQL 的底層是 Citus、把 PostgreSQL 從單機擴展成 coordinator + worker 的分散式叢集。它的關鍵概念有幾個。</p>
<p>它跑 <em>真正的 PostgreSQL</em>。不是 wire-compat、不是 PostgreSQL API on top of NoSQL — 是 PostgreSQL engine 加 Citus extension。標準 SQL、JOIN、ACID 交易、PostgreSQL extension 生態（含部分如 PostGIS）都在。這跟核心 Cosmos DB（自己的 query language、SQL-like 但無 JOIN、RU/s 計費）是根本不同的東西。</p>
<p>架構是 coordinator-worker。coordinator node 接 query、根據 distribution column 把 query 路由 / 拆分到 worker node、worker 存實際的 shard。application 連 coordinator、看起來像連一個 PostgreSQL。</p>
<p>distribution column 是核心設計決策、類比核心 Cosmos DB 的 partition key 之於 NoSQL、也類比 <a href="../partition-key-design/">partition-key-design</a> 講的分散原則。表按 distribution column 的值分片到 worker；同一 distribution column 值的 row 落在同一 shard。JOIN 與交易若在同一 distribution column 值內、可以下推到單一 worker 高效執行（co-location）；跨 distribution column 的 JOIN / 交易要跨 worker 協調、較貴。</p>
<p>表分三種：distributed table（按 distribution column 分片、大表用）、reference table（每個 worker 全複本、小的維度表用、讓 JOIN co-locate）、local table（只在 coordinator）。建模的關鍵是把常一起 JOIN 的大表用 <em>同一 distribution column</em> 分片、達成 co-location。</p>
<h2 id="選型判準三方對照">選型判準：三方對照</h2>
<p>這是本文主判讀段。Cosmos DB for PostgreSQL 的正確位置是「single-node PG 不夠、但 workload 仍是 SQL 範式」的中間地帶。</p>
<p>選 Cosmos DB for PostgreSQL 的條件：</p>
<ul>
<li>workload 是 SQL 範式（關聯 schema、JOIN、交易）、不想 / 不能重寫成 NoSQL</li>
<li>single-node PostgreSQL 已達上限（write throughput / 資料量 / 單表大小）、且資料有好的 distribution column（多租戶的 tenant_id、time-series 的某維度）</li>
<li>工作負載偏向多租戶 SaaS 或 real-time analytics over fresh data — Citus 的典型適配場景</li>
<li>想留在 PostgreSQL 生態（SQL、extension、既有 tooling）而非進 NoSQL</li>
</ul>
<p>回核心 Cosmos DB（NoSQL）的條件：</p>
<ul>
<li>資料形狀已是 document / KV、access pattern 固定、不需要 JOIN 與複雜 SQL</li>
<li>需要 multi-model（document + graph + KV）、5 個 consistency level、turnkey multi-region active-active write</li>
<li>RU/s 容量抽象與 serverless 計費更符合 workload — 見 <a href="../ru-cost-model-sizing/">ru-cost-model-sizing</a></li>
</ul>
<p>一般 Azure Database for PostgreSQL（single-node managed PG）就夠的條件：</p>
<ul>
<li>single-node 還沒到上限 — 多數 OLTP baseline 用 vertical scaling + read replica 就夠、不需要分散式</li>
<li>沒有好的 distribution column — 分散式 PostgreSQL 沒有均勻 distribution column 會 hot worker、好處拿不到、複雜度卻全付</li>
<li>不想承擔 distributed SQL 的複雜度（distribution column 設計、co-location 規劃、跨 shard query 成本）</li>
</ul>
<p>判讀句：先確認 single-node PG 真的到上限、再確認 workload 是 SQL 範式（否則考慮 NoSQL）、最後確認有好的 distribution column。三個都成立、Cosmos DB for PostgreSQL 才是對的；缺任一個、回 single-node PG 或核心 Cosmos DB。</p>
<h3 id="跟其他-distributed-sql-的位置">跟其他 distributed SQL 的位置</h3>
<p>Cosmos DB for PostgreSQL 是 Azure 上、PostgreSQL-native、scale-out（co-location 設計驅動）的 distributed SQL。跟 <a href="/blog/backend/01-database/vendors/spanner/" data-link-title="Google Cloud Spanner" data-link-desc="全球分散式 strong-consistency OLTP、TrueTime API、線性擴展到 10 億 req/sec">Spanner</a>（全球 external consistency、自己的 SQL 方言）、<a href="/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB</a>（跨雲、PostgreSQL wire、自動 range 分散）、Aurora DSQL（AWS、全球 active-active）位置不同：Cosmos DB for PostgreSQL 強在「真 PostgreSQL engine + extension 生態 + co-location 控制」、弱在它的分散需要 distribution column 設計（不像 CockroachDB / Spanner 自動分 range）、且綁 Azure。</p>
<h2 id="操作流程">操作流程</h2>
<h3 id="建叢集與設定-distribution-column">建叢集與設定 distribution column</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1">-- 建 distributed table、按 tenant_id 分片（多租戶 SaaS 典型）
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">tenant_id</span><span class="w">   </span><span class="nb">bigint</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">event_id</span><span class="w">    </span><span class="nb">bigint</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">payload</span><span class="w">     </span><span class="n">jsonb</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="n">created_at</span><span class="w">  </span><span class="n">timestamptz</span><span class="w"> </span><span class="k">DEFAULT</span><span class="w"> </span><span class="n">now</span><span class="p">()</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_distributed_table</span><span class="p">(</span><span class="s1">&#39;events&#39;</span><span class="p">,</span><span class="w"> </span><span class="s1">&#39;tenant_id&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w"></span><span class="c1">-- 維度小表設 reference table、讓 JOIN co-locate
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="c1"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">tenants</span><span class="w"> </span><span class="p">(</span><span class="n">tenant_id</span><span class="w"> </span><span class="nb">bigint</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="nb">text</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">create_reference_table</span><span class="p">(</span><span class="s1">&#39;tenants&#39;</span><span class="p">);</span></span></span></code></pre></div><p>驗證：<code>SELECT * FROM citus_tables;</code> 看每張表的 distribution column 與 shard 分布；對 distributed table 的查詢若帶 distribution column filter、<code>EXPLAIN</code> 顯示下推到單一 shard、不帶則 fan-out 到所有 worker。</p>
<h3 id="驗證-co-location">驗證 co-location</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 同 distribution column 的兩張 distributed table JOIN 應 co-located
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">colocation_id</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">citus_tables</span><span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">colocation_id</span><span class="p">;</span></span></span></code></pre></div><p>驗證：常一起 JOIN 的大表落在同一 colocation group、JOIN 在 worker 本地完成、不跨 worker shuffle。</p>
<h3 id="加-worker-擴容">加 worker 擴容</h3>
<p>加 worker node 後 rebalance shard。驗證：rebalance 後 shard 在新舊 worker 間分布均勻、單一 worker 不再是 hot spot。</p>
<h3 id="rollback-boundary">Rollback boundary</h3>
<p>Cosmos DB for PostgreSQL 是叢集級服務、scale worker 是運維操作、可逆（縮回去）。但 <em>distribution column 一旦選定、改它要重建表 + 重灌資料</em> — 跟核心 Cosmos DB 的 partition key 不可改是同一類不可逆設計、見 <a href="../partition-key-design/">partition-key-design</a>。</p>
<h2 id="失敗模式">失敗模式</h2>
<h3 id="把它跟核心-cosmos-db-當同一產品選">把它跟核心 Cosmos DB 當同一產品選</h3>
<p>選型時把「Cosmos DB for PostgreSQL」當成「核心 Cosmos DB 的 PostgreSQL 介面」、規劃用 RU/s、找 consistency level 設定、結果整套 mental model 對不上 — 因為它是分散式 PostgreSQL、用 node 規格計費、用 PostgreSQL 的交易隔離級別。修法是選型第一步就確認「這是分散式 SQL、不是 NoSQL」、規劃按 PostgreSQL + Citus 的模型走、不要套核心 Cosmos DB 的概念。</p>
<h3 id="沒有好的-distribution-column-硬上分散式">沒有好的 distribution column 硬上分散式</h3>
<p>workload 沒有均勻的 distribution column（例如資料天然集中在少數 tenant）、硬分片後變 hot worker、分散式的好處拿不到、複雜度全付。徵兆是少數 worker CPU / IO 飽和、其他 worker 閒置。修法是選型階段就評估 distribution column 的 cardinality 與均勻度；不均勻時、要嘛留 single-node PG（垂直擴 + read replica）、要嘛重新設計 distribution column（如多租戶用 composite 或對 hot tenant 特殊處理）。</p>
<h3 id="大量跨-shard-query--非-co-located-join">大量跨 shard query / 非 co-located JOIN</h3>
<p>application query 大多不帶 distribution column filter、或常做跨 distribution column 的 JOIN、每個 query fan-out 到所有 worker + shuffle、latency 與成本都差。徵兆是 <code>EXPLAIN</code> 顯示 query 打所有 worker、p99 latency 高。修法是重新設計 schema 讓常一起查的表 co-located、把 distribution column 放進熱 query 的 filter；改不動時、這個 workload 可能不適合 scale-out PG、回 single-node 或考慮其他方案。</p>
<h3 id="該用-nosql-卻選了分散式-pg或反之">該用 NoSQL 卻選了分散式 PG（或反之）</h3>
<p>document / KV、固定 access pattern、不需要 JOIN 的 workload 選了 Cosmos DB for PostgreSQL、付了 SQL / distribution column 設計的複雜度卻沒用到關聯能力 — 這類 workload 核心 Cosmos DB（NoSQL）更自然。反過來、SQL / JOIN / 交易重的 workload 被推去核心 Cosmos DB（NoSQL）要重寫成 document model 也是錯。修法是回到「workload 是 SQL 範式還是 document / KV 範式」的根本判斷、見本文選型判準段與 <a href="../mongodb-api-vs-sql-api/">mongodb-api-vs-sql-api</a> 的範式判讀。</p>
<h3 id="anti-recommendationsingle-node-pg-沒到上限不要上">Anti-recommendation：single-node PG 沒到上限不要上</h3>
<p>分散式 PostgreSQL 帶來 distribution column 設計、co-location 規劃、跨 shard query 成本、rebalance 運維。single-node managed PostgreSQL 加 vertical scaling 與 read replica 能撐的 OLTP baseline 比多數團隊以為的大。沒有觸及 single-node 真實上限（write throughput 飽和、單表大到 maintenance 困難、資料量超出單機）就上分散式、是用複雜度換不存在的容量需求。</p>
<h2 id="容量與觀測">容量與觀測</h2>
<ul>
<li>必看 metric：各 worker node 的 CPU / IO / 連線（找 hot worker）、shard 在 worker 間的分布均勻度、跨 shard query 比例、coordinator 連線數</li>
<li>容量單位：node 規格（不是 RU/s）— 規劃是 coordinator + N worker 的 vCPU / memory / storage、跟核心 Cosmos DB 的 RU 思維完全不同、不要混用 <a href="../ru-cost-model-sizing/">ru-cost-model-sizing</a> 的 RU 模型來估這個服務</li>
<li>distribution column 均勻度是容量上限的真實決定因素 — 跟 <a href="/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition</a> 同模型、hot worker 讓名義叢集容量達不到</li>
<li>回 <a href="/blog/backend/09-performance-capacity/capacity-planning/" data-link-title="9.6 容量規劃模型" data-link-desc="peak forecast、headroom budget、growth curve、autoscaling sizing">9.6 容量規劃模型</a>：scale-out 的有效容量 = node 數 × 單 node 容量 × distribution 均勻度</li>
<li>Alert：單一 worker 飽和（distribution skew）、跨 shard query 比例上升、rebalance 後仍不均</li>
</ul>
<h2 id="邊界與整合">邊界與整合</h2>
<ul>
<li>定位釐清：本服務是 <em>分散式 PostgreSQL</em>、不是核心 Cosmos DB（NoSQL）— 共用品牌名稱、產品不同、選型不要混淆</li>
<li>跟核心 Cosmos DB 的分界：SQL / JOIN / 交易 + 到單機上限 → 本服務；document / KV / multi-model / multi-region active-active → 核心 Cosmos DB、見 <a href="../mongodb-api-vs-sql-api/">mongodb-api-vs-sql-api</a></li>
<li>跟 PostgreSQL vendor 的分界：single-node 沒到上限 → <a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">Azure Database for PostgreSQL / 一般 PG</a>；PostgreSQL 既有的 <a href="/blog/backend/01-database/vendors/postgresql/specialized-pg-variants/" data-link-title="Specialized PostgreSQL Variants" data-link-desc="pgvectorscale、Citus、TimescaleDB、PostGIS、AlloyDB、Cosmos DB for PostgreSQL、serverless PG 等 PostgreSQL 變體的選型邊界">Specialized PostgreSQL Variants</a> 段已把 Cosmos DB for PostgreSQL 列為 Citus-based 變體之一</li>
<li>跟其他 distributed SQL：<a href="/blog/backend/01-database/vendors/spanner/" data-link-title="Google Cloud Spanner" data-link-desc="全球分散式 strong-consistency OLTP、TrueTime API、線性擴展到 10 億 req/sec">Spanner</a>（全球強一致）、<a href="/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB</a>（跨雲、自動 range）— 本服務強在真 PostgreSQL engine + co-location 控制、弱在需 distribution column 設計 + 綁 Azure</li>
<li>distribution column 不可改：跟 <a href="../partition-key-design/">partition-key-design</a> 的 partition key 不可改是同類不可逆設計</li>
<li>Knowledge card：<a href="/blog/backend/knowledge-cards/distributed-sql/" data-link-title="Distributed SQL" data-link-desc="把 SQL 與交易語意延伸到多節點與多區域的資料庫形態">distributed SQL</a> / <a href="/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition</a></li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li><a href="/blog/backend/01-database/vendors/cosmosdb/" data-link-title="Azure Cosmos DB" data-link-desc="全球分散式 multi-model DB、5 個 consistency levels、Microsoft 自家 dogfood 證據">Cosmos DB vendor overview</a> — 本文是該頁尾 Cosmos DB for PostgreSQL backlog 的深度展開</li>
<li><a href="../mongodb-api-vs-sql-api/">mongodb-api-vs-sql-api</a> — SQL 範式 vs document / KV 範式的根本判讀</li>
<li><a href="/blog/backend/01-database/vendors/postgresql/" data-link-title="PostgreSQL" data-link-desc="多用途 OLTP 主流關聯式資料庫、MVCC、豐富 SQL 特性、是 Aurora / Cosmos DB / Spanner / CockroachDB / Aurora DSQL 的相容目標">PostgreSQL vendor</a> / <a href="/blog/backend/01-database/vendors/postgresql/specialized-pg-variants/" data-link-title="Specialized PostgreSQL Variants" data-link-desc="pgvectorscale、Citus、TimescaleDB、PostGIS、AlloyDB、Cosmos DB for PostgreSQL、serverless PG 等 PostgreSQL 變體的選型邊界">Specialized PostgreSQL Variants</a> — single-node PG 與 Citus 變體定位</li>
<li><a href="/blog/backend/01-database/vendors/spanner/" data-link-title="Google Cloud Spanner" data-link-desc="全球分散式 strong-consistency OLTP、TrueTime API、線性擴展到 10 億 req/sec">Spanner vendor</a> / <a href="/blog/backend/01-database/vendors/cockroachdb/" data-link-title="CockroachDB" data-link-desc="分散式 SQL、PostgreSQL 相容、跨區強一致、Spanner 的開源 / 跨雲替代">CockroachDB vendor</a> — 其他 distributed SQL 對照</li>
<li><a href="../partition-key-design/">partition-key-design</a> — distribution column 不可改的同類設計</li>
<li><a href="/blog/backend/knowledge-cards/distributed-sql/" data-link-title="Distributed SQL" data-link-desc="把 SQL 與交易語意延伸到多節點與多區域的資料庫形態">Distributed SQL 卡片</a> / <a href="/blog/backend/knowledge-cards/hot-partition/" data-link-title="Hot Partition" data-link-desc="說明分散式 KV / OLTP 中、單一 partition 流量遠超其他的容量問題">Hot Partition 卡片</a> — 概念基底</li>
<li>官方：<a href="https://learn.microsoft.com/azure/cosmos-db/postgresql/">Azure Cosmos DB for PostgreSQL</a> / <a href="https://learn.microsoft.com/azure/cosmos-db/postgresql/concepts-distributed-data">Citus distributed tables</a></li>
</ul>
]]></content:encoded></item></channel></rss>