Citus on Tarragon

PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster

Tue, 19 May 2026 00:00:00 +0000

本文是 PostgreSQL overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 Citus distributed extension — 把 PG 變成 sharded cluster 的方式。

當 PG single-primary 寫吞吐撞上單機極限（50K-100K WPS）、選項三條：

Application 層 sharding：應用層自管 shard routing
Citus：PG extension、自動 routing + cross-shard query
Distributed SQL（CockroachDB / Aurora DSQL / Spanner）：不同 engine

選 Citus 的核心 driver：保留 PG SQL syntax + extension 生態。但「應用層幾乎不必改」是樂觀說法 — 實際上 application 必須圍繞 distribution column 重設計（query 加 filter / transaction 限定同 shard / reference table 量控制）、跟 Vitess 比 cross-shard query 自動化弱。代價是 coordinator / worker 部署複雜度 + cross-shard query 限制 + application schema 改造工作量。

閱讀本文前可先對齊 Database Sharding 的 shard key、routing、resharding 與 cross-shard query 語意；容量失衡時再接 Hot Partition。

跟 MySQL Vitess sharding 的核心差異：Citus 是 PG extension（PG 自己跑）、Vitess 是 獨立 proxy + tablet 系統（包 MySQL）。Citus 用 PG 原生機制（FDW / extension hook）、Vitess 是 外部包裝。

Citus 架構：Coordinator + Worker

 1                ┌─────────────────┐
 2   Application  │   Coordinator   │  ← 對外 PG wire protocol、planner、routing
 3                │   (Citus + PG)  │
 4                └────┬─────┬──────┘
 5                     │     │
 6              ┌──────┘     └──────┐
 7              ▼                   ▼
 8        ┌──────────┐         ┌──────────┐
 9        │ Worker 1 │         │ Worker 2 │  ← 各跑 PG + Citus extension
10        │  (PG)    │         │  (PG)    │
11        │ shard 1,3│         │ shard 2,4│
12        └──────────┘         └──────────┘

Coordinator：

對 application 看起來像 PG（同 port / 同 wire protocol）
接 SQL → Citus planner 把 query 分解 + route 給 worker
不存 data（distributed table 的 shard 在 worker 上）
存 metadata（哪個 shard 在哪個 worker）

Worker：

標準 PG instance + Citus extension
各存若干 shard
接 coordinator 來的 query、跑 local execute、回結果

Shard：

Distributed table 拆成 N 個 shard（預設 32）
每 shard 是 worker 上的 physical PG table（含 _ 後綴）
行為跟一般 PG table 一樣、可以直接連 worker 用 PG 工具 access

3 種 Table Type

Distributed table — 跨 shard 切分

 1-- 建一般 PG table
 2CREATE TABLE orders (
 3    id BIGSERIAL,
 4    user_id BIGINT NOT NULL,
 5    amount DECIMAL(10,2),
 6    created_at TIMESTAMP,
 7    PRIMARY KEY (user_id, id)  -- PK 必須含 distribution column
 8);
 9
10-- 用 Citus 把它變 distributed
11SELECT create_distributed_table('orders', 'user_id');

user_id 是 distribution column — Citus 用它的 hash 決定 row 屬哪個 shard。PK 必須含 distribution column（跟 MySQL partitioning 同要求）。

跟 Vitess Vindex 對比：

Citus：hash distribution column → shard（單一 hash function、不可選 algorithm）
Vitess：Vindex 可選多種（hash / lookup_hash / xxhash / null）

Reference table — 全 shard 共有

1CREATE TABLE products (
2    id SERIAL PRIMARY KEY,
3    name VARCHAR(100),
4    price DECIMAL
5);
6
7SELECT create_reference_table('products');

products 在 每個 worker 都有完整 copy、寫入 coordinator 廣播給所有 worker。

用途：

小 lookup table（country code / product category 等）
跨 distributed table JOIN 時、reference table 在每 worker 上、不必 cross-shard
寫入頻率低（廣播 cost 跟 worker 數 linear）

Local table — Coordinator 上的 PG table

1CREATE TABLE audit_log (
2    id SERIAL PRIMARY KEY,
3    event JSONB
4);
5-- 不調用 Citus function、預設留在 coordinator

行為跟一般 PG table 一樣。用於 不需 distribute 的 table（如 admin metadata）。

Colocation：跨 distributed table 同 shard 對齊

當兩個 distributed table 都用 同 distribution column（例如 user_id）+ 同 shard count、Citus 自動 colocate：

1SELECT create_distributed_table('orders', 'user_id');
2SELECT create_distributed_table('user_addresses', 'user_id', colocate_with => 'orders');

Colocate 後：

user_id = 100 的 orders 跟 user_addresses 在 同一 worker shard
JOIN 不跨 worker、效率高
可用 PG 原生 FK constraint（cross-table 但同 shard）

Colocate 是 Citus 設計的核心 跨 table 一致性 機制。沒 colocate 的 cross-table query 變 cross-worker、效率大降。

配置 step-by-step（local cluster）

Production 用 Citus Cloud（Microsoft 託管）或 Azure Cosmos DB for PostgreSQL（同 engine）。Self-hosted：

Step 1：Coordinator + worker 都裝 PG + Citus

1# 在每個 node（coordinator + 2 worker）
2apt install postgresql-14
3apt install postgresql-14-citus-12.0
4
5# postgresql.conf
6shared_preload_libraries = 'citus'
7
8systemctl restart postgresql

1-- 在每個 node 跑
2CREATE EXTENSION citus;

Step 2：Coordinator 註冊 worker

1-- 在 coordinator 跑
2SELECT citus_add_node('worker1.example.com', 5432);
3SELECT citus_add_node('worker2.example.com', 5432);
4
5-- 確認
6SELECT * FROM citus_get_active_worker_nodes();

Step 3：建 distributed table

1CREATE TABLE orders (
2    id BIGSERIAL,
3    user_id BIGINT NOT NULL,
4    amount DECIMAL(10,2),
5    created_at TIMESTAMP,
6    PRIMARY KEY (user_id, id)
7);
8
9SELECT create_distributed_table('orders', 'user_id');

Citus 自動把 orders 拆成 32 個 shard（orders_102008 等）、分配到 worker。

Step 4：Application 連 coordinator

Application connection string 連 coordinator IP / port（不必知道 worker 存在）。

1-- 從 application 跑 query、Citus 透明 route
2INSERT INTO orders (user_id, amount) VALUES (12345, 50);
3-- → Citus 看 user_id=12345 hash 屬 shard 17、route 給對應 worker
4
5SELECT * FROM orders WHERE user_id = 12345;
6-- → Single-shard query、極快
7
8SELECT count(*) FROM orders;
9-- → Cross-shard aggregation、Citus 並行跑、合併結果

5 個 Production 踩雷

1. Distribution column 選錯 — Cross-shard query 變主流

選 created_at 或 id（auto increment）作 distribution column、看起來均勻、實際 application query 多以 user_id 為主、變成 每個 query 都 cross-shard、performance 雪崩。

修法：

Distribution column 選 application 最常 filter / join 的 column（通常是 tenant_id / user_id）
Audit application top query、確認 distribution column 對齊 query pattern
改 distribution column 要 rewrite 所有 shard、像 resharding、大工程

2. Cross-shard transaction 限制

跨多 shard 的 transaction（如：UPDATE 兩個 user_id 不同的 row）Citus 用 2PC（two-phase commit）但有限制：

Multi-statement transaction 跨 shard 需明確開 SET citus.multi_shard_modify_mode = 'sequential'
部分 isolation level 不保證 serializable across shards
DDL 跨 shard 是 sequential

修法：

Schema design 避免 cross-shard transaction（同 colocation group 內 transaction 沒問題）
必要 cross-shard 場景明確設 multi-shard mode
對 strict cross-shard consistency、考慮 distributed SQL（CockroachDB / Aurora DSQL）

3. Reference table 過大 — 寫入廣播 cost 爆

Reference table 在每 worker 都有 copy、寫入 廣播給所有 worker。Reference table 100K row + 高頻寫入 → 寫一次寫 N worker、cost N x。

修法：

Reference table 限 小 + 寫入頻率低 的 lookup data
超大表不該是 reference table、考慮 distributed
監控 reference table 寫入 rate、超 threshold 重新評估

4. Colocate 沒對齊 — 隱性 cross-shard JOIN

1-- 看似可以、實際 cross-shard 慢
2SELECT * FROM orders o JOIN user_addresses ua ON o.user_id = ua.user_id;

若 user_addresses 沒 colocate_with => 'orders'、兩表 shard 分配獨立、JOIN 跨 worker。

修法：

建相關 table 時 colocate_with 對齊
用 SELECT * FROM citus_tables 看 colocation_id、確認對齊
跨非 colocate table 的 JOIN 用 materialized view 或 application 層拆 query 避開

5. Worker failover — Coordinator 必須知道

Worker 故障、Citus 預設 coordinator 看到 query 失敗、不自動 failover。

修法（Citus 11+）：

用 shard replication（citus.shard_replication_factor = 2）— 每 shard 在 2 個 worker 有 copy
配 PG streaming replication 在 worker 層、外加 Patroni 管 failover
Coordinator 失敗 → 整個 cluster 失能、coordinator 也要 HA（Patroni）

跟 Vitess 對比 Citus 的 HA story 較弱、production 必須認真規劃。

何時用 Citus

條件	建議
Multi-tenant SaaS、tenant_id 為自然 distribution	是
寫吞吐 > 50K WPS、單 PG 撐不住	是
需要保留 PG SQL + extension（pgvector / TimescaleDB）	是
應用 query pattern 80% 都用同一 distribution column	是
應用大量 ad-hoc cross-tenant aggregation	否（cross-shard 慢）
強 cross-shard consistency 需求	否（用 CockroachDB）
想 zero-ops managed	Azure Cosmos DB for PostgreSQL（同 engine）

容量規劃

Coordinator: 中等 CPU + RAM、metadata 不大、不存 data
Worker: per-worker spec 同 single PG production
Shard count: 預設 32、實務常設 worker count × 4-8
Replication factor: production 至少 2

跟其他模組整合

跟 Replication topology

Coordinator + worker 各跑 PG streaming replication、Citus 不取代 PG replication。Worker failover 用 Patroni / streaming replication。詳見 Replication Topology。

跟 PG Extensions

Citus 跟其他 PG extension 多數兼容（pgvector / TimescaleDB / pg_stat_statements）— 它維持 extension 形態，保留 PostgreSQL 生態接點。詳見 PG Extension Ecosystem 篇（待寫）。

跟 MySQL Vitess

維度	Citus	Vitess
部署模型	PG extension	獨立 proxy + tablet
主要場景	Multi-tenant SaaS	超大規模分片
Cross-shard JOIN	colocate 對齊 + reference table	VTGate 自動 split + aggregate
FK	同 colocation 內可用	Vitess 18+ 支援、cross-shard 限制
HA	依賴 Patroni + replication factor	VTOrc + replication
學習曲線	中（PG ops 經驗夠）	高（4 component）

Citus 對 PG-native 場景更平順、Vitess 對 MySQL-native 場景更平順、不直接競爭。詳見 MySQL Vitess Sharding。

Cosmos DB for PostgreSQL：基於 Citus 的分散式 PostgreSQL、跟核心 Cosmos DB 是不同產品、何時選它而非核心 Cosmos 或一般 PG

Tue, 02 Jun 2026 00:00:00 +0000

本文是 Cosmos DB overview 的 deep article、寫作參照 vendor deep article methodology。Cosmos DB for PostgreSQL 是 Azure 在 2022 把 Citus（PostgreSQL 的分散式 extension）納入後推出的 分散式 PostgreSQL 託管服務 — 它跑真正的 PostgreSQL engine、支援標準 SQL / JOIN / ACID 交易、把單表水平分片到多個 worker node。它跟本 vendor 頁主講的核心 Cosmos DB（NoSQL、multi-model、RU/s 計費）是 兩個不同產品、只是共用品牌名稱。本文的主責任是釐清這個定位混淆、再講它的架構與選型判準：何時選它、何時該回核心 Cosmos DB、何時一般 PostgreSQL 就夠。

本文沒有專屬 production case anchor：Cosmos DB for PostgreSQL 的公開 case 覆蓋稀薄、機制以 Azure / Citus vendor 規格與分散式 PostgreSQL 通用工程展開、選型判準用「scale-out PG vs NoSQL vs single-node PG」這個具體決策驅動。

Scope warning：本文涉及的服務命名、node 規格上限、Citus 版本、PostgreSQL major version 支援屬時間敏感、Azure 服務命名歷史上有變動、實作前以 Cosmos DB for PostgreSQL 官方文件 cross-verify。

問題情境

典型觸發場景：team 在 Azure 上跑 PostgreSQL、單機 primary 撐到上限 — write throughput、資料量、或單表太大導致 index / vacuum / query 變慢。看到「Cosmos DB」以為是要把資料搬進 NoSQL、重寫 application 成 document model；或反過來、看到「Cosmos DB for PostgreSQL」以為它就是核心 Cosmos DB 的一個 PostgreSQL API、結果發現它是完全不同的東西。命名混淆讓選型從一開始就走偏。

讀者徵兆：

「單機 PostgreSQL 撐不住、但 application 是 SQL / JOIN / 交易重、不想重寫成 NoSQL」
「Cosmos DB for PostgreSQL 跟核心 Cosmos DB 是同一個東西嗎」
「它跟一般 Azure Database for PostgreSQL 差在哪、什麼時候才需要它」
「跟 CockroachDB / Aurora / Spanner 這些 distributed SQL 怎麼選」

真實壓力：SQL workload 撐到單機上限時、選錯方向的成本是年級的。誤以為要遷 NoSQL 而重寫 application 是浪費；誤以為核心 Cosmos DB 有「PostgreSQL 相容」而選錯產品也是浪費。正確的選型要先把這個服務放回它真正的分類 — 分散式 SQL、見 distributed SQL。

核心機制：Citus-based coordinator-worker 分散式 PostgreSQL

Cosmos DB for PostgreSQL 的底層是 Citus、把 PostgreSQL 從單機擴展成 coordinator + worker 的分散式叢集。它的關鍵概念有幾個。

它跑 真正的 PostgreSQL。不是 wire-compat、不是 PostgreSQL API on top of NoSQL — 是 PostgreSQL engine 加 Citus extension。標準 SQL、JOIN、ACID 交易、PostgreSQL extension 生態（含部分如 PostGIS）都在。這跟核心 Cosmos DB（自己的 query language、SQL-like 但無 JOIN、RU/s 計費）是根本不同的東西。

架構是 coordinator-worker。coordinator node 接 query、根據 distribution column 把 query 路由 / 拆分到 worker node、worker 存實際的 shard。application 連 coordinator、看起來像連一個 PostgreSQL。

distribution column 是核心設計決策、類比核心 Cosmos DB 的 partition key 之於 NoSQL、也類比 partition-key-design 講的分散原則。表按 distribution column 的值分片到 worker；同一 distribution column 值的 row 落在同一 shard。JOIN 與交易若在同一 distribution column 值內、可以下推到單一 worker 高效執行（co-location）；跨 distribution column 的 JOIN / 交易要跨 worker 協調、較貴。

表分三種：distributed table（按 distribution column 分片、大表用）、reference table（每個 worker 全複本、小的維度表用、讓 JOIN co-locate）、local table（只在 coordinator）。建模的關鍵是把常一起 JOIN 的大表用 同一 distribution column 分片、達成 co-location。

選型判準：三方對照

這是本文主判讀段。Cosmos DB for PostgreSQL 的正確位置是「single-node PG 不夠、但 workload 仍是 SQL 範式」的中間地帶。

選 Cosmos DB for PostgreSQL 的條件：

workload 是 SQL 範式（關聯 schema、JOIN、交易）、不想 / 不能重寫成 NoSQL
single-node PostgreSQL 已達上限（write throughput / 資料量 / 單表大小）、且資料有好的 distribution column（多租戶的 tenant_id、time-series 的某維度）
工作負載偏向多租戶 SaaS 或 real-time analytics over fresh data — Citus 的典型適配場景
想留在 PostgreSQL 生態（SQL、extension、既有 tooling）而非進 NoSQL

回核心 Cosmos DB（NoSQL）的條件：

資料形狀已是 document / KV、access pattern 固定、不需要 JOIN 與複雜 SQL
需要 multi-model（document + graph + KV）、5 個 consistency level、turnkey multi-region active-active write
RU/s 容量抽象與 serverless 計費更符合 workload — 見 ru-cost-model-sizing

一般 Azure Database for PostgreSQL（single-node managed PG）就夠的條件：

single-node 還沒到上限 — 多數 OLTP baseline 用 vertical scaling + read replica 就夠、不需要分散式
沒有好的 distribution column — 分散式 PostgreSQL 沒有均勻 distribution column 會 hot worker、好處拿不到、複雜度卻全付
不想承擔 distributed SQL 的複雜度（distribution column 設計、co-location 規劃、跨 shard query 成本）

判讀句：先確認 single-node PG 真的到上限、再確認 workload 是 SQL 範式（否則考慮 NoSQL）、最後確認有好的 distribution column。三個都成立、Cosmos DB for PostgreSQL 才是對的；缺任一個、回 single-node PG 或核心 Cosmos DB。

跟其他 distributed SQL 的位置

Cosmos DB for PostgreSQL 是 Azure 上、PostgreSQL-native、scale-out（co-location 設計驅動）的 distributed SQL。跟 Spanner（全球 external consistency、自己的 SQL 方言）、CockroachDB（跨雲、PostgreSQL wire、自動 range 分散）、Aurora DSQL（AWS、全球 active-active）位置不同：Cosmos DB for PostgreSQL 強在「真 PostgreSQL engine + extension 生態 + co-location 控制」、弱在它的分散需要 distribution column 設計（不像 CockroachDB / Spanner 自動分 range）、且綁 Azure。

操作流程

建叢集與設定 distribution column

 1-- 建 distributed table、按 tenant_id 分片（多租戶 SaaS 典型）
 2CREATE TABLE events (
 3    tenant_id   bigint NOT NULL,
 4    event_id    bigint NOT NULL,
 5    payload     jsonb,
 6    created_at  timestamptz DEFAULT now()
 7);
 8SELECT create_distributed_table('events', 'tenant_id');
 9
10-- 維度小表設 reference table、讓 JOIN co-locate
11CREATE TABLE tenants (tenant_id bigint PRIMARY KEY, name text);
12SELECT create_reference_table('tenants');

驗證：SELECT * FROM citus_tables; 看每張表的 distribution column 與 shard 分布；對 distributed table 的查詢若帶 distribution column filter、EXPLAIN 顯示下推到單一 shard、不帶則 fan-out 到所有 worker。

驗證 co-location

1-- 同 distribution column 的兩張 distributed table JOIN 應 co-located
2SELECT colocation_id, count(*)
3FROM citus_tables GROUP BY colocation_id;

驗證：常一起 JOIN 的大表落在同一 colocation group、JOIN 在 worker 本地完成、不跨 worker shuffle。

加 worker 擴容

加 worker node 後 rebalance shard。驗證：rebalance 後 shard 在新舊 worker 間分布均勻、單一 worker 不再是 hot spot。

Rollback boundary

Cosmos DB for PostgreSQL 是叢集級服務、scale worker 是運維操作、可逆（縮回去）。但 distribution column 一旦選定、改它要重建表 + 重灌資料 — 跟核心 Cosmos DB 的 partition key 不可改是同一類不可逆設計、見 partition-key-design。

失敗模式

把它跟核心 Cosmos DB 當同一產品選

選型時把「Cosmos DB for PostgreSQL」當成「核心 Cosmos DB 的 PostgreSQL 介面」、規劃用 RU/s、找 consistency level 設定、結果整套 mental model 對不上 — 因為它是分散式 PostgreSQL、用 node 規格計費、用 PostgreSQL 的交易隔離級別。修法是選型第一步就確認「這是分散式 SQL、不是 NoSQL」、規劃按 PostgreSQL + Citus 的模型走、不要套核心 Cosmos DB 的概念。

沒有好的 distribution column 硬上分散式

workload 沒有均勻的 distribution column（例如資料天然集中在少數 tenant）、硬分片後變 hot worker、分散式的好處拿不到、複雜度全付。徵兆是少數 worker CPU / IO 飽和、其他 worker 閒置。修法是選型階段就評估 distribution column 的 cardinality 與均勻度；不均勻時、要嘛留 single-node PG（垂直擴 + read replica）、要嘛重新設計 distribution column（如多租戶用 composite 或對 hot tenant 特殊處理）。

大量跨 shard query / 非 co-located JOIN

application query 大多不帶 distribution column filter、或常做跨 distribution column 的 JOIN、每個 query fan-out 到所有 worker + shuffle、latency 與成本都差。徵兆是 EXPLAIN 顯示 query 打所有 worker、p99 latency 高。修法是重新設計 schema 讓常一起查的表 co-located、把 distribution column 放進熱 query 的 filter；改不動時、這個 workload 可能不適合 scale-out PG、回 single-node 或考慮其他方案。

該用 NoSQL 卻選了分散式 PG（或反之）

document / KV、固定 access pattern、不需要 JOIN 的 workload 選了 Cosmos DB for PostgreSQL、付了 SQL / distribution column 設計的複雜度卻沒用到關聯能力 — 這類 workload 核心 Cosmos DB（NoSQL）更自然。反過來、SQL / JOIN / 交易重的 workload 被推去核心 Cosmos DB（NoSQL）要重寫成 document model 也是錯。修法是回到「workload 是 SQL 範式還是 document / KV 範式」的根本判斷、見本文選型判準段與 mongodb-api-vs-sql-api 的範式判讀。

Anti-recommendation：single-node PG 沒到上限不要上

分散式 PostgreSQL 帶來 distribution column 設計、co-location 規劃、跨 shard query 成本、rebalance 運維。single-node managed PostgreSQL 加 vertical scaling 與 read replica 能撐的 OLTP baseline 比多數團隊以為的大。沒有觸及 single-node 真實上限（write throughput 飽和、單表大到 maintenance 困難、資料量超出單機）就上分散式、是用複雜度換不存在的容量需求。

容量與觀測

必看 metric：各 worker node 的 CPU / IO / 連線（找 hot worker）、shard 在 worker 間的分布均勻度、跨 shard query 比例、coordinator 連線數
容量單位：node 規格（不是 RU/s）— 規劃是 coordinator + N worker 的 vCPU / memory / storage、跟核心 Cosmos DB 的 RU 思維完全不同、不要混用 ru-cost-model-sizing 的 RU 模型來估這個服務
distribution column 均勻度是容量上限的真實決定因素 — 跟 Hot Partition 同模型、hot worker 讓名義叢集容量達不到
回 9.6 容量規劃模型：scale-out 的有效容量 = node 數 × 單 node 容量 × distribution 均勻度
Alert：單一 worker 飽和（distribution skew）、跨 shard query 比例上升、rebalance 後仍不均

邊界與整合

定位釐清：本服務是 分散式 PostgreSQL、不是核心 Cosmos DB（NoSQL）— 共用品牌名稱、產品不同、選型不要混淆
跟核心 Cosmos DB 的分界：SQL / JOIN / 交易 + 到單機上限 → 本服務；document / KV / multi-model / multi-region active-active → 核心 Cosmos DB、見 mongodb-api-vs-sql-api
跟 PostgreSQL vendor 的分界：single-node 沒到上限 → Azure Database for PostgreSQL / 一般 PG；PostgreSQL 既有的 Specialized PostgreSQL Variants 段已把 Cosmos DB for PostgreSQL 列為 Citus-based 變體之一
跟其他 distributed SQL：Spanner（全球強一致）、CockroachDB（跨雲、自動 range）— 本服務強在真 PostgreSQL engine + co-location 控制、弱在需 distribution column 設計 + 綁 Azure
distribution column 不可改：跟 partition-key-design 的 partition key 不可改是同類不可逆設計
Knowledge card：distributed SQL / Hot Partition