Sharding on Tarragon

MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作

Tue, 19 May 2026 00:00:00 +0000

本文是 MySQL overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 Vitess sharding — 4 個 component 協作的完整 sharding 系統。

問題情境：MySQL 寫吞吐撞上 single primary 上限

MySQL primary 單機極限大致 50K-100K WPS（依 schema / hardware）。超過這個級別、選項三條：

Application 層 sharding：每張 table 自己決定怎麼分片、application 寫 routing logic、跨 shard query / migration 都要自己處理
Vitess：proxy layer 自動 routing、cross-shard query 可選自動 split、resharding 自動化
Distributed SQL（CockroachDB / Spanner / Aurora DSQL）：跟 MySQL 不同 engine、application 改 driver

選 Vitess 的核心 driver：保留 MySQL wire protocol + 應用層幾乎不必改 + 透明分片。代價是 4 個 component 的 operational complexity — Vitess 的責任範圍是完整分散式系統，而非單純 proxy。

閱讀本文前可先對齊 Database Sharding 的 shard key、routing、resharding 與 cross-shard query 語意；容量失衡時再接 Hot Partition。

Vitess 四件套：每個 component 的責任

 1                        ┌─────────────────┐
 2   Application ────→    │     VTGate      │  ← 對外 MySQL wire protocol
 3                        │  (proxy + parse + route + aggregate)  │
 4                        └────┬─────┬──────┘
 5                             │     │
 6                ┌────────────┘     └──────────────┐
 7                ▼                                 ▼
 8        ┌──────────────┐                  ┌──────────────┐
 9        │   VTTablet   │                  │   VTTablet   │
10        │ (per-MySQL   │                  │ (per-MySQL   │
11        │  sidecar)    │                  │  sidecar)    │
12        └─────┬────────┘                  └─────┬────────┘
13              │                                 │
14              ▼                                 ▼
15        ┌──────────────┐                  ┌──────────────┐
16        │    MySQL     │                  │    MySQL     │
17        │  (Shard -80) │                  │  (Shard 80-) │
18        └──────────────┘                  └──────────────┘
19
20   Topology Service (etcd / Consul / ZooKeeper)
21   ↑↓ 所有 component 共享 metadata
22   VSchema：keyspace 結構、shard 範圍、Vindex 定義

VTGate — query routing layer

對 application 看起來像 MySQL（同樣 port、同樣 wire protocol、同樣 query 語法）、實際是 stateless proxy。每個 query VTGate：

Parse SQL → 找出 routing key（從 WHERE column 拿）
查 VSchema → 計算 routing key 對應的 shard
把 query 送該 shard 的 VTTablet
等 response、aggregate（如果是 cross-shard query）、回 application

Stateless 設計 → VTGate 可以隨意 scale、放 N 個前面接 LB。多數 production 部署 3-10 個 VTGate per region。

VTTablet — per-MySQL agent

每個 MySQL instance 旁邊都跑一個 VTTablet。VTTablet 責任：

把 MySQL primary 標記、上報給 topology
接 VTGate 的 query、轉發給 local MySQL
跑 connection pool（VTGate 跟 VTTablet 之間少量連線、VTTablet 跟 local MySQL 共享 connection）
跑 query plan cache / transactional consistency check
處理 online schema change（Vitess 內建 OSC）
跟 VTOrc（fork of Orchestrator）配合做 failover

VTTablet 是 Vitess 跟 MySQL 唯一連接點 — 沒 VTTablet 直接連 MySQL 不在 Vitess 管理下。

VReplication — 跨 shard 資料移動

VReplication 是 Vitess 跨 shard / 跨 keyspace / 跨 cluster 資料移動引擎、底層用 MySQL binlog。用途：

Resharding：把 shard -80 拆成 -40 + 40-80、VReplication 自動拆 binlog event 對應 shard
Materialized view：cross-shard aggregation 預計算
MoveTables：跨 keyspace 移 table（schema-level migration）
VStream：CDC、binlog event 對外輸出（可接 Kafka / Debezium）

VReplication 的主要使用者是 Vitess operator，它和 application 行為直接相關（resharding 期間有 write split 行為）。

VSchema — sharding metadata

VSchema 是 keyspace 內 哪張 table 怎麼 shard 的定義、JSON 格式存 topology service。例子：

 1{
 2  "sharded": true,
 3  "vindexes": {
 4    "hash": {
 5      "type": "hash"
 6    }
 7  },
 8  "tables": {
 9    "orders": {
10      "column_vindexes": [
11        {
12          "column": "user_id",
13          "name": "hash"
14        }
15      ]
16    },
17    "users": {
18      "column_vindexes": [
19        {
20          "column": "user_id",
21          "name": "hash"
22        }
23      ]
24    }
25  }
26}

orders.user_id 跟 users.user_id 用同一個 Vindex（hash）+ 同一個 column → 同 user_id 的 orders + users 落在同 shard、可以 JOIN 不跨 shard。

Vindex：Vitess 的 sharding function

Vindex 是 Vitess 的 shard key 計算函數。內建多種：

Vindex 類型	計算方式	適用
`hash`	3DES-based null hash（非 MD5）→ 對應 shard range	預設、均勻分布、適合 primary key
`binary_md5`	MD5(binary)	binary key
`unicode_loose_xxhash`	xxHash on lowercased unicode	string key
`numeric`	直接 numeric value	連續 numeric range（適合 time-based）
`numeric_static_map`	預定義 map	國家 code / region 等少 enum
`lookup_hash`	透過 lookup table 查 shard	多個 column 都要 shard、需要二級 index

最常用：hash（primary key）+ lookup_hash（secondary access pattern）。

Keyspace / Shard / Tablet 階層

 1Keyspace (邏輯 database)
 2   └── Shards
 3        ├── -80 (shard range 0-128)
 4        │     ├── Primary tablet (1 MySQL primary)
 5        │     ├── Replica tablet × 2
 6        │     └── RDOnly tablet × 1 (analytics)
 7        └── 80- (shard range 128-256)
 8              ├── Primary tablet
 9              ├── Replica tablet × 2
10              └── RDOnly tablet × 1

Shard range 用 binary hex prefix（-80 表示 0 到 0x80、80- 表示 0x80 到 max）— 給 resharding 留 split 餘地（-80 可切成 -40 + 40-80）。

Tablet type：

Primary：寫入入口
Replica：read traffic（Vitess query rules 控制）
RDOnly：純 analytics / backup / VReplication source、低 SLA、不上 production read traffic

配置 step-by-step（local cluster）

Production 通常用 Kubernetes operator（vitess-operator）部署、但理解概念用 local cluster 最快：

 1# 用 vtctldclient 操作（替代舊的 vtctlclient）
 2
 3# 1. 建 unsharded keyspace
 4vtctldclient CreateKeyspace --durability-policy=semi_sync commerce
 5
 6# 2. 從一個 MySQL primary 開始（unsharded）
 7vtctldclient ApplySchema --sql="CREATE TABLE orders (id INT PRIMARY KEY, user_id INT)" commerce
 8
 9# 3. 把 keyspace 改成 sharded、定義 VSchema
10vtctldclient ApplyVSchema --vschema='{
11  "sharded": true,
12  "vindexes": {"hash": {"type": "hash"}},
13  "tables": {
14    "orders": {
15      "column_vindexes": [{"column": "user_id", "name": "hash"}]
16    }
17  }
18}' commerce
19
20# 4. 觸發 resharding：unsharded → 2 shards (-80, 80-)
21vtctldclient Reshard --workflow=initial-shard create \
22  --source-shards="commerce/0" \
23  --target-shards="commerce/-80,commerce/80-"
24
25# 5. 等資料 copy 完（VReplication 跑）
26vtctldclient Workflow --keyspace=commerce show initial-shard
27
28# 6. SwitchTraffic：先切 RDOnly → 再切 Replica → 最後切 Primary
29vtctldclient Reshard --workflow=initial-shard switchtraffic \
30  --tablet-types="rdonly,replica"
31vtctldclient Reshard --workflow=initial-shard switchtraffic \
32  --tablet-types="primary"
33
34# 7. 完成、cleanup old shard
35vtctldclient Reshard --workflow=initial-shard complete

實際 production 走 Vitess Kubernetes operator、用 VitessCluster CRD 宣告 desired state、operator 自動操作上面這些 step。

5 個 Production 踩雷

1. Cross-shard transaction — Vitess 不支援 atomic（預設）

兩個 user 的 order 在不同 shard、BEGIN; UPDATE orders WHERE user_id=1; UPDATE orders WHERE user_id=2; COMMIT; 跨兩個 shard。Vitess 預設 不保證 atomic — 兩個 shard 各自 commit、可能一個成功一個失敗、application 看到 partial state。

修法：

避免 cross-shard transaction：schema design 讓 transaction boundary 落在單一 shard 內
啟用 atomic 2-phase commit（Vitess transaction_mode=TWOPC、實驗性、performance penalty 大）
大規模需要 atomic 的場景應該換 distributed SQL（CockroachDB / Spanner），讓資料庫層承擔跨節點一致性

2. VStream lag — Resharding 期間 CDC 落後

Resharding 過程 VReplication 大量寫 binlog event、application 本來在用 的 VStream（接 Kafka 等）共享同 binlog stream、可能 lag。Downstream consumer 看到 stale data 1-2 小時。

修法：

Resharding 期間 暫停非關鍵 VStream（analytics ETL 可暫停、real-time recommendation 需要保留）
確認 binlog disk capacity > resharding 期間預估 binlog 量 × 2（buffer）
Resharding 完成後 手動驗證 VStream offset 已 catch up，把驗證結果留成 cutover evidence

3. Vindex 不均勻 — Hot shard

Vindex 預設 hash 對 primary key 均勻分布、但對 natural key（country / region / company_id 等）可能不均勻。10 個 country、其中 1 個 country 佔 80% traffic、單一 shard 永遠 hot。

修法：

Composite Vindex：combine country + user_id 兩 column 作為 shard key、user-level 仍均勻
Synthetic shard key：application 層加 sharding_key=hash(actual_key) % N、控制分布
監控 per-shard QPS：vtctldclient ShowVDiff + Prometheus exporter
Hot shard 出現後 Vitess 可以 resharding 解（split hot shard 為 2 個小 shard）、但工作量大

4. Resharding 切流量瞬間 deadlock

Resharding 最後的 SwitchTraffic 切 primary 階段、舊 shard 仍接 write、Vitess 切 routing、Application 一瞬間連兩個 shard、相同 user_id 寫入可能跑兩邊、deadlock 或 lost update。

修法：

SwitchTraffic 用 ReverseTraffic 預備：先 switch、確認問題後可 reverse 回去
切流量 只在 known quiet period（夜間 / 週末早上）
VTGate --retry-count=2 + --track-vtgate-deadlock-events：deadlock 自動 retry、不暴露給 application
真的失敗用 Reshard cancel 回 old state，讓 workflow 回到可驗證狀態

5. VReplication workflow 卡住 — cancel 前需要保護狀態

VReplication workflow 跑到 50% 但 某個 row 解析錯誤（schema mismatch / blob 大小超過 limit）、workflow stuck、進度條卡住、無 timeout。整個 resharding flow halt。

修法：

平時跑 staging 資料 dry-run、發現 schema 跟 blob 邊界問題
Workflow 卡住時 vtctldclient Workflow show 看 last_message / row_state
手動修問題 row（直接 MySQL 改）後 resume workflow
大 cluster 建議 VReplication 跑前先 SchemaApply audit、確認 source / target schema 兼容

Vitess 跟自管 sharding 對照

維度	Vitess	Application-level sharding
Application 改動	幾乎不必（保留 MySQL wire）	大改（routing logic 寫 application）
Cross-shard query	VTGate 自動 split（受限）	Application 自己處理
Resharding	VReplication 自動	手寫腳本、操作複雜
Online schema change	Vitess 內建（VReplication-based）	用 gh-ost / pt-osc
Failover	VTOrc 整合	自管 Orchestrator
Operational cost	高（4 component 要懂）	中（fewer abstractions、但 application logic 多）
Cross-keyspace 共用 vindex	內建（lookup_hash 跨 keyspace）	自寫

Vitess 的 operational complexity 是它的代價。10-20 人 SRE 團隊撐得住、5 人團隊用 managed Vitess（PlanetScale） 更實際。

跟其他模組整合

跟 Replication topology

Vitess shard 內部仍用 MySQL replication（Replication Topology）— 每個 shard 有 primary + replica + rdonly。Vitess durability-policy 控制 primary 寫入是否等 replica ack（semi-sync）。

跟 OSC tool

Vitess 不用 gh-ost / pt-osc、用 VReplication-based online DDL。Vitess online DDL：

1vtctldclient ApplySchema --strategy=vitess \
2  --sql="ALTER TABLE orders ADD COLUMN status VARCHAR(20)" commerce

詳見 Online Schema Change Tools。

跟 ProxySQL

Vitess 取代 ProxySQL。VTGate 本身做 connection pool + query routing、不再需要 ProxySQL。混用會造成 routing 衝突（VTGate 期待自己決定 shard、ProxySQL 跟 VTGate 競爭）。詳見 ProxySQL 配置。

跟 Orchestrator

Vitess 用 VTOrc（fork of Orchestrator）作 failover、跟 Vitess topology metadata 整合。不用獨立 Orchestrator。詳見 Orchestrator failover 設計。

跟 PlanetScale（managed Vitess）

PlanetScale 是 Vitess managed service、隱藏 4 component operational complexity、加 branch-based schema workflow。詳見 PlanetScale migration playbook。

跟 Aurora MySQL

Aurora 跟 Vitess 是 不同 scale 路徑：

Aurora：single-region scaling（storage / compute 分離、最高 ~128 TB）
Vitess：horizontal sharding（無上限、靠加 shard scaling）

兩者承擔的容量與操作責任不同。超過 Aurora single-region 上限的場景才考慮 Vitess。詳見 Aurora vendor page。

Production case：YouTube / Vitess

Vitess 的 production 責任是把 MySQL shard 拓撲變成應用可查詢、可遷移、可操作的資料庫層。YouTube / Vitess 的公開歷史提供的工程訊號是 VTGate、VTTablet、VReplication 與 VSchema 這組元件分工：application query 進 VTGate、tablet 層包住 MySQL、VSchema 描述 routing / sharding 規則、VReplication 支援 resharding 與資料搬移。

這個案例要回收到三個操作判準。第一，Vitess 是一套 database control plane，而非單一 proxy；導入時要把 topology service、tablet lifecycle、backup、failover 與 schema workflow 一起納入 ownership。第二，VSchema 是 application contract，shard key、lookup vindex 與 cross-shard query 都會影響產品功能設計。第三，VReplication 讓 resharding 可操作，但它仍需要 capacity window、backfill 監控與 cutover plan。

Vitess 的 sibling 路由是 PostgreSQL Citus Distributed 與 1.11 全球分散式 OLTP。Citus 保留 PostgreSQL 生態並用 coordinator / worker 拆分資料；CockroachDB / Spanner 則用 distributed SQL 重新定義交易與一致性邊界。選型時要先判斷自己是在延伸 MySQL 投資，還是在重新選 global OLTP model。

何時用 Vitess

條件	評估
流量 > 50K WPS、單 primary 撐不住	是 Vitess scope
已有大量 MySQL 投資、不想換 distributed SQL	是
有 5-10 人 SRE / DBA 團隊	是
流量 < 10K WPS	否（過度設計、用單 MySQL + replica）
5 人團隊、不想養 DBA	否（用 PlanetScale managed）
必須 multi-region 強一致 transaction	否（CockroachDB / Spanner 才對）
需要複雜 cross-shard analytics	否（搭配 BigQuery / Snowflake）

PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster

Tue, 19 May 2026 00:00:00 +0000

本文是 PostgreSQL overview 的 implementation-layer deep article。Overview 已說明 PG 在 OLTP 譜系的定位、本文聚焦 Citus distributed extension — 把 PG 變成 sharded cluster 的方式。

當 PG single-primary 寫吞吐撞上單機極限（50K-100K WPS）、選項三條：

Application 層 sharding：應用層自管 shard routing
Citus：PG extension、自動 routing + cross-shard query
Distributed SQL（CockroachDB / Aurora DSQL / Spanner）：不同 engine

選 Citus 的核心 driver：保留 PG SQL syntax + extension 生態。但「應用層幾乎不必改」是樂觀說法 — 實際上 application 必須圍繞 distribution column 重設計（query 加 filter / transaction 限定同 shard / reference table 量控制）、跟 Vitess 比 cross-shard query 自動化弱。代價是 coordinator / worker 部署複雜度 + cross-shard query 限制 + application schema 改造工作量。

閱讀本文前可先對齊 Database Sharding 的 shard key、routing、resharding 與 cross-shard query 語意；容量失衡時再接 Hot Partition。

跟 MySQL Vitess sharding 的核心差異：Citus 是 PG extension（PG 自己跑）、Vitess 是 獨立 proxy + tablet 系統（包 MySQL）。Citus 用 PG 原生機制（FDW / extension hook）、Vitess 是 外部包裝。

Citus 架構：Coordinator + Worker

 1                ┌─────────────────┐
 2   Application  │   Coordinator   │  ← 對外 PG wire protocol、planner、routing
 3                │   (Citus + PG)  │
 4                └────┬─────┬──────┘
 5                     │     │
 6              ┌──────┘     └──────┐
 7              ▼                   ▼
 8        ┌──────────┐         ┌──────────┐
 9        │ Worker 1 │         │ Worker 2 │  ← 各跑 PG + Citus extension
10        │  (PG)    │         │  (PG)    │
11        │ shard 1,3│         │ shard 2,4│
12        └──────────┘         └──────────┘

Coordinator：

對 application 看起來像 PG（同 port / 同 wire protocol）
接 SQL → Citus planner 把 query 分解 + route 給 worker
不存 data（distributed table 的 shard 在 worker 上）
存 metadata（哪個 shard 在哪個 worker）

Worker：

標準 PG instance + Citus extension
各存若干 shard
接 coordinator 來的 query、跑 local execute、回結果

Shard：

Distributed table 拆成 N 個 shard（預設 32）
每 shard 是 worker 上的 physical PG table（含 _ 後綴）
行為跟一般 PG table 一樣、可以直接連 worker 用 PG 工具 access

3 種 Table Type

Distributed table — 跨 shard 切分

 1-- 建一般 PG table
 2CREATE TABLE orders (
 3    id BIGSERIAL,
 4    user_id BIGINT NOT NULL,
 5    amount DECIMAL(10,2),
 6    created_at TIMESTAMP,
 7    PRIMARY KEY (user_id, id)  -- PK 必須含 distribution column
 8);
 9
10-- 用 Citus 把它變 distributed
11SELECT create_distributed_table('orders', 'user_id');

user_id 是 distribution column — Citus 用它的 hash 決定 row 屬哪個 shard。PK 必須含 distribution column（跟 MySQL partitioning 同要求）。

跟 Vitess Vindex 對比：

Citus：hash distribution column → shard（單一 hash function、不可選 algorithm）
Vitess：Vindex 可選多種（hash / lookup_hash / xxhash / null）

Reference table — 全 shard 共有

1CREATE TABLE products (
2    id SERIAL PRIMARY KEY,
3    name VARCHAR(100),
4    price DECIMAL
5);
6
7SELECT create_reference_table('products');

products 在 每個 worker 都有完整 copy、寫入 coordinator 廣播給所有 worker。

用途：

小 lookup table（country code / product category 等）
跨 distributed table JOIN 時、reference table 在每 worker 上、不必 cross-shard
寫入頻率低（廣播 cost 跟 worker 數 linear）

Local table — Coordinator 上的 PG table

1CREATE TABLE audit_log (
2    id SERIAL PRIMARY KEY,
3    event JSONB
4);
5-- 不調用 Citus function、預設留在 coordinator

行為跟一般 PG table 一樣。用於 不需 distribute 的 table（如 admin metadata）。

Colocation：跨 distributed table 同 shard 對齊

當兩個 distributed table 都用 同 distribution column（例如 user_id）+ 同 shard count、Citus 自動 colocate：

1SELECT create_distributed_table('orders', 'user_id');
2SELECT create_distributed_table('user_addresses', 'user_id', colocate_with => 'orders');

Colocate 後：

user_id = 100 的 orders 跟 user_addresses 在 同一 worker shard
JOIN 不跨 worker、效率高
可用 PG 原生 FK constraint（cross-table 但同 shard）

Colocate 是 Citus 設計的核心 跨 table 一致性 機制。沒 colocate 的 cross-table query 變 cross-worker、效率大降。

配置 step-by-step（local cluster）

Production 用 Citus Cloud（Microsoft 託管）或 Azure Cosmos DB for PostgreSQL（同 engine）。Self-hosted：

Step 1：Coordinator + worker 都裝 PG + Citus

1# 在每個 node（coordinator + 2 worker）
2apt install postgresql-14
3apt install postgresql-14-citus-12.0
4
5# postgresql.conf
6shared_preload_libraries = 'citus'
7
8systemctl restart postgresql

1-- 在每個 node 跑
2CREATE EXTENSION citus;

Step 2：Coordinator 註冊 worker

1-- 在 coordinator 跑
2SELECT citus_add_node('worker1.example.com', 5432);
3SELECT citus_add_node('worker2.example.com', 5432);
4
5-- 確認
6SELECT * FROM citus_get_active_worker_nodes();

Step 3：建 distributed table

1CREATE TABLE orders (
2    id BIGSERIAL,
3    user_id BIGINT NOT NULL,
4    amount DECIMAL(10,2),
5    created_at TIMESTAMP,
6    PRIMARY KEY (user_id, id)
7);
8
9SELECT create_distributed_table('orders', 'user_id');

Citus 自動把 orders 拆成 32 個 shard（orders_102008 等）、分配到 worker。

Step 4：Application 連 coordinator

Application connection string 連 coordinator IP / port（不必知道 worker 存在）。

1-- 從 application 跑 query、Citus 透明 route
2INSERT INTO orders (user_id, amount) VALUES (12345, 50);
3-- → Citus 看 user_id=12345 hash 屬 shard 17、route 給對應 worker
4
5SELECT * FROM orders WHERE user_id = 12345;
6-- → Single-shard query、極快
7
8SELECT count(*) FROM orders;
9-- → Cross-shard aggregation、Citus 並行跑、合併結果

5 個 Production 踩雷

1. Distribution column 選錯 — Cross-shard query 變主流

選 created_at 或 id（auto increment）作 distribution column、看起來均勻、實際 application query 多以 user_id 為主、變成 每個 query 都 cross-shard、performance 雪崩。

修法：

Distribution column 選 application 最常 filter / join 的 column（通常是 tenant_id / user_id）
Audit application top query、確認 distribution column 對齊 query pattern
改 distribution column 要 rewrite 所有 shard、像 resharding、大工程

2. Cross-shard transaction 限制

跨多 shard 的 transaction（如：UPDATE 兩個 user_id 不同的 row）Citus 用 2PC（two-phase commit）但有限制：

Multi-statement transaction 跨 shard 需明確開 SET citus.multi_shard_modify_mode = 'sequential'
部分 isolation level 不保證 serializable across shards
DDL 跨 shard 是 sequential

修法：

Schema design 避免 cross-shard transaction（同 colocation group 內 transaction 沒問題）
必要 cross-shard 場景明確設 multi-shard mode
對 strict cross-shard consistency、考慮 distributed SQL（CockroachDB / Aurora DSQL）

3. Reference table 過大 — 寫入廣播 cost 爆

Reference table 在每 worker 都有 copy、寫入 廣播給所有 worker。Reference table 100K row + 高頻寫入 → 寫一次寫 N worker、cost N x。

修法：

Reference table 限 小 + 寫入頻率低 的 lookup data
超大表不該是 reference table、考慮 distributed
監控 reference table 寫入 rate、超 threshold 重新評估

4. Colocate 沒對齊 — 隱性 cross-shard JOIN

1-- 看似可以、實際 cross-shard 慢
2SELECT * FROM orders o JOIN user_addresses ua ON o.user_id = ua.user_id;

若 user_addresses 沒 colocate_with => 'orders'、兩表 shard 分配獨立、JOIN 跨 worker。

修法：

建相關 table 時 colocate_with 對齊
用 SELECT * FROM citus_tables 看 colocation_id、確認對齊
跨非 colocate table 的 JOIN 用 materialized view 或 application 層拆 query 避開

5. Worker failover — Coordinator 必須知道

Worker 故障、Citus 預設 coordinator 看到 query 失敗、不自動 failover。

修法（Citus 11+）：

用 shard replication（citus.shard_replication_factor = 2）— 每 shard 在 2 個 worker 有 copy
配 PG streaming replication 在 worker 層、外加 Patroni 管 failover
Coordinator 失敗 → 整個 cluster 失能、coordinator 也要 HA（Patroni）

跟 Vitess 對比 Citus 的 HA story 較弱、production 必須認真規劃。

何時用 Citus

條件	建議
Multi-tenant SaaS、tenant_id 為自然 distribution	是
寫吞吐 > 50K WPS、單 PG 撐不住	是
需要保留 PG SQL + extension（pgvector / TimescaleDB）	是
應用 query pattern 80% 都用同一 distribution column	是
應用大量 ad-hoc cross-tenant aggregation	否（cross-shard 慢）
強 cross-shard consistency 需求	否（用 CockroachDB）
想 zero-ops managed	Azure Cosmos DB for PostgreSQL（同 engine）

容量規劃

Coordinator: 中等 CPU + RAM、metadata 不大、不存 data
Worker: per-worker spec 同 single PG production
Shard count: 預設 32、實務常設 worker count × 4-8
Replication factor: production 至少 2

跟其他模組整合

跟 Replication topology

Coordinator + worker 各跑 PG streaming replication、Citus 不取代 PG replication。Worker failover 用 Patroni / streaming replication。詳見 Replication Topology。

跟 PG Extensions

Citus 跟其他 PG extension 多數兼容（pgvector / TimescaleDB / pg_stat_statements）— 它維持 extension 形態，保留 PostgreSQL 生態接點。詳見 PG Extension Ecosystem 篇（待寫）。

跟 MySQL Vitess

維度	Citus	Vitess
部署模型	PG extension	獨立 proxy + tablet
主要場景	Multi-tenant SaaS	超大規模分片
Cross-shard JOIN	colocate 對齊 + reference table	VTGate 自動 split + aggregate
FK	同 colocation 內可用	Vitess 18+ 支援、cross-shard 限制
HA	依賴 Patroni + replication factor	VTOrc + replication
學習曲線	中（PG ops 經驗夠）	高（4 component）

Citus 對 PG-native 場景更平順、Vitess 對 MySQL-native 場景更平順、不直接競爭。詳見 MySQL Vitess Sharding。

MongoDB Shard Key Selection：hashed vs ranged、單 cluster 切 shard vs 多 cluster 切 blast radius

Wed, 27 May 2026 00:00:00 +0000

MongoDB shard key 是 sharded cluster 上線時最難回頭的決策。Shard key 一旦設定錯、5.0 之前完全不可逆、5.0+ 用 reshardCollection 可改但仍是長時間運算 + 額外磁碟 + 寫入暫停窗口。但 shard key 不是 production 唯一的橫向擴展選項 — 還有「多 cluster」這條路徑（Toyota Connected 揭露），兩者解的問題完全不同。本文把 shard key 三特性（cardinality / frequency / monotonicity）跟「單 cluster vs 多 cluster」對照在一起、配合跨 vendor partition key 可逆性紀律一起討論。

本文不重複 MongoDB vendor overview 已寫過的 sharding 簡介 — 而是 production 設計 + 失敗修復的實作層教學。

MongoDB 適用度前置判讀：進到 shard key 設計前先確認 workload 在 MongoDB 適用區（document shape 主導 / contract layer 該放哪 / 跨雲 hedging 是否需要）— 詳見 schema-design-pattern 開頭 3 軸前置判讀、本篇不重複展開。Sharded cluster 是 已選 MongoDB 後 的容量決策、不是 vendor 選型決策。

問題情境：橫向擴展不是只有 sharded cluster 一條路

典型觸發場景：single replica set 撐到上限、writes 已經把 primary 推到 CPU 90% / disk IO 飽和、working set 超出 RAM。讀者下意識會想到「分 shard」、但同時還有「分 cluster」這條路徑、兩者 trigger 完全不同：

單 cluster 切 shard：解的是 單一資料域寫入飽和、collection 大到單 replica set 撐不住
多 cluster 切 DB：解的是 blast radius / ownership / 合規邊界、不一定是吞吐問題

混淆兩者的後果：吞吐沒撞牆但 blast radius 是議題、強行分 shard → aggregation / transaction / $lookup 成本全部跳一級、業務 ownership 仍混在一起。或反過來：吞吐撞牆但選了分 cluster → 跨 cluster transaction 不存在、單一 collection 跨多 cluster 要在 application 層拼。

讀者徵兆：

mongos 的 targeted query / scatter-gather query 比例失衡
單一 shard CPU 遠高其他 shard、balancer 移 chunk 跟不上寫入速度
chunkMigrated 異常頻繁、sh.status() 顯示 chunk 分布偏斜
微服務 ownership 跟 collection 邊界不對齊、某 microservice 故障打到其他服務

Case anchor：9.C38 Toyota Connected 揭露「20 個 Atlas database 是業務邊界切分、不是吞吐切分」（單 cluster vs 多 cluster 對照）；hot shard 在 e-commerce flash sale / 遊戲開新區 / B2B 大客戶獨佔 chunk 的具體 incident 細節需未來 case 補完、本文以「常見 failure pattern」處理、不憑空編造 incident 數字。

核心機制：shard key、chunk、balancer

Shard key 三特性決定 sharded cluster 行為：

Cardinality（基數）：shard key 的不同值數量。status: "active" | "inactive" 只有兩個值、cardinality = 2、不能分到多 chunk
Frequency（頻率分布）：值的分布是否平均。country 在全球流量中通常一兩個國家佔 80%
Monotonicity（單調性）：值是否單調遞增。_id（ObjectId）/ 時間戳 / 自增 ID 都是單調

三特性決定 shard key 行為：

Hashed shard key：hash function 把 key 打散、寫入分布均勻、但 range query 變 scatter-gather（每個 shard 都問）
Ranged shard key：相同 key 相近 → 同 chunk → range query 高效；但單調 key + ranged → 所有寫打最後 chunk
Compound shard key（5.0+ 是常用做法、對應 Composite Partition Key 的 MongoDB 實作）：例如 { tenantId: 1, _id: "hashed" } — 先 tenant 隔離、再 hash 避免 tenant 內熱點
Zone sharding：把特定 chunk 釘到特定 shard（地域 / 合規 / 硬體分層）

Chunk 是 MongoDB 在 collection 上劃出的 64MB（預設）邏輯區塊。Balancer 在 shard 間搬 chunk 達成均衡。Chunk 不可 split 的條件是 shard key 在該範圍只有一個值（low cardinality / 大 tenant 獨佔範圍）— chunk split 不了、balancer 也搬不開。

reshardCollection（4.4+）：透過 temporary collection + chunk 重切 + 雙寫 + cutover、耗時等比於資料量、需額外 ~1.2x 磁碟。是「設計錯了還有補救機會」但不是 free lunch。

對應 knowledge card：database-sharding、hot-partition、partition。

單 cluster 切 shard vs 多 cluster 切 blast radius

跨案合成 frame（本章合成、9.C38 Toyota 揭露事實但 case 原文沒提這個 frame）：橫向擴展不是只有「sharded cluster 一條路」、多 cluster 是另一條路。

9.C38 Toyota Connected 揭露事實：

18B transactions / 月 ÷ 30 天 ÷ 86400 秒 ≈ 7K txn/sec（口徑：月度滾動平均、非瞬時尖峰）
單一 MongoDB cluster 完全撐得下這個吞吐
Toyota 切 20 個 Atlas database 不是吞吐切分、是 microservice ownership + blast radius 切分
「每個 microservice 擁有自己的 DB、單一 DB 故障不影響其他服務」

兩條路徑的判讀條件不同：

路徑	Trigger	代價
Sharded cluster（分 shard）	單一 collection 寫入飽和、storage 撐爆單 replica set、access pattern 在同一個資料域內	aggregation / transaction / `$lookup` 成本全部跳一級
多 cluster（分 DB）	微服務 ownership 邊界、blast radius 隔離、合規 boundary、不同 workload shape 共處風險	跨 cluster transaction 不存在、跨 DB join 必須在 application 層做

兩者可以同時用：每個 microservice 有獨立 cluster、cluster 內部該分 shard 還是分。寫設計文件時要避免讓讀者以為「sharded cluster 是唯一橫向擴展選項」。

Partition key 可逆性跨 vendor 對照

跨 vendor 可逆性對照 SSoT：MongoDB / DynamoDB / Cosmos DB 三家可逆性不在同一光譜、跨 vendor 對照的 SSoT 主寫位置在 DB3 entry — 三 vendor 對比 10 軸 + 對應的軸的延伸子段。本段聚焦 MongoDB 5.0+ reshardCollection 對 shard key 設計的影響、不重複展開三 vendor 全光譜比較。

不同 vendor 對 partition key 可逆性紀律完全不在同一光譜：

Vendor	機制	可逆性	成本
MongoDB	Shard key（`shardCollection`）	4.4+ `reshardCollection` 可改、5.0 前完全不可逆	等比資料量、~1.2x 磁碟、雙寫 + cutover
DynamoDB	Partition key	可改（用 backfill 到新 table）	重設計 access pattern、流量切換成本
Cosmos DB	Partition key	不可改（必須 export-recreate-import）	全量重灌、雙寫驗證、最大遷移成本

寫進設計文件時必須附 vendor + 版本、避免讓讀者把三家當「partition key 都不可改」、也避免把 MongoDB 5.0+ 的 reshardCollection 當「便宜遷移」。

操作流程

Step 1：横向擴展路徑決策。先問「我要解的是 單一資料域寫入飽和 還是 blast radius / ownership」、選分 shard 或分 cluster。若兩者都要、決定 cluster 邊界後再在 cluster 內分 shard。

Step 2：access pattern audit。列出所有讀寫 query、標出哪些 query 必須走 single shard（targeted），哪些 query 不在意 scatter-gather。

Step 3：候選 key 評估表。對每個候選打 cardinality / frequency / monotonicity 三項評分：

候選 key	Cardinality	Frequency	Monotonicity	適合？
`_id`（ObjectId）	極高	均勻	單調	否（單調寫熱）
`tenantId`	中	偏斜	否	視 tenant 分布
`{ tenantId: 1, _id: "hashed" }`	高	均勻	否	通常合適
`country`	極低（~200）	嚴重偏斜	否	否

Step 4：dry-run 採樣。對既有資料採樣，跑 db.coll.aggregate([{$sample:{size:100000}}, {$group:{_id:"$candidateKey", c:{$sum:1}}}, {$sort:{c:-1}}]) 看分布、確認沒有單一 key value 吃掉 > 20% 流量。

Step 5：shardCollection。

1sh.enableSharding("shop")
2sh.shardCollection("shop.orders", { tenantId: 1, _id: "hashed" })

先在 staging 跑流量重放、確認 chunk 分布平均、targeted query 比例 > 90%。

Step 6：監控。

1sh.status()                              // 看 cluster 狀態
2db.orders.getShardDistribution()         // 看 chunk 分布
3db.adminCommand({ balancerStatus: 1 })   // 看 balancer 狀態

Step 7：若已上錯 key。評估 reshardCollection（4.4+）vs application-level 雙寫遷移：

1db.adminCommand({
2  reshardCollection: "shop.orders",
3  key: { tenantId: 1, region: 1, _id: "hashed" }
4})

reshardCollection 進入 cutover 後不能回退、必須 dry-run 估完時間 + 磁碟 + IO 影響再上。

驗證點：targeted query 比例 > 90%、單 shard QPS 變異係數 < 20%、balancer migration 速率追上寫入速率。

Rollback boundary：shardCollection 是不可逆操作（5.0 前完全不可逆、5.0+ 透過 reshardCollection 可改但需重做）；reshardCollection 進入 cutover 後不能回退。

失敗模式

單調 key 寫熱點：_id（ObjectId）/ 時間戳 / 自增 ID 當 ranged shard key → 所有寫進最後 chunk，scale-out 等於零。修法是 hashed key 或 compound key 把單調軸拌散。

低 cardinality key：用 country 當 shard key、某個 country 佔 80% 流量、chunk 無法繼續 split、該 shard 永久熱。修法是加一個高 cardinality 軸（compound key）讓 chunk 可繼續分。

Tenant skew：B2B 場景大客戶獨佔 chunk、且該 tenant 的 chunk 還會繼續長大、balancer 搬不走。修法 compound key { tenantId: 1, _id: "hashed" } — tenant 隔離但 tenant 內 hash 散開。

Scatter-gather 過多：選了 hashed _id 但業務查詢主要是 tenantId 範圍查、每筆 query 打所有 shard、p99 隨 shard 數線性退化。修法 compound key 把常用查詢軸放第一位、targeted query 才能對 single shard。

Resharding 卡在 build 階段：磁碟不夠（需 1.2x source size）、IO 飽和影響線上 workload、預期 4 小時實際跑 14 小時。修法是先擴磁碟、staging 跑 dry-run 量實際耗時、production 在低峰期啟動。

Zone sharding 規則打架：合規規則（資料必須留在某 region）跟負載平衡規則衝突、balancer 無法移動 chunk → 熱點固化。修法是 zone 規則 vs balancer 設計階段就劃清、不要事後加 zone。

誤把多 cluster 當分 shard 解：blast radius 議題塞到 sharded cluster、單 cluster 故障仍打掉全部 microservice。該分 cluster 的就分 cluster、不是塞到 shard。9.C38 Toyota 揭露：7K txn/sec 仍切 20 DB 的 trigger 是 microservice ownership、不是吞吐。

Cluster 擴容時間估計太樂觀：MongoDB cluster 擴容是天級議題、不是 console 點點就好。9.C36 Coinbase 揭露 cluster 擴容要 70 分鐘（口徑：Coinbase 特定環境 cluster tier / 資料量 / Atlas API 條件下、reactive scaling 起點到完成、非 MongoDB 普遍承諾）；預測性流量必須走 predictive / scheduled scaling、不能只靠 sharded cluster 動態橫向擴展接住 surge（見 connection management and cache layer）。

Anti-recommendation：

寫入 < 5K WPS、storage < 1TB、single replica set 還能撐就不該分 shard；分了之後 aggregation、transaction、$lookup、index 成本全部跳一級
shard vs 多 cluster 對照：吞吐沒撞牆但 blast radius / ownership 是議題、走多 cluster 不是強行分 shard（9.C38 Toyota 7K txn/sec 仍切 20 DB 的 trigger）
跨 case 合成 frame：「不是所有資料都該進同一個 MongoDB cluster」、按 microservice ownership / blast radius / 合規邊界切

容量與觀測

關鍵 metric：

Shard 分布健康：每 shard QPS / CPU / disk usage 變異係數（< 20% 合理）
Query 路由：targeted vs scatter-gather query 比例（targeted > 90% 合理）
Balancer 健康：chunk migration rate、balancer round duration
Cluster 邊界：cluster-to-cluster ownership 邊界、跨 cluster query 比例

Mongo command：

sh.status()：cluster 整體狀態
db.coll.getShardDistribution()：collection 在各 shard 的分布
db.adminCommand({balancerStatus:1})：balancer 狀態
db.serverStatus().sharding：sharding metric

mongos profiler：每 query 帶 executionStats.executionStages.shards[]、看是否 single shard。

回到 4.20 observability evidence：把 shard distribution、targeted ratio、resharding 進度列為 evidence 三件套。

回到 9.4 saturation discovery：hot shard 是 partition-level saturation 的典型例子。

回到 9.5 bottleneck localization：當整 cluster CPU 看似只用 25%、實際是 1/4 shard 在 100%。

邊界與整合

Sibling deep articles：

schema design pattern — document 形狀決定 shard key 選擇空間
aggregation pipeline optimization — cross-shard aggregation 的 $out / $merge 限制
change streams + Kafka — cluster-wide vs collection-level change stream 在 sharded cluster 的差異
connection management and cache layer — cluster 擴容時間是天級議題、必須跟 predictive scaling / proxy 層配合

Migration playbook：

避免自管 sharding 走 → Atlas 用 managed shard tier
徹底重新分區走 shard expansion + multi-DC

跟 1.x 互引：1.10 KV / Document DB 容量規劃把 shard key 列為 capacity 決策；1.12 大規模 DB 遷移實戰收 resharding 失敗 retrospective。

跨 vendor 對照：DynamoDB vendor page（partition key + adaptive capacity + backfill 可改）、Cosmos DB vendor page（partition key 不可改）。

Sharding on Tarragon

MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作

問題情境：MySQL 寫吞吐撞上 single primary 上限

Vitess 四件套：每個 component 的責任

VTGate — query routing layer

VTTablet — per-MySQL agent

VReplication — 跨 shard 資料移動

VSchema — sharding metadata

Vindex：Vitess 的 sharding function

Keyspace / Shard / Tablet 階層

配置 step-by-step（local cluster）

5 個 Production 踩雷

1. Cross-shard transaction — Vitess 不支援 atomic（預設）

2. VStream lag — Resharding 期間 CDC 落後

3. Vindex 不均勻 — Hot shard

4. Resharding 切流量瞬間 deadlock

5. VReplication workflow 卡住 — cancel 前需要保護狀態

Vitess 跟自管 sharding 對照

跟其他模組整合

跟 Replication topology

跟 OSC tool

跟 ProxySQL

跟 Orchestrator

跟 PlanetScale（managed Vitess）

跟 Aurora MySQL

Production case：YouTube / Vitess

何時用 Vitess

相關連結

PostgreSQL Citus Distributed：用 extension 把 PG 變成 sharded cluster

Citus 架構：Coordinator + Worker

3 種 Table Type

Distributed table — 跨 shard 切分

Reference table — 全 shard 共有

Local table — Coordinator 上的 PG table

Colocation：跨 distributed table 同 shard 對齊

配置 step-by-step（local cluster）

Step 1：Coordinator + worker 都裝 PG + Citus

Step 2：Coordinator 註冊 worker

Step 3：建 distributed table

Step 4：Application 連 coordinator

5 個 Production 踩雷

1. Distribution column 選錯 — Cross-shard query 變主流

2. Cross-shard transaction 限制

3. Reference table 過大 — 寫入廣播 cost 爆

4. Colocate 沒對齊 — 隱性 cross-shard JOIN

5. Worker failover — Coordinator 必須知道

何時用 Citus

容量規劃

跟其他模組整合

跟 Replication topology

跟 PG Extensions

跟 MySQL Vitess

相關連結

MongoDB Shard Key Selection：hashed vs ranged、單 cluster 切 shard vs 多 cluster 切 blast radius

問題情境：橫向擴展不是只有 sharded cluster 一條路

核心機制：shard key、chunk、balancer

單 cluster 切 shard vs 多 cluster 切 blast radius

Partition key 可逆性跨 vendor 對照

操作流程

失敗模式

容量與觀測

邊界與整合

相關連結