MySQL Vitess Sharding：VTGate / VTTablet / VReplication / VSchema 四件套協作

2026-05-19

本文是 MySQL overview 的 implementation-layer deep article。Overview 已說明 MySQL 在 OLTP 譜系的定位、本文聚焦 Vitess sharding — 4 個 component 協作的完整 sharding 系統。

問題情境：MySQL 寫吞吐撞上 single primary 上限

MySQL primary 單機極限大致 50K-100K WPS（依 schema / hardware）。超過這個級別、選項三條：

Application 層 sharding：每張 table 自己決定怎麼分片、application 寫 routing logic、跨 shard query / migration 都要自己處理
Vitess：proxy layer 自動 routing、cross-shard query 可選自動 split、resharding 自動化
Distributed SQL（CockroachDB / Spanner / Aurora DSQL）：跟 MySQL 不同 engine、application 改 driver

選 Vitess 的核心 driver：保留 MySQL wire protocol + 應用層幾乎不必改 + 透明分片。代價是 4 個 component 的 operational complexity — Vitess 的責任範圍是完整分散式系統，而非單純 proxy。

閱讀本文前可先對齊 Database Sharding 的 shard key、routing、resharding 與 cross-shard query 語意；容量失衡時再接 Hot Partition。

Vitess 四件套：每個 component 的責任

 1                        ┌─────────────────┐
 2   Application ────→    │     VTGate      │  ← 對外 MySQL wire protocol
 3                        │  (proxy + parse + route + aggregate)  │
 4                        └────┬─────┬──────┘
 5                             │     │
 6                ┌────────────┘     └──────────────┐
 7                ▼                                 ▼
 8        ┌──────────────┐                  ┌──────────────┐
 9        │   VTTablet   │                  │   VTTablet   │
10        │ (per-MySQL   │                  │ (per-MySQL   │
11        │  sidecar)    │                  │  sidecar)    │
12        └─────┬────────┘                  └─────┬────────┘
13              │                                 │
14              ▼                                 ▼
15        ┌──────────────┐                  ┌──────────────┐
16        │    MySQL     │                  │    MySQL     │
17        │  (Shard -80) │                  │  (Shard 80-) │
18        └──────────────┘                  └──────────────┘
19
20   Topology Service (etcd / Consul / ZooKeeper)
21   ↑↓ 所有 component 共享 metadata
22   VSchema：keyspace 結構、shard 範圍、Vindex 定義

VTGate — query routing layer

對 application 看起來像 MySQL（同樣 port、同樣 wire protocol、同樣 query 語法）、實際是 stateless proxy。每個 query VTGate：

Parse SQL → 找出 routing key（從 WHERE column 拿）
查 VSchema → 計算 routing key 對應的 shard
把 query 送該 shard 的 VTTablet
等 response、aggregate（如果是 cross-shard query）、回 application

Stateless 設計 → VTGate 可以隨意 scale、放 N 個前面接 LB。多數 production 部署 3-10 個 VTGate per region。

VTTablet — per-MySQL agent

每個 MySQL instance 旁邊都跑一個 VTTablet。VTTablet 責任：

把 MySQL primary 標記、上報給 topology
接 VTGate 的 query、轉發給 local MySQL
跑 connection pool（VTGate 跟 VTTablet 之間少量連線、VTTablet 跟 local MySQL 共享 connection）
跑 query plan cache / transactional consistency check
處理 online schema change（Vitess 內建 OSC）
跟 VTOrc（fork of Orchestrator）配合做 failover

VTTablet 是 Vitess 跟 MySQL 唯一連接點 — 沒 VTTablet 直接連 MySQL 不在 Vitess 管理下。

VReplication — 跨 shard 資料移動

VReplication 是 Vitess 跨 shard / 跨 keyspace / 跨 cluster 資料移動引擎、底層用 MySQL binlog。用途：

Resharding：把 shard -80 拆成 -40 + 40-80、VReplication 自動拆 binlog event 對應 shard
Materialized view：cross-shard aggregation 預計算
MoveTables：跨 keyspace 移 table（schema-level migration）
VStream：CDC、binlog event 對外輸出（可接 Kafka / Debezium）

VReplication 的主要使用者是 Vitess operator，它和 application 行為直接相關（resharding 期間有 write split 行為）。

VSchema — sharding metadata

VSchema 是 keyspace 內 哪張 table 怎麼 shard 的定義、JSON 格式存 topology service。例子：

 1{
 2  "sharded": true,
 3  "vindexes": {
 4    "hash": {
 5      "type": "hash"
 6    }
 7  },
 8  "tables": {
 9    "orders": {
10      "column_vindexes": [
11        {
12          "column": "user_id",
13          "name": "hash"
14        }
15      ]
16    },
17    "users": {
18      "column_vindexes": [
19        {
20          "column": "user_id",
21          "name": "hash"
22        }
23      ]
24    }
25  }
26}

orders.user_id 跟 users.user_id 用同一個 Vindex（hash）+ 同一個 column → 同 user_id 的 orders + users 落在同 shard、可以 JOIN 不跨 shard。

Vindex：Vitess 的 sharding function

Vindex 是 Vitess 的 shard key 計算函數。內建多種：

Vindex 類型	計算方式	適用
`hash`	3DES-based null hash（非 MD5）→ 對應 shard range	預設、均勻分布、適合 primary key
`binary_md5`	MD5(binary)	binary key
`unicode_loose_xxhash`	xxHash on lowercased unicode	string key
`numeric`	直接 numeric value	連續 numeric range（適合 time-based）
`numeric_static_map`	預定義 map	國家 code / region 等少 enum
`lookup_hash`	透過 lookup table 查 shard	多個 column 都要 shard、需要二級 index

最常用：hash（primary key）+ lookup_hash（secondary access pattern）。

Keyspace / Shard / Tablet 階層

 1Keyspace (邏輯 database)
 2   └── Shards
 3        ├── -80 (shard range 0-128)
 4        │     ├── Primary tablet (1 MySQL primary)
 5        │     ├── Replica tablet × 2
 6        │     └── RDOnly tablet × 1 (analytics)
 7        └── 80- (shard range 128-256)
 8              ├── Primary tablet
 9              ├── Replica tablet × 2
10              └── RDOnly tablet × 1

Shard range 用 binary hex prefix（-80 表示 0 到 0x80、80- 表示 0x80 到 max）— 給 resharding 留 split 餘地（-80 可切成 -40 + 40-80）。

Tablet type：

Primary：寫入入口
Replica：read traffic（Vitess query rules 控制）
RDOnly：純 analytics / backup / VReplication source、低 SLA、不上 production read traffic

配置 step-by-step（local cluster）

Production 通常用 Kubernetes operator（vitess-operator）部署、但理解概念用 local cluster 最快：

 1# 用 vtctldclient 操作（替代舊的 vtctlclient）
 2
 3# 1. 建 unsharded keyspace
 4vtctldclient CreateKeyspace --durability-policy=semi_sync commerce
 5
 6# 2. 從一個 MySQL primary 開始（unsharded）
 7vtctldclient ApplySchema --sql="CREATE TABLE orders (id INT PRIMARY KEY, user_id INT)" commerce
 8
 9# 3. 把 keyspace 改成 sharded、定義 VSchema
10vtctldclient ApplyVSchema --vschema='{
11  "sharded": true,
12  "vindexes": {"hash": {"type": "hash"}},
13  "tables": {
14    "orders": {
15      "column_vindexes": [{"column": "user_id", "name": "hash"}]
16    }
17  }
18}' commerce
19
20# 4. 觸發 resharding：unsharded → 2 shards (-80, 80-)
21vtctldclient Reshard --workflow=initial-shard create \
22  --source-shards="commerce/0" \
23  --target-shards="commerce/-80,commerce/80-"
24
25# 5. 等資料 copy 完（VReplication 跑）
26vtctldclient Workflow --keyspace=commerce show initial-shard
27
28# 6. SwitchTraffic：先切 RDOnly → 再切 Replica → 最後切 Primary
29vtctldclient Reshard --workflow=initial-shard switchtraffic \
30  --tablet-types="rdonly,replica"
31vtctldclient Reshard --workflow=initial-shard switchtraffic \
32  --tablet-types="primary"
33
34# 7. 完成、cleanup old shard
35vtctldclient Reshard --workflow=initial-shard complete

實際 production 走 Vitess Kubernetes operator、用 VitessCluster CRD 宣告 desired state、operator 自動操作上面這些 step。

5 個 Production 踩雷

1. Cross-shard transaction — Vitess 不支援 atomic（預設）

兩個 user 的 order 在不同 shard、BEGIN; UPDATE orders WHERE user_id=1; UPDATE orders WHERE user_id=2; COMMIT; 跨兩個 shard。Vitess 預設 不保證 atomic — 兩個 shard 各自 commit、可能一個成功一個失敗、application 看到 partial state。

修法：

避免 cross-shard transaction：schema design 讓 transaction boundary 落在單一 shard 內
啟用 atomic 2-phase commit（Vitess transaction_mode=TWOPC、實驗性、performance penalty 大）
大規模需要 atomic 的場景應該換 distributed SQL（CockroachDB / Spanner），讓資料庫層承擔跨節點一致性

2. VStream lag — Resharding 期間 CDC 落後

Resharding 過程 VReplication 大量寫 binlog event、application 本來在用 的 VStream（接 Kafka 等）共享同 binlog stream、可能 lag。Downstream consumer 看到 stale data 1-2 小時。

修法：

Resharding 期間 暫停非關鍵 VStream（analytics ETL 可暫停、real-time recommendation 需要保留）
確認 binlog disk capacity > resharding 期間預估 binlog 量 × 2（buffer）
Resharding 完成後 手動驗證 VStream offset 已 catch up，把驗證結果留成 cutover evidence

3. Vindex 不均勻 — Hot shard

Vindex 預設 hash 對 primary key 均勻分布、但對 natural key（country / region / company_id 等）可能不均勻。10 個 country、其中 1 個 country 佔 80% traffic、單一 shard 永遠 hot。

修法：

Composite Vindex：combine country + user_id 兩 column 作為 shard key、user-level 仍均勻
Synthetic shard key：application 層加 sharding_key=hash(actual_key) % N、控制分布
監控 per-shard QPS：vtctldclient ShowVDiff + Prometheus exporter
Hot shard 出現後 Vitess 可以 resharding 解（split hot shard 為 2 個小 shard）、但工作量大

4. Resharding 切流量瞬間 deadlock

Resharding 最後的 SwitchTraffic 切 primary 階段、舊 shard 仍接 write、Vitess 切 routing、Application 一瞬間連兩個 shard、相同 user_id 寫入可能跑兩邊、deadlock 或 lost update。

修法：

SwitchTraffic 用 ReverseTraffic 預備：先 switch、確認問題後可 reverse 回去
切流量 只在 known quiet period（夜間 / 週末早上）
VTGate --retry-count=2 + --track-vtgate-deadlock-events：deadlock 自動 retry、不暴露給 application
真的失敗用 Reshard cancel 回 old state，讓 workflow 回到可驗證狀態

5. VReplication workflow 卡住 — cancel 前需要保護狀態

VReplication workflow 跑到 50% 但 某個 row 解析錯誤（schema mismatch / blob 大小超過 limit）、workflow stuck、進度條卡住、無 timeout。整個 resharding flow halt。

修法：

平時跑 staging 資料 dry-run、發現 schema 跟 blob 邊界問題
Workflow 卡住時 vtctldclient Workflow show 看 last_message / row_state
手動修問題 row（直接 MySQL 改）後 resume workflow
大 cluster 建議 VReplication 跑前先 SchemaApply audit、確認 source / target schema 兼容

Vitess 跟自管 sharding 對照

維度	Vitess	Application-level sharding
Application 改動	幾乎不必（保留 MySQL wire）	大改（routing logic 寫 application）
Cross-shard query	VTGate 自動 split（受限）	Application 自己處理
Resharding	VReplication 自動	手寫腳本、操作複雜
Online schema change	Vitess 內建（VReplication-based）	用 gh-ost / pt-osc
Failover	VTOrc 整合	自管 Orchestrator
Operational cost	高（4 component 要懂）	中（fewer abstractions、但 application logic 多）
Cross-keyspace 共用 vindex	內建（lookup_hash 跨 keyspace）	自寫

Vitess 的 operational complexity 是它的代價。10-20 人 SRE 團隊撐得住、5 人團隊用 managed Vitess（PlanetScale） 更實際。

跟其他模組整合

跟 Replication topology

Vitess shard 內部仍用 MySQL replication（Replication Topology）— 每個 shard 有 primary + replica + rdonly。Vitess durability-policy 控制 primary 寫入是否等 replica ack（semi-sync）。

跟 OSC tool

Vitess 不用 gh-ost / pt-osc、用 VReplication-based online DDL。Vitess online DDL：

1vtctldclient ApplySchema --strategy=vitess \
2  --sql="ALTER TABLE orders ADD COLUMN status VARCHAR(20)" commerce

詳見 Online Schema Change Tools。

跟 ProxySQL

Vitess 取代 ProxySQL。VTGate 本身做 connection pool + query routing、不再需要 ProxySQL。混用會造成 routing 衝突（VTGate 期待自己決定 shard、ProxySQL 跟 VTGate 競爭）。詳見 ProxySQL 配置。

跟 Orchestrator

Vitess 用 VTOrc（fork of Orchestrator）作 failover、跟 Vitess topology metadata 整合。不用獨立 Orchestrator。詳見 Orchestrator failover 設計。

跟 PlanetScale（managed Vitess）

PlanetScale 是 Vitess managed service、隱藏 4 component operational complexity、加 branch-based schema workflow。詳見 PlanetScale migration playbook。

跟 Aurora MySQL

Aurora 跟 Vitess 是 不同 scale 路徑：

Aurora：single-region scaling（storage / compute 分離、最高 ~128 TB）
Vitess：horizontal sharding（無上限、靠加 shard scaling）

兩者承擔的容量與操作責任不同。超過 Aurora single-region 上限的場景才考慮 Vitess。詳見 Aurora vendor page。

Production case：YouTube / Vitess

Vitess 的 production 責任是把 MySQL shard 拓撲變成應用可查詢、可遷移、可操作的資料庫層。YouTube / Vitess 的公開歷史提供的工程訊號是 VTGate、VTTablet、VReplication 與 VSchema 這組元件分工：application query 進 VTGate、tablet 層包住 MySQL、VSchema 描述 routing / sharding 規則、VReplication 支援 resharding 與資料搬移。

這個案例要回收到三個操作判準。第一，Vitess 是一套 database control plane，而非單一 proxy；導入時要把 topology service、tablet lifecycle、backup、failover 與 schema workflow 一起納入 ownership。第二，VSchema 是 application contract，shard key、lookup vindex 與 cross-shard query 都會影響產品功能設計。第三，VReplication 讓 resharding 可操作，但它仍需要 capacity window、backfill 監控與 cutover plan。

Vitess 的 sibling 路由是 PostgreSQL Citus Distributed 與 1.11 全球分散式 OLTP。Citus 保留 PostgreSQL 生態並用 coordinator / worker 拆分資料；CockroachDB / Spanner 則用 distributed SQL 重新定義交易與一致性邊界。選型時要先判斷自己是在延伸 MySQL 投資，還是在重新選 global OLTP model。

何時用 Vitess

條件	評估
流量 > 50K WPS、單 primary 撐不住	是 Vitess scope
已有大量 MySQL 投資、不想換 distributed SQL	是
有 5-10 人 SRE / DBA 團隊	是
流量 < 10K WPS	否（過度設計、用單 MySQL + replica）
5 人團隊、不想養 DBA	否（用 PlanetScale managed）
必須 multi-region 強一致 transaction	否（CockroachDB / Spanner 才對）
需要複雜 cross-shard analytics	否（搭配 BigQuery / Snowflake）

Tarragon