Topology on Tarragon

MongoDB Shard Expansion + Multi-DC：Type F「不需要 parallel run」的 multi-region 例外

Tue, 19 May 2026 00:00:00 +0000

本文是 MongoDB overview 的 implementation-layer deep article。對應 #128 Type F「Topology re-layout」第 3 個 dogfood、特別驗證 self-aware limitation 第 3 點「不需要 parallel run」claim 的 multi-region rollout 例外 — 本文是反例的具體實證。

Reviewer D 的質疑：Type F 一定不需要 parallel run 嗎

#128 Self-aware limitation 第 3 點承認：

「不需要 parallel run」claim 部分不成立：multi-region rollout（#128 列為 Type F 情境）必須 parallel run — 兩 region 同時跑然後切流量、不然就是停機切換、跟 Type A phase 3 機制相同。

本文是該 claim 的 正面實證 — MongoDB sharded cluster 從 single-DC 加 shard + 加 secondary DC、確實需要 parallel run + 流量切換、跟 Type A phased migration 局部同構：

Type F 假設	Single-DC re-sharding（Redis case）	Multi-DC expansion（本文）
同 cluster 不同 state	yes	yes（同 MongoDB cluster）
不需 schema translation	yes	yes
不需 parallel run	yes（slot migration 內部完成）	no — 兩 DC 同跑後切流量
不需 cleanup phase	yes	partial（舊 DC 角色降為 standby）
Step-by-step + rollback boundary	yes	yes

→ Type F anatomy 仍適用、但「不需 parallel run」是 子情境條件、不是 universal claim。

兩個操作合併：shard 加 + DC 加

實務上中型公司常同時跑兩個 topology 變動：

Shard expansion：現有 3-shard cluster 加到 5-shard、chunk migration 平均分佈
Multi-DC：從 single-DC（us-east-1）加到 multi-DC（us-east-1 + us-west-2）

兩個操作的 diff dimension audit：

維度	Shard 加（單獨）	Multi-DC（單獨）	兩者同跑
Schema / API	Low	Low	Low
Operational model	Low	Medium（跨 DC ops）	Medium
Paradigm	Low	Low	Low
Components	Low（加 shard、同 cluster）	Low	Low
Application change	Low	Low-Medium（cross-DC latency aware）	Low-Medium
Data topology	High（sharding strategy）	High（replication + region）	High（雙變、複合 topology）

兩者主導維度都是 topology = High、組合走 Type F multi-axis 子情境。

Pre-layout analysis：當前 + 目標 topology

 1// 1. 當前 shard 分佈
 2sh.status({verbose: false});
 3// 期望輸出: 3 shard、每個 ~33% chunks、no migration in progress
 4
 5db.printShardingStatus({verbose: false});
 6// 找 hot shard、imbalanced chunk distribution
 7
 8// 2. Replication topology
 9rs.status();
10// 各 replica set primary/secondary 健康度、replication lag
11
12// 3. Cross-DC network baseline (在 add DC 前測)
13// us-east-1 → us-west-2 RTT、bandwidth

Pre-layout 階段 output：

當前：3 shard × 1 replica set per shard (3 member) = 9 node、全在 us-east-1
目標：5 shard × 1 replica set per shard (5 member: 3 us-east + 2 us-west) = 25 node
Migration scope：加 2 shard + 加 2 DC member 每 shard、共 +16 node
Chunk migration estimate：30% chunk 需重分（從 33% × 3 變 20% × 5）

Re-layout 機制

兩個 mechanism 平行進行：

Shard expansion mechanism

 1// 1. 新增 shard 到 cluster
 2sh.addShard("rs-shard4/host10:27017,host11:27017,host12:27017");
 3sh.addShard("rs-shard5/host13:27017,host14:27017,host15:27017");
 4
 5// 2. balancer 自動 chunk migration
 6sh.startBalancer();
 7// 觀察 progress: db.adminCommand({balancerStatus: 1})
 8
 9// 3. 完成後 verify shard distribution
10sh.status();

Chunk migration 是 background job、balancer 控制 throttle；不阻塞 production query、但 CPU / network 上升 30-50%。

Multi-DC expansion mechanism

 1// 1. 對每 shard 的 replica set 加 us-west-2 member (priority 0)
 2rs.add({
 3  host: "us-west-2-host:27017",
 4  priority: 0,           // 不能當 primary
 5  votes: 1,              // 參與投票
 6  hidden: false
 7});
 8
 9// 2. 等 initial sync 完成（依資料量 1 小時 - 1 天）
10rs.printReplicationInfo();
11
12// 3. 確認 secondary 健康後、提升 priority 或 votes
13// 不要立刻設 priority 1、避免 unintended failover
14
15// 4. Cross-DC routing 透過 readPreference 在 application 設
16const client = new MongoClient(uri, {
17  readPreference: 'secondaryPreferred',
18  readPreferenceTags: [{ region: 'us-west-2' }, {}],
19});

關鍵：multi-DC 是 漸進加 member、不是 atomic switch；每 shard 獨立加、整體耗時 = shard 數 × initial sync time。

Execution flow（含 parallel run + 流量切換）

8 step、包含 parallel run + 切流量 段——驗證 #128 self-aware limitation 第 3 點：

Step	動作	Parallel run?	Rollback boundary
1 Pre-check	量化當前 topology、確認 cluster 健康	no	-
2 加 us-east shard	sh.addShard、balancer migrate chunk	no（cluster 內）	removeShard、chunk migrate 回
3 加 us-west member	對每 shard rs.add 跨 DC member	no	rs.remove、initial sync 投入廢棄
4 Initial sync wait	等所有 us-west member catch up	parallel run starts：兩 DC 同時 serve	-
5 Cross-DC dual-serve	兩 DC 都跑 read traffic（不切 write）	yes、parallel run：app 用 secondary preferred us-west	readPref 切回 us-east primary
6 流量切換	application us-west traffic 走 us-west read	yes	DNS / readPref 切回
7 Promote us-west（optional）	一個 shard 的 us-west member priority 提到 1	post-cutover	demote priority 回 0
8 Cleanup	Verify、archive log、document new topology	no	-

Step 4-6 是 parallel run + 切流量 — Type F 有此例外、跟 Type A phase 3 機制同構；anatomy 中「Execution flow per-step」段必須含 parallel run 子段。

Production 故障演練

Case 1：Balancer 跑 chunk migration 撞 production peak

徵兆：加 shard 後 balancer 開始 migrate chunk、production write latency p99 從 10ms 跳到 100ms；application 端 timeout 大量。

根因：MongoDB balancer 預設 24×7 跑、chunk migrate 是 blocking 操作（migration lock 期間阻塞 write 到該 chunk）；產線高峰時間 balancer 不會自動暫停。

修法：

1// 限 balancer 跑在 low-traffic window
2sh.setBalancerState(true);
3db.settings.update(
4  { _id: "balancer" },
5  { $set: { activeWindow: { start: "02:00", stop: "06:00" } } },
6  { upsert: true }
7);

且設 chunkSize 較小（128MB → 64MB）讓 migration 步驟細、單次 lock 時間短。

Case 2：Cross-DC initial sync 期間 oplog 跑出窗口

徵兆：加 us-west member 後、initial sync 跑 4 小時、結束時 member 顯示「too stale to catch up」、需要 full re-sync。

根因：MongoDB oplog 是 capped collection、預設 size 5% disk；4 小時 initial sync 期間 primary 寫入量超出 oplog 保留範圍、member 拿到的 oplog start point 已被覆蓋。

修法：

預先擴 oplog size：db.adminCommand({replSetResizeOplog: 1, size: 51200}) 加到 50GB、覆蓋 sync window
Off-peak initial sync：跑在低流量時間、oplog 寫入較慢
Manual initial sync via snapshot：用 mongodump 從 primary snapshot、restore 到 new member、跳過 oplog tail catch-up

Case 3：跨 DC read 路由錯誤、stale data 影響業務

徵兆：切流量到 us-west 後、application 偶爾抓到 5-30 秒前的 stale data；customer 報告「明明剛改了 setting、refresh 又變回去」。

根因：us-west member 是 secondary、replication lag 5-30 秒；application readPreference 設 secondaryPreferred 但沒 maxStalenessSeconds、可能讀到嚴重 stale member。

修法：

 1const client = new MongoClient(uri, {
 2  readPreference: 'secondaryPreferred',
 3  readPreferenceTags: [{ region: 'us-west-2' }, {}],
 4  maxStalenessSeconds: 90,  // 限 stale 不超過 90 秒
 5});
 6
 7// 對 strict consistency 場景強制 primary
 8const client_strict = new MongoClient(uri, {
 9  readPreference: 'primary',  // 強制讀 us-east primary
10});

Application-level read pattern 必須區分「accept stale read」vs「require fresh read」、不是 cluster-level 統一配置。

Case 4：Shard tag-aware routing 沒設、cross-DC traffic 爆 cost

徵兆：multi-DC 跑了 1 個月、AWS egress cost 從 $500 / month 漲到 $8000 / month；99% 流量還是 us-east → us-west 跨 DC。

根因：sharded cluster 沒設 zone sharding、application 不知道哪些 chunk 在哪個 DC、所有 query 預設打 us-east primary、跨 DC bandwidth 爆。

修法：

 1// 注意: MongoDB 4.2+ API、舊版 sh.addShardTag / sh.addTagRange 已 deprecated
 2// 對應改 sh.addShardToZone / sh.updateZoneKeyRange
 3
 4// 1. 給 shard 加 zone (MongoDB 4.2+)
 5sh.addShardToZone("rs-shard1", "us-east");
 6sh.addShardToZone("rs-shard2", "us-east");
 7sh.addShardToZone("rs-shard3", "us-east");
 8sh.addShardToZone("rs-shard4", "us-west");
 9sh.addShardToZone("rs-shard5", "us-west");
10
11// 2. 對 collection 加 zone range
12sh.updateZoneKeyRange(
13  "myapp.events",
14  { region: "us-east", _id: MinKey },
15  { region: "us-east", _id: MaxKey },
16  "us-east"
17);
18sh.updateZoneKeyRange(
19  "myapp.events",
20  { region: "us-west", _id: MinKey },
21  { region: "us-west", _id: MaxKey },
22  "us-west"
23);
24
25// 3. balancer 重新分配 chunk 到對應 zone

Zone sharding 是 multi-DC 必要設計、不設等於白付 egress cost。

Case 5：Failover 後跨 DC primary 切換、application 連線中斷

徵兆：production 跑 6 個月後、us-east-1 outage、某 shard primary 切到 us-west member；application 5-10 秒內大量 connection error。

根因：MongoDB driver 預設 election timeout 10 秒、application 沒設 server selection retry；primary 切換期間 client 沒重連。

修法：

1const client = new MongoClient(uri, {
2  serverSelectionTimeoutMS: 30000,    // 等 30 秒給 election
3  retryWrites: true,
4  retryReads: true,
5  heartbeatFrequencyMS: 5000,         // 更頻繁 detect topology 變動
6});

且 multi-DC primary 應該設 priority asymmetry：us-east member priority 2、us-west priority 1；正常情況不切換、災難時自動切。

Capacity / cost

維度	Single-DC 3-shard	Multi-DC 5-shard	Trade-off
Node count	9	25	~3x infrastructure cost
Storage redundancy	3 replica	5 replica (3 east + 2 west)	+2 copy、storage cost +66%
Network egress	內部 VPC、低	Cross-DC、高（需 zone sharding）	$500 → $8000 / month if no zone sharding
Latency p99 (write)	5-10ms	5-15ms（primary 仍 us-east）	略升
Latency p99 (read)	5-10ms	2-5ms (local DC)	Multi-DC 區域 read 加快
Disaster recovery	RTO 30 分鐘（rebuild）	RTO < 1 分鐘（auto failover）	顯著改善
Operational complexity	低	高（zone sharding / DR drill）	+1 SRE FTE 維護

判讀：multi-DC 是 DR 投資、不是 cost optimization；只在 availability SLA > 99.9% 或合規要求 場景值得。

整合 / 下一步

跟 MongoDB → Atlas migration 對位

Self-managed multi-DC 複雜度高、Atlas 把 multi-cluster + cross-region 簡化成 UI 配置；如果走 multi-DC、考慮直接遷 Atlas。

跟 Application read pattern 整合

zone sharding + readPreference 跟 application logic 緊密耦合；不能事後補、應在 multi-DC 設計階段就設計 application 端的 region-aware routing。

跟 Cassandra keyspace re-balance 對比

Cassandra 是另一個 Type F multi-DC 典型 case；用 NetworkTopologyStrategy + replication factor per DC、跟 MongoDB zone sharding 概念對等但 mechanism 完全不同。Reviewer D 把 Cassandra 列為 Type F 反例 — 本文以 MongoDB 替代驗證。

下一步議題

Cross-region active-active：MongoDB 不支援 multi-primary、cross-region active-active 需要 application-level conflict resolution
PostgreSQL Citus / CockroachDB multi-region 對比：distributed SQL 對 multi-region 有不同設計
Cost optimization：跨 DC egress 是 long-term concern、zone sharding 設好後仍要 quarterly review

Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程

Tue, 19 May 2026 00:00:00 +0000

本文是 Redis overview 的 implementation-layer deep article。本文是 Migration playbook methodology 「何時不該套」段的第 3 項實證（容量重新規劃 / re-sharding）— source / target 同 vendor 同 cluster、但 data topology 重劃、不在 5 type 內。

Source = Target，但 topology 重劃

Migration 通常假設 source 跟 target 是不同 cluster / vendor；re-sharding 是 同 cluster 內的 slot 重分配、source 跟 target 是 同一個 Redis Cluster 的不同 state：

1Before re-shard:
2  Cluster A: [node1: slots 0-5460] [node2: slots 5461-10921] [node3: slots 10922-16383]
3              ~ 33% load           ~ 50% load              ~ 17% load (heavy imbalance)
4
5After re-shard:
6  Cluster A: [node1: slots 0-4095] [node2: slots 4096-8191] [node3: slots 8192-12287] [node4: slots 12288-16383]
7              ~ 25% load           ~ 25% load              ~ 25% load              ~ 25% load

source 跟 target 是 同 cluster、區別在 slot 對 node 的 mapping。Application connection string 不變、cluster API 不變、data model 不變。但 slot migration 期間 application 行為跟 normal operation 差很多 — 這是 re-sharding 主要工作。

跑 diff dimension audit 對 Redis cluster re-sharding：

維度	評估	等級
Schema / API	同 Redis、無變	Low
Operational model	同 Redis Cluster、operational 不變	Low
Abstraction / paradigm	同 Redis Cluster、無 paradigm 差	Low
Number of components	同 1 個（cluster）	Low
Application change	多數不改、client cluster mode 自處理	Low
Data topology	重劃 — slot mapping 跟 node 數	New axis

5 維皆 Low、對映 Type B drop-in；但 data topology 是 5 type 沒有的 第 6 維度。本文採用 re-sharding-specific 結構、不是 5 type 任一個。

4 種 re-sharding driver

不同 driver 對應不同 re-sharding 策略：

Driver	觸發場景	對應 re-sharding 操作
Slot imbalance	業務熱點打到部分 slot、單 node CPU / memory 80%+	Rebalance（slot 重分配、不加 node）
Capacity expansion	整 cluster memory / throughput 上限快到、要加 node	Add node + slot migration（從現有 node 搬部分 slot 過去）
Node decommission	老 node 硬體淘汰 / cloud instance 換代	Drain（該 node 的 slot 全搬走）+ remove
Hash tag refactor	業務 access pattern 變、需要 co-located key 群重分組	Application-side migration（不是 cluster-level）

前 3 種是 cluster-internal、用 redis-cli --cluster 工具完成；第 4 種需要 application 端 dual-write + migration、本文不展開。

Slot migration 機制

Redis Cluster 16384 個 slot、每個 key 經 CRC16(key) % 16384 對應 slot。Slot migration 過程：

 1Source node:     [slot N: MIGRATING to dest]
 2Dest node:       [slot N: IMPORTING from source]
 3                 ↓
 4Source node:     SCAN slot N → for each key:
 5                 1. DUMP key (serialize value)
 6                 2. send to dest via MIGRATE command
 7                 3. dest RESTORE key
 8                 4. source DEL key
 9                 ↓
10Source node:     [slot N: OWNED by dest]
11Dest node:       [slot N: OWNED]
12                 ↓
13跨 cluster broadcast: slot N 屬於 dest

期間 client 行為：

Key 在 source 端（未 migrate）：source 直接 serve
Key 在 dest 端（已 migrate）：source 回 -ASK redirect、client 重發到 dest
寫入 MIGRATING slot 的新 key：source serve、之後也會 migrate
Application 不需要改 code、cluster-aware client 自動處理 -ASK redirect

redis-cli –cluster 工具

production 用 official tool、不要手寫 slot migration：

 1# 1. Rebalance（slot 重分配、適合 imbalance）
 2redis-cli --cluster rebalance 10.0.0.1:6379 \
 3  --cluster-use-empty-masters \
 4  --cluster-threshold 5
 5
 6# 2. Reshard（指定來源 → 目標、適合 capacity expansion）
 7redis-cli --cluster reshard 10.0.0.1:6379 \
 8  --cluster-from  \
 9  --cluster-to  \
10  --cluster-slots 4096 \
11  --cluster-yes
12
13# 3. Add-node（加新 node 進 cluster）
14redis-cli --cluster add-node 10.0.0.4:6379 10.0.0.1:6379 \
15  --cluster-master-id 
16
17# 4. Del-node（移除 node、需先 drain slot）
18redis-cli --cluster del-node 10.0.0.1:6379

關鍵：

--cluster-threshold 5：load 差異超過 5% 才 rebalance、避免反覆觸發
--cluster-slots：一次 migrate 多少 slot；太大 lock 久、太小步驟多
Rebalance / reshard 過程 cluster 仍 serve traffic、但 latency 升高（migration overhead）

5 段執行流程

 11. Pre-resharding analysis
 2   - 當前 slot 分佈跟 load
 3   - Hot key 識別（CLUSTER COUNTKEYSINSLOT）
 4   - 預估 migration 時間
 5
 62. Backup checkpoint
 7   - BGSAVE on all master
 8   - 確認 replica 跟得上（replication offset diff < 10MB）
 9
103. Execute re-sharding
11   - 用 redis-cli --cluster 工具
12   - Monitor cluster health（CLUSTER INFO + CLUSTER NODES）
13   - Migration 期間 application 端 latency baseline 比對
14
154. Verify
16   - Slot distribution 對 expected mapping
17   - Application traffic pattern 對 baseline
18   - 跑 cross-node sanity check
19
205. Cleanup
21   - 舊 node（若 decommission）reset / 釋放
22   - Monitoring dashboard 更新 (Prometheus target / Grafana panel)
23   - Document new topology

整體 1-7 天、依 cluster 大小（10GB ~ 1 小時、TB 級 1-3 天）。

Production 故障演練

Case 1：Cluster busy 期間 application timeout

徵兆：re-sharding 跑到一半、application 端開始大量 CLUSTER BUSY error / OOM warning / latency p99 從 5ms 跳到 200-2000ms；某些 batch operation 完全失敗。

根因：MIGRATE command 對單 key 是 blocking（DUMP + send + RESTORE + DEL atomic）— 大 value（HASH / SORTED SET / LIST 含 100K+ entry）migration 可能 lock node 數秒；同期間其他 query 阻塞。

修法：

Pre-resharding audit：MEMORY USAGE 跑 sample key、找 > 1MB 的 fat key、列出單獨處理
MIGRATE timeout 調：redis.conf 設 cluster-migration-timeout 10000（10s）、避免單 key migration 卡爆 cluster
降低並行：--cluster-pipeline 1 一次只搬一個 slot（預設 10）、減少 CPU 壓力
Fat key refactor：production 不該有 1M+ entry 的 collection、refactor 拆分

Case 2：Replica lag during re-sharding

徵兆：reshard 完成後、replica 顯示 stale data 數分鐘、application 端 read from replica 拿到舊值。

根因：master 端 slot migration 產生大量 DEL + RESTORE 命令、replication stream 量爆、replica 跟不上、accumulated lag。

修法：

Pre-resharding 確認 replica lag < 5MB、否則先 fix replica issue 再開始
Throttle migration：用 --cluster-replace + lower pipeline、放慢 master 寫入速度
Application 端 read-write split policy：reshard 期間強制 read from master、暫時放棄 replica read
預備計畫：若 lag > 30s 撐了 5+ 分鐘、考慮暫停 reshard、wait replica catch up

Case 3：Client-side topology cache stale

徵兆：reshard 完、application 端持續報 MOVED redirect、但隔 30s 又 redirect 一次；某些 client 直接 connection refused（連到已 decommission node）。

根因：cluster-aware client（lettuce / Jedis cluster mode）有 topology cache、reshard 後不主動 refresh；遇 MOVED 後 refresh 一次、但 cache TTL 內可能繼續用舊 mapping。

修法：

Client config：lettuce clusterTopologyRefreshOptions(...) 設較短 refresh interval（60s）+ enablePeriodicRefresh()
Reshard 完後 trigger refresh：application 端可主動發 CLUSTER NODES 拿最新 topology、不依賴 client lib 自動 refresh
Graceful client shutdown / restart：對 latency-sensitive 服務、reshard 完 rolling restart application pod、避免 stale cache
Decommissioned node 保留 5 分鐘：不立刻 stop node、給 stale client 自然 retry 機會

Case 4：Cross-slot transaction 失敗

徵兆：application 用 MULTI/EXEC 跨多 key、reshard 期間部分 transaction 報 MOVED error、整個 transaction 失敗、business logic 不一致。

根因：Redis Cluster transaction 要求 所有 key 在同 slot（用 hash tag {user:123}）；reshard 期間如果 transaction 內某 key migrate 到 dest、cluster topology 暫時 inconsistent、transaction 拒絕。

修法：

Pre-resharding audit：grep application code 找 MULTI / pipeline 使用、確認所有都用 hash tag co-locate
Reshard 期間 application 端加 retry：transaction failure 後 backoff retry、cluster stabilize 後成功
架構：transaction-heavy 場景考慮不用 Redis Cluster、用 Redis Sentinel single master（無 slot 概念）

Case 5：Monitor visibility gap during reshard

徵兆：reshard 期間 Prometheus dashboard 對某 node 的 metric 突然顯示錯位 — load = 95% 但 slot count 顯示 6% slot；SOC 不知道 node 健康狀況。

根因：Prometheus exporter 對 slot count 跟 traffic load 分開計算；reshard 期間 slot count 已 migrate 但流量仍打 source node（client cache stale）— metric 看似矛盾。

修法：

Reshard 期間關 alert：knownmaintenance window、Prometheus silence alert
加 reshard-aware metric：用 redis_cluster_migration_slots 量化 in-flight migration
Dashboard 加註解：reshard 期間 SOC 看 dashboard 知道是 normal anomaly

Capacity / cost

維度	估算	警戒
Slot migration 速度	1-10K key / sec（依 key size + network）	TB 級 10K key / sec → 1 天
Application latency impact	p99 +50-200% during migration	設 latency budget、超出暫停
Memory / node	不變、但 temporary 雙寫期間 +5-15%	不能在 memory 90%+ 時 reshard
Network bandwidth	跨 node 大流量、~100-500 Mbps per migration stream	跨 AZ reshard egress cost 注意
Recovery time	Reshard 失敗回退 = 反向 reshard（時間相同）	不能在 incident 期間 reshard

實務 default：

跑在 低流量時段（夜間 / 週末）
Throughput 容忍度 < 50% 再 reshard、不要 80%+ 時操作
預留 回退 window — reshard 卡住時能 abort + 恢復原狀

整合 / 下一步

跟 Redis → DragonflyDB migration 對位

DragonflyDB 設計上 單機效能取代 cluster、re-sharding 議題消失；如果 cluster re-sharding 頻繁觸發、評估直接遷 DragonflyDB 是否更便宜。

跟 Sentinel HA 對比

Sentinel 模式無 slot 概念、re-sharding 不適用；但 manual sharding by application 場景仍可能需要類似 topology re-layout、application 端要自己處理。

跟 Redis 7+ Function / Cluster v2

Redis 7 推 Cluster v2 跟 Functions、slot migration 機制部分升級；keyspace migration 仍是核心議題、但 API 跟 monitoring 改進。

下一步議題

Auto-rebalance via operator：Redis Enterprise / Aiven 等 managed Redis 提供自動 rebalance、不需手動觸發
Cross-DC slot migration：跨 region cluster slot migration 對 latency / cost 影響大、通常用 application-level sharding 取代 cluster-level
Hash tag 治理：application code grep / lint 強制 hash tag、避免 cross-slot transaction 反模式

PostgreSQL Partition Redesign：當 monthly partition 越跑越慢

Tue, 19 May 2026 00:00:00 +0000

本文是 PostgreSQL overview 的 implementation-layer deep article。對應 #127 Type F「Topology re-layout」第 2 個 dogfood（第 1 個是 Redis cluster re-sharding）— 驗證 Type F anatomy 在不同 vendor 上的通用性。

為什麼 monthly partition 越跑越慢

上線時 monthly range partition 設計很合理 — 每月一個 partition、12 個月一年、partition_pruning 在 WHERE event_time >= '2026-05-01' 時跑單 partition、查詢快。但業務跑了 18 個月後：

每月 partition size 從 50GB 漲到 500GB（流量 10x）
單月查詢 WHERE event_time BETWEEN '2026-05-01' AND '2026-05-15' 仍掃整月 500GB（partition_pruning 粒度只到 month）
Vacuum 一個月 partition 需要 6-8 小時、跑不進 maintenance window
DROP 老 partition 釋放 storage 是 monthly cadence、但 retention policy 要求 daily granularity

partition 設計需要 redesign、不是「optimize」 — 從 monthly range partition 改成 daily range partition、partition 數量從 36 個（3 年 retention）變 1095 個。

diff dimension audit 結果：

維度	評估	等級
Schema / API	同 PostgreSQL、同 table 定義、partition key 不變	Low
Operational model	同 PostgreSQL operational stack	Low
Paradigm	同 OLTP RDBMS	Low
Components	同 1 個 DB	Low
Application change	不改（partition_pruning 透明）	Low
Data topology	Partition strategy 從 monthly → daily	High

6 維皆 Low + topology High = Type F「Topology re-layout」。

Pre-layout analysis：partition 不平衡偵測

執行 redesign 前必須先量化當前 topology：

 1-- 1. 每 partition size + row count
 2SELECT
 3  child.relname AS partition_name,
 4  pg_size_pretty(pg_relation_size(child.oid)) AS size,
 5  child.reltuples::bigint AS estimated_rows,
 6  pg_stat_get_last_vacuum_time(child.oid) AS last_vacuum
 7FROM pg_inherits
 8JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
 9JOIN pg_class child ON pg_inherits.inhrelid = child.oid
10WHERE parent.relname = 'events'
11ORDER BY pg_relation_size(child.oid) DESC;
12
13-- 2. partition_pruning 命中率
14EXPLAIN (ANALYZE, BUFFERS)
15SELECT count(*) FROM events
16WHERE event_time BETWEEN '2026-05-01' AND '2026-05-15';
17-- 期望: 只 scan 1 partition (target: daily) 或 1 partition (current: monthly)
18-- 觀察: monthly 設計下、即使 query 只跨 15 天、planner 仍 scan 整月 partition (~500GB)
19
20-- 3. 找 partition imbalance
21SELECT
22  to_char(event_time, 'YYYY-MM') AS month,
23  count(*) AS row_count
24FROM events
25GROUP BY 1
26ORDER BY 2 DESC;
27-- 找 hot month / cold month、判斷 redesign 後分佈

Pre-layout 階段的 output：

當前 topology 量化：36 monthly partition、總 size 1.8TB、最大 partition 500GB、最小 50GB
Hot key 分佈：80% 流量集中最近 3 個月
Redesign 目標：daily partition、最近 3 個月 hot daily / 3 個月 + 之前 cold weekly / 1 年 + 之前 monthly（sub-partition strategy）
Migration scope：1095 個 partition 不直接全建、按 retention policy 階段性

Re-layout 機制：ATTACH / DETACH 線上重劃

PostgreSQL 不支援「直接改 partition strategy」、必須走 新 partition tree + 資料搬遷：

 1-- 1. 建新 daily partition table (parallel to events)
 2CREATE TABLE events_daily (
 3  id bigint,
 4  event_time timestamptz NOT NULL,
 5  payload jsonb
 6) PARTITION BY RANGE (event_time);
 7
 8-- 2. 預建未來 90 天 daily partition
 9SELECT
10  format(
11    'CREATE TABLE events_daily_%s PARTITION OF events_daily FOR VALUES FROM (%L) TO (%L)',
12    to_char(d, 'YYYY_MM_DD'), d, d + interval '1 day'
13  )
14FROM generate_series(current_date, current_date + interval '90 days', interval '1 day') AS d;
15
16-- 3. dual-write phase: application 同寫 events + events_daily
17-- (用 trigger 或 application-side)
18CREATE OR REPLACE FUNCTION dual_write_events() RETURNS TRIGGER AS $$
19BEGIN
20  INSERT INTO events_daily VALUES (NEW.*);
21  RETURN NEW;
22END;
23$$ LANGUAGE plpgsql;
24
25CREATE TRIGGER events_dual_write
26AFTER INSERT ON events
27FOR EACH ROW EXECUTE FUNCTION dual_write_events();
28
29-- 4. backfill historical data per partition
30INSERT INTO events_daily
31SELECT * FROM events
32WHERE event_time >= '2026-05-01' AND event_time < '2026-05-02';
33-- ... 每天跑一個 day partition、avoid long transaction
34
35-- 5. cutover: rename swap
36BEGIN;
37ALTER TABLE events RENAME TO events_old;
38ALTER TABLE events_daily RENAME TO events;
39DROP TRIGGER events_dual_write ON events_old;
40COMMIT;
41
42-- 6. 觀察 1-2 週、DROP events_old

關鍵：rename swap 是 single transaction、cutover 瞬間發生；application connection 不需重連、但 prepared statement cache 可能要刷新。

Execution flow per-step

5 段、每段含 rollback boundary：

Step	動作	Rollback boundary
1 預建 partition	建 events_daily + 90 天 partition、不影響 production	DROP events_daily、無 impact
2 Dual-write	加 trigger 同寫兩端、observe diff	DROP trigger、events_daily 留作 cleanup
3 Backfill	逐日 backfill 歷史資料、用 CHECK constraint 確保完整性	DROP backfilled partition、不影響 source events
4 Verify	對 sample query 跑 events vs events_daily、確認 row count 一致	仍在 dual-write、發現 diff 可暫停 cutover
5 Cutover	Rename swap	不可逆、回退需 reverse rename + dual-write restart

Step 5 是不可逆邊界、應該排在 低流量 maintenance window 跑、且 cutover 前必須有 backup checkpoint。

Production 故障演練

Case 1：Backfill 期間 long transaction 阻塞 vacuum

徵兆：backfill 跑 6 小時的 INSERT INTO events_daily SELECT * FROM events WHERE ...、期間 events 表的 autovacuum 完全不跑、dead tuple 累積、production query 變慢。

根因：PostgreSQL transaction 期間 xmin horizon 鎖死、vacuum 只能回收「不會被任何 active transaction 看到」的 dead tuple；long backfill = long open transaction、vacuum 失效。

修法：

拆 batch INSERT：每日 backfill 拆成 small batch（10 萬 row 一個 transaction）、每個 commit 釋放 xmin
用 COPY 不用 INSERT：COPY events_daily FROM (SELECT * FROM events WHERE ...) 是 PG 對 batch 最快 + 對 vacuum 影響小
Backfill 跑在 standby：用 logical replication 從 standby 拉資料、不在 primary 跑長 transaction

Case 2：Trigger dual-write 對 application 造成 latency

徵兆：加 trigger 後 application 寫入 latency p99 從 5ms 漲到 25-50ms；high-throughput batch job 直接 timeout。

根因：每筆 INSERT 都觸發 trigger function 跑一次 INSERT 到 events_daily、IO 雙倍、index 也雙倍維護。

修法：

改 application-side dual-write：application code 顯式寫兩端、用 connection pool batch 攤平 IO
用 logical replication slot：events → events_daily 用 logical replication 取代 trigger、降 IO 衝擊
dual-write 時間最小化：trigger 只在 backfill + verify 期間打開、cutover 前關掉

Case 3：Partition_pruning 沒命中、planner 仍掃所有 partition

徵兆：cutover 完成後、application 端某些 query latency 從 200ms 跳到 5000ms；EXPLAIN 顯示 Append 下面所有 1095 個 partition 都被 scan。

根因：partition 數量爆到 1000+、planner planning_time 對某些 query 變長（含 prepared statement 沒帶 partition key bound）；或 query 用了 WHERE event_time = some_function(now())、planning-time pruning 不觸發。

修法：

enable_partition_pruning = on 預設、確認沒被 disable
PG 11+ runtime pruning：prepared statement 用 generic plan、runtime pruning 補位
Sub-partition strategy：1095 個 daily 太多、改 最近 90 天 daily / 之前 monthly 混合 strategy、減 partition count
Planner statistics：跑 ANALYZE 重建 statistics、partition 樹太大時 planner 需新 stats

Case 4：Constraint exclusion 失敗、跨 partition unique 不 enforce

徵兆：cutover 後發現某 user 的 event 在多個 partition 都有、unique constraint (user_id, event_id) 沒 enforce；data audit 抓到 duplicate。

根因：PostgreSQL partition table 的 UNIQUE constraint 必須包含 partition key；本來 monthly partition 下 UNIQUE (user_id, event_id) 加上 event_time（partition key）變 UNIQUE (user_id, event_id, event_time)、實際語意是「同月同 user 同 event_id 唯一」；改 daily 後變「同日同 user 同 event_id 唯一」— unique scope 從月變天、原本月內跨日 dedup 失效。

修法：

Pre-redesign：明示 unique constraint 的 時間 scope、redesign 後 scope 縮小是否可接受
Application-side dedup：跨 partition 唯一性走 application 層 lookup（用 Redis SETEX 暫存 key）
退到 non-partitioned dedup 表：建獨立 user_events_dedup 表、application 寫入前先 lookup

Case 5：DROP 老 partition 太頻繁、shared_buffers cache miss 爆

徵兆：daily partition 上線後、每天凌晨 cron DROP events_2025_05_18（90 天前）；DROP 後 shared_buffers 大量 invalidate、application 端 query latency p99 從 10ms 跳到 100-200ms 持續 30 分鐘。

根因：PostgreSQL shared_buffers cache 對被 DROP 表的 page 全部 invalidate；DROP 大 partition（10GB+）後 cache hit rate 從 99% 掉到 60%、application 等 disk IO。

修法：

DROP 跑在 off-peak：凌晨 3-4 點 cron、避開業務高峰
預熱 next partition：DROP 前用 pg_prewarm 主動 load 熱 partition 進 cache
改 DETACH + DROP TABLE delayed：DETACH 是 fast、DROP TABLE 排到 weekly batch、降頻率

Capacity / cost

維度	Monthly partition (current)	Daily partition (target)	Trade-off
Partition count	36 (3 年 retention)	1095 (3 年 retention)	30x partition count、planner cost 略升
Single partition size	50-500GB	1-20GB	Daily 更易 vacuum
DROP old data	Monthly cadence	Daily cadence	更細 retention 控制
Query latency	跨 partition 多時 50-200ms	跨 partition 少時 5-50ms	Daily 多數 query 更快
Planning time	5-10ms	50-100ms (對 generic plan)	Planning overhead + 1 order
Maintenance window	Vacuum 1 partition 6 小時	Vacuum 1 partition 5-30 分鐘	維護視窗更小、可日跑

判讀：daily partition 適合 高流量 + 跨日查詢多 + retention 細的場景；超大 partition (TB 級單日) 仍要 sub-partition 拆。

整合 / 下一步

跟 autovacuum tuning 整合

Daily partition 後 autovacuum 行為：

每 daily partition 獨立 autovacuum、scale_factor + threshold per-partition tuning
autovacuum_max_workers 要從 3 拉到 6-10（partition 數爆）
Cold partition (> 30 天) autovacuum_enabled = false、不浪費 CPU

跟 Patroni HA 整合

Failover 期間 partition migration 不能跑、必須在 stable cluster state 執行；Patroni promote 後重新評估 partition health。

跟 Logical Replication + Debezium 整合

publish_via_partition_root = true 讓 publication 從 parent 角度看；CDC consumer 不需要對每個 partition 設 subscription。

下一步議題

跨 daily partition 的 archive strategy：archive 到 S3 cold storage、daily granularity 給更細 retention 控制
pg_partman extension：自動建 daily partition、不用 cron；但要先確認 Aurora / RDS 支援
Sub-partitioning：未來流量爆時用「daily by time + list by tenant」雙軸 partition

Topology on Tarragon

MongoDB Shard Expansion + Multi-DC：Type F「不需要 parallel run」的 multi-region 例外

Reviewer D 的質疑：Type F 一定不需要 parallel run 嗎

兩個操作合併：shard 加 + DC 加

Pre-layout analysis：當前 + 目標 topology

Re-layout 機制

Shard expansion mechanism

Multi-DC expansion mechanism

Execution flow（含 parallel run + 流量切換）

Production 故障演練

Case 1：Balancer 跑 chunk migration 撞 production peak

Case 2：Cross-DC initial sync 期間 oplog 跑出窗口

Case 3：跨 DC read 路由錯誤、stale data 影響業務

Case 4：Shard tag-aware routing 沒設、cross-DC traffic 爆 cost

Case 5：Failover 後跨 DC primary 切換、application 連線中斷

Capacity / cost

整合 / 下一步

跟 MongoDB → Atlas migration 對位

跟 Application read pattern 整合

跟 Cassandra keyspace re-balance 對比

下一步議題

相關連結

Redis Cluster Re-sharding：source = target，但 topology 重劃的 5 段流程

Source = Target，但 topology 重劃

4 種 re-sharding driver

Slot migration 機制

redis-cli –cluster 工具

5 段執行流程

Production 故障演練

Case 1：Cluster busy 期間 application timeout

Case 2：Replica lag during re-sharding

Case 3：Client-side topology cache stale

Case 4：Cross-slot transaction 失敗

Case 5：Monitor visibility gap during reshard

Capacity / cost

整合 / 下一步

跟 Redis → DragonflyDB migration 對位

跟 Sentinel HA 對比

跟 Redis 7+ Function / Cluster v2

下一步議題

相關連結

PostgreSQL Partition Redesign：當 monthly partition 越跑越慢

為什麼 monthly partition 越跑越慢

Pre-layout analysis：partition 不平衡偵測

Re-layout 機制：ATTACH / DETACH 線上重劃

Execution flow per-step

Production 故障演練

Case 1：Backfill 期間 long transaction 阻塞 vacuum

Case 2：Trigger dual-write 對 application 造成 latency

Case 3：Partition_pruning 沒命中、planner 仍掃所有 partition

Case 4：Constraint exclusion 失敗、跨 partition unique 不 enforce

Case 5：DROP 老 partition 太頻繁、shared_buffers cache miss 爆

Capacity / cost

整合 / 下一步

跟 autovacuum tuning 整合

跟 Patroni HA 整合

跟 Logical Replication + Debezium 整合

下一步議題

相關連結