MongoDB on Tarragon

MongoDB → Atlas：Atlas 不是 MongoDB + managed、是另一個 product

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link 到 MongoDB 跟 MongoDB Atlas。本文是 Migration playbook methodology Type C operational redesign hybrid 的標準形態實證。每階段切換用 migration gate 把關 — 4 phase 之間的驗證條件就是 gate。

Atlas 不是 MongoDB + managed、是另一個 product

「MongoDB Atlas 是 MongoDB 的 managed 版本」這個 framing 看似合理、實際誤導：

Protocol 相容：MongoDB wire protocol 一致、driver 不改、mongosh 連線跟 self-managed 一樣
Storage 一致：WiredTiger storage engine 一樣、document model 一樣
API 一致：Aggregation framework、indexing、change stream 都一樣

但 operational surface 完全不同：

Operational concept	Self-managed MongoDB	Atlas
Cluster bootstrap	mongod + replica set config + cfgsvr + shard 手動	UI / API 一鍵建集群、全自動
HA	Replica set 自管 + arbiter + priority	自動跨 AZ replica + automatic failover
Backup	mongodump + S3 archive 自管	內建 cloud backup + PITR（按 region 設）
Network access	VPC + security group + IP whitelist 自管	Atlas private endpoint / VPC peering / IP access list
Authentication	mongod 內部 user / x.509 自管	Atlas Database User + 整合 LDAP / SSO / AWS IAM
Monitoring	Self-deploy Prometheus + grafana	Atlas Performance Advisor + APM 內建
Sizing	Manual instance class + scale	Auto-tier scaling + tier-based pricing
Patching	Manual + outage window	Automatic（可配置 maintenance window）

Migration 主要工作不在 資料層 — protocol drop-in 已 cover；是 operational stack 全換：SRE runbook、monitoring dashboard、access control、IAM 整合、cost 預估全要重做。「Atlas 是 managed MongoDB」這個 framing 低估了 operational 工作量。

跑 diff dimension audit：

維度	評估	等級
Schema / API	MongoDB protocol / API 完全相容	Low
Operational model	HA / backup / monitoring / IAM / network 全換	High
Abstraction / paradigm	同 document DB	Low
Number of components	同 1 個 cluster	Low
Application change	Connection string / IAM 整合改、application logic 不改	Low/Medium

主導維度 Operational = High、Schema / Paradigm 都 Low — 對映 Type C operational redesign hybrid。

結構：4-phase operational + drop-in cutover

跟 PostgreSQL → Aurora 結構對齊（同 Type C）：

 1Phase 0：Pre-migration audit（1-2 週）
 2  - Workload sizing（IOPS / connection / storage）
 3  - Application connection pattern audit
 4  - Compliance requirement audit
 5
 6Phase 1：Operational infrastructure 準備（2-3 週）
 7  - Atlas cluster 建立
 8  - VPC peering / private endpoint
 9  - IAM role + Atlas Database User
10  - Monitoring + alert
11  - Backup retention 設定
12
13Phase 2：Data migration（取決於 dataset 大小）
14  - mongomirror / Atlas Live Migration tool
15  - 或 mongodump → mongorestore（小 DB）
16
17Phase 3：Cutover 跟 verification
18
19Phase 4：Cleanup（self-managed decommission）

整體 4-12 週、依 dataset 大小跟 organization 流程複雜度。

Phase 0：Pre-migration audit

Workload sizing → Atlas tier

 1Self-managed observations:
 2- Peak IOPS: 8000
 3- P99 read latency: 5ms
 4- Connection count peak: 1500
 5- Storage: 800GB
 6- Cross-region replication needed: yes
 7
 8Atlas tier mapping:
 9- M40 (8 vCPU, 16GB RAM): IOPS 3000、不夠
10- M60 (16 vCPU, 64GB RAM): IOPS 6000、邊界
11- M80 (32 vCPU, 128GB RAM): IOPS 9000、安全（選此）
12- Storage: 1TB tier（足夠 800GB + 25% buffer）
13- Cross-region replication add-on

Atlas 不是 自由 instance class、是 固定 tier；workload 跨 tier 邊界時要選 上一級 而不是 push 下一級。

Connection pattern audit

1// Application connection pool config
2const client = new MongoClient(uri, {
3  maxPoolSize: 100,     // ← Atlas 端 tier-specific connection limit
4  minPoolSize: 10,
5  maxIdleTimeMS: 60000,
6});

Atlas tier 對 single user connection 有限制（M40 ~1500、M80 ~3000）；多 application instance 跑同帳號連 Atlas 可能撞 limit。預先計算 total connection = pod_count × maxPoolSize、對照 tier limit。

Compliance audit

Data residency：Atlas 部署 region 是否符合 GDPR / 客戶合約
Encryption at rest：Atlas 預設 enable、但 encryption key 是 Atlas-managed — 合規嚴格要用 CMK / BYOK
Audit log：Atlas 提供 audit log、export 到 S3 / Splunk

Phase 1：Operational infrastructure 準備

Atlas cluster 配置

 1# 用 Terraform mongodbatlas provider
 2resource "mongodbatlas_cluster" "production" {
 3  project_id   = var.project_id
 4  name         = "production-cluster"
 5  cluster_type = "REPLICASET"
 6
 7  provider_name         = "AWS"
 8  provider_region_name  = "US_EAST_1"
 9  provider_instance_size_name = "M80"
10
11  backup_enabled         = true
12  pit_enabled            = true   # PITR
13  mongo_db_major_version = "7.0"
14
15  advanced_configuration {
16    javascript_enabled                   = false
17    minimum_enabled_tls_protocol         = "TLS1_2"
18    no_table_scan                        = false
19    oplog_size_mb                        = 51200
20  }
21}
22
23# Backup retention
24resource "mongodbatlas_cloud_backup_schedule" "production" {
25  project_id   = var.project_id
26  cluster_name = mongodbatlas_cluster.production.name
27
28  reference_hour_of_day    = 3
29  reference_minute_of_hour = 0
30  restore_window_days      = 7
31
32  policy_item_daily {
33    frequency_interval = 1
34    retention_unit     = "days"
35    retention_value    = 7
36  }
37}

VPC peering / private endpoint

 1Pattern A: VPC Peering
 2  AWS VPC <──peering──> Atlas project VPC
 3  - 跨 region 跑、routing table 對齊
 4  - 適合中型 / 大型 workload、stable network topology
 5
 6Pattern B: Private Endpoint (Atlas private link)
 7  AWS VPC ──private link──> Atlas
 8  - 不需要 routing table 改
 9  - 適合 multi-account / multi-region 複雜場景
10  - Cost 略高

production default 走 Private Endpoint、設定簡單跟 IAM 整合好。

Atlas Database User 跟 IAM 整合

1Pattern A: 傳統 username / password
2  - 設 Database User、application 用 SCRAM-SHA-256 連
3  - 適合 legacy application
4
5Pattern B: AWS IAM authentication（推薦）
6  - Atlas Database User type: "AWS IAM"
7  - Application 用 AWS IAM role + Atlas SDK
8  - Token 15 分鐘輪換、application 自管 refresh

cutover 時間表內加 IAM authentication migration、不要事後補。

Phase 2：Data migration

Atlas Live Migration tool（小到中型）

Atlas UI 內建 Live Migration tool：

Source cluster URI（self-managed MongoDB）
Atlas target cluster
tool 自動 full sync + oplog tailing
Cutover window 內 final cutover

支援 dataset < 100GB 簡單；100GB-1TB 需要分批 / collection 順序設計。

mongomirror（大型）

1# Mongomirror: source → atlas
2mongomirror \
3  --host source-replicaset/host1:27017,host2:27017 \
4  --destination atlas-cluster-host:27017 \
5  --destinationUsername admin \
6  --destinationPassword $ATLAS_PASSWORD \
7  --ssl

mongomirror 分兩段：

Initial sync（full dump + restore）
Oplog tailing（continuous CDC）

Cutover 期間 application 切 connection string、mongomirror 跟著 stream 收尾。

Phase 3：Cutover + verification

11. Application 端設 maintenance mode（block write）
22. Wait mongomirror catch up（oplog gap → 0）
33. 驗證 Atlas 端 collection count + sample query
44. Application connection string 切到 Atlas
55. 解除 maintenance、monitor 24-48 小時
66. Self-managed mongo read-only standby 1-2 週

Production 故障演練

Case 1：Atlas tier connection limit 撞牆

徵兆：cutover 後 application 流量高峰時大量 Connection refused、Atlas 端顯示 connection limit reached；self-managed 階段沒有這問題。

根因：M80 tier connection limit ~3000、application 100 個 pod × maxPoolSize=50 = 5000 connection；超出 limit。

修法：

Pre-migration 計算：total connection 對照 Atlas tier、超出選上一級 tier
降 maxPoolSize：100 pod × 30 = 3000、剛好 cap；但 burst 仍可能撞
加 connection proxy：在 application 跟 Atlas 之間放 connection pooler（如 mongos sharded 或 ProxySQL-style proxy）

Case 2：IP whitelist 漏 application VPC、cutover 後完全連不上

徵兆：cutover 後 application 直接報 connection timeout、Atlas dashboard 顯示 zero traffic；troubleshooting 1 小時才發現是 IP access list 漏掉某 application VPC CIDR。

根因：Atlas IP access list 預設 deny all、必須明示加 application VPC；Phase 1 設定漏看某個 VPC（如 multi-account organization 內的 staging account）。

修法：

Pre-cutover 連線測試：每個 application VPC 跑 sample MongoDB 連線、確認 ping 通
改 Private Endpoint：不靠 IP whitelist、用 PrivateLink 自動 routing
Backup access：保留 bastion host with whitelisted IP、incident 期間能直連

Case 3：Backup retention 設不夠、compliance audit 抓到

徵兆：cutover 3 個月後 SOX audit 發現 backup retention 設 7 天、合規要求 90 天；急忙改 Atlas config 設 90 天、但 過去 3 個月 backup 已不可恢復。

根因：Atlas backup retention 是 向前生效、不能回追加；Phase 1 預設配置漏對合規 review。

修法：

Pre-Phase 1 跑 compliance review：跟 legal / security team 確認 retention / data residency / audit log
預設 retention 設保守值（30 / 60 天）、之後可降不能升
PITR 跟 backup retention 分開設：PITR window 7-30 天、full backup 90-365 天

Case 4：IAM token 過期、application 端 reconnect storm

徵兆：production 切到 IAM authentication 後、每 15 分鐘出現一波 connection failure；Atlas log 顯示「auth token expired」。

根因：AWS IAM token 15 分鐘輪換、application 用舊 token 重連失敗；token refresh 邏輯沒寫對。

修法：

1// 用 Atlas SDK + AWS SDK 整合、自動 token refresh
2const { MongoClient } = require('mongodb');
3const { fromIni } = require('@aws-sdk/credential-providers');
4
5const credentials = fromIni({ profile: 'production' });
6const client = new MongoClient(uri, {
7  authMechanism: 'MONGODB-AWS',
8  // SDK 自動 refresh token
9});

不要自管 token rotation、用 vendor SDK 抽象掉。

Case 5：Billing 暴漲、IOPS 跟 backup storage 超預估

徵兆：第一個月 Atlas 帳單 $15K USD、預估 $8K；Atlas dashboard 顯示 backup storage 跟 IOPS 各超 1.5-2x 預估。

根因：

Atlas backup 預設 跨 region replicated、storage cost 2x
IOPS-heavy workload 在 M tier 內可能撞 burst credit、auto-tier-up 暫時觸發更貴 tier
Data transfer 跨 region / 跨 cloud 計費沒算

修法：

Pre-migration cost estimate：用 self-managed metrics 估 IOPS / bandwidth、套 Atlas pricing
Backup region 設單一：若不要跨 region DR、設 same-region backup 省 50%
Reserved Instance：穩定 workload 預付 1-3 年、省 30-40%
Performance Advisor 早用：第一週就跑、找 inefficient query 降 IOPS

Capacity / cost

維度	Self-managed MongoDB	Atlas
Cluster cost (M80)	EC2 r6g.4xlarge × 3 ≈ $1.5K / mo	M80 + storage + backup ≈ $3K / mo
Operational FTE	0.5-1.5 FTE	0.1-0.3 FTE
Backup cost	S3 + tooling 自管	內建 + tiered storage
Cross-region DR cost	Manual + 2x infrastructure	1-click + 1.5-2x billing
Time to value	1-3 個月（HA + ops setup）	1-2 週（cluster ready + IAM）
Migration cost	-	1-3 FTE × 2-3 個月

Break-even：~200GB / 中型 workload、Atlas operational savings 平攤 1-2 年後比 self-managed cheaper；TB+ 大型 workload self-managed 仍可能便宜、但需要 ops team。

整合 / 下一步

跟 PostgreSQL → Aurora migration 對照

兩篇都是 Type C operational redesign hybrid、模板共用、細節差：

Aurora 端 RDS Proxy 是推薦做法、Atlas 端 Private Endpoint 更標準
Aurora 端 IAM authentication 是 optional best practice、Atlas IAM 是 推薦預設
兩家 cost model 都複雜、I/O cost 是 surprise 主要來源

跟 Application 端 IAM token rotation 整合

Vault dynamic credential 可 issue Atlas Database User credential、lease lifecycle 對齊 application；對 high-stakes workload 是好做法、但 setup 複雜。

下一步議題

Atlas Data Federation：跨 Atlas 集群 query S3 / 跨 region；如果走 multi-region 評估這 feature
Atlas Online Archive：cold data 自動 archive 到 S3、查 query 透明；對 retention 重的 workload 省 storage cost
Atlas Serverless：burst workload 適合、steady 不划算

MongoDB Shard Expansion + Multi-DC：Type F「不需要 parallel run」的 multi-region 例外

Tue, 19 May 2026 00:00:00 +0000

本文是 MongoDB overview 的 implementation-layer deep article。對應 #128 Type F「Topology re-layout」第 3 個 dogfood、特別驗證 self-aware limitation 第 3 點「不需要 parallel run」claim 的 multi-region rollout 例外 — 本文是反例的具體實證。

Reviewer D 的質疑：Type F 一定不需要 parallel run 嗎

#128 Self-aware limitation 第 3 點承認：

「不需要 parallel run」claim 部分不成立：multi-region rollout（#128 列為 Type F 情境）必須 parallel run — 兩 region 同時跑然後切流量、不然就是停機切換、跟 Type A phase 3 機制相同。

本文是該 claim 的 正面實證 — MongoDB sharded cluster 從 single-DC 加 shard + 加 secondary DC、確實需要 parallel run + 流量切換、跟 Type A phased migration 局部同構：

Type F 假設	Single-DC re-sharding（Redis case）	Multi-DC expansion（本文）
同 cluster 不同 state	yes	yes（同 MongoDB cluster）
不需 schema translation	yes	yes
不需 parallel run	yes（slot migration 內部完成）	no — 兩 DC 同跑後切流量
不需 cleanup phase	yes	partial（舊 DC 角色降為 standby）
Step-by-step + rollback boundary	yes	yes

→ Type F anatomy 仍適用、但「不需 parallel run」是 子情境條件、不是 universal claim。

兩個操作合併：shard 加 + DC 加

實務上中型公司常同時跑兩個 topology 變動：

Shard expansion：現有 3-shard cluster 加到 5-shard、chunk migration 平均分佈
Multi-DC：從 single-DC（us-east-1）加到 multi-DC（us-east-1 + us-west-2）

兩個操作的 diff dimension audit：

維度	Shard 加（單獨）	Multi-DC（單獨）	兩者同跑
Schema / API	Low	Low	Low
Operational model	Low	Medium（跨 DC ops）	Medium
Paradigm	Low	Low	Low
Components	Low（加 shard、同 cluster）	Low	Low
Application change	Low	Low-Medium（cross-DC latency aware）	Low-Medium
Data topology	High（sharding strategy）	High（replication + region）	High（雙變、複合 topology）

兩者主導維度都是 topology = High、組合走 Type F multi-axis 子情境。

Pre-layout analysis：當前 + 目標 topology

 1// 1. 當前 shard 分佈
 2sh.status({verbose: false});
 3// 期望輸出: 3 shard、每個 ~33% chunks、no migration in progress
 4
 5db.printShardingStatus({verbose: false});
 6// 找 hot shard、imbalanced chunk distribution
 7
 8// 2. Replication topology
 9rs.status();
10// 各 replica set primary/secondary 健康度、replication lag
11
12// 3. Cross-DC network baseline (在 add DC 前測)
13// us-east-1 → us-west-2 RTT、bandwidth

Pre-layout 階段 output：

當前：3 shard × 1 replica set per shard (3 member) = 9 node、全在 us-east-1
目標：5 shard × 1 replica set per shard (5 member: 3 us-east + 2 us-west) = 25 node
Migration scope：加 2 shard + 加 2 DC member 每 shard、共 +16 node
Chunk migration estimate：30% chunk 需重分（從 33% × 3 變 20% × 5）

Re-layout 機制

兩個 mechanism 平行進行：

Shard expansion mechanism

 1// 1. 新增 shard 到 cluster
 2sh.addShard("rs-shard4/host10:27017,host11:27017,host12:27017");
 3sh.addShard("rs-shard5/host13:27017,host14:27017,host15:27017");
 4
 5// 2. balancer 自動 chunk migration
 6sh.startBalancer();
 7// 觀察 progress: db.adminCommand({balancerStatus: 1})
 8
 9// 3. 完成後 verify shard distribution
10sh.status();

Chunk migration 是 background job、balancer 控制 throttle；不阻塞 production query、但 CPU / network 上升 30-50%。

Multi-DC expansion mechanism

 1// 1. 對每 shard 的 replica set 加 us-west-2 member (priority 0)
 2rs.add({
 3  host: "us-west-2-host:27017",
 4  priority: 0,           // 不能當 primary
 5  votes: 1,              // 參與投票
 6  hidden: false
 7});
 8
 9// 2. 等 initial sync 完成（依資料量 1 小時 - 1 天）
10rs.printReplicationInfo();
11
12// 3. 確認 secondary 健康後、提升 priority 或 votes
13// 不要立刻設 priority 1、避免 unintended failover
14
15// 4. Cross-DC routing 透過 readPreference 在 application 設
16const client = new MongoClient(uri, {
17  readPreference: 'secondaryPreferred',
18  readPreferenceTags: [{ region: 'us-west-2' }, {}],
19});

關鍵：multi-DC 是 漸進加 member、不是 atomic switch；每 shard 獨立加、整體耗時 = shard 數 × initial sync time。

Execution flow（含 parallel run + 流量切換）

8 step、包含 parallel run + 切流量 段——驗證 #128 self-aware limitation 第 3 點：

Step	動作	Parallel run?	Rollback boundary
1 Pre-check	量化當前 topology、確認 cluster 健康	no	-
2 加 us-east shard	sh.addShard、balancer migrate chunk	no（cluster 內）	removeShard、chunk migrate 回
3 加 us-west member	對每 shard rs.add 跨 DC member	no	rs.remove、initial sync 投入廢棄
4 Initial sync wait	等所有 us-west member catch up	parallel run starts：兩 DC 同時 serve	-
5 Cross-DC dual-serve	兩 DC 都跑 read traffic（不切 write）	yes、parallel run：app 用 secondary preferred us-west	readPref 切回 us-east primary
6 流量切換	application us-west traffic 走 us-west read	yes	DNS / readPref 切回
7 Promote us-west（optional）	一個 shard 的 us-west member priority 提到 1	post-cutover	demote priority 回 0
8 Cleanup	Verify、archive log、document new topology	no	-

Step 4-6 是 parallel run + 切流量 — Type F 有此例外、跟 Type A phase 3 機制同構；anatomy 中「Execution flow per-step」段必須含 parallel run 子段。

Production 故障演練

Case 1：Balancer 跑 chunk migration 撞 production peak

徵兆：加 shard 後 balancer 開始 migrate chunk、production write latency p99 從 10ms 跳到 100ms；application 端 timeout 大量。

根因：MongoDB balancer 預設 24×7 跑、chunk migrate 是 blocking 操作（migration lock 期間阻塞 write 到該 chunk）；產線高峰時間 balancer 不會自動暫停。

修法：

1// 限 balancer 跑在 low-traffic window
2sh.setBalancerState(true);
3db.settings.update(
4  { _id: "balancer" },
5  { $set: { activeWindow: { start: "02:00", stop: "06:00" } } },
6  { upsert: true }
7);

且設 chunkSize 較小（128MB → 64MB）讓 migration 步驟細、單次 lock 時間短。

Case 2：Cross-DC initial sync 期間 oplog 跑出窗口

徵兆：加 us-west member 後、initial sync 跑 4 小時、結束時 member 顯示「too stale to catch up」、需要 full re-sync。

根因：MongoDB oplog 是 capped collection、預設 size 5% disk；4 小時 initial sync 期間 primary 寫入量超出 oplog 保留範圍、member 拿到的 oplog start point 已被覆蓋。

修法：

預先擴 oplog size：db.adminCommand({replSetResizeOplog: 1, size: 51200}) 加到 50GB、覆蓋 sync window
Off-peak initial sync：跑在低流量時間、oplog 寫入較慢
Manual initial sync via snapshot：用 mongodump 從 primary snapshot、restore 到 new member、跳過 oplog tail catch-up

Case 3：跨 DC read 路由錯誤、stale data 影響業務

徵兆：切流量到 us-west 後、application 偶爾抓到 5-30 秒前的 stale data；customer 報告「明明剛改了 setting、refresh 又變回去」。

根因：us-west member 是 secondary、replication lag 5-30 秒；application readPreference 設 secondaryPreferred 但沒 maxStalenessSeconds、可能讀到嚴重 stale member。

修法：

 1const client = new MongoClient(uri, {
 2  readPreference: 'secondaryPreferred',
 3  readPreferenceTags: [{ region: 'us-west-2' }, {}],
 4  maxStalenessSeconds: 90,  // 限 stale 不超過 90 秒
 5});
 6
 7// 對 strict consistency 場景強制 primary
 8const client_strict = new MongoClient(uri, {
 9  readPreference: 'primary',  // 強制讀 us-east primary
10});

Application-level read pattern 必須區分「accept stale read」vs「require fresh read」、不是 cluster-level 統一配置。

Case 4：Shard tag-aware routing 沒設、cross-DC traffic 爆 cost

徵兆：multi-DC 跑了 1 個月、AWS egress cost 從 $500 / month 漲到 $8000 / month；99% 流量還是 us-east → us-west 跨 DC。

根因：sharded cluster 沒設 zone sharding、application 不知道哪些 chunk 在哪個 DC、所有 query 預設打 us-east primary、跨 DC bandwidth 爆。

修法：

 1// 注意: MongoDB 4.2+ API、舊版 sh.addShardTag / sh.addTagRange 已 deprecated
 2// 對應改 sh.addShardToZone / sh.updateZoneKeyRange
 3
 4// 1. 給 shard 加 zone (MongoDB 4.2+)
 5sh.addShardToZone("rs-shard1", "us-east");
 6sh.addShardToZone("rs-shard2", "us-east");
 7sh.addShardToZone("rs-shard3", "us-east");
 8sh.addShardToZone("rs-shard4", "us-west");
 9sh.addShardToZone("rs-shard5", "us-west");
10
11// 2. 對 collection 加 zone range
12sh.updateZoneKeyRange(
13  "myapp.events",
14  { region: "us-east", _id: MinKey },
15  { region: "us-east", _id: MaxKey },
16  "us-east"
17);
18sh.updateZoneKeyRange(
19  "myapp.events",
20  { region: "us-west", _id: MinKey },
21  { region: "us-west", _id: MaxKey },
22  "us-west"
23);
24
25// 3. balancer 重新分配 chunk 到對應 zone

Zone sharding 是 multi-DC 必要設計、不設等於白付 egress cost。

Case 5：Failover 後跨 DC primary 切換、application 連線中斷

徵兆：production 跑 6 個月後、us-east-1 outage、某 shard primary 切到 us-west member；application 5-10 秒內大量 connection error。

根因：MongoDB driver 預設 election timeout 10 秒、application 沒設 server selection retry；primary 切換期間 client 沒重連。

修法：

1const client = new MongoClient(uri, {
2  serverSelectionTimeoutMS: 30000,    // 等 30 秒給 election
3  retryWrites: true,
4  retryReads: true,
5  heartbeatFrequencyMS: 5000,         // 更頻繁 detect topology 變動
6});

且 multi-DC primary 應該設 priority asymmetry：us-east member priority 2、us-west priority 1；正常情況不切換、災難時自動切。

Capacity / cost

維度	Single-DC 3-shard	Multi-DC 5-shard	Trade-off
Node count	9	25	~3x infrastructure cost
Storage redundancy	3 replica	5 replica (3 east + 2 west)	+2 copy、storage cost +66%
Network egress	內部 VPC、低	Cross-DC、高（需 zone sharding）	$500 → $8000 / month if no zone sharding
Latency p99 (write)	5-10ms	5-15ms（primary 仍 us-east）	略升
Latency p99 (read)	5-10ms	2-5ms (local DC)	Multi-DC 區域 read 加快
Disaster recovery	RTO 30 分鐘（rebuild）	RTO < 1 分鐘（auto failover）	顯著改善
Operational complexity	低	高（zone sharding / DR drill）	+1 SRE FTE 維護

判讀：multi-DC 是 DR 投資、不是 cost optimization；只在 availability SLA > 99.9% 或合規要求 場景值得。

整合 / 下一步

跟 MongoDB → Atlas migration 對位

Self-managed multi-DC 複雜度高、Atlas 把 multi-cluster + cross-region 簡化成 UI 配置；如果走 multi-DC、考慮直接遷 Atlas。

跟 Application read pattern 整合

zone sharding + readPreference 跟 application logic 緊密耦合；不能事後補、應在 multi-DC 設計階段就設計 application 端的 region-aware routing。

跟 Cassandra keyspace re-balance 對比

Cassandra 是另一個 Type F multi-DC 典型 case；用 NetworkTopologyStrategy + replication factor per DC、跟 MongoDB zone sharding 概念對等但 mechanism 完全不同。Reviewer D 把 Cassandra 列為 Type F 反例 — 本文以 MongoDB 替代驗證。

下一步議題

Cross-region active-active：MongoDB 不支援 multi-primary、cross-region active-active 需要 application-level conflict resolution
PostgreSQL Citus / CockroachDB multi-region 對比：distributed SQL 對 multi-region 有不同設計
Cost optimization：跨 DC egress 是 long-term concern、zone sharding 設好後仍要 quarterly review

MongoDB Schema Design Pattern：contract layer 在哪 vs embedded / reference

Wed, 27 May 2026 00:00:00 +0000

MongoDB schema design 的初學討論常停在「embedded vs reference 二選一」。真實 production 議題遠不止此：document model 給的 schema flexibility 在第一年是紅利、跑半年後同 collection 開始混三代 schema、application code 三層 if-else 處理欄位缺失與型別漂移。這時候讀者要解的不是「embed 還是 reference」、是 schema contract 該由誰守、守在哪一層。本文把這個議題拆成三條 contract layer 路徑（DB-layer validator / app-layer abstraction / 混合）、配合 embedded / reference / polymorphic 機制與 time-series collection 邊界一起討論。

本文不重複 MongoDB vendor overview 已寫過的 document model 適用條件 — 而是 production 部署 + schema governance + 失敗修復的實作層教學。

問題情境：document 自由的後座力

MongoDB 適用度的前置判讀有三件事要確認：

document shape 是否主導資料：sensor signal / CMS article / order aggregate 這類「形狀本來就多型 + 隨產品演進」適合 document model；access pattern 固定 + 欄位定型的反而該回 KV 系統或 SQL
contract layer 該放哪：DB-layer validator 適合 schema 穩定 / 跨服務共用 collection 的場景；app-layer abstraction 適合 schema 演進快 / 微服務獨立 owner；混合適合大型 production
跨雲 hedging 是否需要：若團隊未來雲商策略不確定、Atlas 跨雲是 selection 訊號；只在單雲跑就不必為 hedging 多付代價

確認 MongoDB 該用之後，讀者真正在 production 撞到的徵兆：

Document model 早期 schema-less 紅利、跑半年後 collection 同時混三代 schema、application 寫 if-else 處理欄位缺失與型別漂移
子文件越塞越深、單 document 突破 1-2MB、partial update 仍要把整顆 document load + write、IO 跟 working set 雙重壓力
反向過度 normalize：訂單跟訂單 item 拆兩個 collection、單一查詢得 N+1 $lookup、aggregation cost 飆
IoT / sensor / event log workload 寫進 regular collection、寫入吞吐撞牆但沒考慮 time-series collection
$lookup 出現在 hot path、document size warning（16MB 上限預警）、partial update 卻產生大量 disk write、schema validation 報錯比例突然爬升

Case anchor：9.C38 Toyota Connected 揭露車載 sensor schema 隨車型 / 年份 / 規範演進、polymorphic document 與 schema governance 並存；9.C37 Forbes 揭露 CMS 50+ 微服務透過自建中介 abstraction layer 隔離 schema 變動；9.C30 Microsoft 365 揭露 document model 保留 + 跨 vendor 形狀治理。早期 startup MongoDB 三代 schema 並存的具體 incident 細節需未來 case 補完、本文先以「常見 failure pattern」處理。

核心機制：aggregate root、embedded、reference、polymorphic

MongoDB schema design 的第一層是 aggregate root 決定 atomicity 邊界。MongoDB 把寫入 atomicity 限制在「單 document 內」、跨 document 要 multi-document transaction（5.0+ 在 replica set / sharded cluster 都支援、但跨 shard 有性能成本）。aggregate root 是 DDD 概念落地到 MongoDB 的具體實作 — 把「一起讀、一起寫、一致性邊界一致」的資料塞同一個 document。

Embedded（subdocument / array）：寫入 atomic、讀取一次到位；代價是 update sub-element 仍要 rewrite 整顆 document，sub-element 寫頻很高時不適合
Reference（手動 _id foreign key + $lookup）：document 大小可控，但 join 在 application 或 aggregation 階段做；JOIN-heavy workload 跑這條路徑會 N+1
Polymorphic pattern：同 collection 用 type discriminator 存多型實體；MongoDB 沒 inheritance、靠 schema validator 與 partial index 維持邊界
16MB document hard limit：是 MongoDB 機制邊界；working set 在 RAM 的隱性軟限制（單 doc 大小直接影響 page cache 效率）更早就會出問題

Contract layer 三條路徑

跨 case 合成 frame（本章合成、Toyota + Forbes 共同揭露）：document model 的 schema flexibility 在 production 必須以 schema governance 對沖、否則「schema 自由」變「production data inconsistency」（Toyota case 明示）。讀者要選的不是「要不要做 schema governance」、是「contract 守在哪一層」。三條路徑：

路徑	實作機制	適用條件
DB-layer contract	MongoDB `$jsonSchema` validator + `validationLevel` + `validationAction`	Schema 穩定、多服務共用 collection、要 DB 擋髒資料
App-layer contract	自建 API abstraction + middleware schema 驗證	Schema 演進快、微服務獨立 owner、跨雲彈性需求
混合	DB 層擋型別 / 必填、app 層擋業務語意 / 版本	大型 production、多 owner、跨團隊

DB-layer 路徑：$jsonSchema validator 在 production 是「契約 enforcement」工具、不是 dev-time linter。設 validationAction: "error" 寫入直接擋；設 "warn" 只記 log。validationLevel: "moderate" 對既有 doc 放行、對新寫入嚴格；"strict" 對所有寫入都嚴格。適合 schema 穩定到「跨服務共用 collection」的程度。

App-layer 路徑：9.C37 Forbes 揭露的模式 — 50+ 微服務透過自建中介 abstraction layer 看到穩定的 contract API、DB schema 變動限制在 owner microservice 內。Forbes 跨雲彈性能用起來、核心原因是 abstraction layer 把 schema 治理收斂到單點、跨雲遷移時 abstraction layer 不變、微服務不知道底層 DB 換 cluster 換雲。

混合路徑：Atlas Application Services、enterprise schema registry 屬此類。DB 層 validator 守底線（欄位型別、必填欄位）、app 層 abstraction 守業務（版本欄位 / 相容處理 / cross-document 一致性）。代價是兩層都要維護、版本同步成本高、適合 production 規模真的撐住這個複雜度的團隊。

讀者選哪條路徑要看：team 規模 / collection 跨服務程度 / schema 演進速度。

Time-series collection（6.0+）

Time-series collection 是 MongoDB 為 IoT / sensor / event log / metrics 設計的 vendor-specific 機制 — 比 regular collection 寫入吞吐高 3-5x、storage 壓縮率更好。資料形狀必須是 { timestamp, metadata, measurement } 三段式、timestamp 主導。

適用情境：sensor signal 高頻寫入、metrics 系統的 time series、application event log。不適用情境：schema 不以 timestamp 為主、需要跨 document update、需要 polymorphic discriminator。

9.C38 Toyota Connected 自承「20 個 Atlas database 沒明確說有沒有用 time series collection — 對 IoT 案例這是重要區分、但 case study 沒揭露」。寫進 production 時必須明示：IoT / sensor 場景該考慮 time-series collection、Toyota case 未揭露實際使用情況、不可寫成「Toyota 使用 time-series collection」。

對應 knowledge card：document-store、transaction-boundary（aggregate boundary = transaction boundary）、data-inconsistency。

操作流程

Step 1：access pattern 盤點。列出 top 10 query / write、標 read together / write together 集合 — 這份清單決定 embedded vs reference vs polymorphic 的候選。

Step 2：contract layer 決策。

條件	路徑
Collection 跨多服務 + schema 穩定	DB-layer validator
Schema 演進快 + 微服務獨立 owner	App-layer abstraction
大型 production + 多 owner + 跨團隊	混合（兩者並用）
IoT / sensor / event log + timestamp 主導	Time-series collection（取代 regular collection）

Step 3：embed 判準 — 1:few、life-cycle 同步、< 1MB 預期上限；reference 判準 — 1:many 寫頻不對稱、跨 aggregate 引用。

Step 4：DB-layer 路徑 validator 配置：

 1db.runCommand({
 2  collMod: "orders",
 3  validator: {
 4    $jsonSchema: {
 5      bsonType: "object",
 6      required: ["_id", "tenantId", "createdAt", "items"],
 7      properties: {
 8        tenantId: { bsonType: "string" },
 9        createdAt: { bsonType: "date" },
10        items: {
11          bsonType: "array",
12          minItems: 1,
13          items: {
14            bsonType: "object",
15            required: ["sku", "qty"],
16            properties: {
17              sku: { bsonType: "string" },
18              qty: { bsonType: "int", minimum: 1 }
19            }
20          }
21        }
22      }
23    }
24  },
25  validationLevel: "moderate",
26  validationAction: "warn"
27})

灰度策略：先 validationLevel: "moderate" + validationAction: "warn" 觀察兩週、確認 application 不寫違規 doc、再切 "strict" + "error" 封死。

Step 5：App-layer 路徑 abstraction 介面。9.C37 Forbes 揭露的模式 — middleware 攔截 microservice 寫入、驗 schema、套版本欄位、把 owner microservice 的 schema 變動隔離在 abstraction 內。

Step 6：Polymorphic + partial index — partialFilterExpression 避免冷分支吃 index 成本：

1db.events.createIndex(
2  { type: 1, timestamp: -1 },
3  { partialFilterExpression: { type: { $in: ["click", "purchase"] } } }
4)

Step 7：量測 doc 形狀。用 bsondump + $bsonSize + collStats 量測：

1db.coll.aggregate([
2  { $group: {
3      _id: null,
4      avg: { $avg: { $bsonSize: "$$ROOT" } },
5      max: { $max: { $bsonSize: "$$ROOT" } }
6  }}
7])

驗證點：avgObjSize 在預期範圍、validator failure rate < SLO、abstraction layer schema mismatch rate 可追溯。

Rollback boundary：validator 從 strict 退回 moderate 是 single-command、application code 不必改；abstraction layer 換版需 application code 灰度；已 embed 進去的 schema 變更要靠 backfill migration script、無法 in-place 還原。

失敗模式

Unbounded array growth：把「使用者所有訊息」embed 進 user document、document 撞 16MB → 寫入直接 reject。修法是改 reference、訊息獨立 collection、用 userId 索引。

Hot subdocument update：所有寫都打同一個 nested field、wiredTiger document-level lock 退化成熱點，concurrency 看似多核卻被序列化。修法是把熱寫欄位拆 reference document、或改 sharded collection 把寫散開（見 shard key selection）。

$lookup 在 hot path：reference 沒設好變 join、p99 latency 隨 collection 大小線性退化。修法是 schema design 階段 denormalize、把 read-together 資料 embed 回 aggregate root；或 $merge 寫 materialized view（見 aggregation pipeline optimization）。

Schema 三代並存（缺 contract layer）：缺 validator 跟 abstraction layer、舊版欄位殘留、application code 三層 fallback、新 dev onboarding 看不懂哪個欄位是現役。9.C38 Toyota 揭露：document model 的彈性「成本是 production 必須做 schema governance」、否則「schema 自由」變「production data inconsistency」。

Abstraction layer 變成 lock-in：app-layer contract 寫得太重、跨 vendor 遷移時 abstraction 本身要重寫。該層應該薄、只做 schema 隔離、不做業務邏輯。

Polymorphic 全表掃描：discriminator 沒進 index、type: "rare" 查詢全表 scan。修法用 partial index 把熱類型蓋住、冷類型走全表也只是冷路徑。

Time-series collection 用錯場景：把非 timestamp 主導資料塞進 time-series collection、失去 flexibility 又拿不到吞吐紅利。Time-series collection 是專屬優化、不是普適 collection 升級。

Anti-recommendation：

access pattern 還沒穩定的早期 MVP 不需要鎖死 schema validator；先用 app-layer abstraction、production 穩定後再決定 DB 層該不該封死
JOIN-heavy / 強 normalize workload 一開始就該回 PostgreSQL JSONB 或 SQL、不是塞進 MongoDB 再 $lookup
跨案合成 frame：「不是所有資料都該進 MongoDB」、document-shaped + 形狀變化頻繁的進、access pattern 固定的 KV 走 KV（9.C36 Coinbase 揭露 MongoDB + DynamoDB 按 workload 分流）

容量與觀測

關鍵 metric：

Document 形狀：collStats.avgObjSize、collStats.size vs storageSize（壓縮比）
Contract 健康：document validation failure rate、abstraction layer schema mismatch rate
Working set 壓力：wiredTiger.cache.bytes currently in the cache 對比 working set 估算
Aggregation 副作用：profiler slow op、$lookup / $unwind 在 hot path 出現位置

Mongo command：

db.coll.stats() 看 document 平均 / 最大 size、storage / index size
db.runCommand({collMod: ..., validator: ...}) 改 validator
db.setProfilingLevel(1, {slowms: 100}) 抓 slow op

回到 4.20 observability evidence：把 doc size 分布、validator failure rate、abstraction layer schema mismatch、$lookup 出現位置列為 evidence 三件套。

回到 9.5 bottleneck localization：working set 撐爆 RAM 時的 page fault 信號、跟 doc size 異常增長強相關。

邊界與整合

Sibling deep articles：

shard key selection — document 形狀決定 shard key 候選空間
aggregation pipeline optimization — $lookup 與 schema reference 互相牽動
connection management and cache layer — abstraction layer 跟 cache 層協作

Migration playbook：

document 形狀走樣到無法治理時的 → MongoDB → PostgreSQL 拆 normalize 路徑
保留 document model 換 vendor 三型對照 — 保留主 DB 補周邊（Coinbase）/ 同 DB 換託管（Forbes Atlas）/ 同 model 換 vendor（Microsoft 365 Cosmos DB MongoDB API）

跟 1.x 互引：1.2 schema design 處理通用 schema 演進原則、本文是 MongoDB-specific 落地；1.3 transaction boundary 對齊 aggregate = atomic 邊界。

MongoDB Shard Key Selection：hashed vs ranged、單 cluster 切 shard vs 多 cluster 切 blast radius

Wed, 27 May 2026 00:00:00 +0000

MongoDB shard key 是 sharded cluster 上線時最難回頭的決策。Shard key 一旦設定錯、5.0 之前完全不可逆、5.0+ 用 reshardCollection 可改但仍是長時間運算 + 額外磁碟 + 寫入暫停窗口。但 shard key 不是 production 唯一的橫向擴展選項 — 還有「多 cluster」這條路徑（Toyota Connected 揭露），兩者解的問題完全不同。本文把 shard key 三特性（cardinality / frequency / monotonicity）跟「單 cluster vs 多 cluster」對照在一起、配合跨 vendor partition key 可逆性紀律一起討論。

本文不重複 MongoDB vendor overview 已寫過的 sharding 簡介 — 而是 production 設計 + 失敗修復的實作層教學。

MongoDB 適用度前置判讀：進到 shard key 設計前先確認 workload 在 MongoDB 適用區（document shape 主導 / contract layer 該放哪 / 跨雲 hedging 是否需要）— 詳見 schema-design-pattern 開頭 3 軸前置判讀、本篇不重複展開。Sharded cluster 是 已選 MongoDB 後 的容量決策、不是 vendor 選型決策。

問題情境：橫向擴展不是只有 sharded cluster 一條路

典型觸發場景：single replica set 撐到上限、writes 已經把 primary 推到 CPU 90% / disk IO 飽和、working set 超出 RAM。讀者下意識會想到「分 shard」、但同時還有「分 cluster」這條路徑、兩者 trigger 完全不同：

單 cluster 切 shard：解的是 單一資料域寫入飽和、collection 大到單 replica set 撐不住
多 cluster 切 DB：解的是 blast radius / ownership / 合規邊界、不一定是吞吐問題

混淆兩者的後果：吞吐沒撞牆但 blast radius 是議題、強行分 shard → aggregation / transaction / $lookup 成本全部跳一級、業務 ownership 仍混在一起。或反過來：吞吐撞牆但選了分 cluster → 跨 cluster transaction 不存在、單一 collection 跨多 cluster 要在 application 層拼。

讀者徵兆：

mongos 的 targeted query / scatter-gather query 比例失衡
單一 shard CPU 遠高其他 shard、balancer 移 chunk 跟不上寫入速度
chunkMigrated 異常頻繁、sh.status() 顯示 chunk 分布偏斜
微服務 ownership 跟 collection 邊界不對齊、某 microservice 故障打到其他服務

Case anchor：9.C38 Toyota Connected 揭露「20 個 Atlas database 是業務邊界切分、不是吞吐切分」（單 cluster vs 多 cluster 對照）；hot shard 在 e-commerce flash sale / 遊戲開新區 / B2B 大客戶獨佔 chunk 的具體 incident 細節需未來 case 補完、本文以「常見 failure pattern」處理、不憑空編造 incident 數字。

核心機制：shard key、chunk、balancer

Shard key 三特性決定 sharded cluster 行為：

Cardinality（基數）：shard key 的不同值數量。status: "active" | "inactive" 只有兩個值、cardinality = 2、不能分到多 chunk
Frequency（頻率分布）：值的分布是否平均。country 在全球流量中通常一兩個國家佔 80%
Monotonicity（單調性）：值是否單調遞增。_id（ObjectId）/ 時間戳 / 自增 ID 都是單調

三特性決定 shard key 行為：

Hashed shard key：hash function 把 key 打散、寫入分布均勻、但 range query 變 scatter-gather（每個 shard 都問）
Ranged shard key：相同 key 相近 → 同 chunk → range query 高效；但單調 key + ranged → 所有寫打最後 chunk
Compound shard key（5.0+ 是常用做法、對應 Composite Partition Key 的 MongoDB 實作）：例如 { tenantId: 1, _id: "hashed" } — 先 tenant 隔離、再 hash 避免 tenant 內熱點
Zone sharding：把特定 chunk 釘到特定 shard（地域 / 合規 / 硬體分層）

Chunk 是 MongoDB 在 collection 上劃出的 64MB（預設）邏輯區塊。Balancer 在 shard 間搬 chunk 達成均衡。Chunk 不可 split 的條件是 shard key 在該範圍只有一個值（low cardinality / 大 tenant 獨佔範圍）— chunk split 不了、balancer 也搬不開。

reshardCollection（4.4+）：透過 temporary collection + chunk 重切 + 雙寫 + cutover、耗時等比於資料量、需額外 ~1.2x 磁碟。是「設計錯了還有補救機會」但不是 free lunch。

對應 knowledge card：database-sharding、hot-partition、partition。

單 cluster 切 shard vs 多 cluster 切 blast radius

跨案合成 frame（本章合成、9.C38 Toyota 揭露事實但 case 原文沒提這個 frame）：橫向擴展不是只有「sharded cluster 一條路」、多 cluster 是另一條路。

9.C38 Toyota Connected 揭露事實：

18B transactions / 月 ÷ 30 天 ÷ 86400 秒 ≈ 7K txn/sec（口徑：月度滾動平均、非瞬時尖峰）
單一 MongoDB cluster 完全撐得下這個吞吐
Toyota 切 20 個 Atlas database 不是吞吐切分、是 microservice ownership + blast radius 切分
「每個 microservice 擁有自己的 DB、單一 DB 故障不影響其他服務」

兩條路徑的判讀條件不同：

路徑	Trigger	代價
Sharded cluster（分 shard）	單一 collection 寫入飽和、storage 撐爆單 replica set、access pattern 在同一個資料域內	aggregation / transaction / `$lookup` 成本全部跳一級
多 cluster（分 DB）	微服務 ownership 邊界、blast radius 隔離、合規 boundary、不同 workload shape 共處風險	跨 cluster transaction 不存在、跨 DB join 必須在 application 層做

兩者可以同時用：每個 microservice 有獨立 cluster、cluster 內部該分 shard 還是分。寫設計文件時要避免讓讀者以為「sharded cluster 是唯一橫向擴展選項」。

Partition key 可逆性跨 vendor 對照

跨 vendor 可逆性對照 SSoT：MongoDB / DynamoDB / Cosmos DB 三家可逆性不在同一光譜、跨 vendor 對照的 SSoT 主寫位置在 DB3 entry — 三 vendor 對比 10 軸 + 對應的軸的延伸子段。本段聚焦 MongoDB 5.0+ reshardCollection 對 shard key 設計的影響、不重複展開三 vendor 全光譜比較。

不同 vendor 對 partition key 可逆性紀律完全不在同一光譜：

Vendor	機制	可逆性	成本
MongoDB	Shard key（`shardCollection`）	4.4+ `reshardCollection` 可改、5.0 前完全不可逆	等比資料量、~1.2x 磁碟、雙寫 + cutover
DynamoDB	Partition key	可改（用 backfill 到新 table）	重設計 access pattern、流量切換成本
Cosmos DB	Partition key	不可改（必須 export-recreate-import）	全量重灌、雙寫驗證、最大遷移成本

寫進設計文件時必須附 vendor + 版本、避免讓讀者把三家當「partition key 都不可改」、也避免把 MongoDB 5.0+ 的 reshardCollection 當「便宜遷移」。

操作流程

Step 1：横向擴展路徑決策。先問「我要解的是 單一資料域寫入飽和 還是 blast radius / ownership」、選分 shard 或分 cluster。若兩者都要、決定 cluster 邊界後再在 cluster 內分 shard。

Step 2：access pattern audit。列出所有讀寫 query、標出哪些 query 必須走 single shard（targeted），哪些 query 不在意 scatter-gather。

Step 3：候選 key 評估表。對每個候選打 cardinality / frequency / monotonicity 三項評分：

候選 key	Cardinality	Frequency	Monotonicity	適合？
`_id`（ObjectId）	極高	均勻	單調	否（單調寫熱）
`tenantId`	中	偏斜	否	視 tenant 分布
`{ tenantId: 1, _id: "hashed" }`	高	均勻	否	通常合適
`country`	極低（~200）	嚴重偏斜	否	否

Step 4：dry-run 採樣。對既有資料採樣，跑 db.coll.aggregate([{$sample:{size:100000}}, {$group:{_id:"$candidateKey", c:{$sum:1}}}, {$sort:{c:-1}}]) 看分布、確認沒有單一 key value 吃掉 > 20% 流量。

Step 5：shardCollection。

1sh.enableSharding("shop")
2sh.shardCollection("shop.orders", { tenantId: 1, _id: "hashed" })

先在 staging 跑流量重放、確認 chunk 分布平均、targeted query 比例 > 90%。

Step 6：監控。

1sh.status()                              // 看 cluster 狀態
2db.orders.getShardDistribution()         // 看 chunk 分布
3db.adminCommand({ balancerStatus: 1 })   // 看 balancer 狀態

Step 7：若已上錯 key。評估 reshardCollection（4.4+）vs application-level 雙寫遷移：

1db.adminCommand({
2  reshardCollection: "shop.orders",
3  key: { tenantId: 1, region: 1, _id: "hashed" }
4})

reshardCollection 進入 cutover 後不能回退、必須 dry-run 估完時間 + 磁碟 + IO 影響再上。

驗證點：targeted query 比例 > 90%、單 shard QPS 變異係數 < 20%、balancer migration 速率追上寫入速率。

Rollback boundary：shardCollection 是不可逆操作（5.0 前完全不可逆、5.0+ 透過 reshardCollection 可改但需重做）；reshardCollection 進入 cutover 後不能回退。

失敗模式

單調 key 寫熱點：_id（ObjectId）/ 時間戳 / 自增 ID 當 ranged shard key → 所有寫進最後 chunk，scale-out 等於零。修法是 hashed key 或 compound key 把單調軸拌散。

低 cardinality key：用 country 當 shard key、某個 country 佔 80% 流量、chunk 無法繼續 split、該 shard 永久熱。修法是加一個高 cardinality 軸（compound key）讓 chunk 可繼續分。

Tenant skew：B2B 場景大客戶獨佔 chunk、且該 tenant 的 chunk 還會繼續長大、balancer 搬不走。修法 compound key { tenantId: 1, _id: "hashed" } — tenant 隔離但 tenant 內 hash 散開。

Scatter-gather 過多：選了 hashed _id 但業務查詢主要是 tenantId 範圍查、每筆 query 打所有 shard、p99 隨 shard 數線性退化。修法 compound key 把常用查詢軸放第一位、targeted query 才能對 single shard。

Resharding 卡在 build 階段：磁碟不夠（需 1.2x source size）、IO 飽和影響線上 workload、預期 4 小時實際跑 14 小時。修法是先擴磁碟、staging 跑 dry-run 量實際耗時、production 在低峰期啟動。

Zone sharding 規則打架：合規規則（資料必須留在某 region）跟負載平衡規則衝突、balancer 無法移動 chunk → 熱點固化。修法是 zone 規則 vs balancer 設計階段就劃清、不要事後加 zone。

誤把多 cluster 當分 shard 解：blast radius 議題塞到 sharded cluster、單 cluster 故障仍打掉全部 microservice。該分 cluster 的就分 cluster、不是塞到 shard。9.C38 Toyota 揭露：7K txn/sec 仍切 20 DB 的 trigger 是 microservice ownership、不是吞吐。

Cluster 擴容時間估計太樂觀：MongoDB cluster 擴容是天級議題、不是 console 點點就好。9.C36 Coinbase 揭露 cluster 擴容要 70 分鐘（口徑：Coinbase 特定環境 cluster tier / 資料量 / Atlas API 條件下、reactive scaling 起點到完成、非 MongoDB 普遍承諾）；預測性流量必須走 predictive / scheduled scaling、不能只靠 sharded cluster 動態橫向擴展接住 surge（見 connection management and cache layer）。

Anti-recommendation：

寫入 < 5K WPS、storage < 1TB、single replica set 還能撐就不該分 shard；分了之後 aggregation、transaction、$lookup、index 成本全部跳一級
shard vs 多 cluster 對照：吞吐沒撞牆但 blast radius / ownership 是議題、走多 cluster 不是強行分 shard（9.C38 Toyota 7K txn/sec 仍切 20 DB 的 trigger）
跨 case 合成 frame：「不是所有資料都該進同一個 MongoDB cluster」、按 microservice ownership / blast radius / 合規邊界切

容量與觀測

關鍵 metric：

Shard 分布健康：每 shard QPS / CPU / disk usage 變異係數（< 20% 合理）
Query 路由：targeted vs scatter-gather query 比例（targeted > 90% 合理）
Balancer 健康：chunk migration rate、balancer round duration
Cluster 邊界：cluster-to-cluster ownership 邊界、跨 cluster query 比例

Mongo command：

sh.status()：cluster 整體狀態
db.coll.getShardDistribution()：collection 在各 shard 的分布
db.adminCommand({balancerStatus:1})：balancer 狀態
db.serverStatus().sharding：sharding metric

mongos profiler：每 query 帶 executionStats.executionStages.shards[]、看是否 single shard。

回到 4.20 observability evidence：把 shard distribution、targeted ratio、resharding 進度列為 evidence 三件套。

回到 9.4 saturation discovery：hot shard 是 partition-level saturation 的典型例子。

回到 9.5 bottleneck localization：當整 cluster CPU 看似只用 25%、實際是 1/4 shard 在 100%。

邊界與整合

Sibling deep articles：

schema design pattern — document 形狀決定 shard key 選擇空間
aggregation pipeline optimization — cross-shard aggregation 的 $out / $merge 限制
change streams + Kafka — cluster-wide vs collection-level change stream 在 sharded cluster 的差異
connection management and cache layer — cluster 擴容時間是天級議題、必須跟 predictive scaling / proxy 層配合

Migration playbook：

避免自管 sharding 走 → Atlas 用 managed shard tier
徹底重新分區走 shard expansion + multi-DC

跟 1.x 互引：1.10 KV / Document DB 容量規劃把 shard key 列為 capacity 決策；1.12 大規模 DB 遷移實戰收 resharding 失敗 retrospective。

跨 vendor 對照：DynamoDB vendor page（partition key + adaptive capacity + backfill 可改）、Cosmos DB vendor page（partition key 不可改）。

MongoDB Replica Set Read Preference：DB 層 causal session vs cache 層 freshness token

Wed, 27 May 2026 00:00:00 +0000

MongoDB replica set 在小規模時 read preference 五擇一就夠用、primary 走預設、想分擔 primary 改 secondary — 直觀但會在 production 反噬。讀者真正撞到的議題分兩層：DB 層的 read-your-own-write（同 client 寫完馬上讀讀不到）跟跨層的 read-after-write（write 進 MongoDB、cache 還是舊資料）。前者用 causal consistency session 解、後者要走 freshness token 跨層協議。Coinbase 1.5M reads/sec 不是純 MongoDB 撐出來、是 DB + cache 跨層合成。本文把 read preference 機制 + 跨層協作講清楚。

本文不重複 MongoDB vendor overview 已寫過的 replica set 簡介 — 而是 production 部署 + 跨層協作 + 失敗修復的實作層教學。

進本文前先確認 MongoDB 已通過適配判讀：workload 是否落在 MongoDB 適用區（document shape 主導 / contract layer 該放哪 / 跨雲 hedging 是否需要）— 判讀軸見 schema-design-pattern 開頭 3 軸前置判讀。Read scaling 是 已選 MongoDB 後 的容量決策、判讀通不過時 read preference 修補無法救回 vendor 選錯。

問題情境：read scaling 撞牆的兩種長相

典型觸發場景：primary 寫入飽和、TL 提議「讀都打 secondary」想橫向擴容。改完後幾個 production 徵兆連環出現：

User 看到「我剛下的訂單怎麼還沒出現」— write 進 primary、立刻 read 打 secondary、secondary 還沒 apply 該寫入、user 看到 stale data
跨 region replica set：app server 在 Tokyo、primary 在 Singapore、每筆讀走 70ms 跨海 RTT；改 nearest 後 latency 降但 stale read 出現
Replication lag 在 backup 期間飆到分鐘級、secondary read 拿到幾分鐘前的資料、前端報表時間軸對不上
Failover 期間 read preference 沒寫好、client 一直連舊 primary、SocketTimeout 直到 driver retry 邏輯介入

第二類議題、規模更大：把所有 read 打 secondary、replica 數量加到 5-7 仍撐不住 sustained 高 read（>500K reads/sec）；replication lag 升 + secondary CPU 飽和。這時 read preference 已不夠、必須加 cache + 跨層 freshness 機制。

讀者徵兆：rs.printSecondaryReplicationInfo() 顯示 lag 分鐘級、application log 出現「我剛寫的資料讀不到」客訴、failover 演練後 connection error 持續 30s+、cache hit rate 跟 read latency 反向相關。

Case anchor：9.C36 Coinbase 揭露「document model 撐 1.5M reads/sec 靠 cache + freshness token」、含警示「1.5M reads/sec 是 users 服務 加上 cache 的數字、不是 MongoDB cluster 純讀取數字」。跨 region read preference 改 nearest 後 stale read 的具體 incident 細節需未來 case 補完、本文以「常見 failure pattern」處理。

核心機制

MongoDB read preference + read concern 兩軸

Read preference 五種：

primary（預設）：只打 primary、強一致、primary 飽和時無路可走
primaryPreferred：先 primary、primary 不可用 fallback secondary
secondary：只打 secondary、永遠拒 primary、failover 期間若所有 secondary 都不行就拋錯
secondaryPreferred：先 secondary、secondary 不可用 fallback primary
nearest：不是「最近的 secondary」、是「ping latency 最低的 member」（可能是 primary）；driver 用 latency window（預設 15ms）內隨機挑

Read concern 是另一軸：

local：讀本地最新（含未確認）、效能最佳、可能讀到後來 rollback 的資料
available：跟 local 類似但對 sharded cluster 有差異
majority：讀到「已寫到多數 member」的資料、寫入 commit 後在多數 member 確認後才看得到
linearizable：強制最新、必須打 primary、最高 latency

Write concern w: "majority" 保證寫入確認後在多數 member 上、但不保證 secondary 馬上 visible — 兩個概念分開。

Causal consistency session（DB 層機制）

Causal consistency session 解的是 單 client 在 MongoDB cluster 內部 的因果一致：

Client session 帶 clusterTime + operationTime
Driver 把 read 路由到「已 apply 該 operationTime」的 member
實現 read-your-own-write（自己剛寫的、自己讀得到）

機制只在「同一 client session」內生效。跨 client 的因果一致（A 寫 → B 讀）不在範圍內。

其他輔助機制：

Tag set：member 標 {region: "ap-tokyo", role: "analytics"}、read preference 帶 tag 把流量路由到特定 member
Hidden / delayed secondary：不參與 election、不接 client read、做 backup / DR 用
Election：primary 失聯後 majority 投票選新 primary、預設 10s 內完成；election 期間所有 primary read 失敗

Freshness token（cache 層機制）

9.C36 Coinbase 揭露的跨層機制 — 解的是 MongoDB + cache 跨層 的 read-after-write、不是 cluster 內部。對應 Freshness Token 卡片的 application-level 版本協議定義：

觸發條件：直接打 MongoDB 不可能撐 1.5M reads/sec（口徑：users 服務應用層觀察、含 cache、非 MongoDB cluster 純讀取）。Coinbase 在 users 服務前加 Memcached query cache、單 document query 先查 cache。

跨層一致性問題：write 進 MongoDB primary、cache 還是舊資料、client 下次 read 從 cache 拿到舊版。

freshness token 機制：

Write 成功後、server 給 client 一個 token（包含 OCC version / clusterTime）
Client 之後 read 帶這個 token
Server 保證返回的資料版本 ≥ token
若 cache 的版本 < token、bypass cache 直接打 DB

跟 causal consistency session 的關係：兩者解決同一類問題（read-after-write）但作用範圍不同。Causal session 是 DB 層、保證在同一 cluster 內 read-your-own-write；freshness token 是 DB + cache 兩層共用的版本協議、保證跨層 read-your-own-write。

跨層協作三選一

讀者真實系統的 read 一致性需求要選哪層處理：

路徑	適用情境	代價
只用 DB 層（causal session）	無 cache 層、讀寫都直接打 MongoDB cluster	replica scaling 上限約幾十萬 reads/sec
只用 cache 層（freshness token）	有 cache、跨層一致性要求高、application 願改	需設計 token 協議 + cache bypass 邏輯
兩層並用	大規模 OLTP、cluster 內也要 causal、跨 cache 也要 freshness	複雜度最高、但 Coinbase 規模必走此路

對應 knowledge card：stale-read、replication-lag、session-consistency、eventual-consistency。

操作流程

Step 1：read shape 分類。把所有 read 分成四類：

(a) 強一致必須 read-your-own-write（訂單詳情、帳戶餘額）
(b) 容忍秒級 lag（個人資料、商品詳情）
(c) 容忍分鐘級 lag（報表、analytics）
(d) 大規模 read scaling 需 cache + freshness token（用戶資料 / 高頻 product query）

Step 2：依分類對映機制。

分類	Read preference	Read concern	跨層機制
(a)	primary	majority	causal consistency session
(b)	secondaryPreferred	local	monitoring lag alarm
(c)	secondary（tag set）	available	無
(d)	secondaryPreferred	majority	cache + freshness token + bypass

Step 3：driver config（Node.js / Java / Python 都類似）：

1mongodb://host1:27017,host2:27017,host3:27017/db?
2  replicaSet=rs0&
3  readPreference=secondaryPreferred&
4  readPreferenceTags=region:ap-tokyo&
5  readPreferenceTags=&
6  maxStalenessSeconds=90&
7  readConcernLevel=majority

readPreferenceTags 寫多個 = fallback chain（先 tokyo 失敗 fallback 任意）。maxStalenessSeconds=90 拒絕 lag > 90s 的 secondary。

Step 4：causal consistency session：

1with client.start_session(causal_consistency=True) as s:
2    coll.insert_one(doc, session=s)
3    # 下面這個 find 自動路由到能讀到剛才寫的 member
4    coll.find_one({"_id": doc["_id"]}, session=s)

Session 結束後因果關係結束、下個 session 不繼承。

Step 5：freshness token 設計（9.C36 Coinbase 模式）：

Write API 返回 {result, version_token} — token 含 OCC version 或 MongoDB clusterTime
Read API 接受 optional If-Version-≥ header / parameter
Cache lookup 比對 cache entry version 跟 token、低於 token 就 invalidate + bypass 到 MongoDB
DB 層 read 用 readConcern: "majority" 保證返回的 version ≥ token

Step 6：staging 驗證。灌入 replication lag（暫停 secondary apply）驗證 application 行為；灌入 stale cache 驗證 token bypass 邏輯；模擬 failover 驗證 driver retry。

驗證點：

rs.printSecondaryReplicationInfo() lag < SLO
driver metric readPreferenceUsageCount 分布符合預期
failover drill 後 read recovery < 15s
cache hit rate vs freshness bypass rate 比例監控

Rollback boundary：read preference 是 driver-side config、可以 hot-swap；causal consistency session 需 application code 改、需灰度；freshness token 是 application + cache + DB 三方協議、回退需協調。

失敗模式

Read-after-write 不一致（DB 層）：寫 primary → 立刻 secondary read、應用 race condition 顯示「資料消失」。修法是 causal consistency session、driver 自動路由到已 apply 該寫入的 member。

Read-after-write 不一致（跨層）：寫 primary → cache 還是舊資料 → user 看到舊資料。causal session 解不了（cache 在 MongoDB 外）、必須走 freshness token 跨層協議。

Stale read 在 lag 高峰：backup / DDL / 大量寫入導致 secondary lag 分鐘級、secondary read 拿到舊資料。修法設 maxStalenessSeconds 拒舊 member、driver 自動轉到較新的 member 或 primary。

nearest 在跨 region 不穩：latency 抖動讓 driver 在 primary / secondary 跳、寫一致性與 read latency 同時惡化。修法是不要用 nearest 解跨 region 議題、應該用 tag set 明確路由。

Failover 期間 primary read 全失敗：election 10s 內所有 primary read 拋錯。修法改 primaryPreferred + driver retry 邏輯吃掉短暫失敗、application 端配 retry policy。

Tag set 失準：把 region: "ap-tokyo" 的流量路由到 tag 為 tokyo 的 member、但該 member 故障時沒 fallback、流量直接停。修法是 tag 設多層 fallback chain、最後一層留空 tag 表示「任意 member」。

Analytical query 跑 OLTP secondary：secondaryPreferred 把報表打 OLTP secondary、報表 query 拖垮 OLTP read latency。修法是 analytical workload 用 tag set 路由到專屬 analytics secondary、跟 OLTP read 隔離。

Freshness token 漏寫：write 沒帶 token 給 client / client 沒帶 token、token 機制 silently 失效、read 走 cache 拿舊資料。修法 token 必須 e2e 強制（middleware 自動帶 / 自動驗證）、不能靠 application 自覺。

Cache bypass 比例失控：所有 read 都 bypass cache、cache 等於沒裝。修法是 token 失敗率要監控、過高表示 cache invalidation 設計有問題（cache 沒在 write 後 update / invalidate）。

Anti-recommendation：

read-heavy 但有強一致需求的場景不要為了 scale 改 secondary read；該換 SQL + read replica 加 application-level cache、或加 sharding 把 primary 寫散開
大規模 OLTP（>500K reads/sec）想單靠 MongoDB read preference 撐 = 拿不到那個量級。Coinbase 案明示「直接打 MongoDB 不可能撐 1.5M reads/sec」、必須 cache + freshness token

容量與觀測

關鍵 metric：

Replica health：每個 member 的 opcounters 分布、rs.status().members[].optimeDate 推算 lag
Read preference 命中：driver-side readPreferenceTags 命中率
一致性 SLO：stale read 比例（causal consistency 拒絕重試次數）
跨層 freshness：cache hit rate vs freshness bypass rate

Mongo command：

rs.status()：replica set 整體
rs.printSecondaryReplicationInfo()：lag 概況
db.serverStatus().repl：詳細 replication metric
db.adminCommand({replSetGetStatus:1})：完整 status

Application observability：APM 看「同一 session 內 write + read 順序對 latency / error 的影響」、SLO 是 read-your-own-write 命中率；跨層還要看 freshness token 流動完整性（write 是否發 token、read 是否帶 token、cache 是否驗 token）。

Lag alarm：lag > 30s 預警、> 90s 觸發 driver maxStalenessSeconds 自動拒讀。

回到 4.20 observability evidence：把 read preference 命中分布、replication lag time series、failover drill recovery time、freshness token bypass rate 列為 evidence。

回到 9.5 bottleneck localization：read latency 異常時要區分 (a) primary 飽和 (b) secondary lag 高 (c) tag routing 把流量集中到單一 member (d) cache hit rate 下降 / bypass 率上升。

邊界與整合

Frame 5：合規邊界 — MongoDB 用 cluster-per-region 吸收

MongoDB / Atlas 沒有 row-level locality 機制（不像 CockroachDB 可把單 row pin 在合規 region）— 跨境合規必須以 cluster-per-region 拓樸吸收：每個合規市場開獨立 cluster、application 層做 routing、不靠 replica set / sharded cluster 機制跨 region。

跨 vendor 對照：

Vendor	合規吸收機制	拓樸特性
MongoDB / Cosmos DB	cluster-per-region（無 row-level locality 等價物）	各 region 獨立 cluster、application 層做市場 routing
Aurora	fleet 拓樸（每市場獨立 cluster、Global Database 在合規場景反指標）	active-passive per market、跨市場不複製
CockroachDB	locality + placement（邏輯一個 cluster + region pinning + Outposts）	單 logical cluster、physical row 鎖在合規 region
DynamoDB	region-pinned Global Tables（按 region 開關 replication、各市場可分離）	仍 active-active、但 replication 範圍可控

MongoDB 在這 frame 的退化點：read preference 機制本身不解合規 — 即使 readPreferenceTags={region:eu} 把流量路由到歐洲 secondary、但 primary 在亞洲時跨境 replication 仍在跑、合規 audit 不會放行 路由層 控制當作 資料邊界 控制。合規市場必須整 cluster 分離、再用 application 層 routing 把 user 帶到對應 cluster。

Atlas 在合規場景的 fit：Atlas global cluster（zone sharding 把 shard 鎖在 region）是「跨 region 但 資料 pin 在 zone」的中介選項、適合 GDPR 軟條款（資料在歐洲 EEA 內可流動）；strict 條款（資料不能離開單一國家）仍須走 cluster-per-region。

Sibling 與 cross-link

Sibling deep articles：

shard key selection — read preference 解決不了 write 飽和、要切 shard
change streams + Kafka — change stream 預設打 primary、放 secondary 的 trade-off
aggregation pipeline optimization — 把 analytical aggregation 路由到專屬 secondary
connection management and cache layer — freshness token 是該篇的核心議題之一、本文聚焦 DB 層 vs cache 層機制對照、不展開 cache 部署架構

Migration playbook：

跨 region 強 consistency 需求 → → Cosmos DB MongoDB API（5 consistency level）
跨 region 想保留原生 MongoDB → → Atlas global cluster

跟 1.x 互引：1.1 高併發資料存取處理 read scaling pattern；1.11 全球分散式 OLTP 處理跨 region 一致性升級路徑。

MongoDB Connection Management and Cache Layer：driver × 部署模型 × cache × predictive scaling

Wed, 27 May 2026 00:00:00 +0000

MongoDB 大規模 OLTP 的真實架構不是「一個 driver pool 直連 cluster」、是 driver / proxy 層 + cache + freshness token 層 + scaling trigger 層三層協作。讀者最常的誤解是「Coinbase 用 MongoDB 撐 1.5M reads/sec」— 實際是這個合成架構撐出來的量級、單靠 MongoDB cluster 拿不到那個數字。本文把三層各自議題跟整合操作流程講清楚、並對 mongobetween 的部署模型適用範圍給出明確邊界。

本文不重複 MongoDB vendor overview 的 Atlas / 容量規劃簡介 — 而是 production 部署 + 跨層協作 + 失敗修復的實作層教學。

問題情境：大規模 OLTP 撞三道牆

MongoDB 部署規模從中型撐到大規模時、會連環撞三道牆：

Connection ceiling：應用層 deploy 規模一上來、單一 MongoDB cluster 看到 connection storm。9.C36 Coinbase 揭露具體：Ruby + GVL + blue-green 部署把 instance 數 ×2、連線數隨之 ×2、單一 cluster 看到 60K connections / 分鐘（口徑：Coinbase 特定環境 CRuby + GVL 部署模型）。MongoDB cluster 的 connection limit 撞牆、新 deploy 連不上、線上服務 cascade 失敗。

Read scaling ceiling：讀者把所有 read 都打 secondary、replica 加到 5-7 仍撐不住 sustained 高 read（>500K reads/sec）。Replication lag 升 + secondary CPU 飽和；單靠 MongoDB cluster 內機制（replica scaling + read preference）拿不到大規模量級。

Scaling reaction lag：MongoDB cluster 擴容是天級議題、不是即時擴容。9.C36 Coinbase 揭露 reactive scaling 起點到完成 ~70 分鐘（口徑：Coinbase 特定環境、cluster tier / 資料量 / Atlas API 條件下、非 MongoDB 普遍承諾）。Surge 開始時才動來不及、預測性流量必須提前出手。

Surge 形狀又不規則：加密貨幣 surge（隨外部市場波動）/ 媒體爆量（事件驅動）/ IoT 緊急通報（雙模式並存）— 都不適合單純 reactive auto-scaling 接住、必須 predictive + reactive 兩段式。

讀者徵兆：

MongoDB Atlas console 看到 connection count 在 deploy 後 spike 到上限
p99 read latency 在事件時段集體爬
Atlas auto-scaling event log 顯示 triggered too late
Cache hit rate 跟 read latency 反向相關

Case anchor：9.C36 Coinbase 是 rich case，含具體數字（deploy 尖峰 connection event rate ~60K connections / 分鐘 / mongobetween 後 steady-state concurrent connections 由 ~30K 降到 ~2K — 兩者口徑不同、不是同一數字的連續變化；1.5M reads/sec 含 cache / 70 → 25 分鐘擴容）；9.C38 Toyota Connected 雙模式負載敘事（持續 sensor + 緊急事件）、9.C37 Forbes 媒體爆量形狀。

核心機制：三層合成 frame

跨案合成 frame（本章合成、case 原文沒這個 frame）：應用層連 MongoDB cluster 在大規模 production 是 三層協作、不是 driver 一個元件：

層次	角色	9.C36 Coinbase 對應元件
Driver / Proxy	連線多工、應用 process 跟 cluster 的橋接	MongoDB driver + mongobetween proxy
Cache + freshness token	read scaling 主路、跨層一致性協議	Memcached + freshness token + OCC version
Scaling trigger	cluster 擴容啟動時機	ML predictive scaling + reactive fallback

三層缺一都會在大規模時撞牆。本文聚焦這三層如何協作、單一層的深度議題（read preference 機制、schema 治理、aggregation pipeline）推到 sibling。

Driver / Proxy 層

MongoDB driver 原生 connection 模式：driver 在 application process 內維護 connection pool、每個 process 跟 MongoDB cluster 開固定數量 socket。但 driver 沒跨 process pool — 多個 process 共用同一台機器、每個 process 自己一份 pool、cluster 看到的是 N 倍 connection。跟 PostgreSQL 走 pgbouncer 是同樣需求。

Connection storm 的具體 trigger：

部署模型放大 process 數：CRuby + GVL 強制每 CPU core 一 process、blue-green 部署 instance 數 ×2、連線數隨之 ×2（9.C36 Coinbase 揭露：單 cluster 看到 60K connections/min）
微服務數量多：50+ microservice 各自連 cluster、每服務 connection 加總後撞上限（9.C37 Forbes 50+ 微服務情境對照）

mongobetween proxy（Coinbase 自建）：把多 application process 的連線合成少量到 MongoDB cluster 的連線。9.C36 揭露兩個獨立口徑、不是同一數字的連續變化：deploy 尖峰時 connection event rate 是 ~60K connections / 分鐘（unique connection 事件量、rate）；mongobetween 介入後 steady-state concurrent connection 數 由 ~30K 降到 ~2K（瞬時量、前後對比、一個量級）。引用時把 rate 跟瞬時 concurrent count 分開、不要壓成「60K 收斂到 2K」。

Scope warning（必明示）：mongobetween 是 Coinbase 為 Ruby + GVL 需求自建、case 自承「Go / Java / Node.js 應用因原生支援連線多工、通常不需要這層 proxy」。寫進設計文件時不可寫成「MongoDB 在大規模都需要 mongobetween」、要寫成「特定部署模型才需要」。

Cache + freshness token 層

直接打 MongoDB 不可能撐 1.5M reads/sec（口徑：users 服務應用層觀察、含 cache、非 MongoDB cluster 純讀取）。Coinbase 在 users 服務前面加 Memcached query cache、單 document query 先查 cache。

跨層一致性問題：write 進 MongoDB primary、cache 還是舊版、user 下次 read 拿到舊資料。

Freshness Token 機制：

Write 成功後給 client token（含 OCC version / clusterTime）
Client read 帶 token
Server 保證返回的資料版本 ≥ token
必要時 bypass cache 直接打 DB

跟 DB 層 causal consistency session 對照：causal session 解 MongoDB 內 read-your-own-write、freshness token 解 DB + cache 跨層 read-your-own-write。機制細節見 replica set read preference、本文不重複展開。

Scope warning（必明示）：1.5M reads/sec 是 users 服務 + cache 合成數字、不是 MongoDB cluster 純讀取 benchmark。寫進設計文件必須明示口徑、避免讀者把 1.5M reads/sec 當成「MongoDB 單獨能撐」。

Scaling trigger 層

MongoDB cluster 擴容時間：傳統 reactive scaling 起點到完成 ~70 分鐘（9.C36 Coinbase 揭露口徑：含 instance provisioning + 資料 sync + balancer rebalance、特定 Atlas tier / 資料量條件）。

Reactive 為主撐不住快變流量：CPU / queue 觸發 reactive scaling 在 surge 開始時才動、來不及；surge 已經結束擴容才到位。

Predictive scaling 機制（Coinbase 揭露）：

用外部訊號（加密貨幣價格、賽事行程、票務開賣時間）訓練 ML 模型
提前 60 分鐘預測流量
預先擴容
把擴容啟動時間從 70 分鐘壓到 25 分鐘（口徑：trigger 提前、不是擴容本身變快）

Scope warning（必明示）：case 警示「ML 預測有 false positive / false negative、Coinbase 沒揭露準確率、所以仍保留 reactive scaling 作為 safety net」。寫進設計文件要明示兩段式設計、不可寫成「Predictive scaling 取代 reactive scaling」。

對應 knowledge card：connection-pool、stale-read、session-consistency、hot-partition（cache 失效時打穿 DB 的 hot key）。

操作流程

Step 1：connection ceiling audit。量測現有 deploy 在 peak 的 connection count、推算 deploy ×2 / 微服務新增時 connection 走勢；對照 MongoDB cluster 的 hard limit（Atlas tier 決定、典型 1500-32000）。

Step 2：部署模型判讀。

部署模型	是否需 proxy 層	原因
CRuby + GVL（process-per-core）	需要	每 core 一 process、連線隨 process 線性升
大量微服務（50+）+ 各自 deploy	需要	微服務 connection 加總撞 cluster limit
Blue-green 部署（雙環境並存）	需要	部署期間連線 ×2、容易撞 cluster ceiling
Go / Java / Node.js 單一 binary + 多 thread	通常不需要	原生 driver pool 跨 thread 共用、收斂效率高

Step 3：proxy 選型。Coinbase mongobetween 是參考實作、社群還有 mongoproxy / DocumentDB 內建 connection multiplexer。自建 proxy 是 Coinbase 規模才合理、中型團隊先評估 Atlas tier 升級。

Step 4：cache layer 設計（read scaling 主路）：

前置 Memcached / Redis、cache key = collection + document id + version
Write API 返回 {result, version_token} — token 含 OCC version 或 MongoDB clusterTime
Read API 接受 optional version token、cache lookup 比對 entry version 跟 token、低於就 invalidate + bypass
DB 層 fallback readConcern: "majority" 保證返回 version ≥ token

Step 5：predictive scaling 設計（適用「外部訊號可預測流量」）：

識別 driver 訊號：加密貨幣價格 / 賽事行程 / 票務開賣 / 促銷活動 / IoT 緊急事件預警
訓練 ML：用歷史流量 vs 訊號 correlation 訓練、輸出未來 30-60 分鐘流量預測
觸發擴容：預測超 threshold 時主動 trigger Atlas scaling API、不等 reactive metric
保留 reactive safety net：ML failure 時 reactive scaling 仍會接、不可拿掉

Step 6：全鏈路驗證。Staging 灌入 deploy ×2 模擬 connection storm、灌入 stale cache 驗證 freshness token bypass、放假流量驗證 predictive scaling trigger。

驗證點：

Connection count 在 deploy 後不爆 cluster limit
Cache hit rate vs freshness bypass rate 比例正常（cache hit > 90% + bypass < 5% 屬通用工程估算、case 未揭露具體數字）
Predictive scaling 領先窗 ≥ 30 分鐘
Reactive scaling 仍保留作 safety

Rollback boundary：

Proxy 層可下線（流量改直連 cluster、但短時 connection storm 風險回來）
Cache 層可下線（read 全部打 DB、需 cluster 容量能撐）
Predictive scaling 可下線（退回純 reactive、但快變 surge 接不住）
三層都要設計 graceful degradation、不是全有全無

失敗模式

Connection storm during deploy：blue-green 部署 instance 數 ×2、connection 隨之爆、新 deploy 連不上 cluster、cascade 失敗。修法是 proxy 層 + cluster connection limit 預留 headroom（典型留 30% buffer、屬通用工程估算）。

Proxy 變成單點瓶頸：mongobetween / pgbouncer 風格 proxy 自己變熱點、proxy 故障時下游全死。修法是 proxy 叢集 + health check + 客戶端 retry、跟 application 同 region 共部署降低 proxy ↔ application 的網路 RTT。

Cache hit rate 崩塌：cache 失效 + 大量 read bypass、DB 突然吃 100% 流量、cluster 飽和。修法是 freshness token 設計時要監控 bypass rate、過高表示 cache invalidation 邏輯有問題、cache 沒在 write 後 update / invalidate。

Freshness token 漏寫：write 沒帶 token / client 沒帶 token、token silently 失效、user 拿到舊資料。修法是 protocol 強制（middleware 攔截 write / read、自動帶 token）、不能靠 application 自覺。

Predictive scaling false positive 浪費容量：ML 預測 surge 但實際沒來、cluster 預先擴容後閒置。接受成本、保留 ML model retraining、定期評估 precision / recall。

Predictive scaling false negative 漏接 surge：ML 沒預測到、cluster 沒提前擴、surge 來時 reactive scaling 開始動但 70 分鐘來不及。修法是 reactive safety net + 服務降級（限流 / 部分 read 降級拿舊資料 + freshness token 告警）。

三層協作脫節：proxy 擋住 connection storm 但 cluster 內部 read scaling 沒設計、application 仍打爆。三層必須一起設計、不是各自獨立。

Anti-recommendation：

中小流量（< 100K reads/sec、單 deploy < 50 instance）不需要這三層；Atlas tier 升級 + cluster 內 replica + 簡單 cache 就夠
mongobetween 風格 proxy 只在 Ruby + GVL / 類似部署模型才必要、Go / Java / Node.js 通常不需要（case 自承）
Predictive scaling 只在外部訊號可預測時有效；無預測訊號的純隨機 surge 還是回 reactive + headroom
大規模 OLTP 不該為了省成本拿掉 cache 層；read scaling 主路就是 cache、單靠 MongoDB cluster 拿不到 1.5M reads/sec 量級

容量與觀測

關鍵 metric：

Connection 層：cluster connection count / Atlas tier limit / proxy 到 cluster 的 connection multiplex 比、deploy 前後 connection 走勢
Cache 層：cache hit rate、freshness token bypass rate、cache key collision rate
Scaling 層：predictive scaling trigger event count / 領先窗、reactive scaling fallback 觸發頻率、實際擴容啟動到完成時間、ML 預測準確率（precision / recall）

Mongo / Atlas command：

db.serverStatus().connections：cluster 當前 connection 統計
db.currentOp({})：看 connection 使用
Atlas API：cluster scaling event log
Proxy admin metric：connection multiplex 比、上下游 latency

Application observability：APM 看 connection acquire latency、cache hit rate time series、freshness token 流動完整性（write 是否發 token、read 是否帶 token、cache 是否驗 token）。

回到 4.20 observability evidence：把 connection storm event、cache hit rate / bypass rate、scaling trigger leadtime 列為跨層 evidence 三件套。

回到 9.5 bottleneck localization：大規模 OLTP 撞牆時要區分 (a) connection ceiling (b) cache hit rate 下降 (c) cluster 內 replica 飽和 (d) scaling 跟不上。

邊界與整合

Sibling deep articles：

replica set read preference — DB 層 causal session 機制、freshness token 跨層協議；本文聚焦三層協作、那篇聚焦 DB 層機制
shard key selection — cluster 擴容是天級議題、是 scaling layer 的 trigger；單 cluster vs 多 cluster 切分
schema design pattern — app-layer abstraction 跟本文 cache + freshness token 同層協作、contract layer 三選一
aggregation pipeline optimization — report dashboard 跑爆 primary 的補位路徑是本文的 cache + read scaling、不是讓 aggregation 自己優化

Migration playbook：

Federated DB 模式（9.C36 Coinbase 揭露：MongoDB + DynamoDB）— 不是「全用 MongoDB」、document-shaped 用 MongoDB、access pattern 固定的 KV 用 DynamoDB；對應 DynamoDB vendor page 跨 vendor 對照
跨雲 hedging（9.C37 Forbes 跨雲彈性）— Atlas 跨 AWS / GCP / Azure 是規避未來雲商鎖定的 selection 訊號

跟 1.x 互引：

1.1 高併發資料存取 — connection storm 通用模式（pgbouncer / mongobetween 對應）
1.10 KV / Document DB 容量規劃 — 三層架構列為大規模 OLTP 容量規劃必看點
9.6 容量規劃模型 — predictive scaling 的 ML 訓練紀律

MongoDB Aggregation Pipeline Optimization：stage 順序、index 配合與 memory 邊界

Wed, 27 May 2026 00:00:00 +0000

MongoDB aggregation pipeline 是 document model 做 analytical query 的主要介面、stage stream 設計直觀但 production 容易踩雷 — 上線時 200ms、半年後資料量翻倍變 8s、加 index 沒用；profiler 顯示 stage 之間在 memory 累積上百 MB temp data。Aggregation pipeline 的最佳化跟 RDBMS 的 SQL planner 完全不同邏輯 — RDBMS 靠 planner 自動重排 join / filter、MongoDB 靠寫 query 的人手動排 stage 順序。本文把 stage 機制、index 配合、memory 邊界、cross-shard 限制講清楚、並對「report dashboard 跑爆 primary」這個常見 anti-pattern 給治理路徑。

本文不重複 MongoDB vendor overview 已寫過的 aggregation 簡介 — 而是 production tuning + 失敗修復的實作層教學。

前置閱讀：MongoDB workload 適配判讀（document shape 主導 / contract layer 該放哪 / 跨雲 hedging 是否需要）見 schema-design-pattern 開頭 3 軸前置判讀。本文聚焦 aggregation pipeline 操作層、是 已選 MongoDB 後 的 query 層工程議題、不重複前置判讀。

問題情境：aggregation 是 hot path 的反模式

典型觸發場景：報表 pipeline 上線時 200ms、半年後資料量翻倍變 8s、加 index 沒用；profiler 顯示 stage 之間在 memory 累積上百 MB temp data。

進一步徵兆：

「OLTP collection 上跑 analytical query」的混合 workload：把 $group + $lookup + $sort 接成長 pipeline、aggregation 把整個 working set 從 cache 擠走
Sharded cluster 上跑 cross-shard aggregation：$group / $sort 必須在 mongos 合併、mongos 變單點瓶頸
$lookup 出現在 hot path：每筆 input doc 都要去另一個 collection 查、嚴格意義上是 N+1
db.serverStatus().metrics.aggStageCounters 飆、executionStats.executionTimeMillis 跟 doc 數線性增長
Profiler 報 usedDisk: true、aggregation OOM kill QueryExceededMemoryLimitNoDiskUseAllowed

Case anchor：report dashboard 跑爆 primary 的具體 incident 細節需未來 case 補完、本文以「常見 anti-pattern」處理、不憑空編造 incident 數字。側面引用 9.C30 Microsoft 365 — 從 MongoDB 把 analytics 分離出來的 driver。

核心機制

Aggregation pipeline 是 stage 序列：每個 stage 接 stream of document、產出 stream of document。Stage 順序直接決定後續 stage 處理量 — 第一個 stage 是 IXSCAN 還是 COLLSCAN、$match 推到前面還是後面、$project 早 drop 還是晚 drop、都會放大或縮小後續 cost。

Optimizer rewrite：MongoDB 會自動把 $match / $project 往前推、把 $sort + $limit 合併成 top-K、但不保證所有 case。用 explain("executionStats") 看 rewrite 後的 effective pipeline、不要靠原始 pipeline 推斷實際執行順序。

Index 配合：pipeline 的 第一個 stage 若是 $match 或 $sort、且能對到 index、就走 IXSCAN。中間 stage 都是 in-memory stream、沒 index 概念。所以 $match 永遠該排第一、配合對應 index。

Memory 邊界：每個 aggregation stage 預設 100MB memory 上限、超過要 allowDiskUse: true（4.2+ 是預設）。Disk spill 啟動後 IO 嚴重拖慢、aggregation 變慢 50-100x。

$lookup 在 sharded cluster：foreign collection 不能 sharded（5.0 前完全不行、5.0+ 有限放寬）；$lookup 本質是 nested loop join、沒 hash join / merge join — 對大 collection 不可用。

$facet 平行多 pipeline：但所有 facet 共享同一個 100MB 限制、複雜 facet 容易撞 memory ceiling。

$merge / $out：把結果寫回 collection（pre-computed view / materialized view）— 把 hot analytical query 移出 read path、是治理 anti-pattern 的主要工具。

對應 knowledge card：hot-partition（aggregation 集中讀單 shard 的副作用）、document-store、stale-read（從 secondary 跑 aggregation 的 trade-off）。

操作流程

Step 0：把壞 pipeline 跟好 pipeline 並排。看一個簡化但典型的優化：

 1// 壞：lookup 在 match 前、sort 沒 limit、project 在最後
 2db.orders.aggregate([
 3  { $lookup: { from: "users", localField: "userId", foreignField: "_id", as: "user" } },
 4  { $match: { status: "completed", "user.region": "ap-tokyo" } },
 5  { $sort: { createdAt: -1 } },
 6  { $project: { _id: 1, total: 1, createdAt: 1 } }
 7])
 8
 9// 好：可推前的 match 寫前面、sort + limit 配對、project 早寫
10db.orders.aggregate([
11  { $match: { status: "completed" } },
12  { $sort: { createdAt: -1 } },
13  { $limit: 100 },
14  { $lookup: { from: "users", localField: "userId", foreignField: "_id", as: "user" } },
15  { $match: { "user.region": "ap-tokyo" } },
16  { $project: { _id: 1, total: 1, createdAt: 1, "user.name": 1 } }
17])

差別：壞 pipeline 對整個 orders 做 lookup、然後才過濾；好 pipeline 先過濾 + top-100、只對 100 筆做 lookup、再過濾 lookup 結果。實際 collection 大時兩者差 50-100x。

Step 1：拿 explain plan。

1db.coll.explain("executionStats").aggregate([...])

看 stages[] 顯示 rewrite 後的 effective pipeline、executionTimeMillis、totalDocsExamined / totalDocsReturned 比值、是否 usedDisk。

Step 2：把 $match 推到最前。越早過濾、後續 stage 處理量越小。Optimizer 通常自己會推、但 $lookup 之後的 $match 不會自動推到 $lookup 之前 — 因為 lookup 出的欄位才能被那個 match 用、邏輯依賴。寫 query 時就把能推前的 $match 寫前面。

Step 3：對 $match 欄位建 compound index。確保 executionStages 顯示 IXSCAN 而不是 COLLSCAN。Compound index 順序敏感 — { status: 1, createdAt: -1 } 對 { status: ..., createdAt: $gte: ... } 高效、對 { createdAt: $gte: ... } 走不到 index。

Step 4：$sort + $limit 寫在一起。Optimizer 才會推 top-K（不需要 full sort、只需要 heap）。單 $sort 不限 limit 會做 full sort、容易撞 memory。

Step 5：$project 早寫。把不需要的欄位早期 drop、減少後續 stage 處理 doc size。對大 document 特別有效。

Step 6：把 hot analytical pipeline 寫成 materialized view。

 1db.orders.aggregate([
 2  { $match: { createdAt: { $gte: ISODate("2026-05-01") } } },
 3  { $group: { _id: "$customerId", total: { $sum: "$amount" } } },
 4  { $merge: {
 5      into: "monthly_customer_summary",
 6      on: "_id",
 7      whenMatched: "merge",
 8      whenNotMatched: "insert"
 9  }}
10])

定時更新（cron / 5 分鐘一次）、application 讀 materialized view 而不是即時跑 aggregation。

Step 7：sharded cluster 處理。避免在 hot path 用 cross-shard $lookup / $group、或把這類 query 路由到 analytical replica（用 tag set + read preference）、見 replica set read preference。

驗證點：

executionTimeMillis 在預期 budget 內
totalDocsExamined / totalDocsReturned 比值接近 1（過濾效率高）
無 usedDisk: true
無 stage 看到 inMemory > 50MB

Rollback boundary：pipeline 改寫是 application code 變更、可以灰度；materialized view（$merge）需備份 target collection 才能還原。

典型 tuning 過程（200ms → 8s → 250ms）

一個常見的 production pipeline 演化路徑：

上線時 200ms：collection 100K doc、$match 過濾 95%、$lookup 只跑 5K 次、in-memory $sort 處理 5K row 在 100MB 內
半年後 8s：collection 長到 2M doc、$match 仍過濾 95% 但變 100K row、$lookup 跑 100K 次（5K → 100K 是 20x）、$sort 在 in-memory 撞 100MB 開始 disk spill、IO 100x 退化
加 compound index 沒用：index 是給 $match 用的、但 $match 之後的 stage（$lookup / $sort）走的是 in-memory pipeline、index 救不了
修法到 250ms：(a) $sort + $limit 配對讓 optimizer 走 top-K、避免 full sort (b) 改 schema embed 把 $lookup 拿掉（見 schema design pattern）(c) hot pipeline 寫成 $merge materialized view、application 讀 view 不跑 aggregation

關鍵教訓：aggregation 慢的原因不在 query 本身、在 資料形狀演進。Index 是 hot path 的第一個槓桿、但只對 $match / $sort 第一 stage 有效；後續 stage 要靠 stage 順序、materialized view、schema denormalize 來救。

失敗模式

$lookup 在 hot path：list page 每行去另一 collection 查、p99 隨 page size 線性增。應在 schema design 階段 denormalize、把 read-together 資料 embed 回 aggregate root（見 schema design pattern）。

$sort 不帶 limit + 沒 index：全表 in-memory sort、撞 100MB 限制 → OOM 或 disk spill。allowDiskUse: true 解 OOM 但 IO 100x 退化。修法是建對應 index 走 IXSCAN sort、或限 limit 走 top-K。

Sharded cluster cross-shard aggregation：$group 階段所有 partial result 跑到 mongos 合併、mongos memory + CPU 爆。修法是 group key 包含 shard key prefix（讓 group 在 shard 內完成）、或路由到 analytical replica 跑。

Stage 順序錯：$lookup 放在 $match 前、等於對全表都做 lookup 再過濾、每個 input doc 都觸發 lookup。$match 永遠該排第一。

Aggregation 把 working set 擠走：OLTP 的 hot page 被 aggregation 的 cold scan 擠出 cache、整體 query latency 一起退化。修法是 analytical workload 跟 OLTP read 隔離（read preference tag）、或搬走 analytical（見下面 anti-recommendation）。

$facet 滿載：四個 facet 各跑大 pipeline、共享 100MB 限制立刻爆。修法是拆成獨立 query、不要硬塞 facet。

Anti-recommendation：

報表 / BI / analytics workload 跑 MongoDB primary 是反模式：應該 (a) 設定 analytical secondary + read preference tag (b) 用 $merge 寫到 reporting collection (c) 進階用 BI Connector / data lake / 把 analytical workload 整批搬到 ClickHouse / BigQuery
「report dashboard 跑爆 primary」典型 anti-pattern：BI 工具直連 MongoDB primary 跑長 pipeline、cache eviction 把 OLTP working set 擠走、p99 latency 在報表時段集體升。沒拿到具體 incident 數字、不在本文編造、改寫成「常見 anti-pattern」並推到治理路徑
Aggregation 不能解 read scaling：aggregation 是 OLTP 的補位、不是 read scaling 的主路。Read scaling 在大規模 OLTP 走 cache + freshness token（見 connection management and cache layer）、不是把 aggregation 跑爆 secondary

容量與觀測

關鍵 metric：

Aggregation operation time 分布
Disk spill 次數
opcounters.command 中 aggregate 比例
Cache eviction rate 在 aggregation 高峰時的變化

Mongo command：

db.currentOp({ "command.aggregate": { $exists: true } })：當前 aggregation 在跑
db.serverStatus().metrics.aggStageCounters：stage 級別 counter
explain("executionStats")：單 query 詳細分析

Profiler：db.setProfilingLevel(1, {slowms: 200})、看 usedDisk flag 跟 numYield。

回到 4.20 observability evidence：aggregation slow log + cache hit ratio + disk spill rate 是「analytical 壓力」的 evidence 三件套。

回到 9.5 bottleneck localization：用 explain executionStats 把 pipeline stage 對到瓶頸（IXSCAN 還是 COLLSCAN、in-memory 還是 disk spill、shard-local 還是 mongos merge）。

邊界與整合

Sibling deep articles：

schema design pattern — embedded 設計可消除大部分 $lookup
shard key selection — 決定 aggregation 是 shard-local 還是 cross-shard
replica set read preference — aggregation 跑 secondary 的 stale read trade-off
connection management and cache layer — report dashboard 跑爆 primary 時的 cache + read scaling 主路

Migration playbook：analytical workload 大到不能繼續混在 MongoDB → split 出 → Cosmos DB MongoDB API + Synapse 或 → DynamoDB + Athena/Glue（access pattern 重設計）。

跟 1.x 互引：1.10 KV / Document DB 容量規劃把 aggregation 列為 read-shape 的成本維度；1.1 高併發資料存取處理「OLTP + analytical 同 cluster」的反模式。

MongoDB Change Streams + Kafka 整合：resume token、scope 選擇與 connector 治理

Wed, 27 May 2026 00:00:00 +0000

MongoDB change streams 是 3.6+ 原生 CDC 介面、本質上是 oplog tail 包裝成 cursor API。Application 從 dual-write 模式（自己寫 MongoDB 又寫 Elasticsearch / Redis / data warehouse）換成 change stream → Kafka → downstream sink 後、有了第一版 CDC pipeline、但連續工作幾週後出現「downstream 漏 event」或「duplicate event」；最痛的是 connector restart 後 resume token 過期（oplog 已滾掉）、整個 collection 必須重灌。本文把 change stream 機制、Kafka Connector 配置、resume token 治理、sharded cluster scope 選擇講清楚。

本文不重複 MongoDB vendor overview 已寫過的 change streams 簡介 — 而是 production CDC pipeline 部署 + 失敗修復的實作層教學。

MongoDB 適用度前置判讀：進到 CDC pipeline 設計前先確認 workload 在 MongoDB 適用區（document shape 主導 / contract layer 該放哪 / 跨雲 hedging 是否需要）— 詳見 schema-design-pattern 開頭 3 軸前置判讀、本篇不重複展開。Change streams 是 已選 MongoDB 後 的 event-driven 整合議題。

問題情境：第一版 CDC pipeline 跑幾週的踩雷

典型觸發場景：application 寫 MongoDB 後還要 dual-write Elasticsearch / Redis / data warehouse、application code 越塞越多 hook、寫入失敗的補償邏輯散落各處。改用 change stream → Kafka → downstream sink 後、有了第一版 CDC pipeline、但連續工作幾週後出現：

Downstream 漏 event 或 duplicate event
Connector restart 後 resume token 過期（oplog 已滾掉）、整個 collection 必須重灌
Sharded cluster 上 collection-level change stream 跟 cluster-wide change stream 行為不同、application 連 mongos 跟連 single shard 拿到不同 event

讀者徵兆：

MongoDB Kafka Connector log ChangeStreamHistoryLost 或 ResumeTokenChanged
Downstream Kafka topic event count vs source collection write count 不平
Replication oplog 跟 change stream consumer 的 lag 同時升

Case anchor：CDC pipeline resume token 過期導致全量重灌的具體 incident 細節需未來 case 補完、本文以「常見 failure pattern」+ 容量公式處理、不憑空編造 incident 數字。側面引用 Spotify Kafka → PubSub migration（pipeline-level migration 經驗對照）。

核心機制

Change stream 是 MongoDB 3.6+ 原生 CDC、本質上是 oplog tail 包裝成 cursor API。可以從 collection / database / cluster 三個 scope 開：

Collection-level：監看單一 collection 的變更
Database-level：監看整個 database 的所有 collection
Cluster-wide：監看整個 cluster 的所有 database

Oplog 是 capped collection、預設 size = disk 5% 或 50GB（取較小）。Resume token 對應 oplog entry 的 timestamp + UUID + documentKey。Token 必須對應仍在 oplog 內的 entry — oplog 滾掉就拿不到 token 對應的位置、ChangeStreamHistoryLost。

Resume token 兩種用法：

_id：每個 event 都帶、application 自己存
startAfter / resumeAfter parameter：重啟 cursor 時帶上

fullDocument: "updateLookup"：update event 預設只給 delta、加這個 option 會額外 query 一次 primary 拿完整 doc；高頻 update 下成本顯著（primary 負擔翻倍）。

Pre-image / post-image（6.0+）：可以拿到 update 前的 doc 狀態、需 collection-level option changeStreamPreAndPostImages: true。

Cluster-wide vs collection-level change stream：

Cluster-wide 必須打 mongos、event ordering 是 global
Collection-level 可直接打單 shard、ordering 只在該 shard 內
Sharded cluster 上 cluster-wide stream 容易把 mongos 變單點瓶頸（所有 shard 的 event 都收斂到 mongos）

MongoDB Kafka Connector（Confluent / MongoDB 官方）：

Source connector：把 change stream → Kafka topic
Sink connector：把 Kafka topic → MongoDB
At-least-once 語義、需 application 處理 idempotency

對應 knowledge card：change-data-capture、replication-channel、replication-slot（MongoDB 沒 slot、概念對照）。

操作流程

Step 1：scope 決策樹。

Scope	適用條件	代價
Collection-level	單一 collection 的下游 sink、ordering 需求單一	多 collection 要多 connector
Database-level	多 collection 共享 sink、ordering 跨 collection	filter cost 在 connector 端
Cluster-wide	整個 cluster 統一 audit / replay	mongos 單點瓶頸風險、event 量大

Step 2：oplog sizing。容量公式：

1oplog size >= peak write rate × max acceptable consumer downtime

典型設 24-72 小時可恢復窗口。例：peak 5K WPS、想容忍 48 小時 connector down、oplog 至少 5K × 86400 × 2 ÷ docs_per_GB ≈ 看實際 doc size 決定。在 Atlas 上 oplog size 可直接調、自管 cluster 改 replSetResizeOplog。

Step 3：Kafka Connector 配置。

 1{
 2  "connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
 3  "connection.uri": "mongodb://...",
 4  "database": "shop",
 5  "collection": "orders",
 6  "publish.full.document.only": "true",
 7  "change.stream.full.document": "updateLookup",
 8  "copy.existing": "true",
 9  "copy.existing.namespace.regex": "shop\\.orders",
10  "errors.tolerance": "none",
11  "offset.flush.interval.ms": "10000"
12}

關鍵欄位：

change.stream.full.document: "updateLookup"：每 update 額外 query primary 拿完整 doc（成本意識）
copy.existing: "true"：connector 啟動時先把現有 collection 全量複製、再切到 change stream — 適合初次部署
errors.tolerance: "none"：sink 失敗時 batch 停在 dead-letter queue、不 silently drop

Step 4：resume token persistence。Connector 把 token 寫 Kafka __consumer_offsets 或外部 store；application 自管 change stream 時要寫到 durable store（不是 in-memory）。

Step 5：filter pipeline。Change stream 支援 aggregation pipeline 把過濾下推到 MongoDB：

1const pipeline = [
2  { $match: { "operationType": { $in: ["insert", "update", "delete"] } } },
3  { $match: { "fullDocument.region": "ap-tokyo" } }
4]
5const changeStream = db.orders.watch(pipeline)

把過濾下推減少 connector 處理量、特別是高頻 collection 上。

Step 6：downstream idempotency。Sink 收 Kafka event 時用 documentKey._id + clusterTime 做 dedup key — at-least-once 語義意味著 connector restart 後幾分鐘 event 會重發。

驗證點：

Source collection write count vs Kafka topic event count 差異 < 0.1%
Resume token age < oplog retention 的 50%（健康狀態）
Connector restart drill 能 5 分鐘內接回

Rollback boundary：source connector 是 read-only 對 MongoDB 無傷；sink connector 要備份 target 才能還原；resume token 寫錯 → 從 startAtOperationTime 回退到時間點重跑。

失敗模式

Resume token 過期（oplog 滾掉）：connector down 太久、oplog 已超出 retention、ChangeStreamHistoryLost → 必須 copy.existing 全量重灌、期間 downstream 看不到新資料。預防是 oplog sizing 留 buffer + connector lag alarm + token age 監控（age > oplog retention 的 50% 預警）。

updateLookup 在高頻 update 下打爆 primary：每筆 update event 都觸發一次 primary query、primary 負擔翻倍。修法是改 collection-level pre/post image（6.0+）、由 MongoDB 自己在寫入時記錄、或在 application 補完整 doc 後再寫 Kafka、不用 updateLookup。

Sharded cluster cluster-wide stream 打爆 mongos：所有 shard 的 event 都收斂到 mongos、mongos 變單點瓶頸。修法是改 collection-level stream 多 connector 並行、每 connector 連 mongos 但只訂單一 collection。

At-least-once 變 duplicate flood：connector restart 點之後幾分鐘 event 重發、downstream 沒做 idempotency → 重複 side effect（重複發 email、重複扣款）。修法是 sink 端強制 idempotency（dedup key 寫 Redis / DB）、不能假設「我用 at-least-once 但實際不會 duplicate」。

Schema drift 突然 break sink：MongoDB 寫了新欄位 / 改型別、sink connector 的 JSON schema 不認、batch 停在 dead-letter queue。修法是 schema 變動有 validation gate（見 schema design pattern）、sink schema 設 lenient 模式吃 unknown field、或加 schema registry 統一版本。

Backup / DDL 期間 change stream 異常：reIndex / compact / dropCollection 觸發特殊 event、connector 沒處理 → consumer 停。修法是 connector 處理特殊 event 邏輯要明確、不認得的 operation type 至少 log warning 而不是 silently stuck。

Anti-recommendation：

簡單的 outbox pattern + application transactional write 對於低吞吐 / 單 sink 的場景比 change stream + Kafka 簡單；不是所有「需要 event 通知」的場景都要 CDC pipeline
若 downstream 只是同一 region 同團隊的 Elasticsearch index、$merge 寫進中介 collection 或 application 雙寫 + 對賬可能成本更低
Resume token 過期是這條路徑最痛的事故、oplog sizing 是 投資而不是成本 — 不要為了省 storage 把 oplog 設太小

容量與觀測

關鍵 metric：

Oplog 健康：oplog 寫入速率與保留時間
Change stream 健康：cursor age、resume token 距 oplog 頭尾的距離
Connector 健康：connector lag（Kafka offset 對比 source write）
下游健康：event count diff（source write count vs sink apply count）、event time → arrival time lag 分布

Mongo command：

db.getReplicationInfo()：oplog 大小 / 時間範圍
db.printReplicationInfo()：oplog 摘要
db.currentOp({ "op": "getmore", "ns": "local.oplog.rs" })：看 change stream consumer 連線

Connector metric（Kafka Connect JMX）：source-record-poll-rate、source-record-write-rate、offset-commit-success-rate。

回到 4.20 observability evidence：oplog retention + connector lag + dedup rate 是 CDC pipeline 健康狀態 evidence 三件套。

回到 9.5 bottleneck localization：CDC lag 升高時區分 (a) source oplog 寫太快 (b) connector 處理慢 (c) downstream sink 慢。

邊界與整合

Sibling deep articles：

shard key selection — cluster-wide vs collection-level change stream 在 sharded cluster 的選擇
replica set read preference — change stream 對 primary load 的影響、能否走 secondary
schema design pattern — schema validator 對下游 sink 的契約意義
connection management and cache layer — CDC sink 在 production 跨層架構裡的角色（cache invalidation / federated DB 同步）

Migration playbook：

MongoDB → 其他 sink 的 bulk migration 走 → Atlas Migration Service
遷出 MongoDB 時 change stream 是 catch-up 機制（先 bulk export、再 change stream 補增量）

跟 1.x 互引：1.7 schema migration rollout evidence 處理 schema drift 時 CDC pipeline 的對賬；1.9 reconciliation data repair 處理 CDC 失準後的對賬流程。

MongoDB on Tarragon

MongoDB → Atlas：Atlas 不是 MongoDB + managed、是另一個 product

Atlas 不是 MongoDB + managed、是另一個 product

結構：4-phase operational + drop-in cutover

Phase 0：Pre-migration audit

Workload sizing → Atlas tier

Connection pattern audit

Compliance audit

Phase 1：Operational infrastructure 準備

Atlas cluster 配置

VPC peering / private endpoint

Atlas Database User 跟 IAM 整合

Phase 2：Data migration

Atlas Live Migration tool（小到中型）

mongomirror（大型）

Phase 3：Cutover + verification

Production 故障演練

Case 1：Atlas tier connection limit 撞牆

Case 2：IP whitelist 漏 application VPC、cutover 後完全連不上

Case 3：Backup retention 設不夠、compliance audit 抓到

Case 4：IAM token 過期、application 端 reconnect storm

Case 5：Billing 暴漲、IOPS 跟 backup storage 超預估

Capacity / cost

整合 / 下一步

跟 PostgreSQL → Aurora migration 對照

跟 Application 端 IAM token rotation 整合

下一步議題

相關連結

MongoDB Shard Expansion + Multi-DC：Type F「不需要 parallel run」的 multi-region 例外

Reviewer D 的質疑：Type F 一定不需要 parallel run 嗎

兩個操作合併：shard 加 + DC 加

Pre-layout analysis：當前 + 目標 topology

Re-layout 機制

Shard expansion mechanism

Multi-DC expansion mechanism

Execution flow（含 parallel run + 流量切換）

Production 故障演練

Case 1：Balancer 跑 chunk migration 撞 production peak

Case 2：Cross-DC initial sync 期間 oplog 跑出窗口

Case 3：跨 DC read 路由錯誤、stale data 影響業務

Case 4：Shard tag-aware routing 沒設、cross-DC traffic 爆 cost

Case 5：Failover 後跨 DC primary 切換、application 連線中斷

Capacity / cost

整合 / 下一步

跟 MongoDB → Atlas migration 對位

跟 Application read pattern 整合

跟 Cassandra keyspace re-balance 對比

下一步議題

相關連結

MongoDB Schema Design Pattern：contract layer 在哪 vs embedded / reference

問題情境：document 自由的後座力

核心機制：aggregate root、embedded、reference、polymorphic

Contract layer 三條路徑

Time-series collection（6.0+）

操作流程

失敗模式

容量與觀測

邊界與整合

相關連結

MongoDB Shard Key Selection：hashed vs ranged、單 cluster 切 shard vs 多 cluster 切 blast radius

問題情境：橫向擴展不是只有 sharded cluster 一條路

核心機制：shard key、chunk、balancer

單 cluster 切 shard vs 多 cluster 切 blast radius

Partition key 可逆性跨 vendor 對照

操作流程

失敗模式

容量與觀測

邊界與整合

相關連結

MongoDB Replica Set Read Preference：DB 層 causal session vs cache 層 freshness token

問題情境：read scaling 撞牆的兩種長相

核心機制

MongoDB read preference + read concern 兩軸

Causal consistency session（DB 層機制）

Freshness token（cache 層機制）

跨層協作三選一

操作流程

失敗模式

容量與觀測

邊界與整合