Mongodb on Tarragon

DB3 Vendor Selection：document / KV / multi-model 三方選型 + workload shape 前置判讀

Wed, 27 May 2026 00:00:00 +0000

DB3 vendor selection 的核心責任是把讀者從「我該選 MongoDB / DynamoDB / Cosmos DB 哪一家」這個問題、推到「我的 workload 是 document / KV / multi-model 哪一類」這個更前置的問題。三家文件都標榜 scalable schema-less、但實際取捨在 資料形狀、access pattern 穩定度、consistency 可接受度 三軸決定 — 不識別 workload shape 直接比 vendor 是源頭錯誤。本文是 DB3 reader 進來的第一站：先做 workload shape 三軸前置判讀、再過 migration path 三型 + federated DB 視角、最後落到三 vendor 對比 10 軸。

本文不展開 vendor 機制細節（partition key 設計 / consistency level / RU sizing / connection management 等）— 那些屬 per-vendor deep article 的責任、本文在每個軸後 cross-link 過去。本文也不比較三家「誰比較強」— 三 vendor 在 workload-by-workload 適配光譜上各有位置、寫成優劣比較會誤導讀者把選型壓成單軸。

問題情境：讀者進來時的真實壓力

典型啟動壓力分兩類：

第一類、團隊評估 document / KV / multi-model NoSQL 三家、文件都說「scalable schema-less」、看不出實際取捨。讀者徵兆是「我的資料是 document-shaped 還是 KV-shaped？」「partition key 該怎麼選？」「Atlas 跟 Cosmos DB MongoDB API 不一樣的點在哪？」「Cosmos DB multi-model 是真用得到還是行銷話術？」「on-demand vs provisioned 怎麼選？」

第二類、既有 PostgreSQL / MySQL workload 撞 connection limit（surge 下 1K-5K pool 是隱性天花板、F1.7）、想換 KV 但不知道是否適合。讀者徵兆是「我已經有 Memcached、還要再加 MongoDB cache 層嗎？」「DynamoDB 適合當 OLTP 嗎？」「換 NoSQL 是不是解 connection 問題的銀彈？」

這兩類讀者進來時的 真實問題 不在 vendor 之間、在 workload 自己屬哪一型。Case anchor 覆蓋六個 unique 角度：

多型 document workload — 9.C38 Toyota Connected（車載 sensor schema 隨車型演進、20 個 Atlas DB blast radius 切分）
Document 跨雲 hedging — 9.C37 Forbes（自管 → Atlas、6 個月遷移、跨雲彈性）
同 model 換 vendor 的 dogfood signal — 9.C30 Microsoft 365（MongoDB → Cosmos DB MongoDB API、保留 driver、wire compat 限制）
KV-as-buffer 正向用例 — 9.C15 Tixcraft（DynamoDB 寫入緩衝、6750x 彈性、後端慢消費）
PK 天然均勻典範 — 9.C5 Amazon Ads（90M reads/sec 年度峰值、KV pattern 純粹）
Federated DB 真實系統 — 9.C36 Coinbase（MongoDB + DynamoDB + Memcached + mongobetween + freshness token）

Workload shape × access pattern × consistency 三軸前置判讀

進三家 vendor 對比前先回答：你的 workload 屬哪一型？三軸的組合決定 vendor 候選清單、軸不識別清楚直接比 vendor 是把選型壓成「品牌偏好」、不是工程決策。

軸 1 — 資料形狀：document / KV / 不清楚

資料形狀的核心判讀是 aggregate root 邊界是否明確 跟 schema 是否會隨產品演進新增欄位。document 適合的場景是資料天然多型、單筆記錄欄位差異大、應用層用 aggregate root 模式存取；KV 適合的場景是資料形狀固定、access pattern 數量少（< 5 種）、固定 lookup by key。

訊號	適配資料模型	對應 case
資料天然多型（不同記錄欄位不同）、隨產品演進 schema 增刪欄位、aggregate root 邊界明確	Document（MongoDB / Cosmos DB SQL API / MongoDB API）	Toyota sensor schema 隨車型演進、Forbes CMS article 欄位多型
資料形狀固定、access pattern < 5 種、固定 lookup by key（meeting_id / message_id / user_id）	KV（DynamoDB / Cosmos DB Table API / Redis 持久化變體）	Amazon Ads 用 ad_id 查、Disney+ 用 user_id 查 watchlist、PayPay 用 message_id 查通知
資料形狀還在探索、access pattern 變動頻繁、未來 6 個月會加 5+ 種新 query	暫緩 NoSQL 選型、用 PostgreSQL + JSONB 過渡	屬讀者誤判常見模式、case 沒揭露但 F1.3 / F1.6 推論：NoSQL 假設 access pattern 穩定、未穩定就上 NoSQL 會撞 single-table 設計天花板

第三列的「暫緩 NoSQL」是反指標。NoSQL（特別是 DynamoDB single-table design）的核心假設是「access pattern 在設計時已知、後續變動有限」。資料模型還在探索、access pattern 半年內會大幅增減的場景、PostgreSQL + JSONB 給的彈性遠高於 NoSQL — JSONB 欄位可以演進、ad-hoc query 可以用 SQL 跑、未來釐清穩定 access pattern 後再選 NoSQL 不遲。

軸 2 — Access pattern 穩定度（KV 適用度前置判讀）

KV 適用度的核心判讀是 partition key 天然均勻度。partition key 不均勻會讓 vendor 廣告的「scale infinitely」變成「scale 到 hot partition 為止」、單一 logical key 流量超過該 partition 上限就 throttle 或 latency spike（F1.1）。

天然均勻 PK + 穩定 access pattern（meeting_id / player_id / message_id / user_id）→ DynamoDB / Cosmos DB Table API 適用、PK 不需 composite key 修補。Amazon Ads 用 ad_id 撐 90M reads/sec、Zoom 用 meeting_id、Capcom 用 player_id、PayPay 用 message_id、Disney+ 用 user_id — 五個 case 都揭露同一 frame：業務天然存在均勻 key 時 KV 是最自然的選擇。
天然不均勻 PK（event_id 一場演唱會集中 / date 時間序集中）→ 需 composite key 或 write sharding 修補。Tixcraft（9.C15）用 event_id + user_id_hash composite key 把單一熱門演唱會的 6750x spike 攤平到 partition 上 — 不是 DynamoDB 自身彈性、是 partition key 均勻分散的結果（F1.2）。
Access pattern 變動頻繁（探索期、< 5 種 query 還會增加）→ 不適合 DynamoDB single-table design、回 RDB。Single-table 把 access pattern 編進 PK / SK 結構、增加新 query 等於改 schema、改 schema 等於重新 load 資料、成本不對。

KV 適用度判讀的延伸細節（hot partition 反模式 / composite key 設計 / adaptive capacity）見 DynamoDB partition key antipatterns。

軸 3 — Consistency 需求是否可接受 eventual

Consistency 需求的核心判讀是 跨 partition / 跨 region transaction 是否為產品契約。三家 vendor 都支援單 partition / 單 region 強一致、但 cross-partition / cross-region transaction 的機制跟限制差異大。

可接受 eventual / session consistency：DynamoDB（default eventually consistent reads、可選 strong）、Cosmos DB（5 個 consistency level、default session）、MongoDB（read concern 多級）— 三家都可以、選擇看其他軸。多數 KV / document workload 屬此類（social timeline、watchlist、message queue、analytics aggregation）。
需要強一致 cross-partition transaction：DynamoDB 跨 partition transaction 限制（單一 transaction 最多 100 個 action、跨 region 不支援）、MongoDB 4.0+ 支援 multi-document transaction 但 sharded cluster 仍有 limitation、Cosmos DB 跨 logical partition transaction 受限 — 都不如 SQL／distributed SQL 自然、應回 DB4 entry point 評估 Aurora DSQL / Spanner / CockroachDB。
跨 region active-active write：三家機制完全不同 — Cosmos DB multi-region write 跟 Strong consistency 是互斥設定（CAP 取捨硬約束、見 Cosmos DB multi-region write conflict SSoT 主寫位置）；DynamoDB Global Tables 走 LWW（last-writer-wins）conflict resolution；MongoDB Atlas 跨 region 需手動 conflict 處理。三家不在同一光譜、選擇前必看各 vendor outline 的機制段。

Migration path 三型（跨 case 合成 frame）

本段是 跨 case 合成 frame、不是單一 case 揭露 — 從 Coinbase（9.C36）/ Forbes（9.C37）/ Microsoft 365（9.C30）三 case 萃取的共通結構（F2.1）。

讀者進來時通常不是綠地、是 既有系統演進。三型遷移路徑的風險、ROI、適用條件完全不同、選錯路徑會推到錯的 vendor。

第一型：保留原 DB + 補周邊工具

不換 vendor、加 connection proxy（mongobetween / pgbouncer 類）、加 cache（Memcached + freshness token）、加 predictive scaling — 主資料層不動、應用層跟 ops 層補強。

代表 case：Coinbase（9.C36）保留 MongoDB Atlas、自建 mongobetween 把 60K connections/min 降到 ~2K（一個量級）、用 Memcached + freshness token 撐 1.5M reads/sec、用 ML predictive scaling 把擴容時間從 70 → 25 分鐘提前 60 分鐘
路徑成本：中（自建工具、需要工程資源 build & operate proxy / cache layer / ML model）
風險：低（主資料層不動、回滾代價小）
ROI：保留主資料 schema + access pattern、解 driver / 部署模型 / cache 一致性瓶頸
適合：MongoDB（或主 DB）資料層撐得住、但應用層 connection storm / cache miss / 擴容慢卡瓶頸；團隊有工程能力 build 跟 maintain 周邊工具

延伸實作細節見 MongoDB connection management（per-vendor article、cross-link 待寫稿）。

第二型：同 DB 換託管

自管 → managed（Atlas / Cosmos DB / DocumentDB）、保留 schema 跟 access pattern、遷移期 6 個月量級。

代表 case：Forbes（9.C37）自管 MongoDB → MongoDB Atlas、保留 CMS schema、6 個月遷移、揭露「TCO 改善 25%」
路徑成本：中（dual-write + shadow read 驗證、driver 行為差異、operation runbook 重寫）
風險：中（dual-write 期間雙寫一致性、cutover 時點選擇）
ROI：operation transfer（DBA bandwidth 釋放給 schema design / query tuning）+ TCO 改善
適合：自管 ops burden 大（DBA bandwidth 被 backup / patching / replica lag 吃光）、不想換 model

Scope warning（Forbes 25% TCO）：「25% TCO 改善」是 Forbes 特定流量規模（120M MAU、70+ Atlas region）下的數字、不普適。引用要帶條件 — 不要寫成「Atlas 比自管便宜 25%」這種 vendor-neutral 結論。實際省多少要看自管當下的 license / hardware / ops 工時分配、跟 Atlas 在你流量規模下的 pricing tier。

第三型：換 vendor 保留 model

MongoDB → Cosmos DB MongoDB API、或 MongoDB → DocumentDB — wire protocol + driver 不變、底層架構整個換、ops 模型整個換。

代表 case：Microsoft 365（9.C30）MongoDB → Cosmos DB MongoDB API、保留 MongoDB driver
路徑成本：高（dual-write per query pattern 驗證、wire compat ≠ 100% 行為相同、aggregation pipeline 跟 transaction 行為要逐項驗證）
風險：高（每個 query pattern 都可能踩到不相容 edge case、cutover 點選擇難）
ROI：跨 vendor 換（Azure 生態 / multi-model API / global distribution）+ 保留應用層 driver code

Scope warning（Microsoft 365 dogfood）：Microsoft 365 是 Microsoft 自家 dogfood、case 沒揭露具體 throughput / latency / cost 數字（F2.17）。dogfood 是 高權重 selection signal（雲商賭自家旗艦產品）、但 不是 production benchmark（沒公開數字可比對）。引用要明示「dogfood signal」而非「production proof」。

Scope warning（100% wire compat）：Cosmos DB MongoDB API 廣告「100% wire compatibility」是 vendor 行銷話術、實際是「在某些 query pattern 下相容」（F2.9）。遷移時必須 dual-write per query pattern 驗證 — 不是看 vendor 文件 spec list、是用 production query corpus 跑一遍實測行為。Phase 0 audit checklist 應列出 unsupported aggregation stage、transaction edge case、index behavior 差異、change stream 跟 Change Feed 對應關係。

延伸 Cosmos DB MongoDB API vs SQL API 選型見 Cosmos DB MongoDB API vs SQL API。

第四型不在 DB3 範圍：paradigm shift 換引擎

KV → SQL 或 SQL → distributed SQL 屬 paradigm shift、應進 DB4 entry point: Aurora DSQL / Spanner / CockroachDB decision tree。本文範圍是 DB3 三家內部選型、不展開 paradigm shift。

從 RDB 撞牆來的快速路徑

讀者若從 PostgreSQL / Aurora connection limit 撞牆過來、想評估 KV 替代、依撞牆訊號直接 route 到對應 article、不必先跑完三軸前置判讀：

撞 connection limit（surge 下 pool 1K-5K 隱性天花板、long-lived TCP 占滿）→ HTTP API 模型（no long-lived connection）的 KV 直接接寫入緩衝、進 dynamodb/single-table-design-pattern 的「durable queue / write buffer」段（Tixcraft 9.C15 路徑：DynamoDB 接訂單、傳統 server 慢消費）、或評估 Cosmos DB Table API
撞單 primary 寫入上限（單 leader 寫吞吐天花板、read replica 無法分擔寫）→ multi-primary distributed SQL 路徑、進 DB4 entry point: Aurora DSQL / Spanner / CockroachDB decision tree 的 Path A（DoorDash 1.636 M QPS 單主寫入撞牆）
撞單一 DB 撐不下 + 多 workload 形狀並存（read-heavy / write-heavy / analytics 混在一個 DB）→ federated DB 模式、看 9.C36 Coinbase（MongoDB + DynamoDB + Memcached + mongobetween）+ 9.C29 Lemino（PostgreSQL → DynamoDB 揭露 RDB connection limit 隱性 bottleneck）

進 dynamodb/single-table-design-pattern 前先確認軸 1 / 軸 2 的 access pattern 穩定度跟 PK 天然均勻度 — connection limit 訊號 必要但不充分、KV 適用度 4 軸還是要走完、避免「為了解 connection 把不穩定 access pattern 硬塞 single-table」反模式。

Federated DB + system role 視角（跨 case 合成 frame）

本段也是 跨 case 合成 frame（F2.18 + F1.6）— 三個 rich case（Coinbase / Toyota / Forbes）都揭露 production 系統是 DB + 周邊工具 組合、不是單一 DB monolithic 撐起來。

讀者常誤以為「全用 X」是正解 — 全用 MongoDB、或全遷 DynamoDB、或全換 Cosmos DB。真實 production case 揭露兩個更前置的事實：(a) production 系統是 federated（多 DB 按 workload 分流）、不是 monolithic；(b) 每個 vendor 在系統中扮演 特定角色（control plane vs data plane vs cache）、不是 all-purpose store。

Federated DB by workload

Coinbase（9.C36）production 配置：MongoDB Atlas（document 主資料、identity service）+ DynamoDB（部分固定 KV workload）+ Memcached（read cache）+ mongobetween（connection proxy）+ Kinesis（event stream）。不是「全用 MongoDB」也不是「全遷 DynamoDB」、是按 workload shape 分流。

Toyota Connected（9.C38）：MongoDB Atlas 20 個 DB（microservice 拆 blast radius）+ Lambda + Kinesis + Redis + Kubernetes。20 個 DB 不是吞吐撐不住（18B txn/月 ≈ 7K txn/sec、單一 cluster 撐得下）、是 microservice ownership + blast radius 切分（F2.6）。

Forbes（9.C37）：MongoDB Atlas + 中介 abstraction layer + 50+ microservice。abstraction layer 隔離 schema 變動、避免 50 個服務都依賴 DB schema 細節（F2.3）。

三 case 揭露的共同 frame 是：寫 production 系統時假設「DB 一個服務搞定」、忽略 cache / queue / proxy / abstraction layer 跨層責任、會撞 connection limit / cache miss / cross-region replication 等隱性瓶頸。

System role：control plane vs data plane

DynamoDB 在 surge 場景能撐 nearly infinitely 不是 DynamoDB 自己神奇、是 系統架構解耦 的結果（F1.6）：

Control plane（metadata、state、user record）：DynamoDB / MongoDB / Cosmos DB 適合 — 流量是 small payload + high QPS pattern
Data plane（影音、大型 BLOB、media stream）：CDN / S3 / object storage、不在 DB3 範圍 — 流量是 large payload + bandwidth-bound
Cache layer：Redis / Memcached / DAX（DynamoDB 補位）— 跟主 DB 形成跨層架構、處理讀峰值 + read-your-own-write 一致性

三個 case 揭露同一 frame：Zoom 視訊 metadata 走 DynamoDB、影音走 WebRTC / edge servers；Disney+ watchlist 走 DynamoDB、影片串流走 CDN + S3；Capcom game state 走 DynamoDB + DAX、game server 走 EKS。把影音串流塞 DynamoDB 是違反 control plane vs data plane 分離、容量規劃會錯（每筆 1KB 的 KV vs 每筆 100MB 的 media chunk 是不同 workload）。

三 vendor 對比 10 軸

下表是三 vendor 在 selection 階段的 10 軸對比。每個軸後續都有 per-vendor deep article 展開機制、本文不重複展開。

軸	MongoDB	DynamoDB	Cosmos DB
資料模型核心	Document（aggregate root）+ aggregation pipeline	KV with optional document fields + GSI / LSI	Multi-model（SQL / MongoDB / Cassandra / Gremlin / Table API）
部署 topology	跨雲（Atlas AWS / GCP / Azure）+ self-hosted	AWS-only managed	Azure-only managed
跨雲 hedging	高（Atlas 跨雲、Forbes case）	無（AWS lock-in）	無（Azure lock-in）
Capacity 抽象	CPU + IOPS + working set RAM 三軸	WCU/RCU + on-demand/provisioned + adaptive capacity	RU（Request Unit）+ 5 consistency level
Contract layer	DB 層 `$jsonSchema` validator / app 層 abstraction / 混合	DynamoDB Stream + app 層 validator	DB 層 stored procedure + app 層 validator
Partition / shard key 可逆性	`reshardCollection` 4.4+ 可改、成本高	可改用 backfill	不可改、必 export-recreate
Consistency model	Read concern（local / majority / linearizable）+ causal consistency session	Eventually / strongly consistent reads	5 level spectrum（Strong / Bounded staleness / Session / Consistent prefix / Eventual）
Multi-region write	Atlas 跨 region 手動 conflict 處理	Global Tables LWW	Multi-region write（Strong 互斥、見 cosmosdb/multi-region-write-conflict SSoT）
Dogfood signal	無（MongoDB 是獨立公司、不適用）	Amazon 自家高頻使用（9.C5 Amazon Ads / 9.C27 Disney+ etc）	Microsoft 365 dogfood（9.C30、Scope warning：dogfood 數字不公開、是 selection signal 不是 benchmark）
Multi-model 差異化	單一 document model	單一 KV-with-document model	唯一單服務支援 5 API（差異化價值、F2.16）

軸的延伸子段

部署 topology / 跨雲 hedging：三家 topology 是 vendor lock-in 跟 跨雲彈性 的硬取捨。Forbes 選 Atlas 不是當下省錢（自管 MongoDB 也可以、TCO 改善是副作用）、是 未來雲商策略尚未底定 的 hedging — Atlas 提供 AWS / GCP / Azure 三家部署、未來換雲不用換 DB（F2.10）。對照 DynamoDB / Cosmos DB / Spanner / Aurora 都是單雲鎖定 — 選了就跟著該雲商生態走。團隊雲商策略已底定（深度用 AWS / Azure / GCP 其一）時、單雲 vendor 通常較划算（更好的 IAM 整合、更深的 ops 工具、單一 support 通道）。跨雲價值真正成立是 策略不確定 或 合規要求多雲 場景。

Capacity 抽象：三家 capacity 抽象的 思維遷移成本 可能高過 vendor 廣告的價差（F2.12）。MongoDB 用 CPU + IOPS + working set RAM 三軸思維、跟自管 PostgreSQL / MySQL 類似、團隊轉換成本低。DynamoDB 用 WCU/RCU 抽象、要學「估每個操作消耗多少 unit」、加上 on-demand / provisioned / adaptive capacity 三模式選擇。Cosmos DB 用 Request Unit（RU）抽象、1 RU ≈ 1 KB document 的 strong read 成本、寫 ~5 RU、複雜 query 數百 RU — 工程師要學會用 RU 思考、不是用 CPU 思考、團隊知識遷移成本可能高。容量規劃延伸見對應 vendor 的 sizing article。

Partition / shard key 可逆性：三家 不在同一光譜、是選 vendor 前必做的 access pattern audit 重點（F2.15）。MongoDB reshardCollection（4.4+）可改、但成本高、需要 cluster downtime 或長時間 background migration。DynamoDB partition key 技術上可改、實作上用 backfill（建新 table、新 PK、雙寫舊新、cutover）— ops 工作量大但可逆。Cosmos DB partition key 不可改、改 partition key 等於 export-recreate-import — 對 1TB+ 資料是大型 migration 工程。三家不可逆性遞增、選 Cosmos DB 前必須前期完整 access pattern audit、不能「先上 production 之後再調」。

Consistency model：三家機制設計哲學不同。MongoDB read concern 是 per-operation 選擇（同一 client connection 可以混用）；DynamoDB strong vs eventual 是 per-read 選項（write 端統一強一致）；Cosmos DB 5 個 level 是 account-level default + per-request override、且 Strong 跟 multi-region write 互斥（CAP 硬約束）。設計上 MongoDB 最 flexible、Cosmos DB 最 explicit、DynamoDB 介於中間。延伸機制細節見 Cosmos DB consistency levels engineering、Cosmos DB multi-region write conflict（SSoT 主寫位置）。

Multi-model 差異化：Cosmos DB 是 唯一單一服務支援 5 API 的雲商 DB（SQL / MongoDB / Cassandra / Gremlin / Table）— 對照 AWS 走多產品覆蓋（DynamoDB KV + DocumentDB MongoDB-compat + Neptune graph + Keyspaces Cassandra-compat）、GCP 走多產品覆蓋（Firestore + Spanner + Bigtable）。multi-model 的差異化價值是 減少多 DB 並存運維 — 一個產品團隊只養一個 service、一套 IAM、一套 backup / DR、一套 monitoring。但 是否真用上 multi-model 要看團隊實際 workload — 多數團隊只用 1-2 個 API、單一 model 的競品（DynamoDB / MongoDB）可能更專注（F2.16）。

失敗模式（cross-vendor 反模式）

下列七條是三 vendor 都會踩、跨 case 共通的反模式。Per-vendor 特定反模式（例如 DynamoDB on-demand 隱性 hot partition、MongoDB schema 三代並存）在 per-vendor deep article。

反模式 1：把 DynamoDB 當 OLTP

訊號：access pattern 還在探索期、5+ 種 query 還會增加、強一致 cross-partition transaction 是產品契約。應回 PostgreSQL / Aurora、不是繼續加碼 DynamoDB single-table design。

DynamoDB 的正確用法包含 control plane KV（Zoom / Disney+ / Capcom）跟 durable queue / write buffer（Tixcraft 9.C15 揭露的非 OLTP 正向用例、F1.3）— DynamoDB 接「訂單」寫入、不是即時生效、是讓 traditional server（金流 / 票庫）用自己能承受的速度消費。這層解耦讓「前端可以擴 130 倍、後端不用同步擴」。

反模式 2：把 MongoDB 當 KV

訊號：access pattern 固定、PK 天然均勻、不需要 aggregation pipeline、document 內部從不展開（只查 root 欄位）。

應改 DynamoDB / Cosmos DB Table API。MongoDB 在這場景的 overhead（document overhead / connection model / aggregation engine 未用上）不划算 — KV vendor 的單筆讀寫成本更低、scaling 模型更簡單。

反模式 3：把 Cosmos DB 當跨雲服務

訊號：團隊評估 multi-cloud DR / 跨雲 portability、看到 Cosmos DB 文件強調「global distribution」就以為支援跨雲。

Cosmos DB 是 Azure-only、global distribution 指 Azure 內跨 region。想跨雲應改 MongoDB Atlas。multi-model 差異化是 Azure 生態內 的價值（F2.16）— 一旦離開 Azure、Cosmos DB 的所有獨特優勢都不存在。

反模式 4：federated DB 假設「全用 X」

訊號：寫架構設計時假設「DB 一個服務搞定」、不規劃 cache / queue / proxy / abstraction layer。

Production 真實系統都是 federated（Coinbase / Toyota / Forbes 都是）。寫架構時假設一個 DB 搞定會撞 connection limit（surge 下 RDB 第一個爆點、F1.7）/ cache miss（單靠 DB 撐不住讀峰值）/ cross-region replication（跨 region 一致性處理錯）等隱性瓶頸。預先設計 federated topology + 跨層責任分配、不是事後補。

反模式 5：誤判 dogfood case 數字

訊號：引用 Microsoft 365 / Amazon Prime Day 等 dogfood case 時、把它當 production benchmark、抄具體數字當 sizing 依據。

Dogfood case 數字常 不公開 或 不適用 customer-facing（F2.17 + F1.10）— Amazon Prime Day 「90M reads/sec」是年度峰值最高一秒不是平均、Microsoft 365 直接沒給數字、Google Spanner「10 億 req/sec」是 Google 全使用者加總不是單客戶配額。寫架構時引用要明示 selection signal（雲商賭身家、值得當高權重 vendor 訊號）vs production benchmark（具體 sizing 數字）— 兩者不可混為一談。

反模式 6：partition key 一上 production 才發現不可逆

訊號：選 Cosmos DB / DynamoDB 時、partition key 設計沒做完整 access pattern audit、上 production 一段時間後發現 hot partition、想改 PK。

三家不在同一光譜（見前段對比表）— MongoDB shard key 4.4+ 可改但成本高、DynamoDB 可 backfill 改、Cosmos DB 不可改 必 export-recreate。選 Cosmos DB 前要前期完整 access pattern audit、列所有預期 query 跟對應 PK 訪問頻率、確認最熱 PK 流量在單一 partition 容量上限內（F2.15）。

反模式 7：wire compatibility 當 100% 行為相同

訊號：選 Cosmos DB MongoDB API 或 DocumentDB、看到「MongoDB compatible」就假設 MongoDB driver 跑得起來就是相容、跳過 query pattern 驗證。

Wire compat ≠ 行為 100% 相同（F2.9）。Cosmos DB MongoDB API 廣告「100% wire compatibility」是行銷話術、實際是「在某些 query pattern 下相容」— aggregation pipeline 某些 stage 不支援、transaction edge case 行為差異、index 行為差異都會踩到。遷移必須 dual-write per query pattern 驗證、不是看 vendor spec list。

不該選 DB3 的訊號（升 SQL / 升 distributed SQL 路徑）

下列四條訊號出現時、選擇應跳出 DB3 範圍。

JOIN-heavy + 強 normalize workload：應留 PostgreSQL（包括 PostgreSQL + JSONB 混合方案）、不該塞 NoSQL 再 $lookup。aggregation pipeline 的 $lookup 性能遠不如 SQL JOIN、在 sharded cluster 還有限制。
強一致 cross-region transaction 是產品契約：應進 DB4 entry point 評估 distributed SQL（CockroachDB / Spanner / Aurora DSQL）。三家 NoSQL 的 cross-region transaction 都有 limitation、不該當主路徑。
大流量 + 跨業務 fleet 治理：Aurora 200 cluster 模式（9.C4 DraftKings 揭露的 business sharding fleet）可能更合適、進 Aurora fleet 治理。NoSQL 的 fleet 治理工具鏈（cluster lifecycle / cross-cluster query / unified IAM）通常不如 managed SQL 成熟。
資料模型還在探索 + access pattern 變動快：暫緩 NoSQL 選型、用 PostgreSQL + JSONB 過渡。JSONB 給 document-like flexibility、SQL 給 ad-hoc query power、未來釐清穩定 access pattern 後再選 NoSQL 不遲。

下一步路由（per-vendor outline 子組）

讀者識別 workload type（軸 1-3）+ migration path（三型）+ system role（federated / control plane）後、進對應 per-vendor 子組繼續深化。

MongoDB 子組

入門：schema design pattern（contract layer 三選一：DB 層 validator / app 層 abstraction / 混合）
容量：shard key selection（單 cluster vs 多 cluster blast radius、Toyota 20 DB 模式）
Migration：migrate to Atlas（同 DB 換託管型）

DynamoDB 子組

入門：single-table design pattern（access pattern 設計 + 適用度前置判讀）
機制：consistency model optimization（strong vs eventually consistent 取捨）

Cosmos DB 子組

入門：MongoDB API vs SQL API（API model 選型、四層 framing）

跨層架構（federated DB / cache / proxy）

跨層架構的延伸內容見對應 per-vendor connection management / cache layer article（後續會寫）— 本文只在軸 2 / federated frame 點到、不展開機制。

進 DB4 evaluation

若需要強一致 cross-region SQL / paradigm shift（KV → distributed SQL 或 SQL → distributed SQL）、進 DB4 entry point: Aurora DSQL / Spanner / CockroachDB decision tree。

Knowledge card 路由

本文涉及的 knowledge card：

document-store — document model 的核心概念跟 aggregate root 邊界
hot-partition — KV vendor 的 partition 容量上限機制
database-sharding — shard key 跟 partition key 設計
consistency-level — strong / eventual / session 三類取捨
vendor-lock-in — 單雲 vs 跨雲的 hedging 取捨
distributed-sql — 跳出 DB3 進 DB4 的概念入口

MongoDB

Wed, 13 May 2026 00:00:00 +0000

MongoDB 是 document database 的事實標準。schema flexibility、aggregation pipeline、跨雲 managed（Atlas）讓它成為許多 startup 的 default 選擇。Microsoft 365、Disney+ 早期、Uber 等大規模平台都從 MongoDB 起家，後來依 workload 壓力把部分路徑遷移到 KV / 雲商專屬服務（Cosmos DB、DynamoDB）。

教學路線：Document shape 與 schema governance

MongoDB 服務頁的教學目標是把 document model、schema flexibility、index、aggregation pipeline 與 sharding 放回資料形狀治理。讀者讀完後要能判斷資料是否適合 aggregate root，並知道 schema governance 如何影響長期維護成本。

學習段	核心問題	對應段落
Document shape	哪些資料適合 aggregate root 與 nested document	定位、適用場景
Schema governance	schema flexibility 如何搭配 validation、版本與 migration	容量規劃要點、預計實作話題
Query / index	index、aggregation pipeline、ad-hoc query 如何影響成本	容量特性、常見陷阱
Sharding	shard key、chunk、balancer 如何把資料形狀變容量問題	容量規劃要點、Database Sharding
替代路由	何時轉 PostgreSQL、DynamoDB、Cosmos DB 或 search	不適用場景、跟其他 vendor 的取捨

定位：JSON document + 跨雲彈性

MongoDB 是以 document model 為主體的 DB。PostgreSQL JSONB 適合「SQL 為主、少量半結構化欄位」；MongoDB 則把 BSON document、aggregation pipeline、database sharding 與 schema governance 放在核心設計裡。近年版本加入 time series、change streams、queryable encryption、CSFLE 等能力。

選 MongoDB 的核心訴求：document model 是主要 use case、需要跨雲 managed（Atlas）、想避免 vendor lock-in（也可自管）。

容量特性

單一 instance 吞吐：

一般 m5.4xlarge：5K-15K WPS（依 doc size、index）
高階 instance + tuning：30K-50K WPS
超過此級別 → sharding

Sharding：

MongoDB 原生支援 sharded cluster
mongos router + config servers + shard
MongoDB sharding 要主動設計 shard key，並和 Hot Partition 風險一起看

Replication：

Replica set（primary + secondary、async）
跨 region 通常 async
自動 failover < 30 秒（mongod 內建）

Storage：

單一 collection 沒有官方上限、但 shard key resharding 過去版本是大手術（4.4+ 支援 reshardCollection）

適用場景

1. Document model 主要 workload：

schema 變化頻繁的早期產品
nested document 自然表達領域模型（訂單含多個 item、用戶含多個 preference）
對應案例：9.C30 Microsoft 365 — 從 MongoDB 遷移到 Cosmos DB MongoDB API、保留 document model

2. Aggregation pipeline 重 workload：

複雜的 $group / $match / $project chain
報表、analytics、ETL prep
比 RDBMS 寫複雜 query 更直觀（對某些 team）

3. 跨雲 managed（Atlas）：

MongoDB Atlas 跨 AWS / GCP / Azure
跟 DynamoDB（AWS only）、Cosmos DB（Azure only）、Spanner（GCP only）相反
適合多雲策略、避免單一 vendor lock-in

4. Time series workload（6.0+）：

time series collection 專屬優化
不過 InfluxDB / TimescaleDB 仍是更專業選擇

5. 已有 MongoDB 生態 + 想轉移操作責任：

Atlas 提供 backup、failover、monitoring、auto-scale
想把 MongoDB DBA / SRE 操作責任交給 Atlas

不適用場景

1. 強 ACID multi-document transaction：

MongoDB Transaction 支援多 document、但跨 shard 有性能影響
高頻金融交易仍建議 SQL 系統
替代：PostgreSQL、Aurora、Spanner

2. 複雜 JOIN：

MongoDB $lookup 適合少量相鄰資料，JOIN-heavy workload 應回 SQL 系統
schema design 階段要把常用讀取路徑 denormalize 成 document shape
替代：SQL 系統做 JOIN-heavy workload

3. 純 KV + sub-ms latency：

MongoDB document model 比 KV 多一層 BSON parsing
替代：Redis、DynamoDB、Bigtable

4. 大規模 OLAP：

aggregation 對中等資料量還行、TB 級不適合
替代：ClickHouse、BigQuery、Spark on Delta Lake

5. 嚴格資料模型 + schema enforcement：

MongoDB schema flexibility 可能導致 production data inconsistency
替代：SQL DB（schema 強制）+ JSONB column 處理半結構化

跟其他 vendor 的取捨

vs Cosmos DB MongoDB API：

MongoDB Atlas：跨雲、原生 MongoDB 行為
Cosmos DB MongoDB API：Azure-only、global distribution + 5 consistency levels
選 MongoDB Atlas：跨雲、需要原生 MongoDB features
選 Cosmos DB：Azure 生態、需要更好 global distribution
對應案例：9.C30 Microsoft 365 — 從 MongoDB 遷到 Cosmos DB MongoDB API，主要保留 document model

vs DynamoDB：

MongoDB：document model、aggregation 強、跨雲
DynamoDB：KV / single-table design、AWS 整合、5 個 9 SLA
選 MongoDB：document 為主、跨雲
選 DynamoDB：KV 為主、AWS 生態
詳見 DynamoDB vendor page 對比段

vs PostgreSQL JSONB：

MongoDB：document 為主、schema-less
PostgreSQL：SQL 為主、JSONB 補充
選 MongoDB：document 占主要 schema
選 PostgreSQL JSONB：主要結構化、少量半結構化欄位

vs Couchbase / Couchdb / Firestore：

Couchbase：MongoDB 替代、有 N1QL（SQL-like）
CouchDB：偏小規模、master-master replication
Firestore：GCP-only、realtime updates
MongoDB 在這群裡是生態最廣的

vs Elasticsearch 作為 search 替代：

兩者分屬不同類別：MongoDB 是 OLTP / document、Elasticsearch 是 search + analytics
通常搭配用：MongoDB 主、Elasticsearch 處理 full-text search

容量規劃要點

1. Shard key 設計是命脈：

跟 DynamoDB partition key 同樣關鍵
不均勻 → hot shard、實際容量達不到名義
4.4+ 可以 reshard、但仍是大手術

2. Replica set 是 HA 基礎：

至少 3 個 member（1 primary + 2 secondary）
secondary 可 read（read preference）但要注意 lag
failover 通常 < 30 秒

3. Atlas managed 服務：

提供 auto-scaling、auto-backup、跨雲部署
Tier 從 M0（free）到 M700（高階）
Atlas Online Archive 自動把舊資料移到便宜 storage

4. Index 限制：

單 collection 最多 64 個 index
compound index 有順序敏感（{a:1, b:1} 跟 {b:1, a:1} 不同）
TTL index 自動 expire 過期 document

5. Change streams（CDC）：

4.0+ 提供原生 change streams
對接 Kafka / event bus 做 event sourcing

Anti-recommendation 與升級路由

MongoDB 的 schema flexibility 會降低早期建模成本，也會把 schema governance 延後到 production。這一段先說何時維持 document model，再說何時升級 Atlas、sharding、Cosmos DB、DynamoDB 或 SQL。

機制 / 路線	維持簡單設計的條件	升級訊號	主要引用路徑
單一 replica set	document size 穩定、working set 可控、primary 寫入足夠	storage / write / working set 接近上限、failover 演練不足	Replication Lag、RPO
Atlas managed	團隊仍能管理 backup、upgrade、monitoring 與 scaling	DBA / SRE 責任想轉交平台、跨雲部署與 backup 成為主要壓力	Audit Log、Secret Management
Sharded cluster	single replica set 還能承擔容量與維護窗口	shard key 穩定、tenant / user / region 可分、hot shard 可觀測	Database Sharding、Hot Partition
Cosmos DB MongoDB API	Azure 只是部署選項，原生 MongoDB 行為仍重要	Azure global distribution、multi-region write 或 RU governance 成主題	Cosmos DB vendor
DynamoDB / KV	query 仍需要 document traversal 與 aggregation	access pattern 固定、sub-10ms p99、connection-free scaling 成主題	DynamoDB vendor
PostgreSQL	document 是主要資料形狀	JOIN-heavy、transaction-heavy、schema 約束是主要價值	PostgreSQL vendor

MongoDB 的簡單路徑是先把 document boundary 寫清楚。資料可以彈性演進，但 application 仍要知道哪些欄位是正式契約、哪些欄位只是相容期，並用 validation、migration 與 data quality check 管住版本漂移。

Sharding 的升級路徑要等 shard key 與 query shape 足夠穩定。過早切 shard 會把 aggregation、transaction 與 index 成本提前放大；過晚切 shard 則會讓 resharding、chunk migration 與 balancer 壓力進入 production 高峰期。

Deep article（已完成）

本批 6 篇 deep article 已完成、覆蓋 MongoDB 從 schema 設計到 production 跨層架構的核心 production 議題：

主題	文章	對應 production 議題
Schema contract 該放 DB 層 validator 還是 app 層 abstraction	schema-design-pattern	Toyota polymorphic governance、Forbes abstraction layer
Shard key 選型 + 單 cluster vs 多 cluster blast radius	shard-key-selection	Toyota 20 DB blast radius、跟 DynamoDB 可逆性對比
Read preference + causal session 跟 cache 層 freshness token	replica-set-read-preference	DB 層 + cache 層讀後一致性兩層合用
Aggregation pipeline 順序 / index / memory boundary	aggregation-pipeline-optimization	report dashboard 跑爆 primary 的 anti-pattern 治理
Change streams resume token + Kafka connector 治理	change-streams-kafka	at-least-once 語義 + idempotency + resume token 過期防護
Driver × deployment × cache × predictive scaling 三層協作	connection-management-and-cache-layer	Coinbase mongobetween + freshness token + ML 預測擴容三件套

跨 vendor entry：先看 DB3 vendor selection（MongoDB / DynamoDB / Cosmos DB 三方選型 + workload shape 前置判讀），再進本 vendor 的 deep article。

後續擴充（仍待補）

Index 設計跟覆蓋
從自管 MongoDB 遷到 Atlas
從 MongoDB 遷到 Cosmos DB MongoDB API（保留 document model）
從 MongoDB 遷到 DynamoDB（access pattern 需要重設計）
Queryable encryption（CSFLE）

案例對照

案例	跟 MongoDB 的關係
9.C30 Microsoft 365	從 MongoDB 遷到 Cosmos DB MongoDB API、planet-scale analytics
9.C36 Coinbase	MongoDB 為主資料層、自建 mongobetween 解決 Ruby 連線爆炸、users 服務 1.5M reads/sec
9.C37 Forbes	自管 MongoDB → Atlas on GCP、6 個月遷完、build 25→9 分鐘、120M MAU
9.C38 Toyota Connected	Atlas 撐 900 萬車 telematics、月 180 億 transaction、緊急訊號 3 秒內到 agent

MongoDB case 的讀法分三組：

作為 production 主角持續演進（Coinbase、Toyota Connected）：document model 撐住核心 OLTP / IoT、配 connection proxy / cache / event-driven 處理擴展周邊。
自管 → managed 遷移（Forbes）：同 document model、換託管模式、ROI 集中在 DBA 責任轉移跟跨雲彈性、不是性能改善。
遷出 MongoDB 保留 API（Microsoft 365）：document model 保留、底層換到 Cosmos DB MongoDB API、換取 Azure global distribution。

讀 case 時要區分 MongoDB 在「主角 / 遷入 / 遷出」三種位置的差異，三種位置揭露的工程議題完全不同。

常見陷阱

schema 長期 schema-less：production 出現 data inconsistency、難 query
shard key 用 _id（自增）：寫入全集中在最後一個 shard
$lookup 過度使用：跨 collection JOIN-heavy workload 應在 schema design 時 denormalize 或回 SQL
index 太多：寫吞吐被拖垮、定期 review 未用 index
secondary read 不檢查 lag：用戶讀到 stale data
不規劃 Atlas tier upgrade 路徑：流量上來才發現 tier 跟不上、緊急升級費用高

下一步路由

完整 T1 對照：01-database vendors index
平行：Cosmos DB vendor（MongoDB API replacement）、DynamoDB vendor（KV alternative）
上游：1.2 schema design、1.10 KV / Document DB 容量規劃
下游：1.12 大規模 DB 遷移實戰（MongoDB 遷出範例）
跨模組：9.6 容量規劃模型、9.4 Saturation Discovery（shard key 跟 hot shard）
官方：MongoDB Manual、MongoDB Atlas

MongoDB → Atlas：Atlas 不是 MongoDB + managed、是另一個 product

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link 到 MongoDB 跟 MongoDB Atlas。本文是 Migration playbook methodology Type C operational redesign hybrid 的標準形態實證。每階段切換用 migration gate 把關 — 4 phase 之間的驗證條件就是 gate。

Atlas 不是 MongoDB + managed、是另一個 product

「MongoDB Atlas 是 MongoDB 的 managed 版本」這個 framing 看似合理、實際誤導：

Protocol 相容：MongoDB wire protocol 一致、driver 不改、mongosh 連線跟 self-managed 一樣
Storage 一致：WiredTiger storage engine 一樣、document model 一樣
API 一致：Aggregation framework、indexing、change stream 都一樣

但 operational surface 完全不同：

Operational concept	Self-managed MongoDB	Atlas
Cluster bootstrap	mongod + replica set config + cfgsvr + shard 手動	UI / API 一鍵建集群、全自動
HA	Replica set 自管 + arbiter + priority	自動跨 AZ replica + automatic failover
Backup	mongodump + S3 archive 自管	內建 cloud backup + PITR（按 region 設）
Network access	VPC + security group + IP whitelist 自管	Atlas private endpoint / VPC peering / IP access list
Authentication	mongod 內部 user / x.509 自管	Atlas Database User + 整合 LDAP / SSO / AWS IAM
Monitoring	Self-deploy Prometheus + grafana	Atlas Performance Advisor + APM 內建
Sizing	Manual instance class + scale	Auto-tier scaling + tier-based pricing
Patching	Manual + outage window	Automatic（可配置 maintenance window）

Migration 主要工作不在 資料層 — protocol drop-in 已 cover；是 operational stack 全換：SRE runbook、monitoring dashboard、access control、IAM 整合、cost 預估全要重做。「Atlas 是 managed MongoDB」這個 framing 低估了 operational 工作量。

跑 diff dimension audit：

維度	評估	等級
Schema / API	MongoDB protocol / API 完全相容	Low
Operational model	HA / backup / monitoring / IAM / network 全換	High
Abstraction / paradigm	同 document DB	Low
Number of components	同 1 個 cluster	Low
Application change	Connection string / IAM 整合改、application logic 不改	Low/Medium

主導維度 Operational = High、Schema / Paradigm 都 Low — 對映 Type C operational redesign hybrid。

結構：4-phase operational + drop-in cutover

跟 PostgreSQL → Aurora 結構對齊（同 Type C）：

 1Phase 0：Pre-migration audit（1-2 週）
 2  - Workload sizing（IOPS / connection / storage）
 3  - Application connection pattern audit
 4  - Compliance requirement audit
 5
 6Phase 1：Operational infrastructure 準備（2-3 週）
 7  - Atlas cluster 建立
 8  - VPC peering / private endpoint
 9  - IAM role + Atlas Database User
10  - Monitoring + alert
11  - Backup retention 設定
12
13Phase 2：Data migration（取決於 dataset 大小）
14  - mongomirror / Atlas Live Migration tool
15  - 或 mongodump → mongorestore（小 DB）
16
17Phase 3：Cutover 跟 verification
18
19Phase 4：Cleanup（self-managed decommission）

整體 4-12 週、依 dataset 大小跟 organization 流程複雜度。

Phase 0：Pre-migration audit

Workload sizing → Atlas tier

 1Self-managed observations:
 2- Peak IOPS: 8000
 3- P99 read latency: 5ms
 4- Connection count peak: 1500
 5- Storage: 800GB
 6- Cross-region replication needed: yes
 7
 8Atlas tier mapping:
 9- M40 (8 vCPU, 16GB RAM): IOPS 3000、不夠
10- M60 (16 vCPU, 64GB RAM): IOPS 6000、邊界
11- M80 (32 vCPU, 128GB RAM): IOPS 9000、安全（選此）
12- Storage: 1TB tier（足夠 800GB + 25% buffer）
13- Cross-region replication add-on

Atlas 不是 自由 instance class、是 固定 tier；workload 跨 tier 邊界時要選 上一級 而不是 push 下一級。

Connection pattern audit

1// Application connection pool config
2const client = new MongoClient(uri, {
3  maxPoolSize: 100,     // ← Atlas 端 tier-specific connection limit
4  minPoolSize: 10,
5  maxIdleTimeMS: 60000,
6});

Atlas tier 對 single user connection 有限制（M40 ~1500、M80 ~3000）；多 application instance 跑同帳號連 Atlas 可能撞 limit。預先計算 total connection = pod_count × maxPoolSize、對照 tier limit。

Compliance audit

Data residency：Atlas 部署 region 是否符合 GDPR / 客戶合約
Encryption at rest：Atlas 預設 enable、但 encryption key 是 Atlas-managed — 合規嚴格要用 CMK / BYOK
Audit log：Atlas 提供 audit log、export 到 S3 / Splunk

Phase 1：Operational infrastructure 準備

Atlas cluster 配置

 1# 用 Terraform mongodbatlas provider
 2resource "mongodbatlas_cluster" "production" {
 3  project_id   = var.project_id
 4  name         = "production-cluster"
 5  cluster_type = "REPLICASET"
 6
 7  provider_name         = "AWS"
 8  provider_region_name  = "US_EAST_1"
 9  provider_instance_size_name = "M80"
10
11  backup_enabled         = true
12  pit_enabled            = true   # PITR
13  mongo_db_major_version = "7.0"
14
15  advanced_configuration {
16    javascript_enabled                   = false
17    minimum_enabled_tls_protocol         = "TLS1_2"
18    no_table_scan                        = false
19    oplog_size_mb                        = 51200
20  }
21}
22
23# Backup retention
24resource "mongodbatlas_cloud_backup_schedule" "production" {
25  project_id   = var.project_id
26  cluster_name = mongodbatlas_cluster.production.name
27
28  reference_hour_of_day    = 3
29  reference_minute_of_hour = 0
30  restore_window_days      = 7
31
32  policy_item_daily {
33    frequency_interval = 1
34    retention_unit     = "days"
35    retention_value    = 7
36  }
37}

VPC peering / private endpoint

 1Pattern A: VPC Peering
 2  AWS VPC <──peering──> Atlas project VPC
 3  - 跨 region 跑、routing table 對齊
 4  - 適合中型 / 大型 workload、stable network topology
 5
 6Pattern B: Private Endpoint (Atlas private link)
 7  AWS VPC ──private link──> Atlas
 8  - 不需要 routing table 改
 9  - 適合 multi-account / multi-region 複雜場景
10  - Cost 略高

production default 走 Private Endpoint、設定簡單跟 IAM 整合好。

Atlas Database User 跟 IAM 整合

1Pattern A: 傳統 username / password
2  - 設 Database User、application 用 SCRAM-SHA-256 連
3  - 適合 legacy application
4
5Pattern B: AWS IAM authentication（推薦）
6  - Atlas Database User type: "AWS IAM"
7  - Application 用 AWS IAM role + Atlas SDK
8  - Token 15 分鐘輪換、application 自管 refresh

cutover 時間表內加 IAM authentication migration、不要事後補。

Phase 2：Data migration

Atlas Live Migration tool（小到中型）

Atlas UI 內建 Live Migration tool：

Source cluster URI（self-managed MongoDB）
Atlas target cluster
tool 自動 full sync + oplog tailing
Cutover window 內 final cutover

支援 dataset < 100GB 簡單；100GB-1TB 需要分批 / collection 順序設計。

mongomirror（大型）

1# Mongomirror: source → atlas
2mongomirror \
3  --host source-replicaset/host1:27017,host2:27017 \
4  --destination atlas-cluster-host:27017 \
5  --destinationUsername admin \
6  --destinationPassword $ATLAS_PASSWORD \
7  --ssl

mongomirror 分兩段：

Initial sync（full dump + restore）
Oplog tailing（continuous CDC）

Cutover 期間 application 切 connection string、mongomirror 跟著 stream 收尾。

Phase 3：Cutover + verification

11. Application 端設 maintenance mode（block write）
22. Wait mongomirror catch up（oplog gap → 0）
33. 驗證 Atlas 端 collection count + sample query
44. Application connection string 切到 Atlas
55. 解除 maintenance、monitor 24-48 小時
66. Self-managed mongo read-only standby 1-2 週

Production 故障演練

Case 1：Atlas tier connection limit 撞牆

徵兆：cutover 後 application 流量高峰時大量 Connection refused、Atlas 端顯示 connection limit reached；self-managed 階段沒有這問題。

根因：M80 tier connection limit ~3000、application 100 個 pod × maxPoolSize=50 = 5000 connection；超出 limit。

修法：

Pre-migration 計算：total connection 對照 Atlas tier、超出選上一級 tier
降 maxPoolSize：100 pod × 30 = 3000、剛好 cap；但 burst 仍可能撞
加 connection proxy：在 application 跟 Atlas 之間放 connection pooler（如 mongos sharded 或 ProxySQL-style proxy）

Case 2：IP whitelist 漏 application VPC、cutover 後完全連不上

徵兆：cutover 後 application 直接報 connection timeout、Atlas dashboard 顯示 zero traffic；troubleshooting 1 小時才發現是 IP access list 漏掉某 application VPC CIDR。

根因：Atlas IP access list 預設 deny all、必須明示加 application VPC；Phase 1 設定漏看某個 VPC（如 multi-account organization 內的 staging account）。

修法：

Pre-cutover 連線測試：每個 application VPC 跑 sample MongoDB 連線、確認 ping 通
改 Private Endpoint：不靠 IP whitelist、用 PrivateLink 自動 routing
Backup access：保留 bastion host with whitelisted IP、incident 期間能直連

Case 3：Backup retention 設不夠、compliance audit 抓到

徵兆：cutover 3 個月後 SOX audit 發現 backup retention 設 7 天、合規要求 90 天；急忙改 Atlas config 設 90 天、但 過去 3 個月 backup 已不可恢復。

根因：Atlas backup retention 是 向前生效、不能回追加；Phase 1 預設配置漏對合規 review。

修法：

Pre-Phase 1 跑 compliance review：跟 legal / security team 確認 retention / data residency / audit log
預設 retention 設保守值（30 / 60 天）、之後可降不能升
PITR 跟 backup retention 分開設：PITR window 7-30 天、full backup 90-365 天

Case 4：IAM token 過期、application 端 reconnect storm

徵兆：production 切到 IAM authentication 後、每 15 分鐘出現一波 connection failure；Atlas log 顯示「auth token expired」。

根因：AWS IAM token 15 分鐘輪換、application 用舊 token 重連失敗；token refresh 邏輯沒寫對。

修法：

1// 用 Atlas SDK + AWS SDK 整合、自動 token refresh
2const { MongoClient } = require('mongodb');
3const { fromIni } = require('@aws-sdk/credential-providers');
4
5const credentials = fromIni({ profile: 'production' });
6const client = new MongoClient(uri, {
7  authMechanism: 'MONGODB-AWS',
8  // SDK 自動 refresh token
9});

不要自管 token rotation、用 vendor SDK 抽象掉。

Case 5：Billing 暴漲、IOPS 跟 backup storage 超預估

徵兆：第一個月 Atlas 帳單 $15K USD、預估 $8K；Atlas dashboard 顯示 backup storage 跟 IOPS 各超 1.5-2x 預估。

根因：

Atlas backup 預設 跨 region replicated、storage cost 2x
IOPS-heavy workload 在 M tier 內可能撞 burst credit、auto-tier-up 暫時觸發更貴 tier
Data transfer 跨 region / 跨 cloud 計費沒算

修法：

Pre-migration cost estimate：用 self-managed metrics 估 IOPS / bandwidth、套 Atlas pricing
Backup region 設單一：若不要跨 region DR、設 same-region backup 省 50%
Reserved Instance：穩定 workload 預付 1-3 年、省 30-40%
Performance Advisor 早用：第一週就跑、找 inefficient query 降 IOPS

Capacity / cost

維度	Self-managed MongoDB	Atlas
Cluster cost (M80)	EC2 r6g.4xlarge × 3 ≈ $1.5K / mo	M80 + storage + backup ≈ $3K / mo
Operational FTE	0.5-1.5 FTE	0.1-0.3 FTE
Backup cost	S3 + tooling 自管	內建 + tiered storage
Cross-region DR cost	Manual + 2x infrastructure	1-click + 1.5-2x billing
Time to value	1-3 個月（HA + ops setup）	1-2 週（cluster ready + IAM）
Migration cost	-	1-3 FTE × 2-3 個月

Break-even：~200GB / 中型 workload、Atlas operational savings 平攤 1-2 年後比 self-managed cheaper；TB+ 大型 workload self-managed 仍可能便宜、但需要 ops team。

整合 / 下一步

跟 PostgreSQL → Aurora migration 對照

兩篇都是 Type C operational redesign hybrid、模板共用、細節差：

Aurora 端 RDS Proxy 是推薦做法、Atlas 端 Private Endpoint 更標準
Aurora 端 IAM authentication 是 optional best practice、Atlas IAM 是 推薦預設
兩家 cost model 都複雜、I/O cost 是 surprise 主要來源

跟 Application 端 IAM token rotation 整合

Vault dynamic credential 可 issue Atlas Database User credential、lease lifecycle 對齊 application；對 high-stakes workload 是好做法、但 setup 複雜。

下一步議題

Atlas Data Federation：跨 Atlas 集群 query S3 / 跨 region；如果走 multi-region 評估這 feature
Atlas Online Archive：cold data 自動 archive 到 S3、查 query 透明；對 retention 重的 workload 省 storage cost
Atlas Serverless：burst workload 適合、steady 不划算

MongoDB Shard Expansion + Multi-DC：Type F「不需要 parallel run」的 multi-region 例外

Tue, 19 May 2026 00:00:00 +0000

本文是 MongoDB overview 的 implementation-layer deep article。對應 #128 Type F「Topology re-layout」第 3 個 dogfood、特別驗證 self-aware limitation 第 3 點「不需要 parallel run」claim 的 multi-region rollout 例外 — 本文是反例的具體實證。

Reviewer D 的質疑：Type F 一定不需要 parallel run 嗎

#128 Self-aware limitation 第 3 點承認：

「不需要 parallel run」claim 部分不成立：multi-region rollout（#128 列為 Type F 情境）必須 parallel run — 兩 region 同時跑然後切流量、不然就是停機切換、跟 Type A phase 3 機制相同。

本文是該 claim 的 正面實證 — MongoDB sharded cluster 從 single-DC 加 shard + 加 secondary DC、確實需要 parallel run + 流量切換、跟 Type A phased migration 局部同構：

Type F 假設	Single-DC re-sharding（Redis case）	Multi-DC expansion（本文）
同 cluster 不同 state	yes	yes（同 MongoDB cluster）
不需 schema translation	yes	yes
不需 parallel run	yes（slot migration 內部完成）	no — 兩 DC 同跑後切流量
不需 cleanup phase	yes	partial（舊 DC 角色降為 standby）
Step-by-step + rollback boundary	yes	yes

→ Type F anatomy 仍適用、但「不需 parallel run」是 子情境條件、不是 universal claim。

兩個操作合併：shard 加 + DC 加

實務上中型公司常同時跑兩個 topology 變動：

Shard expansion：現有 3-shard cluster 加到 5-shard、chunk migration 平均分佈
Multi-DC：從 single-DC（us-east-1）加到 multi-DC（us-east-1 + us-west-2）

兩個操作的 diff dimension audit：

維度	Shard 加（單獨）	Multi-DC（單獨）	兩者同跑
Schema / API	Low	Low	Low
Operational model	Low	Medium（跨 DC ops）	Medium
Paradigm	Low	Low	Low
Components	Low（加 shard、同 cluster）	Low	Low
Application change	Low	Low-Medium（cross-DC latency aware）	Low-Medium
Data topology	High（sharding strategy）	High（replication + region）	High（雙變、複合 topology）

兩者主導維度都是 topology = High、組合走 Type F multi-axis 子情境。

Pre-layout analysis：當前 + 目標 topology

 1// 1. 當前 shard 分佈
 2sh.status({verbose: false});
 3// 期望輸出: 3 shard、每個 ~33% chunks、no migration in progress
 4
 5db.printShardingStatus({verbose: false});
 6// 找 hot shard、imbalanced chunk distribution
 7
 8// 2. Replication topology
 9rs.status();
10// 各 replica set primary/secondary 健康度、replication lag
11
12// 3. Cross-DC network baseline (在 add DC 前測)
13// us-east-1 → us-west-2 RTT、bandwidth

Pre-layout 階段 output：

當前：3 shard × 1 replica set per shard (3 member) = 9 node、全在 us-east-1
目標：5 shard × 1 replica set per shard (5 member: 3 us-east + 2 us-west) = 25 node
Migration scope：加 2 shard + 加 2 DC member 每 shard、共 +16 node
Chunk migration estimate：30% chunk 需重分（從 33% × 3 變 20% × 5）

Re-layout 機制

兩個 mechanism 平行進行：

Shard expansion mechanism

 1// 1. 新增 shard 到 cluster
 2sh.addShard("rs-shard4/host10:27017,host11:27017,host12:27017");
 3sh.addShard("rs-shard5/host13:27017,host14:27017,host15:27017");
 4
 5// 2. balancer 自動 chunk migration
 6sh.startBalancer();
 7// 觀察 progress: db.adminCommand({balancerStatus: 1})
 8
 9// 3. 完成後 verify shard distribution
10sh.status();

Chunk migration 是 background job、balancer 控制 throttle；不阻塞 production query、但 CPU / network 上升 30-50%。

Multi-DC expansion mechanism

 1// 1. 對每 shard 的 replica set 加 us-west-2 member (priority 0)
 2rs.add({
 3  host: "us-west-2-host:27017",
 4  priority: 0,           // 不能當 primary
 5  votes: 1,              // 參與投票
 6  hidden: false
 7});
 8
 9// 2. 等 initial sync 完成（依資料量 1 小時 - 1 天）
10rs.printReplicationInfo();
11
12// 3. 確認 secondary 健康後、提升 priority 或 votes
13// 不要立刻設 priority 1、避免 unintended failover
14
15// 4. Cross-DC routing 透過 readPreference 在 application 設
16const client = new MongoClient(uri, {
17  readPreference: 'secondaryPreferred',
18  readPreferenceTags: [{ region: 'us-west-2' }, {}],
19});

關鍵：multi-DC 是 漸進加 member、不是 atomic switch；每 shard 獨立加、整體耗時 = shard 數 × initial sync time。

Execution flow（含 parallel run + 流量切換）

8 step、包含 parallel run + 切流量 段——驗證 #128 self-aware limitation 第 3 點：

Step	動作	Parallel run?	Rollback boundary
1 Pre-check	量化當前 topology、確認 cluster 健康	no	-
2 加 us-east shard	sh.addShard、balancer migrate chunk	no（cluster 內）	removeShard、chunk migrate 回
3 加 us-west member	對每 shard rs.add 跨 DC member	no	rs.remove、initial sync 投入廢棄
4 Initial sync wait	等所有 us-west member catch up	parallel run starts：兩 DC 同時 serve	-
5 Cross-DC dual-serve	兩 DC 都跑 read traffic（不切 write）	yes、parallel run：app 用 secondary preferred us-west	readPref 切回 us-east primary
6 流量切換	application us-west traffic 走 us-west read	yes	DNS / readPref 切回
7 Promote us-west（optional）	一個 shard 的 us-west member priority 提到 1	post-cutover	demote priority 回 0
8 Cleanup	Verify、archive log、document new topology	no	-

Step 4-6 是 parallel run + 切流量 — Type F 有此例外、跟 Type A phase 3 機制同構；anatomy 中「Execution flow per-step」段必須含 parallel run 子段。

Production 故障演練

Case 1：Balancer 跑 chunk migration 撞 production peak

徵兆：加 shard 後 balancer 開始 migrate chunk、production write latency p99 從 10ms 跳到 100ms；application 端 timeout 大量。

根因：MongoDB balancer 預設 24×7 跑、chunk migrate 是 blocking 操作（migration lock 期間阻塞 write 到該 chunk）；產線高峰時間 balancer 不會自動暫停。

修法：

1// 限 balancer 跑在 low-traffic window
2sh.setBalancerState(true);
3db.settings.update(
4  { _id: "balancer" },
5  { $set: { activeWindow: { start: "02:00", stop: "06:00" } } },
6  { upsert: true }
7);

且設 chunkSize 較小（128MB → 64MB）讓 migration 步驟細、單次 lock 時間短。

Case 2：Cross-DC initial sync 期間 oplog 跑出窗口

徵兆：加 us-west member 後、initial sync 跑 4 小時、結束時 member 顯示「too stale to catch up」、需要 full re-sync。

根因：MongoDB oplog 是 capped collection、預設 size 5% disk；4 小時 initial sync 期間 primary 寫入量超出 oplog 保留範圍、member 拿到的 oplog start point 已被覆蓋。

修法：

預先擴 oplog size：db.adminCommand({replSetResizeOplog: 1, size: 51200}) 加到 50GB、覆蓋 sync window
Off-peak initial sync：跑在低流量時間、oplog 寫入較慢
Manual initial sync via snapshot：用 mongodump 從 primary snapshot、restore 到 new member、跳過 oplog tail catch-up

Case 3：跨 DC read 路由錯誤、stale data 影響業務

徵兆：切流量到 us-west 後、application 偶爾抓到 5-30 秒前的 stale data；customer 報告「明明剛改了 setting、refresh 又變回去」。

根因：us-west member 是 secondary、replication lag 5-30 秒；application readPreference 設 secondaryPreferred 但沒 maxStalenessSeconds、可能讀到嚴重 stale member。

修法：

 1const client = new MongoClient(uri, {
 2  readPreference: 'secondaryPreferred',
 3  readPreferenceTags: [{ region: 'us-west-2' }, {}],
 4  maxStalenessSeconds: 90,  // 限 stale 不超過 90 秒
 5});
 6
 7// 對 strict consistency 場景強制 primary
 8const client_strict = new MongoClient(uri, {
 9  readPreference: 'primary',  // 強制讀 us-east primary
10});

Application-level read pattern 必須區分「accept stale read」vs「require fresh read」、不是 cluster-level 統一配置。

Case 4：Shard tag-aware routing 沒設、cross-DC traffic 爆 cost

徵兆：multi-DC 跑了 1 個月、AWS egress cost 從 $500 / month 漲到 $8000 / month；99% 流量還是 us-east → us-west 跨 DC。

根因：sharded cluster 沒設 zone sharding、application 不知道哪些 chunk 在哪個 DC、所有 query 預設打 us-east primary、跨 DC bandwidth 爆。

修法：

 1// 注意: MongoDB 4.2+ API、舊版 sh.addShardTag / sh.addTagRange 已 deprecated
 2// 對應改 sh.addShardToZone / sh.updateZoneKeyRange
 3
 4// 1. 給 shard 加 zone (MongoDB 4.2+)
 5sh.addShardToZone("rs-shard1", "us-east");
 6sh.addShardToZone("rs-shard2", "us-east");
 7sh.addShardToZone("rs-shard3", "us-east");
 8sh.addShardToZone("rs-shard4", "us-west");
 9sh.addShardToZone("rs-shard5", "us-west");
10
11// 2. 對 collection 加 zone range
12sh.updateZoneKeyRange(
13  "myapp.events",
14  { region: "us-east", _id: MinKey },
15  { region: "us-east", _id: MaxKey },
16  "us-east"
17);
18sh.updateZoneKeyRange(
19  "myapp.events",
20  { region: "us-west", _id: MinKey },
21  { region: "us-west", _id: MaxKey },
22  "us-west"
23);
24
25// 3. balancer 重新分配 chunk 到對應 zone

Zone sharding 是 multi-DC 必要設計、不設等於白付 egress cost。

Case 5：Failover 後跨 DC primary 切換、application 連線中斷

徵兆：production 跑 6 個月後、us-east-1 outage、某 shard primary 切到 us-west member；application 5-10 秒內大量 connection error。

根因：MongoDB driver 預設 election timeout 10 秒、application 沒設 server selection retry；primary 切換期間 client 沒重連。

修法：

1const client = new MongoClient(uri, {
2  serverSelectionTimeoutMS: 30000,    // 等 30 秒給 election
3  retryWrites: true,
4  retryReads: true,
5  heartbeatFrequencyMS: 5000,         // 更頻繁 detect topology 變動
6});

且 multi-DC primary 應該設 priority asymmetry：us-east member priority 2、us-west priority 1；正常情況不切換、災難時自動切。

Capacity / cost

維度	Single-DC 3-shard	Multi-DC 5-shard	Trade-off
Node count	9	25	~3x infrastructure cost
Storage redundancy	3 replica	5 replica (3 east + 2 west)	+2 copy、storage cost +66%
Network egress	內部 VPC、低	Cross-DC、高（需 zone sharding）	$500 → $8000 / month if no zone sharding
Latency p99 (write)	5-10ms	5-15ms（primary 仍 us-east）	略升
Latency p99 (read)	5-10ms	2-5ms (local DC)	Multi-DC 區域 read 加快
Disaster recovery	RTO 30 分鐘（rebuild）	RTO < 1 分鐘（auto failover）	顯著改善
Operational complexity	低	高（zone sharding / DR drill）	+1 SRE FTE 維護

判讀：multi-DC 是 DR 投資、不是 cost optimization；只在 availability SLA > 99.9% 或合規要求 場景值得。

整合 / 下一步

跟 MongoDB → Atlas migration 對位

Self-managed multi-DC 複雜度高、Atlas 把 multi-cluster + cross-region 簡化成 UI 配置；如果走 multi-DC、考慮直接遷 Atlas。

跟 Application read pattern 整合

zone sharding + readPreference 跟 application logic 緊密耦合；不能事後補、應在 multi-DC 設計階段就設計 application 端的 region-aware routing。

跟 Cassandra keyspace re-balance 對比

Cassandra 是另一個 Type F multi-DC 典型 case；用 NetworkTopologyStrategy + replication factor per DC、跟 MongoDB zone sharding 概念對等但 mechanism 完全不同。Reviewer D 把 Cassandra 列為 Type F 反例 — 本文以 MongoDB 替代驗證。

下一步議題

Cross-region active-active：MongoDB 不支援 multi-primary、cross-region active-active 需要 application-level conflict resolution
PostgreSQL Citus / CockroachDB multi-region 對比：distributed SQL 對 multi-region 有不同設計
Cost optimization：跨 DC egress 是 long-term concern、zone sharding 設好後仍要 quarterly review

MongoDB Schema Design Pattern：contract layer 在哪 vs embedded / reference

Wed, 27 May 2026 00:00:00 +0000

MongoDB schema design 的初學討論常停在「embedded vs reference 二選一」。真實 production 議題遠不止此：document model 給的 schema flexibility 在第一年是紅利、跑半年後同 collection 開始混三代 schema、application code 三層 if-else 處理欄位缺失與型別漂移。這時候讀者要解的不是「embed 還是 reference」、是 schema contract 該由誰守、守在哪一層。本文把這個議題拆成三條 contract layer 路徑（DB-layer validator / app-layer abstraction / 混合）、配合 embedded / reference / polymorphic 機制與 time-series collection 邊界一起討論。

本文不重複 MongoDB vendor overview 已寫過的 document model 適用條件 — 而是 production 部署 + schema governance + 失敗修復的實作層教學。

問題情境：document 自由的後座力

MongoDB 適用度的前置判讀有三件事要確認：

document shape 是否主導資料：sensor signal / CMS article / order aggregate 這類「形狀本來就多型 + 隨產品演進」適合 document model；access pattern 固定 + 欄位定型的反而該回 KV 系統或 SQL
contract layer 該放哪：DB-layer validator 適合 schema 穩定 / 跨服務共用 collection 的場景；app-layer abstraction 適合 schema 演進快 / 微服務獨立 owner；混合適合大型 production
跨雲 hedging 是否需要：若團隊未來雲商策略不確定、Atlas 跨雲是 selection 訊號；只在單雲跑就不必為 hedging 多付代價

確認 MongoDB 該用之後，讀者真正在 production 撞到的徵兆：

Document model 早期 schema-less 紅利、跑半年後 collection 同時混三代 schema、application 寫 if-else 處理欄位缺失與型別漂移
子文件越塞越深、單 document 突破 1-2MB、partial update 仍要把整顆 document load + write、IO 跟 working set 雙重壓力
反向過度 normalize：訂單跟訂單 item 拆兩個 collection、單一查詢得 N+1 $lookup、aggregation cost 飆
IoT / sensor / event log workload 寫進 regular collection、寫入吞吐撞牆但沒考慮 time-series collection
$lookup 出現在 hot path、document size warning（16MB 上限預警）、partial update 卻產生大量 disk write、schema validation 報錯比例突然爬升

Case anchor：9.C38 Toyota Connected 揭露車載 sensor schema 隨車型 / 年份 / 規範演進、polymorphic document 與 schema governance 並存；9.C37 Forbes 揭露 CMS 50+ 微服務透過自建中介 abstraction layer 隔離 schema 變動；9.C30 Microsoft 365 揭露 document model 保留 + 跨 vendor 形狀治理。早期 startup MongoDB 三代 schema 並存的具體 incident 細節需未來 case 補完、本文先以「常見 failure pattern」處理。

核心機制：aggregate root、embedded、reference、polymorphic

MongoDB schema design 的第一層是 aggregate root 決定 atomicity 邊界。MongoDB 把寫入 atomicity 限制在「單 document 內」、跨 document 要 multi-document transaction（5.0+ 在 replica set / sharded cluster 都支援、但跨 shard 有性能成本）。aggregate root 是 DDD 概念落地到 MongoDB 的具體實作 — 把「一起讀、一起寫、一致性邊界一致」的資料塞同一個 document。

Embedded（subdocument / array）：寫入 atomic、讀取一次到位；代價是 update sub-element 仍要 rewrite 整顆 document，sub-element 寫頻很高時不適合
Reference（手動 _id foreign key + $lookup）：document 大小可控，但 join 在 application 或 aggregation 階段做；JOIN-heavy workload 跑這條路徑會 N+1
Polymorphic pattern：同 collection 用 type discriminator 存多型實體；MongoDB 沒 inheritance、靠 schema validator 與 partial index 維持邊界
16MB document hard limit：是 MongoDB 機制邊界；working set 在 RAM 的隱性軟限制（單 doc 大小直接影響 page cache 效率）更早就會出問題

Contract layer 三條路徑

跨 case 合成 frame（本章合成、Toyota + Forbes 共同揭露）：document model 的 schema flexibility 在 production 必須以 schema governance 對沖、否則「schema 自由」變「production data inconsistency」（Toyota case 明示）。讀者要選的不是「要不要做 schema governance」、是「contract 守在哪一層」。三條路徑：

路徑	實作機制	適用條件
DB-layer contract	MongoDB `$jsonSchema` validator + `validationLevel` + `validationAction`	Schema 穩定、多服務共用 collection、要 DB 擋髒資料
App-layer contract	自建 API abstraction + middleware schema 驗證	Schema 演進快、微服務獨立 owner、跨雲彈性需求
混合	DB 層擋型別 / 必填、app 層擋業務語意 / 版本	大型 production、多 owner、跨團隊

DB-layer 路徑：$jsonSchema validator 在 production 是「契約 enforcement」工具、不是 dev-time linter。設 validationAction: "error" 寫入直接擋；設 "warn" 只記 log。validationLevel: "moderate" 對既有 doc 放行、對新寫入嚴格；"strict" 對所有寫入都嚴格。適合 schema 穩定到「跨服務共用 collection」的程度。

App-layer 路徑：9.C37 Forbes 揭露的模式 — 50+ 微服務透過自建中介 abstraction layer 看到穩定的 contract API、DB schema 變動限制在 owner microservice 內。Forbes 跨雲彈性能用起來、核心原因是 abstraction layer 把 schema 治理收斂到單點、跨雲遷移時 abstraction layer 不變、微服務不知道底層 DB 換 cluster 換雲。

混合路徑：Atlas Application Services、enterprise schema registry 屬此類。DB 層 validator 守底線（欄位型別、必填欄位）、app 層 abstraction 守業務（版本欄位 / 相容處理 / cross-document 一致性）。代價是兩層都要維護、版本同步成本高、適合 production 規模真的撐住這個複雜度的團隊。

讀者選哪條路徑要看：team 規模 / collection 跨服務程度 / schema 演進速度。

Time-series collection（6.0+）

Time-series collection 是 MongoDB 為 IoT / sensor / event log / metrics 設計的 vendor-specific 機制 — 比 regular collection 寫入吞吐高 3-5x、storage 壓縮率更好。資料形狀必須是 { timestamp, metadata, measurement } 三段式、timestamp 主導。

適用情境：sensor signal 高頻寫入、metrics 系統的 time series、application event log。不適用情境：schema 不以 timestamp 為主、需要跨 document update、需要 polymorphic discriminator。

9.C38 Toyota Connected 自承「20 個 Atlas database 沒明確說有沒有用 time series collection — 對 IoT 案例這是重要區分、但 case study 沒揭露」。寫進 production 時必須明示：IoT / sensor 場景該考慮 time-series collection、Toyota case 未揭露實際使用情況、不可寫成「Toyota 使用 time-series collection」。

對應 knowledge card：document-store、transaction-boundary（aggregate boundary = transaction boundary）、data-inconsistency。

操作流程

Step 1：access pattern 盤點。列出 top 10 query / write、標 read together / write together 集合 — 這份清單決定 embedded vs reference vs polymorphic 的候選。

Step 2：contract layer 決策。

條件	路徑
Collection 跨多服務 + schema 穩定	DB-layer validator
Schema 演進快 + 微服務獨立 owner	App-layer abstraction
大型 production + 多 owner + 跨團隊	混合（兩者並用）
IoT / sensor / event log + timestamp 主導	Time-series collection（取代 regular collection）

Step 3：embed 判準 — 1:few、life-cycle 同步、< 1MB 預期上限；reference 判準 — 1:many 寫頻不對稱、跨 aggregate 引用。

Step 4：DB-layer 路徑 validator 配置：

 1db.runCommand({
 2  collMod: "orders",
 3  validator: {
 4    $jsonSchema: {
 5      bsonType: "object",
 6      required: ["_id", "tenantId", "createdAt", "items"],
 7      properties: {
 8        tenantId: { bsonType: "string" },
 9        createdAt: { bsonType: "date" },
10        items: {
11          bsonType: "array",
12          minItems: 1,
13          items: {
14            bsonType: "object",
15            required: ["sku", "qty"],
16            properties: {
17              sku: { bsonType: "string" },
18              qty: { bsonType: "int", minimum: 1 }
19            }
20          }
21        }
22      }
23    }
24  },
25  validationLevel: "moderate",
26  validationAction: "warn"
27})

灰度策略：先 validationLevel: "moderate" + validationAction: "warn" 觀察兩週、確認 application 不寫違規 doc、再切 "strict" + "error" 封死。

Step 5：App-layer 路徑 abstraction 介面。9.C37 Forbes 揭露的模式 — middleware 攔截 microservice 寫入、驗 schema、套版本欄位、把 owner microservice 的 schema 變動隔離在 abstraction 內。

Step 6：Polymorphic + partial index — partialFilterExpression 避免冷分支吃 index 成本：

1db.events.createIndex(
2  { type: 1, timestamp: -1 },
3  { partialFilterExpression: { type: { $in: ["click", "purchase"] } } }
4)

Step 7：量測 doc 形狀。用 bsondump + $bsonSize + collStats 量測：

1db.coll.aggregate([
2  { $group: {
3      _id: null,
4      avg: { $avg: { $bsonSize: "$$ROOT" } },
5      max: { $max: { $bsonSize: "$$ROOT" } }
6  }}
7])

驗證點：avgObjSize 在預期範圍、validator failure rate < SLO、abstraction layer schema mismatch rate 可追溯。

Rollback boundary：validator 從 strict 退回 moderate 是 single-command、application code 不必改；abstraction layer 換版需 application code 灰度；已 embed 進去的 schema 變更要靠 backfill migration script、無法 in-place 還原。

失敗模式

Unbounded array growth：把「使用者所有訊息」embed 進 user document、document 撞 16MB → 寫入直接 reject。修法是改 reference、訊息獨立 collection、用 userId 索引。

Hot subdocument update：所有寫都打同一個 nested field、wiredTiger document-level lock 退化成熱點，concurrency 看似多核卻被序列化。修法是把熱寫欄位拆 reference document、或改 sharded collection 把寫散開（見 shard key selection）。

$lookup 在 hot path：reference 沒設好變 join、p99 latency 隨 collection 大小線性退化。修法是 schema design 階段 denormalize、把 read-together 資料 embed 回 aggregate root；或 $merge 寫 materialized view（見 aggregation pipeline optimization）。

Schema 三代並存（缺 contract layer）：缺 validator 跟 abstraction layer、舊版欄位殘留、application code 三層 fallback、新 dev onboarding 看不懂哪個欄位是現役。9.C38 Toyota 揭露：document model 的彈性「成本是 production 必須做 schema governance」、否則「schema 自由」變「production data inconsistency」。

Abstraction layer 變成 lock-in：app-layer contract 寫得太重、跨 vendor 遷移時 abstraction 本身要重寫。該層應該薄、只做 schema 隔離、不做業務邏輯。

Polymorphic 全表掃描：discriminator 沒進 index、type: "rare" 查詢全表 scan。修法用 partial index 把熱類型蓋住、冷類型走全表也只是冷路徑。

Time-series collection 用錯場景：把非 timestamp 主導資料塞進 time-series collection、失去 flexibility 又拿不到吞吐紅利。Time-series collection 是專屬優化、不是普適 collection 升級。

Anti-recommendation：

access pattern 還沒穩定的早期 MVP 不需要鎖死 schema validator；先用 app-layer abstraction、production 穩定後再決定 DB 層該不該封死
JOIN-heavy / 強 normalize workload 一開始就該回 PostgreSQL JSONB 或 SQL、不是塞進 MongoDB 再 $lookup
跨案合成 frame：「不是所有資料都該進 MongoDB」、document-shaped + 形狀變化頻繁的進、access pattern 固定的 KV 走 KV（9.C36 Coinbase 揭露 MongoDB + DynamoDB 按 workload 分流）

容量與觀測

關鍵 metric：

Document 形狀：collStats.avgObjSize、collStats.size vs storageSize（壓縮比）
Contract 健康：document validation failure rate、abstraction layer schema mismatch rate
Working set 壓力：wiredTiger.cache.bytes currently in the cache 對比 working set 估算
Aggregation 副作用：profiler slow op、$lookup / $unwind 在 hot path 出現位置

Mongo command：

db.coll.stats() 看 document 平均 / 最大 size、storage / index size
db.runCommand({collMod: ..., validator: ...}) 改 validator
db.setProfilingLevel(1, {slowms: 100}) 抓 slow op

回到 4.20 observability evidence：把 doc size 分布、validator failure rate、abstraction layer schema mismatch、$lookup 出現位置列為 evidence 三件套。

回到 9.5 bottleneck localization：working set 撐爆 RAM 時的 page fault 信號、跟 doc size 異常增長強相關。

邊界與整合

Sibling deep articles：

shard key selection — document 形狀決定 shard key 候選空間
aggregation pipeline optimization — $lookup 與 schema reference 互相牽動
connection management and cache layer — abstraction layer 跟 cache 層協作

Migration playbook：

document 形狀走樣到無法治理時的 → MongoDB → PostgreSQL 拆 normalize 路徑
保留 document model 換 vendor 三型對照 — 保留主 DB 補周邊（Coinbase）/ 同 DB 換託管（Forbes Atlas）/ 同 model 換 vendor（Microsoft 365 Cosmos DB MongoDB API）

跟 1.x 互引：1.2 schema design 處理通用 schema 演進原則、本文是 MongoDB-specific 落地；1.3 transaction boundary 對齊 aggregate = atomic 邊界。

MongoDB Shard Key Selection：hashed vs ranged、單 cluster 切 shard vs 多 cluster 切 blast radius

Wed, 27 May 2026 00:00:00 +0000

MongoDB shard key 是 sharded cluster 上線時最難回頭的決策。Shard key 一旦設定錯、5.0 之前完全不可逆、5.0+ 用 reshardCollection 可改但仍是長時間運算 + 額外磁碟 + 寫入暫停窗口。但 shard key 不是 production 唯一的橫向擴展選項 — 還有「多 cluster」這條路徑（Toyota Connected 揭露），兩者解的問題完全不同。本文把 shard key 三特性（cardinality / frequency / monotonicity）跟「單 cluster vs 多 cluster」對照在一起、配合跨 vendor partition key 可逆性紀律一起討論。

本文不重複 MongoDB vendor overview 已寫過的 sharding 簡介 — 而是 production 設計 + 失敗修復的實作層教學。

MongoDB 適用度前置判讀：進到 shard key 設計前先確認 workload 在 MongoDB 適用區（document shape 主導 / contract layer 該放哪 / 跨雲 hedging 是否需要）— 詳見 schema-design-pattern 開頭 3 軸前置判讀、本篇不重複展開。Sharded cluster 是 已選 MongoDB 後 的容量決策、不是 vendor 選型決策。

問題情境：橫向擴展不是只有 sharded cluster 一條路

典型觸發場景：single replica set 撐到上限、writes 已經把 primary 推到 CPU 90% / disk IO 飽和、working set 超出 RAM。讀者下意識會想到「分 shard」、但同時還有「分 cluster」這條路徑、兩者 trigger 完全不同：

單 cluster 切 shard：解的是 單一資料域寫入飽和、collection 大到單 replica set 撐不住
多 cluster 切 DB：解的是 blast radius / ownership / 合規邊界、不一定是吞吐問題

混淆兩者的後果：吞吐沒撞牆但 blast radius 是議題、強行分 shard → aggregation / transaction / $lookup 成本全部跳一級、業務 ownership 仍混在一起。或反過來：吞吐撞牆但選了分 cluster → 跨 cluster transaction 不存在、單一 collection 跨多 cluster 要在 application 層拼。

讀者徵兆：

mongos 的 targeted query / scatter-gather query 比例失衡
單一 shard CPU 遠高其他 shard、balancer 移 chunk 跟不上寫入速度
chunkMigrated 異常頻繁、sh.status() 顯示 chunk 分布偏斜
微服務 ownership 跟 collection 邊界不對齊、某 microservice 故障打到其他服務

Case anchor：9.C38 Toyota Connected 揭露「20 個 Atlas database 是業務邊界切分、不是吞吐切分」（單 cluster vs 多 cluster 對照）；hot shard 在 e-commerce flash sale / 遊戲開新區 / B2B 大客戶獨佔 chunk 的具體 incident 細節需未來 case 補完、本文以「常見 failure pattern」處理、不憑空編造 incident 數字。

核心機制：shard key、chunk、balancer

Shard key 三特性決定 sharded cluster 行為：

Cardinality（基數）：shard key 的不同值數量。status: "active" | "inactive" 只有兩個值、cardinality = 2、不能分到多 chunk
Frequency（頻率分布）：值的分布是否平均。country 在全球流量中通常一兩個國家佔 80%
Monotonicity（單調性）：值是否單調遞增。_id（ObjectId）/ 時間戳 / 自增 ID 都是單調

三特性決定 shard key 行為：

Hashed shard key：hash function 把 key 打散、寫入分布均勻、但 range query 變 scatter-gather（每個 shard 都問）
Ranged shard key：相同 key 相近 → 同 chunk → range query 高效；但單調 key + ranged → 所有寫打最後 chunk
Compound shard key（5.0+ 是常用做法、對應 Composite Partition Key 的 MongoDB 實作）：例如 { tenantId: 1, _id: "hashed" } — 先 tenant 隔離、再 hash 避免 tenant 內熱點
Zone sharding：把特定 chunk 釘到特定 shard（地域 / 合規 / 硬體分層）

Chunk 是 MongoDB 在 collection 上劃出的 64MB（預設）邏輯區塊。Balancer 在 shard 間搬 chunk 達成均衡。Chunk 不可 split 的條件是 shard key 在該範圍只有一個值（low cardinality / 大 tenant 獨佔範圍）— chunk split 不了、balancer 也搬不開。

reshardCollection（4.4+）：透過 temporary collection + chunk 重切 + 雙寫 + cutover、耗時等比於資料量、需額外 ~1.2x 磁碟。是「設計錯了還有補救機會」但不是 free lunch。

對應 knowledge card：database-sharding、hot-partition、partition。

單 cluster 切 shard vs 多 cluster 切 blast radius

跨案合成 frame（本章合成、9.C38 Toyota 揭露事實但 case 原文沒提這個 frame）：橫向擴展不是只有「sharded cluster 一條路」、多 cluster 是另一條路。

9.C38 Toyota Connected 揭露事實：

18B transactions / 月 ÷ 30 天 ÷ 86400 秒 ≈ 7K txn/sec（口徑：月度滾動平均、非瞬時尖峰）
單一 MongoDB cluster 完全撐得下這個吞吐
Toyota 切 20 個 Atlas database 不是吞吐切分、是 microservice ownership + blast radius 切分
「每個 microservice 擁有自己的 DB、單一 DB 故障不影響其他服務」

兩條路徑的判讀條件不同：

路徑	Trigger	代價
Sharded cluster（分 shard）	單一 collection 寫入飽和、storage 撐爆單 replica set、access pattern 在同一個資料域內	aggregation / transaction / `$lookup` 成本全部跳一級
多 cluster（分 DB）	微服務 ownership 邊界、blast radius 隔離、合規 boundary、不同 workload shape 共處風險	跨 cluster transaction 不存在、跨 DB join 必須在 application 層做

兩者可以同時用：每個 microservice 有獨立 cluster、cluster 內部該分 shard 還是分。寫設計文件時要避免讓讀者以為「sharded cluster 是唯一橫向擴展選項」。

Partition key 可逆性跨 vendor 對照

跨 vendor 可逆性對照 SSoT：MongoDB / DynamoDB / Cosmos DB 三家可逆性不在同一光譜、跨 vendor 對照的 SSoT 主寫位置在 DB3 entry — 三 vendor 對比 10 軸 + 對應的軸的延伸子段。本段聚焦 MongoDB 5.0+ reshardCollection 對 shard key 設計的影響、不重複展開三 vendor 全光譜比較。

不同 vendor 對 partition key 可逆性紀律完全不在同一光譜：

Vendor	機制	可逆性	成本
MongoDB	Shard key（`shardCollection`）	4.4+ `reshardCollection` 可改、5.0 前完全不可逆	等比資料量、~1.2x 磁碟、雙寫 + cutover
DynamoDB	Partition key	可改（用 backfill 到新 table）	重設計 access pattern、流量切換成本
Cosmos DB	Partition key	不可改（必須 export-recreate-import）	全量重灌、雙寫驗證、最大遷移成本

寫進設計文件時必須附 vendor + 版本、避免讓讀者把三家當「partition key 都不可改」、也避免把 MongoDB 5.0+ 的 reshardCollection 當「便宜遷移」。

操作流程

Step 1：横向擴展路徑決策。先問「我要解的是 單一資料域寫入飽和 還是 blast radius / ownership」、選分 shard 或分 cluster。若兩者都要、決定 cluster 邊界後再在 cluster 內分 shard。

Step 2：access pattern audit。列出所有讀寫 query、標出哪些 query 必須走 single shard（targeted），哪些 query 不在意 scatter-gather。

Step 3：候選 key 評估表。對每個候選打 cardinality / frequency / monotonicity 三項評分：

候選 key	Cardinality	Frequency	Monotonicity	適合？
`_id`（ObjectId）	極高	均勻	單調	否（單調寫熱）
`tenantId`	中	偏斜	否	視 tenant 分布
`{ tenantId: 1, _id: "hashed" }`	高	均勻	否	通常合適
`country`	極低（~200）	嚴重偏斜	否	否

Step 4：dry-run 採樣。對既有資料採樣，跑 db.coll.aggregate([{$sample:{size:100000}}, {$group:{_id:"$candidateKey", c:{$sum:1}}}, {$sort:{c:-1}}]) 看分布、確認沒有單一 key value 吃掉 > 20% 流量。

Step 5：shardCollection。

1sh.enableSharding("shop")
2sh.shardCollection("shop.orders", { tenantId: 1, _id: "hashed" })

先在 staging 跑流量重放、確認 chunk 分布平均、targeted query 比例 > 90%。

Step 6：監控。

1sh.status()                              // 看 cluster 狀態
2db.orders.getShardDistribution()         // 看 chunk 分布
3db.adminCommand({ balancerStatus: 1 })   // 看 balancer 狀態

Step 7：若已上錯 key。評估 reshardCollection（4.4+）vs application-level 雙寫遷移：

1db.adminCommand({
2  reshardCollection: "shop.orders",
3  key: { tenantId: 1, region: 1, _id: "hashed" }
4})

reshardCollection 進入 cutover 後不能回退、必須 dry-run 估完時間 + 磁碟 + IO 影響再上。

驗證點：targeted query 比例 > 90%、單 shard QPS 變異係數 < 20%、balancer migration 速率追上寫入速率。

Rollback boundary：shardCollection 是不可逆操作（5.0 前完全不可逆、5.0+ 透過 reshardCollection 可改但需重做）；reshardCollection 進入 cutover 後不能回退。

失敗模式

單調 key 寫熱點：_id（ObjectId）/ 時間戳 / 自增 ID 當 ranged shard key → 所有寫進最後 chunk，scale-out 等於零。修法是 hashed key 或 compound key 把單調軸拌散。

低 cardinality key：用 country 當 shard key、某個 country 佔 80% 流量、chunk 無法繼續 split、該 shard 永久熱。修法是加一個高 cardinality 軸（compound key）讓 chunk 可繼續分。

Tenant skew：B2B 場景大客戶獨佔 chunk、且該 tenant 的 chunk 還會繼續長大、balancer 搬不走。修法 compound key { tenantId: 1, _id: "hashed" } — tenant 隔離但 tenant 內 hash 散開。

Scatter-gather 過多：選了 hashed _id 但業務查詢主要是 tenantId 範圍查、每筆 query 打所有 shard、p99 隨 shard 數線性退化。修法 compound key 把常用查詢軸放第一位、targeted query 才能對 single shard。

Resharding 卡在 build 階段：磁碟不夠（需 1.2x source size）、IO 飽和影響線上 workload、預期 4 小時實際跑 14 小時。修法是先擴磁碟、staging 跑 dry-run 量實際耗時、production 在低峰期啟動。

Zone sharding 規則打架：合規規則（資料必須留在某 region）跟負載平衡規則衝突、balancer 無法移動 chunk → 熱點固化。修法是 zone 規則 vs balancer 設計階段就劃清、不要事後加 zone。

誤把多 cluster 當分 shard 解：blast radius 議題塞到 sharded cluster、單 cluster 故障仍打掉全部 microservice。該分 cluster 的就分 cluster、不是塞到 shard。9.C38 Toyota 揭露：7K txn/sec 仍切 20 DB 的 trigger 是 microservice ownership、不是吞吐。

Cluster 擴容時間估計太樂觀：MongoDB cluster 擴容是天級議題、不是 console 點點就好。9.C36 Coinbase 揭露 cluster 擴容要 70 分鐘（口徑：Coinbase 特定環境 cluster tier / 資料量 / Atlas API 條件下、reactive scaling 起點到完成、非 MongoDB 普遍承諾）；預測性流量必須走 predictive / scheduled scaling、不能只靠 sharded cluster 動態橫向擴展接住 surge（見 connection management and cache layer）。

Anti-recommendation：

寫入 < 5K WPS、storage < 1TB、single replica set 還能撐就不該分 shard；分了之後 aggregation、transaction、$lookup、index 成本全部跳一級
shard vs 多 cluster 對照：吞吐沒撞牆但 blast radius / ownership 是議題、走多 cluster 不是強行分 shard（9.C38 Toyota 7K txn/sec 仍切 20 DB 的 trigger）
跨 case 合成 frame：「不是所有資料都該進同一個 MongoDB cluster」、按 microservice ownership / blast radius / 合規邊界切

容量與觀測

關鍵 metric：

Shard 分布健康：每 shard QPS / CPU / disk usage 變異係數（< 20% 合理）
Query 路由：targeted vs scatter-gather query 比例（targeted > 90% 合理）
Balancer 健康：chunk migration rate、balancer round duration
Cluster 邊界：cluster-to-cluster ownership 邊界、跨 cluster query 比例

Mongo command：

sh.status()：cluster 整體狀態
db.coll.getShardDistribution()：collection 在各 shard 的分布
db.adminCommand({balancerStatus:1})：balancer 狀態
db.serverStatus().sharding：sharding metric

mongos profiler：每 query 帶 executionStats.executionStages.shards[]、看是否 single shard。

回到 4.20 observability evidence：把 shard distribution、targeted ratio、resharding 進度列為 evidence 三件套。

回到 9.4 saturation discovery：hot shard 是 partition-level saturation 的典型例子。

回到 9.5 bottleneck localization：當整 cluster CPU 看似只用 25%、實際是 1/4 shard 在 100%。

邊界與整合

Sibling deep articles：

schema design pattern — document 形狀決定 shard key 選擇空間
aggregation pipeline optimization — cross-shard aggregation 的 $out / $merge 限制
change streams + Kafka — cluster-wide vs collection-level change stream 在 sharded cluster 的差異
connection management and cache layer — cluster 擴容時間是天級議題、必須跟 predictive scaling / proxy 層配合

Migration playbook：

避免自管 sharding 走 → Atlas 用 managed shard tier
徹底重新分區走 shard expansion + multi-DC

跟 1.x 互引：1.10 KV / Document DB 容量規劃把 shard key 列為 capacity 決策；1.12 大規模 DB 遷移實戰收 resharding 失敗 retrospective。

跨 vendor 對照：DynamoDB vendor page（partition key + adaptive capacity + backfill 可改）、Cosmos DB vendor page（partition key 不可改）。

MongoDB Replica Set Read Preference：DB 層 causal session vs cache 層 freshness token

Wed, 27 May 2026 00:00:00 +0000

MongoDB replica set 在小規模時 read preference 五擇一就夠用、primary 走預設、想分擔 primary 改 secondary — 直觀但會在 production 反噬。讀者真正撞到的議題分兩層：DB 層的 read-your-own-write（同 client 寫完馬上讀讀不到）跟跨層的 read-after-write（write 進 MongoDB、cache 還是舊資料）。前者用 causal consistency session 解、後者要走 freshness token 跨層協議。Coinbase 1.5M reads/sec 不是純 MongoDB 撐出來、是 DB + cache 跨層合成。本文把 read preference 機制 + 跨層協作講清楚。

本文不重複 MongoDB vendor overview 已寫過的 replica set 簡介 — 而是 production 部署 + 跨層協作 + 失敗修復的實作層教學。

進本文前先確認 MongoDB 已通過適配判讀：workload 是否落在 MongoDB 適用區（document shape 主導 / contract layer 該放哪 / 跨雲 hedging 是否需要）— 判讀軸見 schema-design-pattern 開頭 3 軸前置判讀。Read scaling 是 已選 MongoDB 後 的容量決策、判讀通不過時 read preference 修補無法救回 vendor 選錯。

問題情境：read scaling 撞牆的兩種長相

典型觸發場景：primary 寫入飽和、TL 提議「讀都打 secondary」想橫向擴容。改完後幾個 production 徵兆連環出現：

User 看到「我剛下的訂單怎麼還沒出現」— write 進 primary、立刻 read 打 secondary、secondary 還沒 apply 該寫入、user 看到 stale data
跨 region replica set：app server 在 Tokyo、primary 在 Singapore、每筆讀走 70ms 跨海 RTT；改 nearest 後 latency 降但 stale read 出現
Replication lag 在 backup 期間飆到分鐘級、secondary read 拿到幾分鐘前的資料、前端報表時間軸對不上
Failover 期間 read preference 沒寫好、client 一直連舊 primary、SocketTimeout 直到 driver retry 邏輯介入

第二類議題、規模更大：把所有 read 打 secondary、replica 數量加到 5-7 仍撐不住 sustained 高 read（>500K reads/sec）；replication lag 升 + secondary CPU 飽和。這時 read preference 已不夠、必須加 cache + 跨層 freshness 機制。

讀者徵兆：rs.printSecondaryReplicationInfo() 顯示 lag 分鐘級、application log 出現「我剛寫的資料讀不到」客訴、failover 演練後 connection error 持續 30s+、cache hit rate 跟 read latency 反向相關。

Case anchor：9.C36 Coinbase 揭露「document model 撐 1.5M reads/sec 靠 cache + freshness token」、含警示「1.5M reads/sec 是 users 服務 加上 cache 的數字、不是 MongoDB cluster 純讀取數字」。跨 region read preference 改 nearest 後 stale read 的具體 incident 細節需未來 case 補完、本文以「常見 failure pattern」處理。

核心機制

MongoDB read preference + read concern 兩軸

Read preference 五種：

primary（預設）：只打 primary、強一致、primary 飽和時無路可走
primaryPreferred：先 primary、primary 不可用 fallback secondary
secondary：只打 secondary、永遠拒 primary、failover 期間若所有 secondary 都不行就拋錯
secondaryPreferred：先 secondary、secondary 不可用 fallback primary
nearest：不是「最近的 secondary」、是「ping latency 最低的 member」（可能是 primary）；driver 用 latency window（預設 15ms）內隨機挑

Read concern 是另一軸：

local：讀本地最新（含未確認）、效能最佳、可能讀到後來 rollback 的資料
available：跟 local 類似但對 sharded cluster 有差異
majority：讀到「已寫到多數 member」的資料、寫入 commit 後在多數 member 確認後才看得到
linearizable：強制最新、必須打 primary、最高 latency

Write concern w: "majority" 保證寫入確認後在多數 member 上、但不保證 secondary 馬上 visible — 兩個概念分開。

Causal consistency session（DB 層機制）

Causal consistency session 解的是 單 client 在 MongoDB cluster 內部 的因果一致：

Client session 帶 clusterTime + operationTime
Driver 把 read 路由到「已 apply 該 operationTime」的 member
實現 read-your-own-write（自己剛寫的、自己讀得到）

機制只在「同一 client session」內生效。跨 client 的因果一致（A 寫 → B 讀）不在範圍內。

其他輔助機制：

Tag set：member 標 {region: "ap-tokyo", role: "analytics"}、read preference 帶 tag 把流量路由到特定 member
Hidden / delayed secondary：不參與 election、不接 client read、做 backup / DR 用
Election：primary 失聯後 majority 投票選新 primary、預設 10s 內完成；election 期間所有 primary read 失敗

Freshness token（cache 層機制）

9.C36 Coinbase 揭露的跨層機制 — 解的是 MongoDB + cache 跨層 的 read-after-write、不是 cluster 內部。對應 Freshness Token 卡片的 application-level 版本協議定義：

觸發條件：直接打 MongoDB 不可能撐 1.5M reads/sec（口徑：users 服務應用層觀察、含 cache、非 MongoDB cluster 純讀取）。Coinbase 在 users 服務前加 Memcached query cache、單 document query 先查 cache。

跨層一致性問題：write 進 MongoDB primary、cache 還是舊資料、client 下次 read 從 cache 拿到舊版。

freshness token 機制：

Write 成功後、server 給 client 一個 token（包含 OCC version / clusterTime）
Client 之後 read 帶這個 token
Server 保證返回的資料版本 ≥ token
若 cache 的版本 < token、bypass cache 直接打 DB

跟 causal consistency session 的關係：兩者解決同一類問題（read-after-write）但作用範圍不同。Causal session 是 DB 層、保證在同一 cluster 內 read-your-own-write；freshness token 是 DB + cache 兩層共用的版本協議、保證跨層 read-your-own-write。

跨層協作三選一

讀者真實系統的 read 一致性需求要選哪層處理：

路徑	適用情境	代價
只用 DB 層（causal session）	無 cache 層、讀寫都直接打 MongoDB cluster	replica scaling 上限約幾十萬 reads/sec
只用 cache 層（freshness token）	有 cache、跨層一致性要求高、application 願改	需設計 token 協議 + cache bypass 邏輯
兩層並用	大規模 OLTP、cluster 內也要 causal、跨 cache 也要 freshness	複雜度最高、但 Coinbase 規模必走此路

對應 knowledge card：stale-read、replication-lag、session-consistency、eventual-consistency。

操作流程

Step 1：read shape 分類。把所有 read 分成四類：

(a) 強一致必須 read-your-own-write（訂單詳情、帳戶餘額）
(b) 容忍秒級 lag（個人資料、商品詳情）
(c) 容忍分鐘級 lag（報表、analytics）
(d) 大規模 read scaling 需 cache + freshness token（用戶資料 / 高頻 product query）

Step 2：依分類對映機制。

分類	Read preference	Read concern	跨層機制
(a)	primary	majority	causal consistency session
(b)	secondaryPreferred	local	monitoring lag alarm
(c)	secondary（tag set）	available	無
(d)	secondaryPreferred	majority	cache + freshness token + bypass

Step 3：driver config（Node.js / Java / Python 都類似）：

1mongodb://host1:27017,host2:27017,host3:27017/db?
2  replicaSet=rs0&
3  readPreference=secondaryPreferred&
4  readPreferenceTags=region:ap-tokyo&
5  readPreferenceTags=&
6  maxStalenessSeconds=90&
7  readConcernLevel=majority

readPreferenceTags 寫多個 = fallback chain（先 tokyo 失敗 fallback 任意）。maxStalenessSeconds=90 拒絕 lag > 90s 的 secondary。

Step 4：causal consistency session：

1with client.start_session(causal_consistency=True) as s:
2    coll.insert_one(doc, session=s)
3    # 下面這個 find 自動路由到能讀到剛才寫的 member
4    coll.find_one({"_id": doc["_id"]}, session=s)

Session 結束後因果關係結束、下個 session 不繼承。

Step 5：freshness token 設計（9.C36 Coinbase 模式）：

Write API 返回 {result, version_token} — token 含 OCC version 或 MongoDB clusterTime
Read API 接受 optional If-Version-≥ header / parameter
Cache lookup 比對 cache entry version 跟 token、低於 token 就 invalidate + bypass 到 MongoDB
DB 層 read 用 readConcern: "majority" 保證返回的 version ≥ token

Step 6：staging 驗證。灌入 replication lag（暫停 secondary apply）驗證 application 行為；灌入 stale cache 驗證 token bypass 邏輯；模擬 failover 驗證 driver retry。

驗證點：

rs.printSecondaryReplicationInfo() lag < SLO
driver metric readPreferenceUsageCount 分布符合預期
failover drill 後 read recovery < 15s
cache hit rate vs freshness bypass rate 比例監控

Rollback boundary：read preference 是 driver-side config、可以 hot-swap；causal consistency session 需 application code 改、需灰度；freshness token 是 application + cache + DB 三方協議、回退需協調。

失敗模式

Read-after-write 不一致（DB 層）：寫 primary → 立刻 secondary read、應用 race condition 顯示「資料消失」。修法是 causal consistency session、driver 自動路由到已 apply 該寫入的 member。

Read-after-write 不一致（跨層）：寫 primary → cache 還是舊資料 → user 看到舊資料。causal session 解不了（cache 在 MongoDB 外）、必須走 freshness token 跨層協議。

Stale read 在 lag 高峰：backup / DDL / 大量寫入導致 secondary lag 分鐘級、secondary read 拿到舊資料。修法設 maxStalenessSeconds 拒舊 member、driver 自動轉到較新的 member 或 primary。

nearest 在跨 region 不穩：latency 抖動讓 driver 在 primary / secondary 跳、寫一致性與 read latency 同時惡化。修法是不要用 nearest 解跨 region 議題、應該用 tag set 明確路由。

Failover 期間 primary read 全失敗：election 10s 內所有 primary read 拋錯。修法改 primaryPreferred + driver retry 邏輯吃掉短暫失敗、application 端配 retry policy。

Tag set 失準：把 region: "ap-tokyo" 的流量路由到 tag 為 tokyo 的 member、但該 member 故障時沒 fallback、流量直接停。修法是 tag 設多層 fallback chain、最後一層留空 tag 表示「任意 member」。

Analytical query 跑 OLTP secondary：secondaryPreferred 把報表打 OLTP secondary、報表 query 拖垮 OLTP read latency。修法是 analytical workload 用 tag set 路由到專屬 analytics secondary、跟 OLTP read 隔離。

Freshness token 漏寫：write 沒帶 token 給 client / client 沒帶 token、token 機制 silently 失效、read 走 cache 拿舊資料。修法 token 必須 e2e 強制（middleware 自動帶 / 自動驗證）、不能靠 application 自覺。

Cache bypass 比例失控：所有 read 都 bypass cache、cache 等於沒裝。修法是 token 失敗率要監控、過高表示 cache invalidation 設計有問題（cache 沒在 write 後 update / invalidate）。

Anti-recommendation：

read-heavy 但有強一致需求的場景不要為了 scale 改 secondary read；該換 SQL + read replica 加 application-level cache、或加 sharding 把 primary 寫散開
大規模 OLTP（>500K reads/sec）想單靠 MongoDB read preference 撐 = 拿不到那個量級。Coinbase 案明示「直接打 MongoDB 不可能撐 1.5M reads/sec」、必須 cache + freshness token

容量與觀測

關鍵 metric：

Replica health：每個 member 的 opcounters 分布、rs.status().members[].optimeDate 推算 lag
Read preference 命中：driver-side readPreferenceTags 命中率
一致性 SLO：stale read 比例（causal consistency 拒絕重試次數）
跨層 freshness：cache hit rate vs freshness bypass rate

Mongo command：

rs.status()：replica set 整體
rs.printSecondaryReplicationInfo()：lag 概況
db.serverStatus().repl：詳細 replication metric
db.adminCommand({replSetGetStatus:1})：完整 status

Application observability：APM 看「同一 session 內 write + read 順序對 latency / error 的影響」、SLO 是 read-your-own-write 命中率；跨層還要看 freshness token 流動完整性（write 是否發 token、read 是否帶 token、cache 是否驗 token）。

Lag alarm：lag > 30s 預警、> 90s 觸發 driver maxStalenessSeconds 自動拒讀。

回到 4.20 observability evidence：把 read preference 命中分布、replication lag time series、failover drill recovery time、freshness token bypass rate 列為 evidence。

回到 9.5 bottleneck localization：read latency 異常時要區分 (a) primary 飽和 (b) secondary lag 高 (c) tag routing 把流量集中到單一 member (d) cache hit rate 下降 / bypass 率上升。

邊界與整合

Frame 5：合規邊界 — MongoDB 用 cluster-per-region 吸收

MongoDB / Atlas 沒有 row-level locality 機制（不像 CockroachDB 可把單 row pin 在合規 region）— 跨境合規必須以 cluster-per-region 拓樸吸收：每個合規市場開獨立 cluster、application 層做 routing、不靠 replica set / sharded cluster 機制跨 region。

跨 vendor 對照：

Vendor	合規吸收機制	拓樸特性
MongoDB / Cosmos DB	cluster-per-region（無 row-level locality 等價物）	各 region 獨立 cluster、application 層做市場 routing
Aurora	fleet 拓樸（每市場獨立 cluster、Global Database 在合規場景反指標）	active-passive per market、跨市場不複製
CockroachDB	locality + placement（邏輯一個 cluster + region pinning + Outposts）	單 logical cluster、physical row 鎖在合規 region
DynamoDB	region-pinned Global Tables（按 region 開關 replication、各市場可分離）	仍 active-active、但 replication 範圍可控

MongoDB 在這 frame 的退化點：read preference 機制本身不解合規 — 即使 readPreferenceTags={region:eu} 把流量路由到歐洲 secondary、但 primary 在亞洲時跨境 replication 仍在跑、合規 audit 不會放行 路由層 控制當作 資料邊界 控制。合規市場必須整 cluster 分離、再用 application 層 routing 把 user 帶到對應 cluster。

Atlas 在合規場景的 fit：Atlas global cluster（zone sharding 把 shard 鎖在 region）是「跨 region 但 資料 pin 在 zone」的中介選項、適合 GDPR 軟條款（資料在歐洲 EEA 內可流動）；strict 條款（資料不能離開單一國家）仍須走 cluster-per-region。

Sibling 與 cross-link

Sibling deep articles：

shard key selection — read preference 解決不了 write 飽和、要切 shard
change streams + Kafka — change stream 預設打 primary、放 secondary 的 trade-off
aggregation pipeline optimization — 把 analytical aggregation 路由到專屬 secondary
connection management and cache layer — freshness token 是該篇的核心議題之一、本文聚焦 DB 層 vs cache 層機制對照、不展開 cache 部署架構

Migration playbook：

跨 region 強 consistency 需求 → → Cosmos DB MongoDB API（5 consistency level）
跨 region 想保留原生 MongoDB → → Atlas global cluster

跟 1.x 互引：1.1 高併發資料存取處理 read scaling pattern；1.11 全球分散式 OLTP 處理跨 region 一致性升級路徑。

MongoDB Connection Management and Cache Layer：driver × 部署模型 × cache × predictive scaling

Wed, 27 May 2026 00:00:00 +0000

MongoDB 大規模 OLTP 的真實架構不是「一個 driver pool 直連 cluster」、是 driver / proxy 層 + cache + freshness token 層 + scaling trigger 層三層協作。讀者最常的誤解是「Coinbase 用 MongoDB 撐 1.5M reads/sec」— 實際是這個合成架構撐出來的量級、單靠 MongoDB cluster 拿不到那個數字。本文把三層各自議題跟整合操作流程講清楚、並對 mongobetween 的部署模型適用範圍給出明確邊界。

本文不重複 MongoDB vendor overview 的 Atlas / 容量規劃簡介 — 而是 production 部署 + 跨層協作 + 失敗修復的實作層教學。

問題情境：大規模 OLTP 撞三道牆

MongoDB 部署規模從中型撐到大規模時、會連環撞三道牆：

Connection ceiling：應用層 deploy 規模一上來、單一 MongoDB cluster 看到 connection storm。9.C36 Coinbase 揭露具體：Ruby + GVL + blue-green 部署把 instance 數 ×2、連線數隨之 ×2、單一 cluster 看到 60K connections / 分鐘（口徑：Coinbase 特定環境 CRuby + GVL 部署模型）。MongoDB cluster 的 connection limit 撞牆、新 deploy 連不上、線上服務 cascade 失敗。

Read scaling ceiling：讀者把所有 read 都打 secondary、replica 加到 5-7 仍撐不住 sustained 高 read（>500K reads/sec）。Replication lag 升 + secondary CPU 飽和；單靠 MongoDB cluster 內機制（replica scaling + read preference）拿不到大規模量級。

Scaling reaction lag：MongoDB cluster 擴容是天級議題、不是即時擴容。9.C36 Coinbase 揭露 reactive scaling 起點到完成 ~70 分鐘（口徑：Coinbase 特定環境、cluster tier / 資料量 / Atlas API 條件下、非 MongoDB 普遍承諾）。Surge 開始時才動來不及、預測性流量必須提前出手。

Surge 形狀又不規則：加密貨幣 surge（隨外部市場波動）/ 媒體爆量（事件驅動）/ IoT 緊急通報（雙模式並存）— 都不適合單純 reactive auto-scaling 接住、必須 predictive + reactive 兩段式。

讀者徵兆：

MongoDB Atlas console 看到 connection count 在 deploy 後 spike 到上限
p99 read latency 在事件時段集體爬
Atlas auto-scaling event log 顯示 triggered too late
Cache hit rate 跟 read latency 反向相關

Case anchor：9.C36 Coinbase 是 rich case，含具體數字（deploy 尖峰 connection event rate ~60K connections / 分鐘 / mongobetween 後 steady-state concurrent connections 由 ~30K 降到 ~2K — 兩者口徑不同、不是同一數字的連續變化；1.5M reads/sec 含 cache / 70 → 25 分鐘擴容）；9.C38 Toyota Connected 雙模式負載敘事（持續 sensor + 緊急事件）、9.C37 Forbes 媒體爆量形狀。

核心機制：三層合成 frame

跨案合成 frame（本章合成、case 原文沒這個 frame）：應用層連 MongoDB cluster 在大規模 production 是 三層協作、不是 driver 一個元件：

層次	角色	9.C36 Coinbase 對應元件
Driver / Proxy	連線多工、應用 process 跟 cluster 的橋接	MongoDB driver + mongobetween proxy
Cache + freshness token	read scaling 主路、跨層一致性協議	Memcached + freshness token + OCC version
Scaling trigger	cluster 擴容啟動時機	ML predictive scaling + reactive fallback

三層缺一都會在大規模時撞牆。本文聚焦這三層如何協作、單一層的深度議題（read preference 機制、schema 治理、aggregation pipeline）推到 sibling。

Driver / Proxy 層

MongoDB driver 原生 connection 模式：driver 在 application process 內維護 connection pool、每個 process 跟 MongoDB cluster 開固定數量 socket。但 driver 沒跨 process pool — 多個 process 共用同一台機器、每個 process 自己一份 pool、cluster 看到的是 N 倍 connection。跟 PostgreSQL 走 pgbouncer 是同樣需求。

Connection storm 的具體 trigger：

部署模型放大 process 數：CRuby + GVL 強制每 CPU core 一 process、blue-green 部署 instance 數 ×2、連線數隨之 ×2（9.C36 Coinbase 揭露：單 cluster 看到 60K connections/min）
微服務數量多：50+ microservice 各自連 cluster、每服務 connection 加總後撞上限（9.C37 Forbes 50+ 微服務情境對照）

mongobetween proxy（Coinbase 自建）：把多 application process 的連線合成少量到 MongoDB cluster 的連線。9.C36 揭露兩個獨立口徑、不是同一數字的連續變化：deploy 尖峰時 connection event rate 是 ~60K connections / 分鐘（unique connection 事件量、rate）；mongobetween 介入後 steady-state concurrent connection 數 由 ~30K 降到 ~2K（瞬時量、前後對比、一個量級）。引用時把 rate 跟瞬時 concurrent count 分開、不要壓成「60K 收斂到 2K」。

Scope warning（必明示）：mongobetween 是 Coinbase 為 Ruby + GVL 需求自建、case 自承「Go / Java / Node.js 應用因原生支援連線多工、通常不需要這層 proxy」。寫進設計文件時不可寫成「MongoDB 在大規模都需要 mongobetween」、要寫成「特定部署模型才需要」。

Cache + freshness token 層

直接打 MongoDB 不可能撐 1.5M reads/sec（口徑：users 服務應用層觀察、含 cache、非 MongoDB cluster 純讀取）。Coinbase 在 users 服務前面加 Memcached query cache、單 document query 先查 cache。

跨層一致性問題：write 進 MongoDB primary、cache 還是舊版、user 下次 read 拿到舊資料。

Freshness Token 機制：

Write 成功後給 client token（含 OCC version / clusterTime）
Client read 帶 token
Server 保證返回的資料版本 ≥ token
必要時 bypass cache 直接打 DB

跟 DB 層 causal consistency session 對照：causal session 解 MongoDB 內 read-your-own-write、freshness token 解 DB + cache 跨層 read-your-own-write。機制細節見 replica set read preference、本文不重複展開。

Scope warning（必明示）：1.5M reads/sec 是 users 服務 + cache 合成數字、不是 MongoDB cluster 純讀取 benchmark。寫進設計文件必須明示口徑、避免讀者把 1.5M reads/sec 當成「MongoDB 單獨能撐」。

Scaling trigger 層

MongoDB cluster 擴容時間：傳統 reactive scaling 起點到完成 ~70 分鐘（9.C36 Coinbase 揭露口徑：含 instance provisioning + 資料 sync + balancer rebalance、特定 Atlas tier / 資料量條件）。

Reactive 為主撐不住快變流量：CPU / queue 觸發 reactive scaling 在 surge 開始時才動、來不及；surge 已經結束擴容才到位。

Predictive scaling 機制（Coinbase 揭露）：

用外部訊號（加密貨幣價格、賽事行程、票務開賣時間）訓練 ML 模型
提前 60 分鐘預測流量
預先擴容
把擴容啟動時間從 70 分鐘壓到 25 分鐘（口徑：trigger 提前、不是擴容本身變快）

Scope warning（必明示）：case 警示「ML 預測有 false positive / false negative、Coinbase 沒揭露準確率、所以仍保留 reactive scaling 作為 safety net」。寫進設計文件要明示兩段式設計、不可寫成「Predictive scaling 取代 reactive scaling」。

對應 knowledge card：connection-pool、stale-read、session-consistency、hot-partition（cache 失效時打穿 DB 的 hot key）。

操作流程

Step 1：connection ceiling audit。量測現有 deploy 在 peak 的 connection count、推算 deploy ×2 / 微服務新增時 connection 走勢；對照 MongoDB cluster 的 hard limit（Atlas tier 決定、典型 1500-32000）。

Step 2：部署模型判讀。

部署模型	是否需 proxy 層	原因
CRuby + GVL（process-per-core）	需要	每 core 一 process、連線隨 process 線性升
大量微服務（50+）+ 各自 deploy	需要	微服務 connection 加總撞 cluster limit
Blue-green 部署（雙環境並存）	需要	部署期間連線 ×2、容易撞 cluster ceiling
Go / Java / Node.js 單一 binary + 多 thread	通常不需要	原生 driver pool 跨 thread 共用、收斂效率高

Step 3：proxy 選型。Coinbase mongobetween 是參考實作、社群還有 mongoproxy / DocumentDB 內建 connection multiplexer。自建 proxy 是 Coinbase 規模才合理、中型團隊先評估 Atlas tier 升級。

Step 4：cache layer 設計（read scaling 主路）：

前置 Memcached / Redis、cache key = collection + document id + version
Write API 返回 {result, version_token} — token 含 OCC version 或 MongoDB clusterTime
Read API 接受 optional version token、cache lookup 比對 entry version 跟 token、低於就 invalidate + bypass
DB 層 fallback readConcern: "majority" 保證返回 version ≥ token

Step 5：predictive scaling 設計（適用「外部訊號可預測流量」）：

識別 driver 訊號：加密貨幣價格 / 賽事行程 / 票務開賣 / 促銷活動 / IoT 緊急事件預警
訓練 ML：用歷史流量 vs 訊號 correlation 訓練、輸出未來 30-60 分鐘流量預測
觸發擴容：預測超 threshold 時主動 trigger Atlas scaling API、不等 reactive metric
保留 reactive safety net：ML failure 時 reactive scaling 仍會接、不可拿掉

Step 6：全鏈路驗證。Staging 灌入 deploy ×2 模擬 connection storm、灌入 stale cache 驗證 freshness token bypass、放假流量驗證 predictive scaling trigger。

驗證點：

Connection count 在 deploy 後不爆 cluster limit
Cache hit rate vs freshness bypass rate 比例正常（cache hit > 90% + bypass < 5% 屬通用工程估算、case 未揭露具體數字）
Predictive scaling 領先窗 ≥ 30 分鐘
Reactive scaling 仍保留作 safety

Rollback boundary：

Proxy 層可下線（流量改直連 cluster、但短時 connection storm 風險回來）
Cache 層可下線（read 全部打 DB、需 cluster 容量能撐）
Predictive scaling 可下線（退回純 reactive、但快變 surge 接不住）
三層都要設計 graceful degradation、不是全有全無

失敗模式

Connection storm during deploy：blue-green 部署 instance 數 ×2、connection 隨之爆、新 deploy 連不上 cluster、cascade 失敗。修法是 proxy 層 + cluster connection limit 預留 headroom（典型留 30% buffer、屬通用工程估算）。

Proxy 變成單點瓶頸：mongobetween / pgbouncer 風格 proxy 自己變熱點、proxy 故障時下游全死。修法是 proxy 叢集 + health check + 客戶端 retry、跟 application 同 region 共部署降低 proxy ↔ application 的網路 RTT。

Cache hit rate 崩塌：cache 失效 + 大量 read bypass、DB 突然吃 100% 流量、cluster 飽和。修法是 freshness token 設計時要監控 bypass rate、過高表示 cache invalidation 邏輯有問題、cache 沒在 write 後 update / invalidate。

Freshness token 漏寫：write 沒帶 token / client 沒帶 token、token silently 失效、user 拿到舊資料。修法是 protocol 強制（middleware 攔截 write / read、自動帶 token）、不能靠 application 自覺。

Predictive scaling false positive 浪費容量：ML 預測 surge 但實際沒來、cluster 預先擴容後閒置。接受成本、保留 ML model retraining、定期評估 precision / recall。

Predictive scaling false negative 漏接 surge：ML 沒預測到、cluster 沒提前擴、surge 來時 reactive scaling 開始動但 70 分鐘來不及。修法是 reactive safety net + 服務降級（限流 / 部分 read 降級拿舊資料 + freshness token 告警）。

三層協作脫節：proxy 擋住 connection storm 但 cluster 內部 read scaling 沒設計、application 仍打爆。三層必須一起設計、不是各自獨立。

Anti-recommendation：

中小流量（< 100K reads/sec、單 deploy < 50 instance）不需要這三層；Atlas tier 升級 + cluster 內 replica + 簡單 cache 就夠
mongobetween 風格 proxy 只在 Ruby + GVL / 類似部署模型才必要、Go / Java / Node.js 通常不需要（case 自承）
Predictive scaling 只在外部訊號可預測時有效；無預測訊號的純隨機 surge 還是回 reactive + headroom
大規模 OLTP 不該為了省成本拿掉 cache 層；read scaling 主路就是 cache、單靠 MongoDB cluster 拿不到 1.5M reads/sec 量級

容量與觀測

關鍵 metric：

Connection 層：cluster connection count / Atlas tier limit / proxy 到 cluster 的 connection multiplex 比、deploy 前後 connection 走勢
Cache 層：cache hit rate、freshness token bypass rate、cache key collision rate
Scaling 層：predictive scaling trigger event count / 領先窗、reactive scaling fallback 觸發頻率、實際擴容啟動到完成時間、ML 預測準確率（precision / recall）

Mongo / Atlas command：

db.serverStatus().connections：cluster 當前 connection 統計
db.currentOp({})：看 connection 使用
Atlas API：cluster scaling event log
Proxy admin metric：connection multiplex 比、上下游 latency

Application observability：APM 看 connection acquire latency、cache hit rate time series、freshness token 流動完整性（write 是否發 token、read 是否帶 token、cache 是否驗 token）。

回到 4.20 observability evidence：把 connection storm event、cache hit rate / bypass rate、scaling trigger leadtime 列為跨層 evidence 三件套。

回到 9.5 bottleneck localization：大規模 OLTP 撞牆時要區分 (a) connection ceiling (b) cache hit rate 下降 (c) cluster 內 replica 飽和 (d) scaling 跟不上。

邊界與整合

Sibling deep articles：

replica set read preference — DB 層 causal session 機制、freshness token 跨層協議；本文聚焦三層協作、那篇聚焦 DB 層機制
shard key selection — cluster 擴容是天級議題、是 scaling layer 的 trigger；單 cluster vs 多 cluster 切分
schema design pattern — app-layer abstraction 跟本文 cache + freshness token 同層協作、contract layer 三選一
aggregation pipeline optimization — report dashboard 跑爆 primary 的補位路徑是本文的 cache + read scaling、不是讓 aggregation 自己優化

Migration playbook：

Federated DB 模式（9.C36 Coinbase 揭露：MongoDB + DynamoDB）— 不是「全用 MongoDB」、document-shaped 用 MongoDB、access pattern 固定的 KV 用 DynamoDB；對應 DynamoDB vendor page 跨 vendor 對照
跨雲 hedging（9.C37 Forbes 跨雲彈性）— Atlas 跨 AWS / GCP / Azure 是規避未來雲商鎖定的 selection 訊號

跟 1.x 互引：

1.1 高併發資料存取 — connection storm 通用模式（pgbouncer / mongobetween 對應）
1.10 KV / Document DB 容量規劃 — 三層架構列為大規模 OLTP 容量規劃必看點
9.6 容量規劃模型 — predictive scaling 的 ML 訓練紀律

MongoDB Aggregation Pipeline Optimization：stage 順序、index 配合與 memory 邊界

Wed, 27 May 2026 00:00:00 +0000

MongoDB aggregation pipeline 是 document model 做 analytical query 的主要介面、stage stream 設計直觀但 production 容易踩雷 — 上線時 200ms、半年後資料量翻倍變 8s、加 index 沒用；profiler 顯示 stage 之間在 memory 累積上百 MB temp data。Aggregation pipeline 的最佳化跟 RDBMS 的 SQL planner 完全不同邏輯 — RDBMS 靠 planner 自動重排 join / filter、MongoDB 靠寫 query 的人手動排 stage 順序。本文把 stage 機制、index 配合、memory 邊界、cross-shard 限制講清楚、並對「report dashboard 跑爆 primary」這個常見 anti-pattern 給治理路徑。

本文不重複 MongoDB vendor overview 已寫過的 aggregation 簡介 — 而是 production tuning + 失敗修復的實作層教學。

前置閱讀：MongoDB workload 適配判讀（document shape 主導 / contract layer 該放哪 / 跨雲 hedging 是否需要）見 schema-design-pattern 開頭 3 軸前置判讀。本文聚焦 aggregation pipeline 操作層、是 已選 MongoDB 後 的 query 層工程議題、不重複前置判讀。

問題情境：aggregation 是 hot path 的反模式

典型觸發場景：報表 pipeline 上線時 200ms、半年後資料量翻倍變 8s、加 index 沒用；profiler 顯示 stage 之間在 memory 累積上百 MB temp data。

進一步徵兆：

「OLTP collection 上跑 analytical query」的混合 workload：把 $group + $lookup + $sort 接成長 pipeline、aggregation 把整個 working set 從 cache 擠走
Sharded cluster 上跑 cross-shard aggregation：$group / $sort 必須在 mongos 合併、mongos 變單點瓶頸
$lookup 出現在 hot path：每筆 input doc 都要去另一個 collection 查、嚴格意義上是 N+1
db.serverStatus().metrics.aggStageCounters 飆、executionStats.executionTimeMillis 跟 doc 數線性增長
Profiler 報 usedDisk: true、aggregation OOM kill QueryExceededMemoryLimitNoDiskUseAllowed

Case anchor：report dashboard 跑爆 primary 的具體 incident 細節需未來 case 補完、本文以「常見 anti-pattern」處理、不憑空編造 incident 數字。側面引用 9.C30 Microsoft 365 — 從 MongoDB 把 analytics 分離出來的 driver。

核心機制

Aggregation pipeline 是 stage 序列：每個 stage 接 stream of document、產出 stream of document。Stage 順序直接決定後續 stage 處理量 — 第一個 stage 是 IXSCAN 還是 COLLSCAN、$match 推到前面還是後面、$project 早 drop 還是晚 drop、都會放大或縮小後續 cost。

Optimizer rewrite：MongoDB 會自動把 $match / $project 往前推、把 $sort + $limit 合併成 top-K、但不保證所有 case。用 explain("executionStats") 看 rewrite 後的 effective pipeline、不要靠原始 pipeline 推斷實際執行順序。

Index 配合：pipeline 的 第一個 stage 若是 $match 或 $sort、且能對到 index、就走 IXSCAN。中間 stage 都是 in-memory stream、沒 index 概念。所以 $match 永遠該排第一、配合對應 index。

Memory 邊界：每個 aggregation stage 預設 100MB memory 上限、超過要 allowDiskUse: true（4.2+ 是預設）。Disk spill 啟動後 IO 嚴重拖慢、aggregation 變慢 50-100x。

$lookup 在 sharded cluster：foreign collection 不能 sharded（5.0 前完全不行、5.0+ 有限放寬）；$lookup 本質是 nested loop join、沒 hash join / merge join — 對大 collection 不可用。

$facet 平行多 pipeline：但所有 facet 共享同一個 100MB 限制、複雜 facet 容易撞 memory ceiling。

$merge / $out：把結果寫回 collection（pre-computed view / materialized view）— 把 hot analytical query 移出 read path、是治理 anti-pattern 的主要工具。

對應 knowledge card：hot-partition（aggregation 集中讀單 shard 的副作用）、document-store、stale-read（從 secondary 跑 aggregation 的 trade-off）。

操作流程

Step 0：把壞 pipeline 跟好 pipeline 並排。看一個簡化但典型的優化：

 1// 壞：lookup 在 match 前、sort 沒 limit、project 在最後
 2db.orders.aggregate([
 3  { $lookup: { from: "users", localField: "userId", foreignField: "_id", as: "user" } },
 4  { $match: { status: "completed", "user.region": "ap-tokyo" } },
 5  { $sort: { createdAt: -1 } },
 6  { $project: { _id: 1, total: 1, createdAt: 1 } }
 7])
 8
 9// 好：可推前的 match 寫前面、sort + limit 配對、project 早寫
10db.orders.aggregate([
11  { $match: { status: "completed" } },
12  { $sort: { createdAt: -1 } },
13  { $limit: 100 },
14  { $lookup: { from: "users", localField: "userId", foreignField: "_id", as: "user" } },
15  { $match: { "user.region": "ap-tokyo" } },
16  { $project: { _id: 1, total: 1, createdAt: 1, "user.name": 1 } }
17])

差別：壞 pipeline 對整個 orders 做 lookup、然後才過濾；好 pipeline 先過濾 + top-100、只對 100 筆做 lookup、再過濾 lookup 結果。實際 collection 大時兩者差 50-100x。

Step 1：拿 explain plan。

1db.coll.explain("executionStats").aggregate([...])

看 stages[] 顯示 rewrite 後的 effective pipeline、executionTimeMillis、totalDocsExamined / totalDocsReturned 比值、是否 usedDisk。

Step 2：把 $match 推到最前。越早過濾、後續 stage 處理量越小。Optimizer 通常自己會推、但 $lookup 之後的 $match 不會自動推到 $lookup 之前 — 因為 lookup 出的欄位才能被那個 match 用、邏輯依賴。寫 query 時就把能推前的 $match 寫前面。

Step 3：對 $match 欄位建 compound index。確保 executionStages 顯示 IXSCAN 而不是 COLLSCAN。Compound index 順序敏感 — { status: 1, createdAt: -1 } 對 { status: ..., createdAt: $gte: ... } 高效、對 { createdAt: $gte: ... } 走不到 index。

Step 4：$sort + $limit 寫在一起。Optimizer 才會推 top-K（不需要 full sort、只需要 heap）。單 $sort 不限 limit 會做 full sort、容易撞 memory。

Step 5：$project 早寫。把不需要的欄位早期 drop、減少後續 stage 處理 doc size。對大 document 特別有效。

Step 6：把 hot analytical pipeline 寫成 materialized view。

 1db.orders.aggregate([
 2  { $match: { createdAt: { $gte: ISODate("2026-05-01") } } },
 3  { $group: { _id: "$customerId", total: { $sum: "$amount" } } },
 4  { $merge: {
 5      into: "monthly_customer_summary",
 6      on: "_id",
 7      whenMatched: "merge",
 8      whenNotMatched: "insert"
 9  }}
10])

定時更新（cron / 5 分鐘一次）、application 讀 materialized view 而不是即時跑 aggregation。

Step 7：sharded cluster 處理。避免在 hot path 用 cross-shard $lookup / $group、或把這類 query 路由到 analytical replica（用 tag set + read preference）、見 replica set read preference。

驗證點：

executionTimeMillis 在預期 budget 內
totalDocsExamined / totalDocsReturned 比值接近 1（過濾效率高）
無 usedDisk: true
無 stage 看到 inMemory > 50MB

Rollback boundary：pipeline 改寫是 application code 變更、可以灰度；materialized view（$merge）需備份 target collection 才能還原。

典型 tuning 過程（200ms → 8s → 250ms）

一個常見的 production pipeline 演化路徑：

上線時 200ms：collection 100K doc、$match 過濾 95%、$lookup 只跑 5K 次、in-memory $sort 處理 5K row 在 100MB 內
半年後 8s：collection 長到 2M doc、$match 仍過濾 95% 但變 100K row、$lookup 跑 100K 次（5K → 100K 是 20x）、$sort 在 in-memory 撞 100MB 開始 disk spill、IO 100x 退化
加 compound index 沒用：index 是給 $match 用的、但 $match 之後的 stage（$lookup / $sort）走的是 in-memory pipeline、index 救不了
修法到 250ms：(a) $sort + $limit 配對讓 optimizer 走 top-K、避免 full sort (b) 改 schema embed 把 $lookup 拿掉（見 schema design pattern）(c) hot pipeline 寫成 $merge materialized view、application 讀 view 不跑 aggregation

關鍵教訓：aggregation 慢的原因不在 query 本身、在 資料形狀演進。Index 是 hot path 的第一個槓桿、但只對 $match / $sort 第一 stage 有效；後續 stage 要靠 stage 順序、materialized view、schema denormalize 來救。

失敗模式

$lookup 在 hot path：list page 每行去另一 collection 查、p99 隨 page size 線性增。應在 schema design 階段 denormalize、把 read-together 資料 embed 回 aggregate root（見 schema design pattern）。

$sort 不帶 limit + 沒 index：全表 in-memory sort、撞 100MB 限制 → OOM 或 disk spill。allowDiskUse: true 解 OOM 但 IO 100x 退化。修法是建對應 index 走 IXSCAN sort、或限 limit 走 top-K。

Sharded cluster cross-shard aggregation：$group 階段所有 partial result 跑到 mongos 合併、mongos memory + CPU 爆。修法是 group key 包含 shard key prefix（讓 group 在 shard 內完成）、或路由到 analytical replica 跑。

Stage 順序錯：$lookup 放在 $match 前、等於對全表都做 lookup 再過濾、每個 input doc 都觸發 lookup。$match 永遠該排第一。

Aggregation 把 working set 擠走：OLTP 的 hot page 被 aggregation 的 cold scan 擠出 cache、整體 query latency 一起退化。修法是 analytical workload 跟 OLTP read 隔離（read preference tag）、或搬走 analytical（見下面 anti-recommendation）。

$facet 滿載：四個 facet 各跑大 pipeline、共享 100MB 限制立刻爆。修法是拆成獨立 query、不要硬塞 facet。

Anti-recommendation：

報表 / BI / analytics workload 跑 MongoDB primary 是反模式：應該 (a) 設定 analytical secondary + read preference tag (b) 用 $merge 寫到 reporting collection (c) 進階用 BI Connector / data lake / 把 analytical workload 整批搬到 ClickHouse / BigQuery
「report dashboard 跑爆 primary」典型 anti-pattern：BI 工具直連 MongoDB primary 跑長 pipeline、cache eviction 把 OLTP working set 擠走、p99 latency 在報表時段集體升。沒拿到具體 incident 數字、不在本文編造、改寫成「常見 anti-pattern」並推到治理路徑
Aggregation 不能解 read scaling：aggregation 是 OLTP 的補位、不是 read scaling 的主路。Read scaling 在大規模 OLTP 走 cache + freshness token（見 connection management and cache layer）、不是把 aggregation 跑爆 secondary

容量與觀測

關鍵 metric：

Aggregation operation time 分布
Disk spill 次數
opcounters.command 中 aggregate 比例
Cache eviction rate 在 aggregation 高峰時的變化

Mongo command：

db.currentOp({ "command.aggregate": { $exists: true } })：當前 aggregation 在跑
db.serverStatus().metrics.aggStageCounters：stage 級別 counter
explain("executionStats")：單 query 詳細分析

Profiler：db.setProfilingLevel(1, {slowms: 200})、看 usedDisk flag 跟 numYield。

回到 4.20 observability evidence：aggregation slow log + cache hit ratio + disk spill rate 是「analytical 壓力」的 evidence 三件套。

回到 9.5 bottleneck localization：用 explain executionStats 把 pipeline stage 對到瓶頸（IXSCAN 還是 COLLSCAN、in-memory 還是 disk spill、shard-local 還是 mongos merge）。

邊界與整合

Sibling deep articles：

schema design pattern — embedded 設計可消除大部分 $lookup
shard key selection — 決定 aggregation 是 shard-local 還是 cross-shard
replica set read preference — aggregation 跑 secondary 的 stale read trade-off
connection management and cache layer — report dashboard 跑爆 primary 時的 cache + read scaling 主路

Migration playbook：analytical workload 大到不能繼續混在 MongoDB → split 出 → Cosmos DB MongoDB API + Synapse 或 → DynamoDB + Athena/Glue（access pattern 重設計）。

跟 1.x 互引：1.10 KV / Document DB 容量規劃把 aggregation 列為 read-shape 的成本維度；1.1 高併發資料存取處理「OLTP + analytical 同 cluster」的反模式。

MongoDB Change Streams + Kafka 整合：resume token、scope 選擇與 connector 治理

Wed, 27 May 2026 00:00:00 +0000

MongoDB change streams 是 3.6+ 原生 CDC 介面、本質上是 oplog tail 包裝成 cursor API。Application 從 dual-write 模式（自己寫 MongoDB 又寫 Elasticsearch / Redis / data warehouse）換成 change stream → Kafka → downstream sink 後、有了第一版 CDC pipeline、但連續工作幾週後出現「downstream 漏 event」或「duplicate event」；最痛的是 connector restart 後 resume token 過期（oplog 已滾掉）、整個 collection 必須重灌。本文把 change stream 機制、Kafka Connector 配置、resume token 治理、sharded cluster scope 選擇講清楚。

本文不重複 MongoDB vendor overview 已寫過的 change streams 簡介 — 而是 production CDC pipeline 部署 + 失敗修復的實作層教學。

MongoDB 適用度前置判讀：進到 CDC pipeline 設計前先確認 workload 在 MongoDB 適用區（document shape 主導 / contract layer 該放哪 / 跨雲 hedging 是否需要）— 詳見 schema-design-pattern 開頭 3 軸前置判讀、本篇不重複展開。Change streams 是 已選 MongoDB 後 的 event-driven 整合議題。

問題情境：第一版 CDC pipeline 跑幾週的踩雷

典型觸發場景：application 寫 MongoDB 後還要 dual-write Elasticsearch / Redis / data warehouse、application code 越塞越多 hook、寫入失敗的補償邏輯散落各處。改用 change stream → Kafka → downstream sink 後、有了第一版 CDC pipeline、但連續工作幾週後出現：

Downstream 漏 event 或 duplicate event
Connector restart 後 resume token 過期（oplog 已滾掉）、整個 collection 必須重灌
Sharded cluster 上 collection-level change stream 跟 cluster-wide change stream 行為不同、application 連 mongos 跟連 single shard 拿到不同 event

讀者徵兆：

MongoDB Kafka Connector log ChangeStreamHistoryLost 或 ResumeTokenChanged
Downstream Kafka topic event count vs source collection write count 不平
Replication oplog 跟 change stream consumer 的 lag 同時升

Case anchor：CDC pipeline resume token 過期導致全量重灌的具體 incident 細節需未來 case 補完、本文以「常見 failure pattern」+ 容量公式處理、不憑空編造 incident 數字。側面引用 Spotify Kafka → PubSub migration（pipeline-level migration 經驗對照）。

核心機制

Change stream 是 MongoDB 3.6+ 原生 CDC、本質上是 oplog tail 包裝成 cursor API。可以從 collection / database / cluster 三個 scope 開：

Collection-level：監看單一 collection 的變更
Database-level：監看整個 database 的所有 collection
Cluster-wide：監看整個 cluster 的所有 database

Oplog 是 capped collection、預設 size = disk 5% 或 50GB（取較小）。Resume token 對應 oplog entry 的 timestamp + UUID + documentKey。Token 必須對應仍在 oplog 內的 entry — oplog 滾掉就拿不到 token 對應的位置、ChangeStreamHistoryLost。

Resume token 兩種用法：

_id：每個 event 都帶、application 自己存
startAfter / resumeAfter parameter：重啟 cursor 時帶上

fullDocument: "updateLookup"：update event 預設只給 delta、加這個 option 會額外 query 一次 primary 拿完整 doc；高頻 update 下成本顯著（primary 負擔翻倍）。

Pre-image / post-image（6.0+）：可以拿到 update 前的 doc 狀態、需 collection-level option changeStreamPreAndPostImages: true。

Cluster-wide vs collection-level change stream：

Cluster-wide 必須打 mongos、event ordering 是 global
Collection-level 可直接打單 shard、ordering 只在該 shard 內
Sharded cluster 上 cluster-wide stream 容易把 mongos 變單點瓶頸（所有 shard 的 event 都收斂到 mongos）

MongoDB Kafka Connector（Confluent / MongoDB 官方）：

Source connector：把 change stream → Kafka topic
Sink connector：把 Kafka topic → MongoDB
At-least-once 語義、需 application 處理 idempotency

對應 knowledge card：change-data-capture、replication-channel、replication-slot（MongoDB 沒 slot、概念對照）。

操作流程

Step 1：scope 決策樹。

Scope	適用條件	代價
Collection-level	單一 collection 的下游 sink、ordering 需求單一	多 collection 要多 connector
Database-level	多 collection 共享 sink、ordering 跨 collection	filter cost 在 connector 端
Cluster-wide	整個 cluster 統一 audit / replay	mongos 單點瓶頸風險、event 量大

Step 2：oplog sizing。容量公式：

1oplog size >= peak write rate × max acceptable consumer downtime

典型設 24-72 小時可恢復窗口。例：peak 5K WPS、想容忍 48 小時 connector down、oplog 至少 5K × 86400 × 2 ÷ docs_per_GB ≈ 看實際 doc size 決定。在 Atlas 上 oplog size 可直接調、自管 cluster 改 replSetResizeOplog。

Step 3：Kafka Connector 配置。

 1{
 2  "connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
 3  "connection.uri": "mongodb://...",
 4  "database": "shop",
 5  "collection": "orders",
 6  "publish.full.document.only": "true",
 7  "change.stream.full.document": "updateLookup",
 8  "copy.existing": "true",
 9  "copy.existing.namespace.regex": "shop\\.orders",
10  "errors.tolerance": "none",
11  "offset.flush.interval.ms": "10000"
12}

關鍵欄位：

change.stream.full.document: "updateLookup"：每 update 額外 query primary 拿完整 doc（成本意識）
copy.existing: "true"：connector 啟動時先把現有 collection 全量複製、再切到 change stream — 適合初次部署
errors.tolerance: "none"：sink 失敗時 batch 停在 dead-letter queue、不 silently drop

Step 4：resume token persistence。Connector 把 token 寫 Kafka __consumer_offsets 或外部 store；application 自管 change stream 時要寫到 durable store（不是 in-memory）。

Step 5：filter pipeline。Change stream 支援 aggregation pipeline 把過濾下推到 MongoDB：

1const pipeline = [
2  { $match: { "operationType": { $in: ["insert", "update", "delete"] } } },
3  { $match: { "fullDocument.region": "ap-tokyo" } }
4]
5const changeStream = db.orders.watch(pipeline)

把過濾下推減少 connector 處理量、特別是高頻 collection 上。

Step 6：downstream idempotency。Sink 收 Kafka event 時用 documentKey._id + clusterTime 做 dedup key — at-least-once 語義意味著 connector restart 後幾分鐘 event 會重發。

驗證點：

Source collection write count vs Kafka topic event count 差異 < 0.1%
Resume token age < oplog retention 的 50%（健康狀態）
Connector restart drill 能 5 分鐘內接回

Rollback boundary：source connector 是 read-only 對 MongoDB 無傷；sink connector 要備份 target 才能還原；resume token 寫錯 → 從 startAtOperationTime 回退到時間點重跑。

失敗模式

Resume token 過期（oplog 滾掉）：connector down 太久、oplog 已超出 retention、ChangeStreamHistoryLost → 必須 copy.existing 全量重灌、期間 downstream 看不到新資料。預防是 oplog sizing 留 buffer + connector lag alarm + token age 監控（age > oplog retention 的 50% 預警）。

updateLookup 在高頻 update 下打爆 primary：每筆 update event 都觸發一次 primary query、primary 負擔翻倍。修法是改 collection-level pre/post image（6.0+）、由 MongoDB 自己在寫入時記錄、或在 application 補完整 doc 後再寫 Kafka、不用 updateLookup。

Sharded cluster cluster-wide stream 打爆 mongos：所有 shard 的 event 都收斂到 mongos、mongos 變單點瓶頸。修法是改 collection-level stream 多 connector 並行、每 connector 連 mongos 但只訂單一 collection。

At-least-once 變 duplicate flood：connector restart 點之後幾分鐘 event 重發、downstream 沒做 idempotency → 重複 side effect（重複發 email、重複扣款）。修法是 sink 端強制 idempotency（dedup key 寫 Redis / DB）、不能假設「我用 at-least-once 但實際不會 duplicate」。

Schema drift 突然 break sink：MongoDB 寫了新欄位 / 改型別、sink connector 的 JSON schema 不認、batch 停在 dead-letter queue。修法是 schema 變動有 validation gate（見 schema design pattern）、sink schema 設 lenient 模式吃 unknown field、或加 schema registry 統一版本。

Backup / DDL 期間 change stream 異常：reIndex / compact / dropCollection 觸發特殊 event、connector 沒處理 → consumer 停。修法是 connector 處理特殊 event 邏輯要明確、不認得的 operation type 至少 log warning 而不是 silently stuck。

Anti-recommendation：

簡單的 outbox pattern + application transactional write 對於低吞吐 / 單 sink 的場景比 change stream + Kafka 簡單；不是所有「需要 event 通知」的場景都要 CDC pipeline
若 downstream 只是同一 region 同團隊的 Elasticsearch index、$merge 寫進中介 collection 或 application 雙寫 + 對賬可能成本更低
Resume token 過期是這條路徑最痛的事故、oplog sizing 是 投資而不是成本 — 不要為了省 storage 把 oplog 設太小

容量與觀測

關鍵 metric：

Oplog 健康：oplog 寫入速率與保留時間
Change stream 健康：cursor age、resume token 距 oplog 頭尾的距離
Connector 健康：connector lag（Kafka offset 對比 source write）
下游健康：event count diff（source write count vs sink apply count）、event time → arrival time lag 分布

Mongo command：

db.getReplicationInfo()：oplog 大小 / 時間範圍
db.printReplicationInfo()：oplog 摘要
db.currentOp({ "op": "getmore", "ns": "local.oplog.rs" })：看 change stream consumer 連線

Connector metric（Kafka Connect JMX）：source-record-poll-rate、source-record-write-rate、offset-commit-success-rate。

回到 4.20 observability evidence：oplog retention + connector lag + dedup rate 是 CDC pipeline 健康狀態 evidence 三件套。

回到 9.5 bottleneck localization：CDC lag 升高時區分 (a) source oplog 寫太快 (b) connector 處理慢 (c) downstream sink 慢。

邊界與整合

Sibling deep articles：

shard key selection — cluster-wide vs collection-level change stream 在 sharded cluster 的選擇
replica set read preference — change stream 對 primary load 的影響、能否走 secondary
schema design pattern — schema validator 對下游 sink 的契約意義
connection management and cache layer — CDC sink 在 production 跨層架構裡的角色（cache invalidation / federated DB 同步）

Migration playbook：

MongoDB → 其他 sink 的 bulk migration 走 → Atlas Migration Service
遷出 MongoDB 時 change stream 是 catch-up 機制（先 bulk export、再 change stream 補增量）

跟 1.x 互引：1.7 schema migration rollout evidence 處理 schema drift 時 CDC pipeline 的對賬；1.9 reconciliation data repair 處理 CDC 失準後的對賬流程。

從 MongoDB / Cassandra 遷入 Cosmos DB：protocol-compat API drop-in vs native API paradigm shift、相容性邊界與 dual-write cutover

Tue, 02 Jun 2026 00:00:00 +0000

本文是 Cosmos DB overview 的 migration playbook、寫作參照 Migration Playbook 寫作方法論。從 MongoDB 或 Cassandra 遷入 Cosmos DB 的核心決策是 選哪條路徑 — 用 Cosmos 的 protocol-compat API（MongoDB API / Cassandra API）做 wire-protocol drop-in、driver 與 query 大致不動；還是換 native SQL API、把 application 重寫成 Cosmos native paradigm。這兩條路的 diff 維度、風險、不可逆性都不同、是一個 multi-element 的 migration 規劃。本文先把 driver 與 no-go 講清楚、再做 6 維 diff audit 分出兩條路徑、再進各自的 phase plan、evidence 與 cutover。

API 選擇判斷 本身（MongoDB API vs SQL API 的四層 framing、dogfood signal、multi-model、跨雲 hedging）由 mongodb-api-vs-sql-api 主寫、本文不重複展開那層對比；本文主寫 遷移流程 — 選定路徑後怎麼安全把資料與流量搬過去。

Case anchor：9.C30 Microsoft 365（MongoDB → Cosmos DB MongoDB API、planet-scale、dogfood）、9.C37 Forbes（自管 → Atlas、6 個月、同 DB 換託管的時程對照）、9.C36 Coinbase（保留 MongoDB 補周邊、對照「不一定要遷」）。Microsoft 365 case 自承沒揭露 throughput / latency / cost 數字、本文不拿它當 benchmark、只取遷移路徑 frame。

Driver：為什麼遷、什麼條件不遷

有效的遷移 driver 不是「Cosmos DB 比較好」、而是具體壓力：team 已綁 Azure 生態、需要 turnkey global distribution、自管 MongoDB / Cassandra cluster 的 ops 負擔要轉移、或需要 multi-model 把多個 NoSQL 集中治理。Microsoft 365 的 driver 是 planet-scale 全球分散 + Azure dogfood、不是 query 性能。

No-go condition（這些情況不該遷入 Cosmos DB）：

跨雲是核心需求 — Cosmos DB 只在 Azure；跨雲彈性高於 Azure 整合時、MongoDB 留 Atlas（Forbes 路徑、跨 AWS / GCP / Azure）、Cassandra 留自管或 ScyllaDB。
需要 native MongoDB / Cassandra 最新 feature — Cosmos DB 的 protocol-compat API server version 落後原生、且部分 feature 行為不同。
未來雲商策略未定 — hedging 價值高於當下整合、見 vendor lock-in 的退出成本。
現有 cluster 補周邊就夠用 — Coinbase 保留 MongoDB 加 proxy / cache / predictive scaling、沒遷出。遷移成本高、先確認「補周邊」解不了問題再遷。

Diff audit：6 維度分出兩條路徑

source（MongoDB / Cassandra）與 target（Cosmos DB）的差異按 6 維度盤點、兩條路徑的維度高低不同、這也是 type 判定的依據。

維度	protocol-compat API（MongoDB / Cassandra API）	native SQL API
Schema	Low — document / table shape 大致保留	Medium — 重新建模成 Cosmos native document
Operational	High — 自管 cluster → managed RU/s + region	High — 同左
Paradigm	Low — 仍 document / wide-column 語意	High — 換 query 模型、index policy、RU 思維
Components	Medium — driver 保留、aggregation / CQL 部分要改	High — driver、query layer、ORM 全換
Application	Medium — connection string、auth、consistency 對應	High — 整個 data access layer 重寫
Data topology	High — replica set / ring → partition + multi-region	High — 同左

主導差異決定 type：

protocol-compat 路徑 — 最大差異是 operational 與 data topology、paradigm 維持 Low、是 wire-compat 的 drop-in 但有相容 gap。對應 Type B drop-in（partial）：driver 不換、但每個 query pattern 要驗證相容性、不是無腦切換。
native API 路徑 — paradigm High + application High、是 Type E paradigm shift：不只搬資料、要重寫 application 的整個 data access layer。

判讀句：protocol-compat 是「換底層儲存與運維、保留 query 介面」、native API 是「連 query 範式一起換」。多數遷移先走 protocol-compat 把資料與 ops 搬過去、native API 是後續若要拿完整 Cosmos feature（Change Feed、stored procedure 原生支援、SQL API query）才考慮的二次遷移 — 一次到位 native API 的工程複雜度與風險顯著更高。

Cassandra 路徑的專屬差異

Cassandra → Cosmos DB Cassandra API 跟 MongoDB 路徑有一個關鍵不同：Cassandra 的資料建模是 query-driven（partition key + clustering key 對應 access pattern）、這套建模思維跟 Cosmos DB 的 logical partition 概念部分對齊、但 Cosmos DB 的 per-partition RU 上限（目前約 10,000 RU/s、vendor 規格、實作時 cross-verify Azure doc 當前值）與 RU 計費會讓原本 Cassandra 上「寬 partition + 大量 clustering row」的設計變成 hot partition 風險。CQL 的 consistency level（QUORUM / LOCAL_ONE 等）要對應到 Cosmos DB 的 5 個 consistency level、語義不是一對一、見 consistency-levels-engineering。Cassandra 的 secondary index / materialized view 在 Cassandra API 的支援度要逐項驗證（時間敏感、查文件）。

Phase plan

兩條路徑共用大架構、protocol-compat 的相容 audit 較輕、native API 多一段 application 重寫。

protocol-compat 路徑（Type B drop-in）

Phase 0：相容性 audit — 把 production query / aggregation pipeline（MongoDB）或 CQL statement（Cassandra）拉出來、逐條對照 Cosmos DB 對應 API 的 feature support 清單、列出 unsupported 與行為不同的部分。
Phase 1：partition key 設計 — MongoDB shard key / Cassandra partition key 翻譯成 Cosmos logical partition key、檢查 10,000 RU/s 上限與 hot partition 風險、見 partition-key-design。
Phase 2：bulk export-import — 初始資料用 Data Migration Tool / mongodump / sstable export 灌入。
Phase 3：CDC sync — source 的持續變更（MongoDB oplog / Cassandra CDC）同步到 Cosmos DB、收斂初始 load 後的增量。
Phase 4：shadow read — production query 在兩邊各跑一遍、對 result checksum、量 Cosmos 端 RU baseline、見 ru-cost-model-sizing。
Phase 5：read cutover — 讀切 Cosmos、寫仍 source（可回退）。
Phase 6：write cutover — 寫切 Cosmos。
Phase 7：cleanup — 退役 source cluster、保留 export 與最終 checksum。

native API 路徑（Type E paradigm shift）多出的工作

native API 路徑在 Phase 0 與 Phase 1 之間插入 application 重寫 stream、與資料遷移 stream 並行：

重新建模 document（從 MongoDB document / Cassandra table 設計 Cosmos native shape、決定 embed vs reference）
重寫 data access layer（換掉 MongoDB driver / CQL、改用 Cosmos SQL API SDK、重寫所有 query）
重寫 aggregation（Cosmos SQL API 沒有 JOIN、aggregation 模型不同、部分邏輯移到 application 或用 stored procedure / Change Feed 物化）

這條 application stream 是 native API 路徑的主要風險與工期來源、必須跟資料遷移 stream 用獨立 owner 並行、shadow read 階段要對 重寫後的 query 與 原 query 的結果一致性、不只是資料一致性。

時程現實

Forbes 同 DB 換託管（自管 → Atlas、paradigm 不變）用 6 個月、中型團隊多 squad 並行。protocol-compat 遷入 Cosmos DB 的工程複雜度高於 Forbes 型（多了 RU / partition / region 範式與相容 gap）、native API 路徑再高一個量級（加 application 重寫）。拿 Forbes 6 個月當 native API 路徑 baseline 會從第一天 over-commit。

Evidence

每個 phase 用資料證明可前進、不靠感覺：

Phase 0：unsupported feature 清單已窮舉、每條有對應策略（改寫 / 移 application 層 / 接受降級）
Phase 2-3：row / document count 對齊、CDC replication lag 收斂到穩定
Phase 4：query result checksum 一致（protocol-compat 比原 query 結果；native API 比重寫 query 與原 query 結果）、RU baseline 量到、aggregation result 逐條對齊
Phase 5-6：error rate、p99 latency、RU consumption 在 cutover 後在預期範圍
對應 schema-migration-rollout-evidence 的 dual-write 驗證

Cutover

read cutover window：先切讀、寫留 source、Cosmos 端 read error rate 與 latency 達標再進 write cutover
write cutover window：read-only freeze < 10 分鐘、切寫、最終 checksum 對齊
Rollback condition：query error rate 超過閾值（如 > 1%）、RU consumption 顯著高於估算（protocol-compat 翻譯層 overhead 比預期高）、或 result mismatch — 任一成立回退到 source、對應 rollback condition
decision owner：cutover 期間誰有權回退要事前定、資料庫切流失敗代價高、不靠臨場判斷
不可逆點：API kind 是 account 層、建 account 時選定、無法事後切換 — protocol-compat 與 native API 是 兩個不同 account；選 protocol-compat 後想升 native API 是 export → 新 account → import + 重寫 application 的二次全量遷移、不是 in-place 升級。這個不可逆性要在 Phase 0 就決定方向、不能 cutover 後反悔

Cleanup

退役 source cluster 前確認最終 checksum、保留 export dump 90 天作為 rollback 後路
移除 dual-write writer、CDC connector、shadow read harness
保留 RU baseline 與 partition 分布觀測進 production dashboard、見 ru-cost-model-sizing
incident write-back：把相容 gap 與翻譯層成本意外寫回 runbook、給未來同類遷移

失敗模式

假設 wire-compat = 100% 行為相同

protocol-compat API 是「在某些 query pattern 下相容」、不是普遍相容。MongoDB 的部分 aggregation stage（$graphLookup / $facet 等）、Cassandra 的部分 CQL feature 在對應 API 行為不同或不支援、dev 環境 sample data 看不出、production 才爆。修法是 Phase 0 把所有 production query 拉出來逐條驗證、Phase 4 shadow read 對 checksum、不能假設相容。

shard key / partition key 直接照搬

MongoDB shard key 或 Cassandra partition key 直接當 Cosmos logical partition key、忽略 10,000 RU/s per partition 上限。原本 Cassandra 寬 partition 在 Cosmos 變 hot partition、throttle。修法是 Phase 1 按 Cosmos 的 partition 上限重新評估、必要時用 synthetic / composite key 強制分散、見 partition-key-design 與 Hot Partition。

把 native API 二次遷移當「升級」低估

選 protocol-compat 上線後、想拿 Change Feed / SQL query 等 native 能力、以為「升級到 SQL API」是改設定。實際是新 account + 全量資料遷 + application 重寫的第二次完整遷移。修法是 Phase 0 就決定終態方向 — 若終態確定要 native feature 且團隊能承擔重寫、直接走 native API 路徑、不要兩段遷。

consistency level 對應錯

CQL 的 QUORUM / MongoDB 的 read concern majority 直接假設等價於 Cosmos 某個 level、語義不是一對一。修法是按 consistency-levels-engineering 把 read-after-write 與順序需求逐場景對應、不照字面翻譯 consistency 名稱。

邊界與整合

主對比 SSoT：mongodb-api-vs-sql-api — API 選擇判斷 與三型遷移路徑分類在它主寫、本文主寫選定後的 遷移流程
Sibling deep articles：partition-key-design（shard / partition key 翻譯）、ru-cost-model-sizing（翻譯層 RU overhead 與 baseline）、consistency-levels-engineering（read concern / CQL consistency 對應）、change-feed-cdc（native API 才有原生 Change Feed、是 native 路徑的 feature driver 之一）
不遷的對照：Coinbase 保留 MongoDB 補周邊 — 確認「補周邊」解不了再遷
跨雲對照：Forbes 留 Atlas 跨雲 — 跨雲需求是 Cosmos DB 的 no-go
共通遷移模型：1.12 大規模 DB 遷移實戰
Knowledge card：vendor lock-in / Hot Partition
回 overview：Cosmos DB vendor overview 的「從 MongoDB / Cassandra 遷入」backlog

Mongodb on Tarragon

DB3 Vendor Selection：document / KV / multi-model 三方選型 + workload shape 前置判讀

問題情境：讀者進來時的真實壓力

Workload shape × access pattern × consistency 三軸前置判讀

軸 1 — 資料形狀：document / KV / 不清楚

軸 2 — Access pattern 穩定度（KV 適用度前置判讀）

軸 3 — Consistency 需求是否可接受 eventual

Migration path 三型（跨 case 合成 frame）

第一型：保留原 DB + 補周邊工具

第二型：同 DB 換託管

第三型：換 vendor 保留 model

第四型不在 DB3 範圍：paradigm shift 換引擎

從 RDB 撞牆來的快速路徑

Federated DB + system role 視角（跨 case 合成 frame）

Federated DB by workload

System role：control plane vs data plane

三 vendor 對比 10 軸

軸的延伸子段

失敗模式（cross-vendor 反模式）

反模式 1：把 DynamoDB 當 OLTP

反模式 2：把 MongoDB 當 KV

反模式 3：把 Cosmos DB 當跨雲服務

反模式 4：federated DB 假設「全用 X」

反模式 5：誤判 dogfood case 數字

反模式 6：partition key 一上 production 才發現不可逆

反模式 7：wire compatibility 當 100% 行為相同

不該選 DB3 的訊號（升 SQL / 升 distributed SQL 路徑）

下一步路由（per-vendor outline 子組）

MongoDB 子組

DynamoDB 子組

Cosmos DB 子組

跨層架構（federated DB / cache / proxy）

進 DB4 evaluation

Knowledge card 路由

MongoDB

教學路線：Document shape 與 schema governance

定位：JSON document + 跨雲彈性

容量特性

適用場景

不適用場景

跟其他 vendor 的取捨

容量規劃要點

Anti-recommendation 與升級路由

Deep article（已完成）

後續擴充（仍待補）

案例對照

常見陷阱

下一步路由

MongoDB → Atlas：Atlas 不是 MongoDB + managed、是另一個 product

Atlas 不是 MongoDB + managed、是另一個 product

結構：4-phase operational + drop-in cutover

Phase 0：Pre-migration audit

Workload sizing → Atlas tier

Connection pattern audit

Compliance audit

Phase 1：Operational infrastructure 準備

Atlas cluster 配置

VPC peering / private endpoint

Atlas Database User 跟 IAM 整合

Phase 2：Data migration

Atlas Live Migration tool（小到中型）

mongomirror（大型）

Phase 3：Cutover + verification

Production 故障演練

Case 1：Atlas tier connection limit 撞牆

Case 2：IP whitelist 漏 application VPC、cutover 後完全連不上

Case 3：Backup retention 設不夠、compliance audit 抓到

Case 4：IAM token 過期、application 端 reconnect storm

Case 5：Billing 暴漲、IOPS 跟 backup storage 超預估

Capacity / cost

整合 / 下一步

跟 PostgreSQL → Aurora migration 對照

跟 Application 端 IAM token rotation 整合

下一步議題

相關連結

MongoDB Shard Expansion + Multi-DC：Type F「不需要 parallel run」的 multi-region 例外

Reviewer D 的質疑：Type F 一定不需要 parallel run 嗎

兩個操作合併：shard 加 + DC 加

Pre-layout analysis：當前 + 目標 topology

Re-layout 機制