MongoDB → Atlas:Atlas 不是 MongoDB + managed、是另一個 product
本文是跨 vendor migration playbook、cross-link 到 MongoDB 跟 MongoDB Atlas。本文是 Migration playbook methodology Type C operational redesign hybrid 的標準形態實證。每階段切換用 migration gate 把關 — 4 phase 之間的驗證條件就是 gate。
Atlas 不是 MongoDB + managed、是另一個 product
「MongoDB Atlas 是 MongoDB 的 managed 版本」這個 framing 看似合理、實際誤導:
- Protocol 相容:MongoDB wire protocol 一致、driver 不改、
mongosh連線跟 self-managed 一樣 - Storage 一致:WiredTiger storage engine 一樣、document model 一樣
- API 一致:Aggregation framework、indexing、change stream 都一樣
但 operational surface 完全不同:
| Operational concept | Self-managed MongoDB | Atlas |
|---|---|---|
| Cluster bootstrap | mongod + replica set config + cfgsvr + shard 手動 | UI / API 一鍵建集群、全自動 |
| HA | Replica set 自管 + arbiter + priority | 自動跨 AZ replica + automatic failover |
| Backup | mongodump + S3 archive 自管 | 內建 cloud backup + PITR(按 region 設) |
| Network access | VPC + security group + IP whitelist 自管 | Atlas private endpoint / VPC peering / IP access list |
| Authentication | mongod 內部 user / x.509 自管 | Atlas Database User + 整合 LDAP / SSO / AWS IAM |
| Monitoring | Self-deploy Prometheus + grafana | Atlas Performance Advisor + APM 內建 |
| Sizing | Manual instance class + scale | Auto-tier scaling + tier-based pricing |
| Patching | Manual + outage window | Automatic(可配置 maintenance window) |
Migration 主要工作不在 資料層 — protocol drop-in 已 cover;是 operational stack 全換:SRE runbook、monitoring dashboard、access control、IAM 整合、cost 預估全要重做。「Atlas 是 managed MongoDB」這個 framing 低估了 operational 工作量。
| 維度 | 評估 | 等級 |
|---|---|---|
| Schema / API | MongoDB protocol / API 完全相容 | Low |
| Operational model | HA / backup / monitoring / IAM / network 全換 | High |
| Abstraction / paradigm | 同 document DB | Low |
| Number of components | 同 1 個 cluster | Low |
| Application change | Connection string / IAM 整合改、application logic 不改 | Low/Medium |
主導維度 Operational = High、Schema / Paradigm 都 Low — 對映 Type C operational redesign hybrid。
結構:4-phase operational + drop-in cutover
跟 PostgreSQL → Aurora 結構對齊(同 Type C):
1Phase 0:Pre-migration audit(1-2 週)
2 - Workload sizing(IOPS / connection / storage)
3 - Application connection pattern audit
4 - Compliance requirement audit
5
6Phase 1:Operational infrastructure 準備(2-3 週)
7 - Atlas cluster 建立
8 - VPC peering / private endpoint
9 - IAM role + Atlas Database User
10 - Monitoring + alert
11 - Backup retention 設定
12
13Phase 2:Data migration(取決於 dataset 大小)
14 - mongomirror / Atlas Live Migration tool
15 - 或 mongodump → mongorestore(小 DB)
16
17Phase 3:Cutover 跟 verification
18
19Phase 4:Cleanup(self-managed decommission)整體 4-12 週、依 dataset 大小跟 organization 流程複雜度。
Phase 0:Pre-migration audit
Workload sizing → Atlas tier
1Self-managed observations:
2- Peak IOPS: 8000
3- P99 read latency: 5ms
4- Connection count peak: 1500
5- Storage: 800GB
6- Cross-region replication needed: yes
7
8Atlas tier mapping:
9- M40 (8 vCPU, 16GB RAM): IOPS 3000、不夠
10- M60 (16 vCPU, 64GB RAM): IOPS 6000、邊界
11- M80 (32 vCPU, 128GB RAM): IOPS 9000、安全(選此)
12- Storage: 1TB tier(足夠 800GB + 25% buffer)
13- Cross-region replication add-onAtlas 不是 自由 instance class、是 固定 tier;workload 跨 tier 邊界時要選 上一級 而不是 push 下一級。
Connection pattern audit
1// Application connection pool config
2const client = new MongoClient(uri, {
3 maxPoolSize: 100, // ← Atlas 端 tier-specific connection limit
4 minPoolSize: 10,
5 maxIdleTimeMS: 60000,
6});Atlas tier 對 single user connection 有限制(M40 ~1500、M80 ~3000);多 application instance 跑同帳號連 Atlas 可能撞 limit。預先計算 total connection = pod_count × maxPoolSize、對照 tier limit。
Compliance audit
- Data residency:Atlas 部署 region 是否符合 GDPR / 客戶合約
- Encryption at rest:Atlas 預設 enable、但 encryption key 是 Atlas-managed — 合規嚴格要用 CMK / BYOK
- Audit log:Atlas 提供 audit log、export 到 S3 / Splunk
Phase 1:Operational infrastructure 準備
Atlas cluster 配置
1# 用 Terraform mongodbatlas provider
2resource "mongodbatlas_cluster" "production" {
3 project_id = var.project_id
4 name = "production-cluster"
5 cluster_type = "REPLICASET"
6
7 provider_name = "AWS"
8 provider_region_name = "US_EAST_1"
9 provider_instance_size_name = "M80"
10
11 backup_enabled = true
12 pit_enabled = true # PITR
13 mongo_db_major_version = "7.0"
14
15 advanced_configuration {
16 javascript_enabled = false
17 minimum_enabled_tls_protocol = "TLS1_2"
18 no_table_scan = false
19 oplog_size_mb = 51200
20 }
21}
22
23# Backup retention
24resource "mongodbatlas_cloud_backup_schedule" "production" {
25 project_id = var.project_id
26 cluster_name = mongodbatlas_cluster.production.name
27
28 reference_hour_of_day = 3
29 reference_minute_of_hour = 0
30 restore_window_days = 7
31
32 policy_item_daily {
33 frequency_interval = 1
34 retention_unit = "days"
35 retention_value = 7
36 }
37}VPC peering / private endpoint
1Pattern A: VPC Peering
2 AWS VPC <──peering──> Atlas project VPC
3 - 跨 region 跑、routing table 對齊
4 - 適合中型 / 大型 workload、stable network topology
5
6Pattern B: Private Endpoint (Atlas private link)
7 AWS VPC ──private link──> Atlas
8 - 不需要 routing table 改
9 - 適合 multi-account / multi-region 複雜場景
10 - Cost 略高production default 走 Private Endpoint、設定簡單跟 IAM 整合好。
Atlas Database User 跟 IAM 整合
1Pattern A: 傳統 username / password
2 - 設 Database User、application 用 SCRAM-SHA-256 連
3 - 適合 legacy application
4
5Pattern B: AWS IAM authentication(推薦)
6 - Atlas Database User type: "AWS IAM"
7 - Application 用 AWS IAM role + Atlas SDK
8 - Token 15 分鐘輪換、application 自管 refreshcutover 時間表內加 IAM authentication migration、不要事後補。
Phase 2:Data migration
Atlas Live Migration tool(小到中型)
Atlas UI 內建 Live Migration tool:
- Source cluster URI(self-managed MongoDB)
- Atlas target cluster
- tool 自動 full sync + oplog tailing
- Cutover window 內 final cutover
支援 dataset < 100GB 簡單;100GB-1TB 需要分批 / collection 順序設計。
mongomirror(大型)
1# Mongomirror: source → atlas
2mongomirror \
3 --host source-replicaset/host1:27017,host2:27017 \
4 --destination atlas-cluster-host:27017 \
5 --destinationUsername admin \
6 --destinationPassword $ATLAS_PASSWORD \
7 --sslmongomirror 分兩段:
- Initial sync(full dump + restore)
- Oplog tailing(continuous CDC)
Cutover 期間 application 切 connection string、mongomirror 跟著 stream 收尾。
Phase 3:Cutover + verification
11. Application 端設 maintenance mode(block write)
22. Wait mongomirror catch up(oplog gap → 0)
33. 驗證 Atlas 端 collection count + sample query
44. Application connection string 切到 Atlas
55. 解除 maintenance、monitor 24-48 小時
66. Self-managed mongo read-only standby 1-2 週Production 故障演練
Case 1:Atlas tier connection limit 撞牆
徵兆:cutover 後 application 流量高峰時大量 Connection refused、Atlas 端顯示 connection limit reached;self-managed 階段沒有這問題。
根因:M80 tier connection limit ~3000、application 100 個 pod × maxPoolSize=50 = 5000 connection;超出 limit。
修法:
- Pre-migration 計算:total connection 對照 Atlas tier、超出選上一級 tier
- 降 maxPoolSize:100 pod × 30 = 3000、剛好 cap;但 burst 仍可能撞
- 加 connection proxy:在 application 跟 Atlas 之間放 connection pooler(如 mongos sharded 或 ProxySQL-style proxy)
Case 2:IP whitelist 漏 application VPC、cutover 後完全連不上
徵兆:cutover 後 application 直接報 connection timeout、Atlas dashboard 顯示 zero traffic;troubleshooting 1 小時才發現是 IP access list 漏掉某 application VPC CIDR。
根因:Atlas IP access list 預設 deny all、必須明示加 application VPC;Phase 1 設定漏看某個 VPC(如 multi-account organization 內的 staging account)。
修法:
- Pre-cutover 連線測試:每個 application VPC 跑 sample MongoDB 連線、確認 ping 通
- 改 Private Endpoint:不靠 IP whitelist、用 PrivateLink 自動 routing
- Backup access:保留 bastion host with whitelisted IP、incident 期間能直連
Case 3:Backup retention 設不夠、compliance audit 抓到
徵兆:cutover 3 個月後 SOX audit 發現 backup retention 設 7 天、合規要求 90 天;急忙改 Atlas config 設 90 天、但 過去 3 個月 backup 已不可恢復。
根因:Atlas backup retention 是 向前生效、不能回追加;Phase 1 預設配置漏對合規 review。
修法:
- Pre-Phase 1 跑 compliance review:跟 legal / security team 確認 retention / data residency / audit log
- 預設 retention 設保守值(30 / 60 天)、之後可降不能升
- PITR 跟 backup retention 分開設:PITR window 7-30 天、full backup 90-365 天
Case 4:IAM token 過期、application 端 reconnect storm
徵兆:production 切到 IAM authentication 後、每 15 分鐘出現一波 connection failure;Atlas log 顯示「auth token expired」。
根因:AWS IAM token 15 分鐘輪換、application 用舊 token 重連失敗;token refresh 邏輯沒寫對。
修法:
1// 用 Atlas SDK + AWS SDK 整合、自動 token refresh
2const { MongoClient } = require('mongodb');
3const { fromIni } = require('@aws-sdk/credential-providers');
4
5const credentials = fromIni({ profile: 'production' });
6const client = new MongoClient(uri, {
7 authMechanism: 'MONGODB-AWS',
8 // SDK 自動 refresh token
9});不要自管 token rotation、用 vendor SDK 抽象掉。
Case 5:Billing 暴漲、IOPS 跟 backup storage 超預估
徵兆:第一個月 Atlas 帳單 $15K USD、預估 $8K;Atlas dashboard 顯示 backup storage 跟 IOPS 各超 1.5-2x 預估。
根因:
- Atlas backup 預設 跨 region replicated、storage cost 2x
- IOPS-heavy workload 在 M tier 內可能撞 burst credit、auto-tier-up 暫時觸發更貴 tier
- Data transfer 跨 region / 跨 cloud 計費沒算
修法:
- Pre-migration cost estimate:用 self-managed metrics 估 IOPS / bandwidth、套 Atlas pricing
- Backup region 設單一:若不要跨 region DR、設 same-region backup 省 50%
- Reserved Instance:穩定 workload 預付 1-3 年、省 30-40%
- Performance Advisor 早用:第一週就跑、找 inefficient query 降 IOPS
Capacity / cost
| 維度 | Self-managed MongoDB | Atlas |
|---|---|---|
| Cluster cost (M80) | EC2 r6g.4xlarge × 3 ≈ $1.5K / mo | M80 + storage + backup ≈ $3K / mo |
| Operational FTE | 0.5-1.5 FTE | 0.1-0.3 FTE |
| Backup cost | S3 + tooling 自管 | 內建 + tiered storage |
| Cross-region DR cost | Manual + 2x infrastructure | 1-click + 1.5-2x billing |
| Time to value | 1-3 個月(HA + ops setup) | 1-2 週(cluster ready + IAM) |
| Migration cost | - | 1-3 FTE × 2-3 個月 |
Break-even:~200GB / 中型 workload、Atlas operational savings 平攤 1-2 年後比 self-managed cheaper;TB+ 大型 workload self-managed 仍可能便宜、但需要 ops team。
整合 / 下一步
跟 PostgreSQL → Aurora migration 對照
兩篇都是 Type C operational redesign hybrid、模板共用、細節差:
- Aurora 端 RDS Proxy 是推薦做法、Atlas 端 Private Endpoint 更標準
- Aurora 端 IAM authentication 是 optional best practice、Atlas IAM 是 推薦預設
- 兩家 cost model 都複雜、I/O cost 是 surprise 主要來源
跟 Application 端 IAM token rotation 整合
Vault dynamic credential 可 issue Atlas Database User credential、lease lifecycle 對齊 application;對 high-stakes workload 是好做法、但 setup 複雜。
下一步議題
- Atlas Data Federation:跨 Atlas 集群 query S3 / 跨 region;如果走 multi-region 評估這 feature
- Atlas Online Archive:cold data 自動 archive 到 S3、查 query 透明;對 retention 重的 workload 省 storage cost
- Atlas Serverless:burst workload 適合、steady 不划算
相關連結
- Source vendor:MongoDB
- 平行 migration playbook (Type C):PostgreSQL → Aurora
- 平行 migration playbook:Splunk → Elastic(Type A schema 差) / Kafka ↔ NATS(Type E paradigm shift)
- Methodology:Migration playbook methodology(本文驗證 Type C 標準形態)
#backend #database #mongodb #atlas #managed #migration #type-c