本文是跨 vendor migration playbook、cross-link PrometheusGrafana Stack(Grafana Cloud Metrics、Mimir-backed)。跑 migration-playbook-methodology 6 維 audit 後對映 Operational = High → Type C operational redesign hybrid

Feature / ops / cost 三維對照

維度Self-managed PrometheusGrafana Cloud Metrics
Storage backendLocal disk + remote_write (optional)Mimir + S3 (auto cold tier)
RetentionTSDB local 15 天 default13 個月 default、可延長
HATwo Prometheus + sidecarBuilt-in multi-AZ
Cardinality limit自管 limit + recording rule1.5M active series / tier、scale-up 配額
Query APIPromQL + Prometheus HTTP API完全相容
AlertAlertmanager self-managedGrafana Cloud Alerting
DashboardGrafana self-managedGrafana Cloud (included)
Long-term storageThanos / Cortex / Mimir 自管Mimir 內建
Cost (mid-tier)$500-2000 / mo + ops FTE$300-1500 / mo (按 series)
Operational FTE0.3-0.80.05-0.15

6 維 diff dimension audit

維度等級
Schema / APILow(PromQL + API 完全相容)
OperationalHigh(HA / retention / scaling 全託管)
ParadigmLow(同 Prometheus metric paradigm)
ComponentsLow
Application changeLow(remote_write endpoint 改)
Data topologyLow

Operational = High → Type C standard。

為什麼遷:retention / ops / vendor consolidation 三條 driver

Driver觸發
RetentionPrometheus TSDB local 預設 15 天、長期 retention 需要 Thanos / Cortex / Mimir 自管
Ops FTESelf-managed Prometheus + Alertmanager + Grafana 自管全部加起來 0.5-1 FTE
Vendor consolidation已用 Grafana Cloud(logs / traces)、metric 加進 stack 統一

Operational redesign

ConceptSelf-managedGrafana Cloud Metrics
Cluster bootstrapHelm chart + manual configUI 一鍵建
HATwo Prometheus 配置內建 multi-AZ Mimir
Long-term retentionThanos / Cortex / Mimir 自管Built-in (S3-backed)
Cardinality controlManual recording rule + relabelAdaptive sampling + cardinality limit
AlertingAlertmanager 自管Grafana Cloud Alerting (integrated)
DashboardGrafana self-hostGrafana Cloud (free tier 包含)

Migration 4-phase

Phase 0:Audit

  • 列所有 Prometheus job / scrape config
  • 統計 active series 數(Mimir tier 計費基準)
  • 估 retention 需求

Phase 1:Grafana Cloud setup

  • Account + organization 設定
  • API key for remote_write
  • Grafana Cloud Mimir endpoint 啟用

Phase 2:Dual-write

 1# prometheus.yml
 2remote_write:
 3  - url: https://prometheus-prod-XX-prod-us-central-0.grafana.net/api/prom/push
 4    basic_auth:
 5      username: <INSTANCE_ID>
 6      password: <API_KEY>
 7    write_relabel_configs:
 8      # Optional: drop high-cardinality before sending
 9      - source_labels: [__name__]
10        regex: 'high_card_metric_.*'
11        action: drop

跑 4-8 週、確認 query 結果一致 + cost 在預期。

Phase 3:Cutover

  • Dashboard / alert 切到 Grafana Cloud endpoint
  • 應用層 / Grafana 自管 instance 關閉 query 對 self-managed Prometheus

Phase 4:Cleanup

  • Self-managed Prometheus stop scrape
  • 留 1-2 月歷史查詢能力(用 archive snapshot)
  • Decommission

Production 故障演練

Case 1:Cardinality 爆、cost 暴漲

徵兆:dual-write 第 2 週 Grafana Cloud series 從預估 100K 漲到 800K、cost 翻 8 倍。

根因:application-level high-cardinality label(user_id / request_id)沒被 drop、scraped 進來。

修法

  1. write_relabel_configs drop unbounded label
  2. Application metric 設計改 fixed-bucket histogram、不用 unbounded label
  3. Mimir cardinality limit 設保護 + alert

Case 2:Recording rule 對應失效

徵兆:cutover 後 Grafana dashboard 某些 panel 顯示空;發現用了 Prometheus 端 recording rule (job:request_count:rate5m)、Grafana Cloud 端沒對應 rule。

根因:Prometheus 端 recording rule 是 server-side、不會跟著 remote_write 帶過去;Grafana Cloud 需要自己 setup recording rule。

修法

  1. Export 所有 recording rule、import 到 Grafana Cloud Mimir
  2. 或改用 raw query + Grafana query template、不依賴 recording rule

Case 3:PromQL 微差行為

徵兆:某些 query 在 self-managed Prometheus 跑得好好的、切 Grafana Cloud Mimir 後 returns slightly different results。

根因:Mimir 對某些 edge case(empty result handling / staleness marker timing)行為跟 Prometheus 略不同;多數 query 一致、< 1% query 受影響。

修法

  1. Pre-cutover dual-query 驗證、用 critical dashboard 比對
  2. Affected query 重寫、用更 robust PromQL pattern
  3. 文件 known incompatibility list

Case 4:Alert routing 改變

徵兆:Cutover 後 PagerDuty / Slack 收不到 alert;發現 Alertmanager 端 webhook 沒切。

根因:alert 邏輯從 self-managed Alertmanager 搬到 Grafana Cloud Alerting、routing / contact 配置完全重做。

修法

  1. Pre-cutover 在 Grafana Cloud 端 rebuild alert + routing
  2. 雙 alert pipeline 跑 1-2 週、確認 Grafana Cloud 收到
  3. Cutover 切 routing、SOC drill 一次

Case 5:歷史資料查不到

徵兆:Cutover 後 SOC 想 query 6 個月前事件、Grafana Cloud 只有 2 個月(dual-write 後的)資料。

根因:Grafana Cloud 從 dual-write 開始才有資料、之前的 self-managed Prometheus historical data 沒 backfill。

修法

  1. Phase 2 期間用 promtool tsdb dump + mimirtool 把 self-managed historical 灌進 Mimir
  2. 或保留 self-managed Prometheus read-only 6 個月(給 historical query)
  3. Long-term:retention 從 cutover 開始算、historical 是 one-time backfill

Capacity / cost

維度Self-managedGrafana Cloud Metrics
Compute (100 host, 100K series)$500-1000 / mo + ops$300-800 / mo
Operational FTE0.3-0.8 = $3K-8K0.05-0.15 = $500-1500
Long-term retentionThanos / Cortex / Mimir 自管Built-in 13 個月
Total (mid-tier)$4K-9K / mo (含 FTE)$1K-2.5K / mo
Migration cost-1-2 FTE × 1-2 個月

整合 / 下一步

Datadog → Grafana Stack migration 對位

兩條 Grafana Stack 路線:

  • Self-host (Mimir + Loki + Tempo) on K8s:開源、自管
  • Grafana Cloud:SaaS、operational simplification

本篇是「self-managed Prometheus → Grafana Cloud」、互補;如果跑兩階段(self-host → Cloud)跟「Datadog → Grafana Cloud」差不多。

跟 OpenTelemetry 整合

OTel Collector 可同時 ship 到 Mimir (metric) + Loki (log) + Tempo (trace);Migration 順便升 OTel 化避免下次 vendor 切換重複。

相關連結