Mimir on Tarragon

LGTM Stack 組合運維：Loki + Grafana + Tempo + Mimir

Mon, 22 Jun 2026 00:00:00 +0000

本文是 Grafana Stack 的 vendor deep article，深化 overview 的元件組合段。初次接觸 Grafana Stack 的讀者建議先讀 Grafana Stack 服務頁。

定位

Grafana Stack（LGTM = Loki + Grafana + Tempo + Mimir）是自架觀測平台的完整選項，四個元件各自承擔一類訊號的儲存跟查詢。理解每個元件的責任邊界、部署模式跟故障特性，才能避免「裝了四個元件但不知道哪個壞了」的黑盒問題。

四元件的責任分工

元件	訊號類型	查詢語言	儲存後端	角色
Loki	Log	LogQL	Object storage + BoltDB	Log aggregation、grep 替代品
Mimir	Metric	PromQL	Object storage	Prometheus 的可擴展長期儲存
Tempo	Trace	TraceQL	Object storage	Trace 儲存、span 搜尋
Grafana	視覺化	—	—	Dashboard、alert、data source

Grafana 是查詢 / 視覺化層，Loki / Mimir / Tempo 是儲存 / 查詢層。Grafana 本身不存觀測資料，它連接 data source（Loki / Mimir / Tempo / Prometheus / Elasticsearch）做查詢跟渲染。

四個元件獨立部署、獨立擴展、各自有健康指標。一個元件故障不影響其他元件 — Loki 掛了時 Grafana 的 metric dashboard 跟 trace 查詢仍然正常，只有 log panel 會報錯。

部署模式

Monolithic mode

四個元件（或其中幾個）跑在同一個 process / container。適合小規模（每天數 GB log、數十萬 metric series、少量 trace）。部署最簡單 — 一個 docker-compose 或 Helm chart 起全套。

限制是沒辦法獨立擴展 — log 量大但 metric 量小時，monolithic mode 不能只加 Loki 的資源。

Microservices mode

每個元件拆成獨立的 deployment、各自 autoscaling。Loki 拆成 distributor / ingester / querier / compactor；Mimir 拆成類似的元件；Tempo 也有對應的分層。

適合中到大規模。部署跟維運複雜度顯著上升 — 每個元件的每個子服務都需要獨立的 health check、autoscaling 設定、persistent volume。

選擇判準

條件	建議模式
團隊 < 5 人、日 log < 10 GB	Monolithic
需要獨立擴展某一類訊號	Microservices
不想自管、預算足夠	Grafana Cloud
已有 Prometheus、只需要加 log / trace	漸進式加 Loki + Tempo

常見故障模式

Loki：ingester OOM

Loki ingester 把 log chunks 保存在記憶體，高流量時容易 OOM。觸發條件是突然的 log 量爆增（部署後 error storm、某服務開了 debug log level）。

判讀指標：loki_ingester_memory_chunks、process_resident_memory_bytes。修復方向：調整 chunk flush interval（更頻繁寫入 object storage、降低記憶體壓力）、加 ingester replica、或在 pipeline 層（OTel Collector）做 log volume rate limit。

Mimir：compactor 卡住

Mimir compactor 負責合併 ingester 寫入的 block。Compactor 卡住時，block 數量持續增長、query 需要掃描更多 block、延遲上升。

判讀指標：cortex_compactor_runs_completed_total 停滯、cortex_bucket_blocks_count 持續增長。修復方向：檢查 object storage 的寫入權限跟延遲、增加 compactor 資源（CPU / memory）、或暫時停止 ingestion 讓 compactor 追上。

Tempo：trace not found

使用者用 trace ID 查詢時回 “trace not found”，但 trace 確實存在。常見原因是 Tempo 的 bloom filter / compacted block index 還沒包含該 trace（ingestion 到可查詢有延遲），或 trace 被 retention policy 刪除。

判讀方式：查 trace 的 timestamp 是否在 retention 範圍內、查 tempo_ingester_traces_created_total 確認 ingestion 正常、查 compactor 是否正常運行。

Grafana：dashboard provisioning 漂移

用 provisioning（YAML / JSON 檔案）管理 dashboard 時，手動在 UI 修改的 dashboard 會在下次 provisioning 同步時被覆蓋。團隊成員在 UI 調整了 panel、下次重啟 Grafana 後修改消失。

修復方向：dashboard 修改統一透過 git → provisioning pipeline（GitOps），UI 只用於臨時調整跟探索。把 provisioning 的 allowUiUpdates 設為 false、強制所有變更走 git。

Dashboard Provisioning

Dashboard 的管理方式影響長期維護成本。手動在 UI 建立 dashboard 的起步最快，但隨 dashboard 數量增長會出現版本不一致、無法 rollback、owner 不明的問題。

Infrastructure as Code

Dashboard JSON 存在 git repo、透過 provisioning 同步到 Grafana。變更走 PR review、有版本歷史、可以 rollback。

Grafana 的 provisioning 機制讀 YAML config，指定 dashboard JSON 的來源（local file / HTTP / API）。Helm chart 部署時把 dashboard JSON 放在 ConfigMap 或 persistent volume。

Grafonnet / Jsonnet

用 Jsonnet（Grafana 的 dashboard-as-code library）產生 dashboard JSON。適合大量相似 dashboard 的場景 — 每個服務一個 dashboard，結構相同但 data source 跟 label 不同。

Grafonnet 的學習曲線比直接寫 JSON 高，但在 dashboard 數量 > 20 個時開始有維護效率的回報。

下一步路由

Grafana Stack 服務頁：overview 跟日常操作
Prometheus 服務頁：Mimir 的上游 metric 來源
OTel Collector 部署模式：LGTM 的 ingestion 入口
4.11 telemetry pipeline：pipeline 各層的治理
4.18 operating model：dashboard / alert 的 ownership

Self-managed Prometheus → Grafana Cloud Metrics：feature × ops × cost 對照

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link Prometheus 跟 Grafana Stack（Grafana Cloud Metrics、Mimir-backed）。跑 migration-playbook-methodology 6 維 audit 後對映 Operational = High → Type C operational redesign hybrid。

Feature / ops / cost 三維對照

維度	Self-managed Prometheus	Grafana Cloud Metrics
Storage backend	Local disk + remote_write (optional)	Mimir + S3 (auto cold tier)
Retention	TSDB local 15 天 default	13 個月 default、可延長
HA	Two Prometheus + sidecar	Built-in multi-AZ
Cardinality limit	自管 limit + recording rule	1.5M active series / tier、scale-up 配額
Query API	PromQL + Prometheus HTTP API	完全相容
Alert	Alertmanager self-managed	Grafana Cloud Alerting
Dashboard	Grafana self-managed	Grafana Cloud (included)
Long-term storage	Thanos / Cortex / Mimir 自管	Mimir 內建
Cost (mid-tier)	$500-2000 / mo + ops FTE	$300-1500 / mo (按 series)
Operational FTE	0.3-0.8	0.05-0.15

跑 6 維 diff dimension audit：

維度	等級
Schema / API	Low（PromQL + API 完全相容）
Operational	High（HA / retention / scaling 全託管）
Paradigm	Low（同 Prometheus metric paradigm）
Components	Low
Application change	Low（remote_write endpoint 改）
Data topology	Low

Operational = High → Type C standard。

為什麼遷：retention / ops / vendor consolidation 三條 driver

Driver	觸發
Retention	Prometheus TSDB local 預設 15 天、長期 retention 需要 Thanos / Cortex / Mimir 自管
Ops FTE	Self-managed Prometheus + Alertmanager + Grafana 自管全部加起來 0.5-1 FTE
Vendor consolidation	已用 Grafana Cloud（logs / traces）、metric 加進 stack 統一

Operational redesign

Concept	Self-managed	Grafana Cloud Metrics
Cluster bootstrap	Helm chart + manual config	UI 一鍵建
HA	Two Prometheus 配置	內建 multi-AZ Mimir
Long-term retention	Thanos / Cortex / Mimir 自管	Built-in (S3-backed)
Cardinality control	Manual recording rule + relabel	Adaptive sampling + cardinality limit
Alerting	Alertmanager 自管	Grafana Cloud Alerting (integrated)
Dashboard	Grafana self-host	Grafana Cloud (free tier 包含)

Migration 4-phase

Phase 0：Audit

列所有 Prometheus job / scrape config
統計 active series 數（Mimir tier 計費基準）
估 retention 需求

Phase 1：Grafana Cloud setup

Account + organization 設定
API key for remote_write
Grafana Cloud Mimir endpoint 啟用

Phase 2：Dual-write

 1# prometheus.yml
 2remote_write:
 3  - url: https://prometheus-prod-XX-prod-us-central-0.grafana.net/api/prom/push
 4    basic_auth:
 5      username: 
 6      password: 
 7    write_relabel_configs:
 8      # Optional: drop high-cardinality before sending
 9      - source_labels: [__name__]
10        regex: 'high_card_metric_.*'
11        action: drop

跑 4-8 週、確認 query 結果一致 + cost 在預期。

Phase 3：Cutover

Dashboard / alert 切到 Grafana Cloud endpoint
應用層 / Grafana 自管 instance 關閉 query 對 self-managed Prometheus

Phase 4：Cleanup

Self-managed Prometheus stop scrape
留 1-2 月歷史查詢能力（用 archive snapshot）
Decommission

Production 故障演練

Case 1：Cardinality 爆、cost 暴漲

徵兆：dual-write 第 2 週 Grafana Cloud series 從預估 100K 漲到 800K、cost 翻 8 倍。

根因：application-level high-cardinality label（user_id / request_id）沒被 drop、scraped 進來。

修法：

write_relabel_configs drop unbounded label
Application metric 設計改 fixed-bucket histogram、不用 unbounded label
Mimir cardinality limit 設保護 + alert

Case 2：Recording rule 對應失效

徵兆：cutover 後 Grafana dashboard 某些 panel 顯示空；發現用了 Prometheus 端 recording rule (job:request_count:rate5m)、Grafana Cloud 端沒對應 rule。

根因：Prometheus 端 recording rule 是 server-side、不會跟著 remote_write 帶過去；Grafana Cloud 需要自己 setup recording rule。

修法：

Export 所有 recording rule、import 到 Grafana Cloud Mimir
或改用 raw query + Grafana query template、不依賴 recording rule

Case 3：PromQL 微差行為

徵兆：某些 query 在 self-managed Prometheus 跑得好好的、切 Grafana Cloud Mimir 後 returns slightly different results。

根因：Mimir 對某些 edge case（empty result handling / staleness marker timing）行為跟 Prometheus 略不同；多數 query 一致、< 1% query 受影響。

修法：

Pre-cutover dual-query 驗證、用 critical dashboard 比對
Affected query 重寫、用更 robust PromQL pattern
文件 known incompatibility list

Case 4：Alert routing 改變

徵兆：Cutover 後 PagerDuty / Slack 收不到 alert；發現 Alertmanager 端 webhook 沒切。

根因：alert 邏輯從 self-managed Alertmanager 搬到 Grafana Cloud Alerting、routing / contact 配置完全重做。

修法：

Pre-cutover 在 Grafana Cloud 端 rebuild alert + routing
雙 alert pipeline 跑 1-2 週、確認 Grafana Cloud 收到
Cutover 切 routing、SOC drill 一次

Case 5：歷史資料查不到

徵兆：Cutover 後 SOC 想 query 6 個月前事件、Grafana Cloud 只有 2 個月（dual-write 後的）資料。

根因：Grafana Cloud 從 dual-write 開始才有資料、之前的 self-managed Prometheus historical data 沒 backfill。

修法：

Phase 2 期間用 promtool tsdb dump + mimirtool 把 self-managed historical 灌進 Mimir
或保留 self-managed Prometheus read-only 6 個月（給 historical query）
Long-term：retention 從 cutover 開始算、historical 是 one-time backfill

Capacity / cost

維度	Self-managed	Grafana Cloud Metrics
Compute (100 host, 100K series)	$500-1000 / mo + ops	$300-800 / mo
Operational FTE	0.3-0.8 = $3K-8K	0.05-0.15 = $500-1500
Long-term retention	Thanos / Cortex / Mimir 自管	Built-in 13 個月
Total (mid-tier)	$4K-9K / mo (含 FTE)	$1K-2.5K / mo
Migration cost	-	1-2 FTE × 1-2 個月

整合 / 下一步

跟 Datadog → Grafana Stack migration 對位

兩條 Grafana Stack 路線：

Self-host (Mimir + Loki + Tempo) on K8s：開源、自管
Grafana Cloud：SaaS、operational simplification

本篇是「self-managed Prometheus → Grafana Cloud」、互補；如果跑兩階段（self-host → Cloud）跟「Datadog → Grafana Cloud」差不多。

跟 OpenTelemetry 整合

OTel Collector 可同時 ship 到 Mimir (metric) + Loki (log) + Tempo (trace)；Migration 順便升 OTel 化避免下次 vendor 切換重複。

Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link 到 Datadog（source）跟 Grafana Stack（target）。跟前三篇 migration（Splunk → Elastic phased / Redis → DragonflyDB drop-in / PostgreSQL → Aurora hybrid）對照、本篇是 cost-driven multi-tool migration — 不是換一個產品、是把 一站式 SaaS 拆成 五個專責 OSS / cloud component。

$50K/month bill 拆解：先看錢花在哪、再決定怎麼遷

中型 SaaS（100-500 host、5K-50K metric series、TB-level log/day）的 Datadog 月帳單長這樣：

計費項	平均單價	中型 SaaS 估算 / month
Infrastructure host	$15-23 / host	200 host × $20 = $4,000
APM host	$31 / host	100 host × $31 = $3,100
Custom metrics	$0.05 / 100 series	30K series × $0.05 = $1,500
Log ingest	$0.10 / GB ingested	50TB × $0.10 = $5,000
Log retention（15-day）	$1.27 / million events	50G event × $1.27 = $6,350
Log indexing	$1.70 / million events	50G × $1.70 = $8,500
Network	$5 / host	200 × $5 = $1,000
RUM / Session	$1.50 / 1000 session	30M session × $1.5 = $4,500
Synthetics	$5 / 10K test runs	50K test = $25
Total	-	$34,000 / month（保守估）

擴張到 500 host / 100TB log 的 production：$80K-150K / month 範圍。Grafana stack（self-hosted on K8s + Grafana Cloud 部分服務）對等 capacity 通常 $8K-30K / month — 2.5-5x cost reduction。

但 cost 不是唯一 driver。其他 driver：

Multi-cloud / hybrid：Datadog 集中、Grafana 可分散部署符合資料 residency
OpenTelemetry-first：Grafana stack 對 OTel 是 native、Datadog 仍 vendor-specific agent
Long-term retention：Loki 用 S3 cold tier 跑 1 年 retention 比 Datadog 便宜 10-50x

五個責任、五個 component：不是替換一個產品

Datadog 是 一站式 SaaS、單一 agent + 單一 UI 包 5 個責任。Grafana stack 把責任拆給 5 個專責 component：

責任	Datadog 處理	Grafana Stack 對應
Metric	Datadog metric	Mimir（Prometheus-compatible long-term）
Log	Datadog Logs	Loki（label-indexed log）
Trace	Datadog APM	Tempo（trace-only object storage）
Dashboard	Datadog dashboard	Grafana
Agent / shipper	Datadog Agent	Alloy（OTel-based collector）+ Grafana Agent / Promtail

Migration 是 五個獨立 stream、不是單一 cutover。SRE 對「一個 agent 包所有」的心智模型要拆。

Migration 結構：每個 component 各自 phased、整體 staggered

不像前三篇 migration 是線性流程、本篇是 5 個 parallel migration stream + 跨 stream coordination：

1           Phase 0           Phase 1            Phase 2          Phase 3
2           Audit             Deploy             Dual-ship        Cutover
3Metric    [audit]──→        [deploy Mimir]──→ [dual-ship]──→  [cutover]
4APM       [audit]──→        [deploy Tempo]──→ [dual-ship]──→  [cutover]
5Log       [audit]──→        [deploy Loki]──→  [dual-ship]──→  [cutover]
6Dashboard [audit]──→        [deploy Grafana]──→ [rebuild]──→   [cutover]
7Alert     [audit]──→        [deploy Alertmgr]──→ [parallel]──→ [cutover]

每個 stream 獨立做 dual-ship + cutover、不必同步；通常 Metric 先遷（cardinality 議題暴露最快）、然後 Log、最後 APM（trace correlation 最依賴 dashboard / alert）。

Agent migration：Datadog Agent → OTel Collector / Alloy

Datadog Agent 是 vendor-specific binary、抽出來換成 OpenTelemetry Collector / Grafana Alloy：

 1# alloy config (HCL-like)
 2prometheus.scrape "k8s_pods" {
 3  targets = discovery.kubernetes.pods.targets
 4  forward_to = [prometheus.remote_write.mimir.receiver]
 5}
 6
 7prometheus.remote_write "mimir" {
 8  endpoint {
 9    url = "https://mimir.internal/api/v1/push"
10  }
11}
12
13loki.source.kubernetes "pods" {
14  targets = discovery.kubernetes.pods.targets
15  forward_to = [loki.write.production.receiver]
16}
17
18otelcol.receiver.otlp "default" {
19  grpc {}
20  output {
21    traces = [otelcol.exporter.otlp.tempo.input]
22  }
23}

Migration 期間 dual-shipper 是標準作法：

Datadog Agent 跟 Alloy 並存（短期 capacity 兩倍）
同 host 同時 ship 兩端、觀察一致性
漸進 disable Datadog Agent 的 metric / log / APM 子模組

Production 故障演練

Case 1：Cardinality 爆，Mimir 端 series 暴增

徵兆：Datadog 端 30K series、ship 到 Mimir 後 series 變 500K、Mimir indexer OOM。

根因：Datadog 內部對 tag 做 自動 aggregation 跟 low-cardinality enforcement；Prometheus / Mimir 對 每個 unique label set 算一個 series、application code 的 high-cardinality label（user_id / request_id）直接爆。

修法：

Audit 階段 跑 topk(100, count by (__name__) ({__name__=~".+"})) 找 high-cardinality metric
drop high-cardinality label：Alloy / OTel collector 端 relabel 規則 drop user_id 等 unbounded label
改 histogram bucket：高 cardinality 通常來自 label combination、改用 fixed-bucket histogram
適當改 metric 為 log：請求 ID 是 trace context、不該是 metric label

Case 2：Log volume cost 預估失準

徵兆：Loki 部署 1 個月後 S3 帳單比預估高 2x；object storage 跟 query GB-scan 都超預期。

根因：Datadog 對 log 做自動 sampling / aggregation、bill 是 indexed event；Loki 是 全量 raw ingest + S3 cold storage、按實際 byte 計費。raw log volume 比 indexed event 高 3-10x。

修法：

Ingest-side sampling：Alloy / Promtail 端 sample debug / info log、只 ingest warn / error 全量
Log structure：JSON log 比 text log 壓縮率高、Loki S3 size 少 50%
Retention tier：hot 7 天 S3 standard / cold 1 年 S3 Glacier、retention budget 控制

Case 3：Datadog dashboard 不能直接轉 Grafana

徵兆：Migration 計畫設「dashboard 自動轉換」、實際跑 Datadog API export → Grafana import、80% dashboard 缺 widget / metric 對不上。

根因：

Datadog query syntax 跟 Grafana / Mimir 的 PromQL 不直接相容
Datadog widget type（top-list / hostmap）Grafana 沒對應
Tag-based aggregation 對應 Prometheus label 但語法不同

修法：

接受重建：production-grade dashboard 必須人工重建、不要期待自動轉
Prioritize：先重建 SOC 用 / production-critical 30%、其他 deprecate
migration window 增 4-6 週：dashboard rebuild 是 underestimated effort

Case 4：Alert routing 換邏輯，PagerDuty integration 不通

徵兆：Cutover 後 alert 不送 PagerDuty、SOC 半小時才發現；alert 端 webhook 配置正確、但 payload format 跟 Datadog 不同、PagerDuty 端 rule 過濾掉。

根因：

Datadog alert payload 含 event_type=alert、PagerDuty integration 用這個 routing
Alertmanager 預設 payload 結構不同
PagerDuty rule 端針對 Datadog event 寫 schema、Alertmanager event 不 match

修法：

Pre-cutover test：Alertmanager → PagerDuty 跑 dry-run、send test alert 驗證
PagerDuty Service：建獨立 Grafana-source Service、不共用 Datadog Service
Alertmanager template：用 webhook 自定 JSON template、payload 接近 Datadog 結構

Case 5：SLO definition 跟 monitor type 對不上

徵兆：Datadog SLO 跑 99.9% availability、轉到 Grafana SLO + Mimir 後實際 9X% 數字不一致；SOC 跑 dashboard 比對 5 個 SLO、4 個誤差 0.1-0.3%。

根因：

Datadog SLO 計算 over time window 用內部 query；Grafana SLO 用 PromQL 寫公式
Datadog 對 success_rate 處理 missing data 跟 PromQL 預設不同
Time bucket boundary 處理差異

修法：

重定義 SLO 在 PromQL：不嘗試「複製」、是「重定義」、認真寫 PromQL 表達式
接受 ±0.1% drift：production-critical SLO 跑 dual-track 1-2 個月、tune PromQL 到 acceptable drift
SLO migration 不是 dashboard migration 子集：獨立 stream、留更多時間

Capacity / cost 對照

維度	Datadog	Grafana Stack（self-hosted on K8s）
Setup cost	低（SaaS）	中高（K8s deploy + storage backend）
Operational cost (200 host)	$34K / month	$8-12K / month（含 S3 + K8s）
Operational cost (500 host)	$80-150K / month	$15-30K / month
Operational FTE	0.1-0.3	1-2 FTE（K8s + storage + Grafana operator）
Long-term retention	$1.27 / million event for 15+ day	S3 + Loki：~$0.02 / GB / month
Multi-cloud / hybrid	受 Datadog region 限	自由部署
Vendor lock-in	高	低（OSS + OTel）
Time to value	1-2 週	4-8 週
Migration cost (one-time)	-	1-3 FTE × 3 個月

Break-even point：~150 host 規模、3 年 amortized 後 self-hosted cheaper；< 100 host 規模 SaaS 較 ROI 高。

整合 / 下一步

跟 OpenTelemetry 對齊

Migration 是 OTel-first 轉型 的機會：

Application code 用 OTel SDK、避免 Datadog SDK lock-in
Trace context propagation 走 W3C Trace Context
未來換 backend 不用再改 application

跟 Splunk → Elastic 對照

兩篇都是 cost-driven SaaS migration、但細節差：

Splunk → Elastic 是 SIEM 領域、schema translation 是核心議題
Datadog → Grafana 是 multi-tool 拆分、agent + dashboard 重建是核心
共同 pattern：dual-ship → parallel run → cutover

反向遷移（Grafana Stack → Datadog）

存在但少數 — 主要是 operational complexity reduction（不想自管 Mimir / Loki）；schema 對位方向相反、agent 換回 Datadog Agent。

下一步議題

Grafana Cloud 混合：部分 component（Tempo）用 Grafana Cloud SaaS、其他 self-host、混合架構
OpenTelemetry Collector 跟 Alloy 取捨：兩者都是 OTel-based、Alloy 是 Grafana 自家 fork
Vector vs Alloy vs Fluentd：log shipper 戰場、cost / 功能 / OTel 整合度比較

Mimir on Tarragon

LGTM Stack 組合運維：Loki + Grafana + Tempo + Mimir

定位

四元件的責任分工

部署模式

Monolithic mode

Microservices mode

選擇判準

常見故障模式

Loki：ingester OOM

Mimir：compactor 卡住

Tempo：trace not found

Grafana：dashboard provisioning 漂移

Dashboard Provisioning

Infrastructure as Code

Grafonnet / Jsonnet

下一步路由

Self-managed Prometheus → Grafana Cloud Metrics：feature × ops × cost 對照

Feature / ops / cost 三維對照

為什麼遷：retention / ops / vendor consolidation 三條 driver

Operational redesign

Migration 4-phase

Phase 0：Audit

Phase 1：Grafana Cloud setup

Phase 2：Dual-write

Phase 3：Cutover

Phase 4：Cleanup

Production 故障演練

Case 1：Cardinality 爆、cost 暴漲

Case 2：Recording rule 對應失效

Case 3：PromQL 微差行為

Case 4：Alert routing 改變

Case 5：歷史資料查不到

Capacity / cost

整合 / 下一步

跟 Datadog → Grafana Stack migration 對位

跟 OpenTelemetry 整合

相關連結

Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability

$50K/month bill 拆解：先看錢花在哪、再決定怎麼遷

五個責任、五個 component：不是替換一個產品

Migration 結構：每個 component 各自 phased、整體 staggered

Agent migration：Datadog Agent → OTel Collector / Alloy

Production 故障演練

Case 1：Cardinality 爆，Mimir 端 series 暴增

Case 2：Log volume cost 預估失準

Case 3：Datadog dashboard 不能直接轉 Grafana

Case 4：Alert routing 換邏輯，PagerDuty integration 不通

Case 5：SLO definition 跟 monitor type 對不上

Capacity / cost 對照

整合 / 下一步

跟 OpenTelemetry 對齊

跟 Splunk → Elastic 對照

反向遷移（Grafana Stack → Datadog）

下一步議題

相關連結