Datadog on Tarragon

Datadog 成本治理與 Agent 配置

Mon, 22 Jun 2026 00:00:00 +0000

本文是 Datadog 的 vendor deep article，深化 overview 的成本跟 Agent 段。初次接觸 Datadog 的讀者建議先讀 Datadog 服務頁。

定位

Datadog 是全託管觀測平台，涵蓋 metrics、logs、traces、profiling、RUM、synthetic monitoring。託管方案的核心取捨是「零運維但成本跟用量成正比」— 用得越多付得越多，而且計價維度多（host、custom metric、log ingestion、span、indexed span），成本治理需要理解每個維度的計價模型。

計價模型概覽

Datadog 的主要計價維度：

維度	計價方式	常見失控來源
Infrastructure host	每 host/月	Auto-scaling 造成 host 數量波動
Custom metrics	每 unique time series/月	Label 爆炸（同 cardinality 問題）
Log ingestion	每 GB ingested/月	Debug log level 忘記關
Log indexed retention	每 million events × 天/月	預設 retention 太長
APM host + indexed span	每 host/月 + 每 million span	Sampling 沒設、全收
Profiling	每 host/月（APM 加購）	整體成本疊加

多數 Datadog 成本失控的根因是 custom metrics 跟 log ingestion — 兩者跟 cardinality 跟 log volume 直接相關，成長可以很快。

Custom Metrics 成本控制

什麼算 custom metric

Datadog 把每個 unique 的 metric name + tag 組合算一個 time series。http_requests_total{service=checkout, method=GET, status=200} 跟 http_requests_total{service=checkout, method=POST, status=500} 是兩個 time series。

Tag 的笛卡爾積決定 series 數量。5 個 service × 4 個 method × 5 個 status = 100 個 series。加一個 region tag（3 個值）就變 300 個。加一個 endpoint tag（50 個 normalized path）就變 15,000 個。

控制策略

Tag 白名單：跟 Prometheus 的 label 白名單邏輯相同。只保留有查詢價值的 tag — service、method、status_class（2xx/4xx/5xx）。移除 user_id、request_id、完整 URL。

Metrics without Limits：Datadog 的功能 — 在 ingestion 之後、query 之前過濾 tag。所有 tag 都收但只 index / 計費特定 tag。適合「收全量但只查部分維度」的場景。

DogStatsD 聚合：Datadog Agent 的 DogStatsD 端在 Agent 層做 pre-aggregation，把客戶端的 per-request metric 聚合成 per-interval 的摘要。減少送到 Datadog 的 data point 數量。DogStatsD 聚合在 Agent 端執行，跟 TSDB 層的 recording rule 是不同位置的 pre-aggregation 機制。

Usage attribution：Datadog 的 Usage Attribution 功能把 custom metric 成本拆到 service / team tag，讓團隊看到自己的 metric 成本。對應 4.15 cost attribution。

判讀指標

Datadog UI 的 Metric Summary 頁面顯示每個 metric name 的 tag cardinality。定期（每月）檢查 top 20 高 cardinality metric，確認是否有意外的 tag 爆炸。

Log Ingestion 成本控制

Index 策略

Datadog log 的計費分兩層：ingestion（進來就計費）跟 indexing（索引後按保留天數計費）。可以 ingest 所有 log 但只 index 部分 — 非 indexed 的 log 可以在 15 分鐘的 live tail 窗口查看，之後就看不到了（除非歸檔到 S3/GCS 做 rehydrate）。

可操作的分層：

Error / warning log：index，retention 30 天
Info log（關鍵路徑）：index，retention 7 天
Debug log：不 index、只 ingest（live tail 用）；或直接不送
Access log（高量）：不 index、歸檔到 S3、需要時 rehydrate

Exclusion filter

Datadog 的 index exclusion filter 讓特定 pattern 的 log 進入 ingestion pipeline 但跳過 index。例：health check 的 access log（path:/health）每秒數百筆但沒有 debug 價值，設 exclusion filter 讓它不佔 index quota。

Log pipeline 跟 Datadog log 的對應

4.11 telemetry pipeline 的 collector 端可以在 log 送到 Datadog 之前做 filtering — 低價值 log 直接 drop、不進 Datadog ingestion（連 ingestion 費用都省）。這比 Datadog 的 exclusion filter 更節省成本（exclusion filter 仍然計 ingestion 費用）。

Agent 部署配置

Agent 部署模式

模式	部署位置	適用場景
Host agent	每台 VM 一個 agent	傳統 VM 部署
DaemonSet agent	K8s 每個 node 一個 agent	K8s 標準部署
Sidecar agent	每個 pod 一個 agent	需要嚴格隔離時
Cluster agent	K8s cluster 一個	收集 cluster-level metric

多數 K8s 部署用 DaemonSet + Cluster Agent 組合。DaemonSet agent 收集 node-level 跟 pod-level 的 metric / log / trace；Cluster Agent 收集 cluster-level 的 metadata 跟 event。

Agent 健康判讀

Agent 本身需要被監控 — Agent 故障時 Datadog 看到的是「資料消失」而非「Agent 掛了」。

判讀指標（Agent 自帶）：

datadog.agent.running：Agent process 是否存活
datadog.agent.check_run：各 integration check 是否正常
datadog.dogstatsd.packets.dropped：DogStatsD buffer 滿時丟棄的封包數

Agent 掛掉時 dashboard 會出現 gap（資料斷層）。如果所有 host 同時斷層、問題在 Datadog backend；如果特定 host 斷層、問題在該 host 的 Agent。

常見 Agent 故障

CPU / memory over-consumption：Agent 開太多 integration check 或 DogStatsD 收太多 custom metric。修復：減少 check 數量、調整 DogStatsD 的 aggregation interval、或升級 Agent 版本（新版通常更節省資源）。

Log collection 延遲：Agent 的 log tail 落後，log 到達 Datadog 的延遲增加。原因通常是 log rotation 設定跟 Agent 的 tail 設定不一致，或 log 量突然爆增超過 Agent 的處理能力。

Network connectivity：Agent 到 Datadog intake endpoint 的網路問題。Agent 會 buffer 資料並重試，但 buffer 滿（預設 100MB）後會 drop。在網路不穩的環境（edge location、受限網路），需要加大 buffer 或設定 proxy。

跟 OTel 的整合

Datadog 支援 OpenTelemetry — 可以用 OTel SDK instrumentation + OTel Collector，把資料送到 Datadog backend。這種模式讓 instrumentation 跟 vendor 解耦，但犧牲部分 Datadog-native 功能（例如 Watchdog anomaly detection 需要 Datadog Agent 的 metadata）。

整合模式的選擇跟 4.C7 Datadog OTel migration practice 的案例分析對應 — 雙軌期的成本跟語意對齊是主要挑戰。

下一步路由

Datadog 服務頁：overview 跟日常操作
4.7 cardinality：cardinality 治理的完整策略
4.15 cost attribution：成本歸因的組織治理
4.C7 Datadog OTel migration：Datadog 跟 OTel 的整合案例
OpenTelemetry：vendor-neutral instrumentation

Datadog OTLP Ingestion 與 OTel 整合

Tue, 23 Jun 2026 00:00:00 +0000

本文是 Datadog 的 vendor deep article，深化 overview「OTLP ingestion」段。初次接觸 Datadog 的讀者建議先讀 Datadog 服務頁。

問題情境

兩種觸發情境會讓團隊需要 Datadog 的 OTLP ingestion：

團隊已經使用 Datadog APM，但新服務或新語言想用 OTel SDK 避免 vendor lock-in。Datadog SDK 覆蓋的語言有限（Go / Java / Python / Ruby / Node / .NET / PHP / C++），如果服務用 Rust / Elixir / Kotlin multiplatform，OTel SDK 的覆蓋更廣。

另一種情境是團隊原本用 OTel + Jaeger 或 OTel + Grafana，現在想把 visualization 遷到 Datadog 但不想重新 instrument。OTLP ingestion 讓 OTel SDK 產出的 traces / metrics / logs 直接送進 Datadog，不改 application code。

核心概念

Datadog Agent 的 OTLP receiver

Datadog Agent 6.32+ 內建 OTLP receiver，接受 gRPC（port 4317）和 HTTP（port 4318）兩種 protocol。Agent 收到 OTLP 資料後轉換成 Datadog 內部格式，走跟 Datadog SDK 相同的 pipeline（sampling、tagging、forwarding to Datadog backend）。

這代表 OTLP path 的資料在 Datadog UI 裡跟 Datadog SDK path 的資料一樣被處理 — 相同的 APM trace waterfall、相同的 service map、相同的 error tracking。差異在 metadata 完整度（見下方 feature parity）。

三種 signal 的 OTLP 支援度

Signal	OTLP 支援	到 Datadog 的對應
Traces	完整（OTLP gRPC / HTTP）	APM traces、service map、error tracking
Metrics	完整（OTLP gRPC / HTTP）	Custom metrics（按 metric 計費）
Logs	有限（Agent 7.54+ 支援 OTLP logs）	Datadog Logs（按 ingestion volume 計費）

Traces 的 OTLP 支援最成熟、metrics 次之、logs 最新。混合環境常見做法是 traces + metrics 走 OTLP、logs 走 Datadog Agent 的原生 log collection（file tailing / container stdout）。

Datadog SDK vs OTel SDK feature parity

功能	Datadog SDK	OTel SDK → Datadog
Distributed tracing	有	有（完整）
Continuous profiling	有	無（Datadog 專有）
ASM（Application Security）	有	無（需要 Datadog library）
CI Visibility	有	無
Dynamic instrumentation	有	無
Runtime metrics（GC、thread）	自動	需手動配置 OTel metric instrumentation
Log correlation（trace_id 注入 log）	自動	需手動配置（MDC / context propagation）
Unified service tagging	自動（`DD_SERVICE` / `DD_ENV` / `DD_VERSION`）	需 resource attribute mapping

判讀：如果團隊需要 profiling / ASM / CI Visibility，對應服務仍需 Datadog SDK。其他服務可以用 OTel SDK + OTLP ingestion，兩者在同一個 Datadog org 共存。

配置 step-by-step

Datadog Agent OTLP 設定

1# datadog.yaml
2otlp_config:
3  receiver:
4    protocols:
5      grpc:
6        endpoint: 0.0.0.0:4317
7      http:
8        endpoint: 0.0.0.0:4318

Agent 重啟後用 datadog-agent status 確認 OTLP receiver 啟動。

OTel SDK endpoint 配置

1# 環境變數（語言無關）
2export OTEL_EXPORTER_OTLP_ENDPOINT="http://datadog-agent:4317"
3export OTEL_EXPORTER_OTLP_PROTOCOL="grpc"
4export OTEL_SERVICE_NAME="checkout-api"
5export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.version=1.2.3"

Resource attribute → Datadog tag mapping

Datadog Agent 自動把 OTel resource attributes 轉成 Datadog tags：

OTel resource attribute	Datadog tag	備註
`service.name`	`service`	Datadog unified service tagging 的核心
`deployment.environment`	`env`	必填、否則 Datadog UI 的環境篩選失效
`service.version`	`version`	用於 deployment tracking
`host.name`	`host`	Agent 通常自動帶、不需手動設
`container.name`	`container_name`	K8s 環境自動帶

如果 resource attribute 沒設 deployment.environment，Datadog 會把 trace 歸到 env:none — 在 APM 介面幾乎不可見。這是最常見的 OTLP onboarding 問題。

OTel Collector → Datadog（alternative path）

如果不想讓 application 直連 Datadog Agent，可以在中間放 OTel Collector：

 1# otel-collector-config.yaml
 2exporters:
 3  datadog:
 4    api:
 5      key: ${DD_API_KEY}
 6      site: datadoghq.com
 7
 8service:
 9  pipelines:
10    traces:
11      receivers: [otlp]
12      processors: [batch]
13      exporters: [datadog]

OTel Collector 的 datadog exporter 直接把資料送到 Datadog backend（不經 Agent）。適合已有 OTel Collector 基礎設施、不想每個 node 都部署 Datadog Agent 的場景。

故障與邊界

Resource attribute mapping 不對齊

OTel 的 service.name 用 dot notation（如 com.example.checkout），Datadog 預設用 hyphen（如 checkout-api）。如果 mapping 不一致，同一個服務在 Datadog APM 的 service map 會出現多個節點（OTel path 一個、Datadog SDK path 一個）。

修法：統一 service.name 命名。如果兩種 SDK 並存，在 OTel SDK 的 resource attribute 設跟 Datadog SDK 的 DD_SERVICE 完全相同的值。

Metric naming convention 差異

OTel metric 用 dot notation（http.server.request.duration），Datadog 預設用 underscore（http_server_request_duration）。Agent 會自動轉換（dot → underscore），但如果團隊同時有 Datadog SDK 產出的 metric 跟 OTel SDK 產出的 metric，兩者可能在 Datadog 裡產生重複（語意相同但名稱不同）。

修法：用 OTel Collector 的 metricstransform processor 在 export 前統一命名，或在 Datadog 用 metric alias 合併。

Log correlation 在 OTLP path 的限制

Datadog SDK 自動把 dd.trace_id 和 dd.span_id 注入 application log（如 Python logging、Java MDC）。OTel SDK 不做這件事 — log correlation 需要手動設定（把 trace_id 從 OTel context 注入 logging framework）。

如果 log correlation 缺失，Datadog 的 trace → log 跳轉功能失效。修法依語言不同：Java 用 MDC + OTel Java agent 的 log context instrumentation；Python 用 opentelemetry-instrumentation-logging；Go 需要手動從 span context 取 trace ID 寫到 log field。

容量與成本

OTLP path 的計費跟 Datadog SDK path 相同：

Signal	計費單位	OTLP vs Datadog SDK
APM traces	Per ingested span	相同
Metrics	Per custom metric（unique metric name × tag combination）	相同
Logs	Per ingested GB	相同

成本差異不在 ingestion pricing，在 feature access。用 OTel SDK 失去 Profiling / ASM / CI Visibility，這些功能需要 Datadog SDK。如果團隊需要這些功能，走 OTLP 反而要為核心服務額外部署 Datadog SDK — 雙 SDK 的 maintenance cost 可能超過直接全用 Datadog SDK。

判斷分水嶺：如果 > 80% 的服務不需要 Profiling / ASM，走 OTLP + 少數服務用 Datadog SDK 是合理的混合模式。如果核心服務都需要 Profiling，全用 Datadog SDK 更簡單。

整合與下一步

Datadog 服務頁：overview 與日常操作
Datadog 成本治理：Agent 配置與 cost control
4.C7 Datadog OTel migration：從 Datadog SDK 轉向 OTel 相容模式的治理案例
OpenTelemetry Collector 部署模式：OTel Collector → Datadog 的 alternative path
← New Relic migration：New Relic → Datadog 的遷移中 OTLP 扮演的橋接角色

New Relic → Datadog：APM schema 對位 + agent 替換 + dashboard 重建

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link New Relic 跟 Datadog。跑 migration-playbook-methodology 6 維 audit 後對映 Schema = High（NRQL ↔ Datadog query、APM agent 不同）→ Type A phased translation。

問題情境

中型 SaaS 跑 New Relic 3-5 年、production observability 飽和、團隊發現幾個問題：cost 暴漲（per-host APM + custom event + synthetic）、APM trace 對 Kubernetes-native workload 不夠細、跟 PagerDuty / Slack integration 雖然有但 latency 偏高。同期 Datadog 在 K8s monitoring + APM 端深度整合、cost model 在 100-500 host 規模更可預測。

評估遷移時、發現 New Relic → Datadog 不是「換個 agent 就好」 — APM schema、NRQL 查詢語言、custom dashboard、synthetic monitoring rule 全部要 重新對位；application code 端的 agent 也要 完全換 binary。是 Type A 高 schema 差 migration、不是 drop-in。

為什麼遷：cost / k8s-native / vendor consolidation 三條 driver

Driver	觸發場景
Cost	New Relic per-host pricing + custom event + synthetic 加總爆、Datadog 在 K8s 場景單 host 多 container 更划算
K8s-native	Datadog agent 對 K8s sidecar / DaemonSet / autodiscovery 更深
Vendor consolidation	已用 Datadog log / metric、APM 統一 vendor 降工具切換 cost

反向 driver（Datadog → New Relic）：

New Relic 對 full-stack observability（APM + browser + mobile + synthetic）的整合包仍領先
已深用 New Relic NRQL 跟 New Relic University 培訓的 organization、不切

Schema 對位

New Relic concept	Datadog 對應
APM agent (NR Java / Python / Node)	Datadog agent + APM tracer library
NRQL query	Datadog query (Metric / Log / Trace)
Synthetic monitor	Datadog Synthetic Tests
Custom event	Datadog custom metric / log event
NRQL alert condition	Datadog monitor
New Relic dashboard	Datadog dashboard (need rebuild)
Apdex score	Datadog APM `apm.service.errors` + `apm.service.latency`
Distributed trace	Datadog APM trace（OpenTelemetry-compatible）

Phase 0：Audit + classify

列所有 application 跟對應 NR agent version
列所有 NRQL alert / dashboard / synthetic monitor
估每月 cost 跟 Datadog 對比

Phase 1：Schema 對位 + Datadog cluster 建置

Datadog organization 申請 / IAM integration
VPC peering / private link (如果用 self-hosted agent)

Phase 2：Translation pipeline (3-tier)

Tier 1: Datadog 端 import tool（API-based NRQL → Datadog query 轉換、cover ~40-60%）
Tier 2: LLM-assisted（剩餘 query / dashboard）
Tier 3: manual (synthetic / complex correlation)

Phase 3：Parallel run (dual-agent 4-8 週)

兩個 agent 跑同 application、metric / trace / log 雙端輸出、SOC 比對 detection coverage / alert / dashboard 一致性。

Phase 4：Cutover + cleanup

Application 端切 agent
New Relic license downgrade / cancel
Decommission timeline 3-6 個月（保留歷史查詢能力）

Production 故障演練

Case 1：NRQL 不直接對位 Datadog query

徵兆：NRQL SELECT count(*) FROM Transaction FACET name WHERE duration > 5 SINCE 1 hour ago 在 Datadog 端需要拆 metric query + filter + group by；翻譯後語意對等但 syntax 完全不同、SOC analyst 學習曲線陡。

修法：

翻譯腳本 + LLM-assisted、保留 NRQL 字面 + Datadog query 對照表（runbook）
SOC training，1-2 週 hands-on
部分 query 改 Datadog dashboard widget、不用直接 query

Case 2：Synthetic monitor 對位失敗

徵兆：NR Synthetic 跑 100+ ping / browser / API test、切 Datadog Synthetic 後發現 step-based monitor 對應的「Browser Test」配置複雜、setup 工作量 2-3 倍預估。

修法：

Pre-cutover 跑 sample synthetic、估真實 setup cost
優先遷 critical synthetic、其他評估退役
用 Datadog API + Terraform 自動化、避免 UI 手動建

Case 3：Cost 模型反轉

徵兆：cutover 後第一個月 Datadog 帳單比 NR 高 30%；breakdown 後發現 log retention + custom metric series + log indexing 三個項目超預估。

修法：

Pre-migration 估 Datadog cost 必須含 log indexing pricing（按 indexed event 計）、不是純 ingest
Application 端 log scrub PII + sample debug log、降 ingest GB
Custom metric cardinality control（tag combination 爆 series count）

Case 4：Dashboard 自動轉失敗、人工 rebuild 80%

徵兆：用 Datadog import tool 跑 NR dashboard、80% widget 缺 / 對應錯；team 估 2 週 dashboard rebuild、實際跑 6-8 週。

修法：

接受重建：production dashboard 必須人工重建、不要期待自動轉
Prioritize：先重建 SOC critical 30%、其他 deprecate
Migration window 增 4-6 週：dashboard rebuild 是 underestimated effort

Case 5：Cross-platform metric 命名差

徵兆：NR 端 metric Apdex/Apdex 在 Datadog 沒對應、application code 寫死 metric name 失效；alert query 對 NR-specific metric 全失效。

修法：

Pre-cutover 列所有 NR-specific metric、application code 改用 OpenTelemetry-style metric 命名
Datadog query 端 rebuild、用 application-level metric name 而非 vendor-specific
長期：metric naming 用 OpenTelemetry semantic conventions、避免 vendor lock

Capacity / cost

維度	New Relic	Datadog
Pricing model	per-host + custom event / synthetic	per-host APM + log indexing + custom metric
K8s-friendly	中、autodiscovery 有但配置複雜	高、K8s-native autodiscovery first-class
Migration cost	-	2-4 FTE × 2-3 個月
Operational FTE	0.3-0.6	0.3-0.6（相當）

整合 / 下一步

跟 Datadog → Grafana Stack migration 對位

兩種 Datadog 端的後續路線：

切到 Datadog 後 繼續用（穩定 multi-year）
切到 Datadog 後 再切 Grafana Stack 省 cost（multi-tool 拆分、Type D）

多數 organization 第一輪 NR → Datadog 已花 2-3 個月、不會立刻再切；至少穩定 1-2 年。

跟 OpenTelemetry 對齊

Migration 順便升 OTel 化 application、避免下次 vendor 切換重複工作量。

Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link 到 Datadog（source）跟 Grafana Stack（target）。跟前三篇 migration（Splunk → Elastic phased / Redis → DragonflyDB drop-in / PostgreSQL → Aurora hybrid）對照、本篇是 cost-driven multi-tool migration — 不是換一個產品、是把 一站式 SaaS 拆成 五個專責 OSS / cloud component。

$50K/month bill 拆解：先看錢花在哪、再決定怎麼遷

中型 SaaS（100-500 host、5K-50K metric series、TB-level log/day）的 Datadog 月帳單長這樣：

計費項	平均單價	中型 SaaS 估算 / month
Infrastructure host	$15-23 / host	200 host × $20 = $4,000
APM host	$31 / host	100 host × $31 = $3,100
Custom metrics	$0.05 / 100 series	30K series × $0.05 = $1,500
Log ingest	$0.10 / GB ingested	50TB × $0.10 = $5,000
Log retention（15-day）	$1.27 / million events	50G event × $1.27 = $6,350
Log indexing	$1.70 / million events	50G × $1.70 = $8,500
Network	$5 / host	200 × $5 = $1,000
RUM / Session	$1.50 / 1000 session	30M session × $1.5 = $4,500
Synthetics	$5 / 10K test runs	50K test = $25
Total	-	$34,000 / month（保守估）

擴張到 500 host / 100TB log 的 production：$80K-150K / month 範圍。Grafana stack（self-hosted on K8s + Grafana Cloud 部分服務）對等 capacity 通常 $8K-30K / month — 2.5-5x cost reduction。

但 cost 不是唯一 driver。其他 driver：

Multi-cloud / hybrid：Datadog 集中、Grafana 可分散部署符合資料 residency
OpenTelemetry-first：Grafana stack 對 OTel 是 native、Datadog 仍 vendor-specific agent
Long-term retention：Loki 用 S3 cold tier 跑 1 年 retention 比 Datadog 便宜 10-50x

五個責任、五個 component：不是替換一個產品

Datadog 是 一站式 SaaS、單一 agent + 單一 UI 包 5 個責任。Grafana stack 把責任拆給 5 個專責 component：

責任	Datadog 處理	Grafana Stack 對應
Metric	Datadog metric	Mimir（Prometheus-compatible long-term）
Log	Datadog Logs	Loki（label-indexed log）
Trace	Datadog APM	Tempo（trace-only object storage）
Dashboard	Datadog dashboard	Grafana
Agent / shipper	Datadog Agent	Alloy（OTel-based collector）+ Grafana Agent / Promtail

Migration 是 五個獨立 stream、不是單一 cutover。SRE 對「一個 agent 包所有」的心智模型要拆。

Migration 結構：每個 component 各自 phased、整體 staggered

不像前三篇 migration 是線性流程、本篇是 5 個 parallel migration stream + 跨 stream coordination：

1           Phase 0           Phase 1            Phase 2          Phase 3
2           Audit             Deploy             Dual-ship        Cutover
3Metric    [audit]──→        [deploy Mimir]──→ [dual-ship]──→  [cutover]
4APM       [audit]──→        [deploy Tempo]──→ [dual-ship]──→  [cutover]
5Log       [audit]──→        [deploy Loki]──→  [dual-ship]──→  [cutover]
6Dashboard [audit]──→        [deploy Grafana]──→ [rebuild]──→   [cutover]
7Alert     [audit]──→        [deploy Alertmgr]──→ [parallel]──→ [cutover]

每個 stream 獨立做 dual-ship + cutover、不必同步；通常 Metric 先遷（cardinality 議題暴露最快）、然後 Log、最後 APM（trace correlation 最依賴 dashboard / alert）。

Agent migration：Datadog Agent → OTel Collector / Alloy

Datadog Agent 是 vendor-specific binary、抽出來換成 OpenTelemetry Collector / Grafana Alloy：

 1# alloy config (HCL-like)
 2prometheus.scrape "k8s_pods" {
 3  targets = discovery.kubernetes.pods.targets
 4  forward_to = [prometheus.remote_write.mimir.receiver]
 5}
 6
 7prometheus.remote_write "mimir" {
 8  endpoint {
 9    url = "https://mimir.internal/api/v1/push"
10  }
11}
12
13loki.source.kubernetes "pods" {
14  targets = discovery.kubernetes.pods.targets
15  forward_to = [loki.write.production.receiver]
16}
17
18otelcol.receiver.otlp "default" {
19  grpc {}
20  output {
21    traces = [otelcol.exporter.otlp.tempo.input]
22  }
23}

Migration 期間 dual-shipper 是標準作法：

Datadog Agent 跟 Alloy 並存（短期 capacity 兩倍）
同 host 同時 ship 兩端、觀察一致性
漸進 disable Datadog Agent 的 metric / log / APM 子模組

Production 故障演練

Case 1：Cardinality 爆，Mimir 端 series 暴增

徵兆：Datadog 端 30K series、ship 到 Mimir 後 series 變 500K、Mimir indexer OOM。

根因：Datadog 內部對 tag 做 自動 aggregation 跟 low-cardinality enforcement；Prometheus / Mimir 對 每個 unique label set 算一個 series、application code 的 high-cardinality label（user_id / request_id）直接爆。

修法：

Audit 階段 跑 topk(100, count by (__name__) ({__name__=~".+"})) 找 high-cardinality metric
drop high-cardinality label：Alloy / OTel collector 端 relabel 規則 drop user_id 等 unbounded label
改 histogram bucket：高 cardinality 通常來自 label combination、改用 fixed-bucket histogram
適當改 metric 為 log：請求 ID 是 trace context、不該是 metric label

Case 2：Log volume cost 預估失準

徵兆：Loki 部署 1 個月後 S3 帳單比預估高 2x；object storage 跟 query GB-scan 都超預期。

根因：Datadog 對 log 做自動 sampling / aggregation、bill 是 indexed event；Loki 是 全量 raw ingest + S3 cold storage、按實際 byte 計費。raw log volume 比 indexed event 高 3-10x。

修法：

Ingest-side sampling：Alloy / Promtail 端 sample debug / info log、只 ingest warn / error 全量
Log structure：JSON log 比 text log 壓縮率高、Loki S3 size 少 50%
Retention tier：hot 7 天 S3 standard / cold 1 年 S3 Glacier、retention budget 控制

Case 3：Datadog dashboard 不能直接轉 Grafana

徵兆：Migration 計畫設「dashboard 自動轉換」、實際跑 Datadog API export → Grafana import、80% dashboard 缺 widget / metric 對不上。

根因：

Datadog query syntax 跟 Grafana / Mimir 的 PromQL 不直接相容
Datadog widget type（top-list / hostmap）Grafana 沒對應
Tag-based aggregation 對應 Prometheus label 但語法不同

修法：

接受重建：production-grade dashboard 必須人工重建、不要期待自動轉
Prioritize：先重建 SOC 用 / production-critical 30%、其他 deprecate
migration window 增 4-6 週：dashboard rebuild 是 underestimated effort

Case 4：Alert routing 換邏輯，PagerDuty integration 不通

徵兆：Cutover 後 alert 不送 PagerDuty、SOC 半小時才發現；alert 端 webhook 配置正確、但 payload format 跟 Datadog 不同、PagerDuty 端 rule 過濾掉。

根因：

Datadog alert payload 含 event_type=alert、PagerDuty integration 用這個 routing
Alertmanager 預設 payload 結構不同
PagerDuty rule 端針對 Datadog event 寫 schema、Alertmanager event 不 match

修法：

Pre-cutover test：Alertmanager → PagerDuty 跑 dry-run、send test alert 驗證
PagerDuty Service：建獨立 Grafana-source Service、不共用 Datadog Service
Alertmanager template：用 webhook 自定 JSON template、payload 接近 Datadog 結構

Case 5：SLO definition 跟 monitor type 對不上

徵兆：Datadog SLO 跑 99.9% availability、轉到 Grafana SLO + Mimir 後實際 9X% 數字不一致；SOC 跑 dashboard 比對 5 個 SLO、4 個誤差 0.1-0.3%。

根因：

Datadog SLO 計算 over time window 用內部 query；Grafana SLO 用 PromQL 寫公式
Datadog 對 success_rate 處理 missing data 跟 PromQL 預設不同
Time bucket boundary 處理差異

修法：

重定義 SLO 在 PromQL：不嘗試「複製」、是「重定義」、認真寫 PromQL 表達式
接受 ±0.1% drift：production-critical SLO 跑 dual-track 1-2 個月、tune PromQL 到 acceptable drift
SLO migration 不是 dashboard migration 子集：獨立 stream、留更多時間

Capacity / cost 對照

維度	Datadog	Grafana Stack（self-hosted on K8s）
Setup cost	低（SaaS）	中高（K8s deploy + storage backend）
Operational cost (200 host)	$34K / month	$8-12K / month（含 S3 + K8s）
Operational cost (500 host)	$80-150K / month	$15-30K / month
Operational FTE	0.1-0.3	1-2 FTE（K8s + storage + Grafana operator）
Long-term retention	$1.27 / million event for 15+ day	S3 + Loki：~$0.02 / GB / month
Multi-cloud / hybrid	受 Datadog region 限	自由部署
Vendor lock-in	高	低（OSS + OTel）
Time to value	1-2 週	4-8 週
Migration cost (one-time)	-	1-3 FTE × 3 個月

Break-even point：~150 host 規模、3 年 amortized 後 self-hosted cheaper；< 100 host 規模 SaaS 較 ROI 高。

整合 / 下一步

跟 OpenTelemetry 對齊

Migration 是 OTel-first 轉型 的機會：

Application code 用 OTel SDK、避免 Datadog SDK lock-in
Trace context propagation 走 W3C Trace Context
未來換 backend 不用再改 application

跟 Splunk → Elastic 對照

兩篇都是 cost-driven SaaS migration、但細節差：

Splunk → Elastic 是 SIEM 領域、schema translation 是核心議題
Datadog → Grafana 是 multi-tool 拆分、agent + dashboard 重建是核心
共同 pattern：dual-ship → parallel run → cutover

反向遷移（Grafana Stack → Datadog）

存在但少數 — 主要是 operational complexity reduction（不想自管 Mimir / Loki）；schema 對位方向相反、agent 換回 Datadog Agent。

下一步議題

Grafana Cloud 混合：部分 component（Tempo）用 Grafana Cloud SaaS、其他 self-host、混合架構
OpenTelemetry Collector 跟 Alloy 取捨：兩者都是 OTel-based、Alloy 是 Grafana 自家 fork
Vector vs Alloy vs Fluentd：log shipper 戰場、cost / 功能 / OTel 整合度比較

Datadog on Tarragon

Datadog 成本治理與 Agent 配置

定位

計價模型概覽

Custom Metrics 成本控制

什麼算 custom metric

控制策略

判讀指標

Log Ingestion 成本控制

Index 策略

Exclusion filter

Log pipeline 跟 Datadog log 的對應

Agent 部署配置

Agent 部署模式

Agent 健康判讀

常見 Agent 故障

跟 OTel 的整合

下一步路由

Datadog OTLP Ingestion 與 OTel 整合

問題情境

核心概念

Datadog Agent 的 OTLP receiver

三種 signal 的 OTLP 支援度

Datadog SDK vs OTel SDK feature parity

配置 step-by-step

Datadog Agent OTLP 設定

OTel SDK endpoint 配置

Resource attribute → Datadog tag mapping

OTel Collector → Datadog（alternative path）

故障與邊界

Resource attribute mapping 不對齊

Metric naming convention 差異

Log correlation 在 OTLP path 的限制

容量與成本

整合與下一步

New Relic → Datadog：APM schema 對位 + agent 替換 + dashboard 重建

問題情境

為什麼遷：cost / k8s-native / vendor consolidation 三條 driver

Schema 對位

Phase 0：Audit + classify

Phase 1：Schema 對位 + Datadog cluster 建置

Phase 2：Translation pipeline (3-tier)

Phase 3：Parallel run (dual-agent 4-8 週)

Phase 4：Cutover + cleanup

Production 故障演練

Case 1：NRQL 不直接對位 Datadog query

Case 2：Synthetic monitor 對位失敗

Case 3：Cost 模型反轉

Case 4：Dashboard 自動轉失敗、人工 rebuild 80%

Case 5：Cross-platform metric 命名差

Capacity / cost

整合 / 下一步

跟 Datadog → Grafana Stack migration 對位

跟 OpenTelemetry 對齊

相關連結

Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability

$50K/month bill 拆解：先看錢花在哪、再決定怎麼遷

五個責任、五個 component：不是替換一個產品

Migration 結構：每個 component 各自 phased、整體 staggered

Agent migration：Datadog Agent → OTel Collector / Alloy

Production 故障演練

Case 1：Cardinality 爆，Mimir 端 series 暴增

Case 2：Log volume cost 預估失準

Case 3：Datadog dashboard 不能直接轉 Grafana

Case 4：Alert routing 換邏輯，PagerDuty integration 不通

Case 5：SLO definition 跟 monitor type 對不上

Capacity / cost 對照

整合 / 下一步

跟 OpenTelemetry 對齊

跟 Splunk → Elastic 對照

反向遷移（Grafana Stack → Datadog）

下一步議題

相關連結