Datadog on Tarragon

商業方案的事件類型對應

Fri, 19 Jun 2026 00:00:00 +0000

商業監控方案各自有不同的事件分類體系。理解它們的分類邏輯和四類事件（event / error / metric / lifecycle）的對應關係，才能在接入時正確映射自架方案的事件，避免資料遺漏或分類錯誤。

Sentry

Sentry 的核心概念是 error tracking，但已擴展到 performance monitoring 和 session replay。

四類事件	Sentry 對應	說明
Event	Breadcrumb	使用者操作記錄在 breadcrumb trail，附加在 error 上
Error	Event（Exception type）	Sentry 的核心。自動捕獲 + 手動 captureException
Metric	Transaction + Span	Performance monitoring 的度量單位
Lifecycle	Breadcrumb（navigation）	app 生命週期記錄為 navigation/system breadcrumb

Sentry 的設計假設是「error 是主角，其他事件是 error 的 context」。Event 和 lifecycle 都以 breadcrumb 形式附加在 error 報告上，獨立查看的能力有限。Breadcrumb 預設保留最近 100 條且不可獨立查詢 — 它是 error 報告的附件，不是獨立的事件資料庫。Metric 對應的 Transaction + Span 則有獨立的 Performance 頁面可以查看，和 error 是不同的 UI 入口。如果主要需求是行為分析而非 error tracking，Sentry 的 breadcrumb 模型可能不夠用。

Firebase Crashlytics + Analytics

Firebase 把 error tracking 和行為分析拆成兩個獨立產品。

四類事件	Firebase 對應	說明
Event	Analytics custom event	GA4 的 event，有 parameters 附加屬性
Error	Crashlytics exception	fatal + non-fatal exception 分開處理
Metric	Analytics event + parameters	用 event 的 parameters 記錄數值（無原生 metric）
Lifecycle	Analytics auto events	screen_view、app_open 等自動收集

Firebase 的特點是 Crashlytics 和 Analytics 各自獨立運作 — error 資料在 Crashlytics console，行為資料在 Analytics console。Metric 沒有原生支援，只能用 Analytics event 的 parameters 欄位記錄數值（例如 event: 'page_load', parameters: {duration_ms: 320}），查詢時需要在 BigQuery export 中自行聚合。兩個 console 之間的關聯需要手動（在 Crashlytics 的 custom key 中設定 user ID，再到 Analytics 用同一個 ID 查行為）。

Datadog RUM

Datadog Real User Monitoring 從全棧 APM 的角度設計 client-side 監控。

四類事件	Datadog RUM 對應	說明
Event	Action	使用者操作（click、tap、scroll）自動或手動捕獲
Error	Error	JS exception、network error、custom error
Metric	Long Task + 自訂	長任務自動捕獲，自訂 metric 用 global context
Lifecycle	View	頁面/畫面的進入和離開，自動偵測 SPA route 變換

Datadog RUM 的特點是和 backend APM 的深度整合。Client-side 的 action 可以關聯到 server-side 的 trace，形成從按鈕點擊到 database query 的完整鏈路。自架方案通常做不到這個深度的跨層關聯。

接入策略

接入商業方案時的映射原則：

自架事件名稱是 source of truth。商業方案的事件名稱是自架名稱的映射，不是取代。映射邏輯集中在一個 adapter 層，商業方案更換時只改 adapter。

不要為了配合商業方案改變自架的分類。Sentry 把 event 記錄為 breadcrumb 不代表自架方案也要把 event 降級成 error 的附屬品。自架的四類分類是語意正確的，商業方案的分類是它自己的產品設計。

同時接入多個方案時做去重。Error 同時發到 Sentry 和 Crashlytics 會產生重複。在 adapter 層控制「哪類事件發到哪個方案」，避免同一個事件在多個 dashboard 出現。

下一步路由

四類事件的定義 → 四類事件的完整定義
商業方案的深入比較 → 模組六商業方案比較
事件命名規範 → 事件命名規範

Datadog RUM

Fri, 19 Jun 2026 00:00:00 +0000

跟 Backend 04 的分工：本文從 client-side RUM 角度說明 Datadog 的全棧追蹤、四種 RUM 事件與 session replay。Server-side 的 APM 平台治理（agent 配置、成本治理、OTel 相容遷移、從 New Relic 或 Grafana Stack 遷移）見 Backend 04 Datadog vendor page。

Datadog Real User Monitoring（RUM）從全棧 APM 的角度設計 client-side 監控。核心特徵是 client 端的使用者操作可以關聯到 server 端的 trace，形成從按鈕點擊到 database query 的完整請求鏈路。

全棧追蹤

Datadog RUM 的 SDK 在 HTTP 請求中自動注入 trace context header。Server 端的 Datadog APM agent 讀取 header，把 server 端的 trace 和 client 端的 action 關聯。

這個能力在 debug「API 慢」的問題時特別有用 — 從 client 端看到「這個按鈕的回應時間 3 秒」，點進去看到 server 端的 trace 顯示「database query 佔了 2.8 秒」。自架方案和 Sentry 都做不到這個深度的跨層關聯。

前提是 server 端也使用 Datadog APM。如果 server 端用其他 APM（New Relic、Elastic APM），client-server 的關聯需要自行實作或用 OpenTelemetry 橋接。

四種 RUM 事件

Datadog RUM 收集四種事件，和自架方案的四類事件有對應關係（模組一商業方案對應）：

View：頁面或畫面的載入和離開。自動偵測 SPA 的 route 變換，對應 lifecycle 事件。

Action：使用者操作。自動捕獲 click、tap、scroll，可手動記錄自訂 action，對應 event 事件。

Error：JS exception、network error、自訂 error，對應 error 事件。

Long Task：執行時間超過 50ms 的任務（阻塞主執行緒），對應 metric 事件。

定價

Datadog RUM 按 session 數計費（每個 session 是一次使用者訪問）。和 Sentry 按事件數計費不同 — session 計費讓成本更可預測（不會因為單次訪問觸發大量事件而費用暴增）。

Datadog 的完整方案（RUM + APM + Logs + Infrastructure）費用較高，適合已經用 Datadog 做 server-side 監控的團隊。單獨用 RUM 而 server 端用其他方案，失去全棧追蹤的優勢。

Datadog RUM 的全棧追蹤能力獨一無二，但如果只需要行為分析而非 APM，Mixpanel / Amplitude 是更輕量的選擇。和 Sentry 的定位差異在於 Sentry 聚焦 error tracking、Datadog 聚焦全棧關聯。自架 vs 商業的判斷決策表從使用者規模和功能需求維度做系統性比較。

模組六：商業方案對照

Fri, 19 Jun 2026 00:00:00 +0000

回答「什麼時候該從自架切換到商業方案」。

待寫章節

自架 vs 商業的判斷決策表（使用者數 / 網路範圍 / 功能需求 / 合規要求）
Sentry 深入（error + performance + session replay 的架構）
Firebase 套件（Crashlytics + Analytics + Remote Config 的整合）
Datadog RUM（全棧 APM 的 client-side 觀點）
Mixpanel / Amplitude（行為分析專用 vs 通用監控的差異）
部署光譜（BaaS + Serverless / PaaS / 完全自架 / 商業 SaaS 四條路徑）

跨分類引用

→ monitoring 模組八商業利用：商業方案的核心賣點是行為分析功能
→ backend 04 可觀測性：server-side 商業方案（Datadog / New Relic）的對照

Datadog 成本治理與 Agent 配置

Mon, 22 Jun 2026 00:00:00 +0000

本文是 Datadog 的 vendor deep article，深化 overview 的成本跟 Agent 段。初次接觸 Datadog 的讀者建議先讀 Datadog 服務頁。

定位

Datadog 是全託管觀測平台，涵蓋 metrics、logs、traces、profiling、RUM、synthetic monitoring。託管方案的核心取捨是「零運維但成本跟用量成正比」— 用得越多付得越多，而且計價維度多（host、custom metric、log ingestion、span、indexed span），成本治理需要理解每個維度的計價模型。

計價模型概覽

Datadog 的主要計價維度：

維度	計價方式	常見失控來源
Infrastructure host	每 host/月	Auto-scaling 造成 host 數量波動
Custom metrics	每 unique time series/月	Label 爆炸（同 cardinality 問題）
Log ingestion	每 GB ingested/月	Debug log level 忘記關
Log indexed retention	每 million events × 天/月	預設 retention 太長
APM host + indexed span	每 host/月 + 每 million span	Sampling 沒設、全收
Profiling	每 host/月（APM 加購）	整體成本疊加

多數 Datadog 成本失控的根因是 custom metrics 跟 log ingestion — 兩者跟 cardinality 跟 log volume 直接相關，成長可以很快。

Custom Metrics 成本控制

什麼算 custom metric

Datadog 把每個 unique 的 metric name + tag 組合算一個 time series。http_requests_total{service=checkout, method=GET, status=200} 跟 http_requests_total{service=checkout, method=POST, status=500} 是兩個 time series。

Tag 的笛卡爾積決定 series 數量。5 個 service × 4 個 method × 5 個 status = 100 個 series。加一個 region tag（3 個值）就變 300 個。加一個 endpoint tag（50 個 normalized path）就變 15,000 個。

控制策略

Tag 白名單：跟 Prometheus 的 label 白名單邏輯相同。只保留有查詢價值的 tag — service、method、status_class（2xx/4xx/5xx）。移除 user_id、request_id、完整 URL。

Metrics without Limits：Datadog 的功能 — 在 ingestion 之後、query 之前過濾 tag。所有 tag 都收但只 index / 計費特定 tag。適合「收全量但只查部分維度」的場景。

DogStatsD 聚合：Datadog Agent 的 DogStatsD 端在 Agent 層做 pre-aggregation，把客戶端的 per-request metric 聚合成 per-interval 的摘要。減少送到 Datadog 的 data point 數量。DogStatsD 聚合在 Agent 端執行，跟 TSDB 層的 recording rule 是不同位置的 pre-aggregation 機制。

Usage attribution：Datadog 的 Usage Attribution 功能把 custom metric 成本拆到 service / team tag，讓團隊看到自己的 metric 成本。對應 4.15 cost attribution。

判讀指標

Datadog UI 的 Metric Summary 頁面顯示每個 metric name 的 tag cardinality。定期（每月）檢查 top 20 高 cardinality metric，確認是否有意外的 tag 爆炸。

Log Ingestion 成本控制

Index 策略

Datadog log 的計費分兩層：ingestion（進來就計費）跟 indexing（索引後按保留天數計費）。可以 ingest 所有 log 但只 index 部分 — 非 indexed 的 log 可以在 15 分鐘的 live tail 窗口查看，之後就看不到了（除非歸檔到 S3/GCS 做 rehydrate）。

可操作的分層：

Error / warning log：index，retention 30 天
Info log（關鍵路徑）：index，retention 7 天
Debug log：不 index、只 ingest（live tail 用）；或直接不送
Access log（高量）：不 index、歸檔到 S3、需要時 rehydrate

Exclusion filter

Datadog 的 index exclusion filter 讓特定 pattern 的 log 進入 ingestion pipeline 但跳過 index。例：health check 的 access log（path:/health）每秒數百筆但沒有 debug 價值，設 exclusion filter 讓它不佔 index quota。

Log pipeline 跟 Datadog log 的對應

4.11 telemetry pipeline 的 collector 端可以在 log 送到 Datadog 之前做 filtering — 低價值 log 直接 drop、不進 Datadog ingestion（連 ingestion 費用都省）。這比 Datadog 的 exclusion filter 更節省成本（exclusion filter 仍然計 ingestion 費用）。

Agent 部署配置

Agent 部署模式

模式	部署位置	適用場景
Host agent	每台 VM 一個 agent	傳統 VM 部署
DaemonSet agent	K8s 每個 node 一個 agent	K8s 標準部署
Sidecar agent	每個 pod 一個 agent	需要嚴格隔離時
Cluster agent	K8s cluster 一個	收集 cluster-level metric

多數 K8s 部署用 DaemonSet + Cluster Agent 組合。DaemonSet agent 收集 node-level 跟 pod-level 的 metric / log / trace；Cluster Agent 收集 cluster-level 的 metadata 跟 event。

Agent 健康判讀

Agent 本身需要被監控 — Agent 故障時 Datadog 看到的是「資料消失」而非「Agent 掛了」。

判讀指標（Agent 自帶）：

datadog.agent.running：Agent process 是否存活
datadog.agent.check_run：各 integration check 是否正常
datadog.dogstatsd.packets.dropped：DogStatsD buffer 滿時丟棄的封包數

Agent 掛掉時 dashboard 會出現 gap（資料斷層）。如果所有 host 同時斷層、問題在 Datadog backend；如果特定 host 斷層、問題在該 host 的 Agent。

常見 Agent 故障

CPU / memory over-consumption：Agent 開太多 integration check 或 DogStatsD 收太多 custom metric。修復：減少 check 數量、調整 DogStatsD 的 aggregation interval、或升級 Agent 版本（新版通常更節省資源）。

Log collection 延遲：Agent 的 log tail 落後，log 到達 Datadog 的延遲增加。原因通常是 log rotation 設定跟 Agent 的 tail 設定不一致，或 log 量突然爆增超過 Agent 的處理能力。

Network connectivity：Agent 到 Datadog intake endpoint 的網路問題。Agent 會 buffer 資料並重試，但 buffer 滿（預設 100MB）後會 drop。在網路不穩的環境（edge location、受限網路），需要加大 buffer 或設定 proxy。

跟 OTel 的整合

Datadog 支援 OpenTelemetry — 可以用 OTel SDK instrumentation + OTel Collector，把資料送到 Datadog backend。這種模式讓 instrumentation 跟 vendor 解耦，但犧牲部分 Datadog-native 功能（例如 Watchdog anomaly detection 需要 Datadog Agent 的 metadata）。

整合模式的選擇跟 4.C7 Datadog OTel migration practice 的案例分析對應 — 雙軌期的成本跟語意對齊是主要挑戰。

下一步路由

Datadog 服務頁：overview 跟日常操作
4.7 cardinality：cardinality 治理的完整策略
4.15 cost attribution：成本歸因的組織治理
4.C7 Datadog OTel migration：Datadog 跟 OTel 的整合案例
OpenTelemetry：vendor-neutral instrumentation

Datadog OTLP Ingestion 與 OTel 整合

Tue, 23 Jun 2026 00:00:00 +0000

本文是 Datadog 的 vendor deep article，深化 overview「OTLP ingestion」段。初次接觸 Datadog 的讀者建議先讀 Datadog 服務頁。

問題情境

兩種觸發情境會讓團隊需要 Datadog 的 OTLP ingestion：

團隊已經使用 Datadog APM，但新服務或新語言想用 OTel SDK 避免 vendor lock-in。Datadog SDK 覆蓋的語言有限（Go / Java / Python / Ruby / Node / .NET / PHP / C++），如果服務用 Rust / Elixir / Kotlin multiplatform，OTel SDK 的覆蓋更廣。

另一種情境是團隊原本用 OTel + Jaeger 或 OTel + Grafana，現在想把 visualization 遷到 Datadog 但不想重新 instrument。OTLP ingestion 讓 OTel SDK 產出的 traces / metrics / logs 直接送進 Datadog，不改 application code。

核心概念

Datadog Agent 的 OTLP receiver

Datadog Agent 6.32+ 內建 OTLP receiver，接受 gRPC（port 4317）和 HTTP（port 4318）兩種 protocol。Agent 收到 OTLP 資料後轉換成 Datadog 內部格式，走跟 Datadog SDK 相同的 pipeline（sampling、tagging、forwarding to Datadog backend）。

這代表 OTLP path 的資料在 Datadog UI 裡跟 Datadog SDK path 的資料一樣被處理 — 相同的 APM trace waterfall、相同的 service map、相同的 error tracking。差異在 metadata 完整度（見下方 feature parity）。

三種 signal 的 OTLP 支援度

Signal	OTLP 支援	到 Datadog 的對應
Traces	完整（OTLP gRPC / HTTP）	APM traces、service map、error tracking
Metrics	完整（OTLP gRPC / HTTP）	Custom metrics（按 metric 計費）
Logs	有限（Agent 7.54+ 支援 OTLP logs）	Datadog Logs（按 ingestion volume 計費）

Traces 的 OTLP 支援最成熟、metrics 次之、logs 最新。混合環境常見做法是 traces + metrics 走 OTLP、logs 走 Datadog Agent 的原生 log collection（file tailing / container stdout）。

Datadog SDK vs OTel SDK feature parity

功能	Datadog SDK	OTel SDK → Datadog
Distributed tracing	有	有（完整）
Continuous profiling	有	無（Datadog 專有）
ASM（Application Security）	有	無（需要 Datadog library）
CI Visibility	有	無
Dynamic instrumentation	有	無
Runtime metrics（GC、thread）	自動	需手動配置 OTel metric instrumentation
Log correlation（trace_id 注入 log）	自動	需手動配置（MDC / context propagation）
Unified service tagging	自動（`DD_SERVICE` / `DD_ENV` / `DD_VERSION`）	需 resource attribute mapping

判讀：如果團隊需要 profiling / ASM / CI Visibility，對應服務仍需 Datadog SDK。其他服務可以用 OTel SDK + OTLP ingestion，兩者在同一個 Datadog org 共存。

配置 step-by-step

Datadog Agent OTLP 設定

1# datadog.yaml
2otlp_config:
3  receiver:
4    protocols:
5      grpc:
6        endpoint: 0.0.0.0:4317
7      http:
8        endpoint: 0.0.0.0:4318

Agent 重啟後用 datadog-agent status 確認 OTLP receiver 啟動。

OTel SDK endpoint 配置

1# 環境變數（語言無關）
2export OTEL_EXPORTER_OTLP_ENDPOINT="http://datadog-agent:4317"
3export OTEL_EXPORTER_OTLP_PROTOCOL="grpc"
4export OTEL_SERVICE_NAME="checkout-api"
5export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.version=1.2.3"

Resource attribute → Datadog tag mapping

Datadog Agent 自動把 OTel resource attributes 轉成 Datadog tags：

OTel resource attribute	Datadog tag	備註
`service.name`	`service`	Datadog unified service tagging 的核心
`deployment.environment`	`env`	必填、否則 Datadog UI 的環境篩選失效
`service.version`	`version`	用於 deployment tracking
`host.name`	`host`	Agent 通常自動帶、不需手動設
`container.name`	`container_name`	K8s 環境自動帶

如果 resource attribute 沒設 deployment.environment，Datadog 會把 trace 歸到 env:none — 在 APM 介面幾乎不可見。這是最常見的 OTLP onboarding 問題。

OTel Collector → Datadog（alternative path）

如果不想讓 application 直連 Datadog Agent，可以在中間放 OTel Collector：

 1# otel-collector-config.yaml
 2exporters:
 3  datadog:
 4    api:
 5      key: ${DD_API_KEY}
 6      site: datadoghq.com
 7
 8service:
 9  pipelines:
10    traces:
11      receivers: [otlp]
12      processors: [batch]
13      exporters: [datadog]

OTel Collector 的 datadog exporter 直接把資料送到 Datadog backend（不經 Agent）。適合已有 OTel Collector 基礎設施、不想每個 node 都部署 Datadog Agent 的場景。

故障與邊界

Resource attribute mapping 不對齊

OTel 的 service.name 用 dot notation（如 com.example.checkout），Datadog 預設用 hyphen（如 checkout-api）。如果 mapping 不一致，同一個服務在 Datadog APM 的 service map 會出現多個節點（OTel path 一個、Datadog SDK path 一個）。

修法：統一 service.name 命名。如果兩種 SDK 並存，在 OTel SDK 的 resource attribute 設跟 Datadog SDK 的 DD_SERVICE 完全相同的值。

Metric naming convention 差異

OTel metric 用 dot notation（http.server.request.duration），Datadog 預設用 underscore（http_server_request_duration）。Agent 會自動轉換（dot → underscore），但如果團隊同時有 Datadog SDK 產出的 metric 跟 OTel SDK 產出的 metric，兩者可能在 Datadog 裡產生重複（語意相同但名稱不同）。

修法：用 OTel Collector 的 metricstransform processor 在 export 前統一命名，或在 Datadog 用 metric alias 合併。

Log correlation 在 OTLP path 的限制

Datadog SDK 自動把 dd.trace_id 和 dd.span_id 注入 application log（如 Python logging、Java MDC）。OTel SDK 不做這件事 — log correlation 需要手動設定（把 trace_id 從 OTel context 注入 logging framework）。

如果 log correlation 缺失，Datadog 的 trace → log 跳轉功能失效。修法依語言不同：Java 用 MDC + OTel Java agent 的 log context instrumentation；Python 用 opentelemetry-instrumentation-logging；Go 需要手動從 span context 取 trace ID 寫到 log field。

容量與成本

OTLP path 的計費跟 Datadog SDK path 相同：

Signal	計費單位	OTLP vs Datadog SDK
APM traces	Per ingested span	相同
Metrics	Per custom metric（unique metric name × tag combination）	相同
Logs	Per ingested GB	相同

成本差異不在 ingestion pricing，在 feature access。用 OTel SDK 失去 Profiling / ASM / CI Visibility，這些功能需要 Datadog SDK。如果團隊需要這些功能，走 OTLP 反而要為核心服務額外部署 Datadog SDK — 雙 SDK 的 maintenance cost 可能超過直接全用 Datadog SDK。

判斷分水嶺：如果 > 80% 的服務不需要 Profiling / ASM，走 OTLP + 少數服務用 Datadog SDK 是合理的混合模式。如果核心服務都需要 Profiling，全用 Datadog SDK 更簡單。

整合與下一步

Datadog 服務頁：overview 與日常操作
Datadog 成本治理：Agent 配置與 cost control
4.C7 Datadog OTel migration：從 Datadog SDK 轉向 OTel 相容模式的治理案例
OpenTelemetry Collector 部署模式：OTel Collector → Datadog 的 alternative path
← New Relic migration：New Relic → Datadog 的遷移中 OTLP 扮演的橋接角色

New Relic → Datadog：APM schema 對位 + agent 替換 + dashboard 重建

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link New Relic 跟 Datadog。跑 migration-playbook-methodology 6 維 audit 後對映 Schema = High（NRQL ↔ Datadog query、APM agent 不同）→ Type A phased translation。

問題情境

中型 SaaS 跑 New Relic 3-5 年、production observability 飽和、團隊發現幾個問題：cost 暴漲（per-host APM + custom event + synthetic）、APM trace 對 Kubernetes-native workload 不夠細、跟 PagerDuty / Slack integration 雖然有但 latency 偏高。同期 Datadog 在 K8s monitoring + APM 端深度整合、cost model 在 100-500 host 規模更可預測。

評估遷移時、發現 New Relic → Datadog 不是「換個 agent 就好」 — APM schema、NRQL 查詢語言、custom dashboard、synthetic monitoring rule 全部要 重新對位；application code 端的 agent 也要 完全換 binary。是 Type A 高 schema 差 migration、不是 drop-in。

為什麼遷：cost / k8s-native / vendor consolidation 三條 driver

Driver	觸發場景
Cost	New Relic per-host pricing + custom event + synthetic 加總爆、Datadog 在 K8s 場景單 host 多 container 更划算
K8s-native	Datadog agent 對 K8s sidecar / DaemonSet / autodiscovery 更深
Vendor consolidation	已用 Datadog log / metric、APM 統一 vendor 降工具切換 cost

反向 driver（Datadog → New Relic）：

New Relic 對 full-stack observability（APM + browser + mobile + synthetic）的整合包仍領先
已深用 New Relic NRQL 跟 New Relic University 培訓的 organization、不切

Schema 對位

New Relic concept	Datadog 對應
APM agent (NR Java / Python / Node)	Datadog agent + APM tracer library
NRQL query	Datadog query (Metric / Log / Trace)
Synthetic monitor	Datadog Synthetic Tests
Custom event	Datadog custom metric / log event
NRQL alert condition	Datadog monitor
New Relic dashboard	Datadog dashboard (need rebuild)
Apdex score	Datadog APM `apm.service.errors` + `apm.service.latency`
Distributed trace	Datadog APM trace（OpenTelemetry-compatible）

Phase 0：Audit + classify

列所有 application 跟對應 NR agent version
列所有 NRQL alert / dashboard / synthetic monitor
估每月 cost 跟 Datadog 對比

Phase 1：Schema 對位 + Datadog cluster 建置

Datadog organization 申請 / IAM integration
VPC peering / private link (如果用 self-hosted agent)

Phase 2：Translation pipeline (3-tier)

Tier 1: Datadog 端 import tool（API-based NRQL → Datadog query 轉換、cover ~40-60%）
Tier 2: LLM-assisted（剩餘 query / dashboard）
Tier 3: manual (synthetic / complex correlation)

Phase 3：Parallel run (dual-agent 4-8 週)

兩個 agent 跑同 application、metric / trace / log 雙端輸出、SOC 比對 detection coverage / alert / dashboard 一致性。

Phase 4：Cutover + cleanup

Application 端切 agent
New Relic license downgrade / cancel
Decommission timeline 3-6 個月（保留歷史查詢能力）

Production 故障演練

Case 1：NRQL 不直接對位 Datadog query

徵兆：NRQL SELECT count(*) FROM Transaction FACET name WHERE duration > 5 SINCE 1 hour ago 在 Datadog 端需要拆 metric query + filter + group by；翻譯後語意對等但 syntax 完全不同、SOC analyst 學習曲線陡。

修法：

翻譯腳本 + LLM-assisted、保留 NRQL 字面 + Datadog query 對照表（runbook）
SOC training，1-2 週 hands-on
部分 query 改 Datadog dashboard widget、不用直接 query

Case 2：Synthetic monitor 對位失敗

徵兆：NR Synthetic 跑 100+ ping / browser / API test、切 Datadog Synthetic 後發現 step-based monitor 對應的「Browser Test」配置複雜、setup 工作量 2-3 倍預估。

修法：

Pre-cutover 跑 sample synthetic、估真實 setup cost
優先遷 critical synthetic、其他評估退役
用 Datadog API + Terraform 自動化、避免 UI 手動建

Case 3：Cost 模型反轉

徵兆：cutover 後第一個月 Datadog 帳單比 NR 高 30%；breakdown 後發現 log retention + custom metric series + log indexing 三個項目超預估。

修法：

Pre-migration 估 Datadog cost 必須含 log indexing pricing（按 indexed event 計）、不是純 ingest
Application 端 log scrub PII + sample debug log、降 ingest GB
Custom metric cardinality control（tag combination 爆 series count）

Case 4：Dashboard 自動轉失敗、人工 rebuild 80%

徵兆：用 Datadog import tool 跑 NR dashboard、80% widget 缺 / 對應錯；team 估 2 週 dashboard rebuild、實際跑 6-8 週。

修法：

接受重建：production dashboard 必須人工重建、不要期待自動轉
Prioritize：先重建 SOC critical 30%、其他 deprecate
Migration window 增 4-6 週：dashboard rebuild 是 underestimated effort

Case 5：Cross-platform metric 命名差

徵兆：NR 端 metric Apdex/Apdex 在 Datadog 沒對應、application code 寫死 metric name 失效；alert query 對 NR-specific metric 全失效。

修法：

Pre-cutover 列所有 NR-specific metric、application code 改用 OpenTelemetry-style metric 命名
Datadog query 端 rebuild、用 application-level metric name 而非 vendor-specific
長期：metric naming 用 OpenTelemetry semantic conventions、避免 vendor lock

Capacity / cost

維度	New Relic	Datadog
Pricing model	per-host + custom event / synthetic	per-host APM + log indexing + custom metric
K8s-friendly	中、autodiscovery 有但配置複雜	高、K8s-native autodiscovery first-class
Migration cost	-	2-4 FTE × 2-3 個月
Operational FTE	0.3-0.6	0.3-0.6（相當）

整合 / 下一步

跟 Datadog → Grafana Stack migration 對位

兩種 Datadog 端的後續路線：

切到 Datadog 後 繼續用（穩定 multi-year）
切到 Datadog 後 再切 Grafana Stack 省 cost（multi-tool 拆分、Type D）

多數 organization 第一輪 NR → Datadog 已花 2-3 個月、不會立刻再切；至少穩定 1-2 年。

跟 OpenTelemetry 對齊

Migration 順便升 OTel 化 application、避免下次 vendor 切換重複工作量。

Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link 到 Datadog（source）跟 Grafana Stack（target）。跟前三篇 migration（Splunk → Elastic phased / Redis → DragonflyDB drop-in / PostgreSQL → Aurora hybrid）對照、本篇是 cost-driven multi-tool migration — 不是換一個產品、是把 一站式 SaaS 拆成 五個專責 OSS / cloud component。

$50K/month bill 拆解：先看錢花在哪、再決定怎麼遷

中型 SaaS（100-500 host、5K-50K metric series、TB-level log/day）的 Datadog 月帳單長這樣：

計費項	平均單價	中型 SaaS 估算 / month
Infrastructure host	$15-23 / host	200 host × $20 = $4,000
APM host	$31 / host	100 host × $31 = $3,100
Custom metrics	$0.05 / 100 series	30K series × $0.05 = $1,500
Log ingest	$0.10 / GB ingested	50TB × $0.10 = $5,000
Log retention（15-day）	$1.27 / million events	50G event × $1.27 = $6,350
Log indexing	$1.70 / million events	50G × $1.70 = $8,500
Network	$5 / host	200 × $5 = $1,000
RUM / Session	$1.50 / 1000 session	30M session × $1.5 = $4,500
Synthetics	$5 / 10K test runs	50K test = $25
Total	-	$34,000 / month（保守估）

擴張到 500 host / 100TB log 的 production：$80K-150K / month 範圍。Grafana stack（self-hosted on K8s + Grafana Cloud 部分服務）對等 capacity 通常 $8K-30K / month — 2.5-5x cost reduction。

但 cost 不是唯一 driver。其他 driver：

Multi-cloud / hybrid：Datadog 集中、Grafana 可分散部署符合資料 residency
OpenTelemetry-first：Grafana stack 對 OTel 是 native、Datadog 仍 vendor-specific agent
Long-term retention：Loki 用 S3 cold tier 跑 1 年 retention 比 Datadog 便宜 10-50x

五個責任、五個 component：不是替換一個產品

Datadog 是 一站式 SaaS、單一 agent + 單一 UI 包 5 個責任。Grafana stack 把責任拆給 5 個專責 component：

責任	Datadog 處理	Grafana Stack 對應
Metric	Datadog metric	Mimir（Prometheus-compatible long-term）
Log	Datadog Logs	Loki（label-indexed log）
Trace	Datadog APM	Tempo（trace-only object storage）
Dashboard	Datadog dashboard	Grafana
Agent / shipper	Datadog Agent	Alloy（OTel-based collector）+ Grafana Agent / Promtail

Migration 是 五個獨立 stream、不是單一 cutover。SRE 對「一個 agent 包所有」的心智模型要拆。

Migration 結構：每個 component 各自 phased、整體 staggered

不像前三篇 migration 是線性流程、本篇是 5 個 parallel migration stream + 跨 stream coordination：

1           Phase 0           Phase 1            Phase 2          Phase 3
2           Audit             Deploy             Dual-ship        Cutover
3Metric    [audit]──→        [deploy Mimir]──→ [dual-ship]──→  [cutover]
4APM       [audit]──→        [deploy Tempo]──→ [dual-ship]──→  [cutover]
5Log       [audit]──→        [deploy Loki]──→  [dual-ship]──→  [cutover]
6Dashboard [audit]──→        [deploy Grafana]──→ [rebuild]──→   [cutover]
7Alert     [audit]──→        [deploy Alertmgr]──→ [parallel]──→ [cutover]

每個 stream 獨立做 dual-ship + cutover、不必同步；通常 Metric 先遷（cardinality 議題暴露最快）、然後 Log、最後 APM（trace correlation 最依賴 dashboard / alert）。

Agent migration：Datadog Agent → OTel Collector / Alloy

Datadog Agent 是 vendor-specific binary、抽出來換成 OpenTelemetry Collector / Grafana Alloy：

 1# alloy config (HCL-like)
 2prometheus.scrape "k8s_pods" {
 3  targets = discovery.kubernetes.pods.targets
 4  forward_to = [prometheus.remote_write.mimir.receiver]
 5}
 6
 7prometheus.remote_write "mimir" {
 8  endpoint {
 9    url = "https://mimir.internal/api/v1/push"
10  }
11}
12
13loki.source.kubernetes "pods" {
14  targets = discovery.kubernetes.pods.targets
15  forward_to = [loki.write.production.receiver]
16}
17
18otelcol.receiver.otlp "default" {
19  grpc {}
20  output {
21    traces = [otelcol.exporter.otlp.tempo.input]
22  }
23}

Migration 期間 dual-shipper 是標準作法：

Datadog Agent 跟 Alloy 並存（短期 capacity 兩倍）
同 host 同時 ship 兩端、觀察一致性
漸進 disable Datadog Agent 的 metric / log / APM 子模組

Production 故障演練

Case 1：Cardinality 爆，Mimir 端 series 暴增

徵兆：Datadog 端 30K series、ship 到 Mimir 後 series 變 500K、Mimir indexer OOM。

根因：Datadog 內部對 tag 做 自動 aggregation 跟 low-cardinality enforcement；Prometheus / Mimir 對 每個 unique label set 算一個 series、application code 的 high-cardinality label（user_id / request_id）直接爆。

修法：

Audit 階段 跑 topk(100, count by (__name__) ({__name__=~".+"})) 找 high-cardinality metric
drop high-cardinality label：Alloy / OTel collector 端 relabel 規則 drop user_id 等 unbounded label
改 histogram bucket：高 cardinality 通常來自 label combination、改用 fixed-bucket histogram
適當改 metric 為 log：請求 ID 是 trace context、不該是 metric label

Case 2：Log volume cost 預估失準

徵兆：Loki 部署 1 個月後 S3 帳單比預估高 2x；object storage 跟 query GB-scan 都超預期。

根因：Datadog 對 log 做自動 sampling / aggregation、bill 是 indexed event；Loki 是 全量 raw ingest + S3 cold storage、按實際 byte 計費。raw log volume 比 indexed event 高 3-10x。

修法：

Ingest-side sampling：Alloy / Promtail 端 sample debug / info log、只 ingest warn / error 全量
Log structure：JSON log 比 text log 壓縮率高、Loki S3 size 少 50%
Retention tier：hot 7 天 S3 standard / cold 1 年 S3 Glacier、retention budget 控制

Case 3：Datadog dashboard 不能直接轉 Grafana

徵兆：Migration 計畫設「dashboard 自動轉換」、實際跑 Datadog API export → Grafana import、80% dashboard 缺 widget / metric 對不上。

根因：

Datadog query syntax 跟 Grafana / Mimir 的 PromQL 不直接相容
Datadog widget type（top-list / hostmap）Grafana 沒對應
Tag-based aggregation 對應 Prometheus label 但語法不同

修法：

接受重建：production-grade dashboard 必須人工重建、不要期待自動轉
Prioritize：先重建 SOC 用 / production-critical 30%、其他 deprecate
migration window 增 4-6 週：dashboard rebuild 是 underestimated effort

Case 4：Alert routing 換邏輯，PagerDuty integration 不通

徵兆：Cutover 後 alert 不送 PagerDuty、SOC 半小時才發現；alert 端 webhook 配置正確、但 payload format 跟 Datadog 不同、PagerDuty 端 rule 過濾掉。

根因：

Datadog alert payload 含 event_type=alert、PagerDuty integration 用這個 routing
Alertmanager 預設 payload 結構不同
PagerDuty rule 端針對 Datadog event 寫 schema、Alertmanager event 不 match

修法：

Pre-cutover test：Alertmanager → PagerDuty 跑 dry-run、send test alert 驗證
PagerDuty Service：建獨立 Grafana-source Service、不共用 Datadog Service
Alertmanager template：用 webhook 自定 JSON template、payload 接近 Datadog 結構

Case 5：SLO definition 跟 monitor type 對不上

徵兆：Datadog SLO 跑 99.9% availability、轉到 Grafana SLO + Mimir 後實際 9X% 數字不一致；SOC 跑 dashboard 比對 5 個 SLO、4 個誤差 0.1-0.3%。

根因：

Datadog SLO 計算 over time window 用內部 query；Grafana SLO 用 PromQL 寫公式
Datadog 對 success_rate 處理 missing data 跟 PromQL 預設不同
Time bucket boundary 處理差異

修法：

重定義 SLO 在 PromQL：不嘗試「複製」、是「重定義」、認真寫 PromQL 表達式
接受 ±0.1% drift：production-critical SLO 跑 dual-track 1-2 個月、tune PromQL 到 acceptable drift
SLO migration 不是 dashboard migration 子集：獨立 stream、留更多時間

Capacity / cost 對照

維度	Datadog	Grafana Stack（self-hosted on K8s）
Setup cost	低（SaaS）	中高（K8s deploy + storage backend）
Operational cost (200 host)	$34K / month	$8-12K / month（含 S3 + K8s）
Operational cost (500 host)	$80-150K / month	$15-30K / month
Operational FTE	0.1-0.3	1-2 FTE（K8s + storage + Grafana operator）
Long-term retention	$1.27 / million event for 15+ day	S3 + Loki：~$0.02 / GB / month
Multi-cloud / hybrid	受 Datadog region 限	自由部署
Vendor lock-in	高	低（OSS + OTel）
Time to value	1-2 週	4-8 週
Migration cost (one-time)	-	1-3 FTE × 3 個月

Break-even point：~150 host 規模、3 年 amortized 後 self-hosted cheaper；< 100 host 規模 SaaS 較 ROI 高。

整合 / 下一步

跟 OpenTelemetry 對齊

Migration 是 OTel-first 轉型 的機會：

Application code 用 OTel SDK、避免 Datadog SDK lock-in
Trace context propagation 走 W3C Trace Context
未來換 backend 不用再改 application

跟 Splunk → Elastic 對照

兩篇都是 cost-driven SaaS migration、但細節差：

Splunk → Elastic 是 SIEM 領域、schema translation 是核心議題
Datadog → Grafana 是 multi-tool 拆分、agent + dashboard 重建是核心
共同 pattern：dual-ship → parallel run → cutover

反向遷移（Grafana Stack → Datadog）

存在但少數 — 主要是 operational complexity reduction（不想自管 Mimir / Loki）；schema 對位方向相反、agent 換回 Datadog Agent。

下一步議題

Grafana Cloud 混合：部分 component（Tempo）用 Grafana Cloud SaaS、其他 self-host、混合架構
OpenTelemetry Collector 跟 Alloy 取捨：兩者都是 OTel-based、Alloy 是 Grafana 自家 fork
Vector vs Alloy vs Fluentd：log shipper 戰場、cost / 功能 / OTel 整合度比較

Datadog Continuous Profiler

Fri, 15 May 2026 00:00:00 +0000

Datadog Continuous Profiler 的核心責任是把 production profile 接到 SaaS APM、deployment marker、service tag 與 release regression workflow。它適合已經使用 Datadog APM / metrics / logs 的團隊，重點在讓 slow request、resource saturation、deploy version 與 profile diff 能在同一個操作介面中對齊。

定位

Datadog Continuous Profiler 是 Datadog APM 的 production profiling add-on、跟 Datadog Logs / Metrics / Traces 同 plane、共用 service tag、env tag、version tag 與 query bar。它的核心責任是把 production profile 接到 SaaS APM、deployment marker、service tag 與 release regression workflow，讓 slow request、resource saturation、deploy version 與 profile diff 能在同一個操作介面中對齊。

跟 Pyroscope / Parca 這類 OSS profiler 比、Datadog Continuous Profiler 走 ecosystem-bundled 路線 — profiler 本身不獨立計費、跟 APM host 一起進 business unit 預算、profile data 直接跟 trace_id、deploy marker、log query 在同一介面 cross-link。OSS profiler 走 standalone deployment、profile store 自管（ClickHouse / object storage）、跟 observability 其他 plane 要自己 wire（grafana correlation、自寫 trace_id mapping）。差異在 跨 signal 的 query continuity 跟組織計費歸屬、flame graph 本身的視覺呈現相近。

這個定位讓 Datadog Continuous Profiler 接到 9.9 Performance Improvement Loop 與 4.9 Continuous Profiling。它的價值在於降低 profile diff 的交接成本；它的代價在於 SaaS 成本、agent 設定、資料保留與 vendor 約束。

最短判讀路徑

判斷 Datadog Continuous Profiler deployment 是否健康、最少看四件事：

Agent / SDK profiling 是否真的 enabled：Datadog Agent 跑著不等於 profiler 開了 — 各語言要在 SDK init 加 profiling_enabled=true 或環境變數 DD_PROFILING_ENABLED=true、Go / Java / Python / Node / Ruby / .NET 的開啟方式跟覆蓋的 profile type（CPU / heap / goroutine / lock / wall time）各不同
Service / version / env tag 紀律：profile 沒有 service + env + version tag 就無法 diff、release marker 也對不上 — CI 要把 git SHA 或 release tag 注入 DD_VERSION、deploy pipeline 要打 deployment marker API
Sampling rate 跟 production coverage：profiler 預設 60s 採一次、低流量服務或 short-lived 任務可能 sample 不到 hot path — 對 ultra-low latency / burst workload 要評估 sampling 是否還抓得到 regression signal
Profile ingestion cost / retention：profile 是按 APM host 計費、但 profile event 量隨 service 數量 + sampling rate 漲、retention 預設 7 天（custom retention 另計）— 大型 deployment 要做 service-level enable/disable governance

適用場景

Release regression 定位適合 Datadog Continuous Profiler。當 canary 或 release candidate 的 p99、CPU、memory 或 cost per request 退化，團隊可以用 deployment marker 對比 release 前後 profile，找出變寬的 call stack。

APM-to-profile drilldown 適合 Datadog Continuous Profiler。慢 request 可以從 service、endpoint、trace 或 span 往下切到 profile，讓工程師知道 latency 是 DB、network、runtime、serialization、lock 還是 CPU hot path。

多語言 SaaS 團隊適合 Datadog Continuous Profiler。團隊如果同時維護 Go、Java、Python、Ruby、Node.js 或 .NET 服務，SaaS profiler 可以用統一 tag、dashboard 與權限模型管理。

選型判準

判準	Datadog 的價值	需要補的能力
APM 整合	trace、service、endpoint、profile 可串接	service tag 與 deploy label 紀律
Deployment marker	release 前後 profile diff 容易建立	release pipeline 與版本標記整合
SaaS 操作	低自管成本、跨團隊易查詢	成本治理、資料保留與 vendor 約束
多語言支援	多 runtime 用同一套操作介面	各語言 agent overhead 與覆蓋差異

APM 整合價值來自上下文連續。Metrics 告訴你 CPU 上升，trace 告訴你 endpoint 變慢，profile 告訴你哪段 code path 變貴；Datadog 的優勢是把這些訊號放進同一個查詢與 dashboard 流程。

Deployment marker 價值來自 release gate。Profile diff 如果能對齊 commit、version、environment 與 canary cohort，就能成為 6.13 Performance Regression Gate 的 evidence。

核心取捨表

取捨維度	Datadog Continuous Profiler	Pyroscope	Parca
部署模型	SaaS only、跟 Datadog Agent / APM 綁	OSS self-host / Grafana Cloud SaaS	OSS self-host（Polar Signals SaaS 選）
計費模型	跟 APM host 計費（profile 不獨立 metering）	OSS 免費 / Grafana Cloud 按 ingestion	OSS 免費 / SaaS 按 host
Profile 採集方式	Language SDK（pull 採樣）	SDK + eBPF agent	eBPF-first、language-agnostic
Trace correlation	強 — trace_id 自動 link 到 flame graph	中 — 要自己 wire OTel trace_id	弱 — 偏 eBPF profile、trace 整合較淺
視覺 / Workflow	APM service view + Profile diff + Code Hotspot in IDE	Grafana flame graph + diff、跟 Loki / Tempo 同 UI	Parca UI 簡潔、偏單純 profile 探索
多語言支援	Go / Java / Python / Node / Ruby / .NET / PHP 官方 SDK	同 + 社群 SDK；eBPF 補 native binary	eBPF-only、不挑語言但 symbol 解析較吃力
Vendor lock-in	高 — profile 跟 APM workflow 綁、退場要重建 dashboard	低 — OSS、profile 格式相對開放	低 — OSS、pprof 格式相容
適合場景	Datadog-heavy org、APM / log / metric 已用	Grafana stack 已用、要省 license	eBPF-first、low-overhead always-on

選 Datadog Continuous Profiler 的核心訴求：Datadog 已是 observability backbone + 要 APM trace ↔ profile drilldown 是 first-class workflow + 接受 SaaS 計費 + 接受 SDK overhead trade-off。如果 Datadog 不是既有平台、單純為了 profiling 引入 Datadog 通常成本不划算、改走 Pyroscope / Parca。

跟一次性 runtime profiler（pprof、async-profiler 手動跑）的差異是時間維度。一次性 profiler 適合本機或 incident 當下調查；continuous profiler 適合 baseline、release diff 與長期退化治理 — 兩者互補、不互斥。

進階主題

APM trace ↔ profile correlation：Datadog SDK 把 trace_id 注入 profile sample 的 label、APM trace view 上每個 span 可以直接點到「執行這段 span 時的 flame graph」。意義是 p99 latency 異常 trace 不只看 span 等待時間、能直接看到該 span 期間 CPU / lock / allocation 真正花在哪段 code。需要 SDK 版本支援 + trace context propagation 正確接上、舊版 SDK 或自寫 instrumentation 容易斷鏈。

Endpoint profiling：profile 按 HTTP endpoint / RPC method 切片、不只看 service 整體 hot path。意義是 新加的 endpoint 即便 traffic 小、也能單獨看它的 CPU / allocation cost、不會被 service 主流量稀釋。對 multi-tenant API、A/B test endpoint、internal admin endpoint 的退化偵測特別有用。

Code Hotspot in IDE：Datadog IDE plugin（IntelliJ / VS Code）把 production profile 的 hot line 直接 overlay 到 source code、工程師 review PR 時能看到「這個 function 在 production 佔 service CPU 12%」。降低 看 flame graph → 找 source 對應行 的 cognitive cost。對應 9.9 Performance Improvement Loop 中「production signal → code change」的 feedback loop 縮短。

Profile diff（baseline vs candidate）：Datadog 內建 diff view、選兩個 time window 或兩個 version tag、直接看 flame graph 哪些 frame 變寬 / 變窄。是 6.13 Performance Regression Gate 的核心 evidence — canary 跑完 30min、自動拉 baseline vs candidate diff 報告、超過 threshold 阻擋 promote。

Notebooks correlation：Datadog Notebooks 可以把 profile flame graph、APM trace、metric chart、log query 排在同一份文件。incident post-mortem 跟 release review 寫一份 notebook 比散落多個 dashboard tab 更可追溯、也接 evidence package 規範。

排錯與失敗快速判讀

SDK overhead 在 production 過高：profiler 預設 overhead < 2% CPU、但 wall-time profiling / allocation profiling 全開可能到 5%+ — canary 一台量測、按 profile type 分別 enable、不要全部一次開
Sampling rate 太低 / false negative：short-lived job（< 60s）或 low-traffic service 可能整個生命週期沒被 sample 到、看不到 hot path — 改成事件觸發 profile（on-demand profiling API）或拉高該 service 的 sampling rate
Profile 沒有 version tag / 無法 diff：deploy pipeline 沒注入 DD_VERSION、release marker 對不上 — 補 CI 環境變數、用 dd-trace SDK 自動讀 git commit SHA、跑 staging 驗證 diff view 能顯示 version
Trace ↔ profile drilldown 斷鏈：SDK 版本太舊、或 trace context 在非同步 / queue handler 沒 propagate — 升 SDK + 補 trace context propagation、用一條已知慢 trace 驗證能不能跳到 flame graph
Profiling cost spike：新 service 開啟 profiling、或某 service profile event 暴增（exception 路徑反覆採樣）— 看 Datadog usage dashboard 的 profile host hour、對嫌疑 service 暫關 profiling 觀察 cost 曲線、再 tune sampling rate
Flame graph symbol 解析失敗 / 顯示 ? frame：缺 debug symbol、stripped binary、或語言 runtime 版本不支援 — 補 build 時保留 symbol、確認 SDK 版本 vs runtime 版本對應表
Lock profile 看不出 contention：某些語言（Go / Java）的 lock profiling 需要額外 flag（DD_PROFILING_BLOCK_ENABLED / DD_PROFILING_LOCK_ENABLED）— 預設沒開、要明確 enable 才看得到 lock contention flame graph

操作成本

Datadog Continuous Profiler 的主要成本是資料量與保留。Profile sample、tag cardinality、service 數量、environment 數量與 retention 都會影響費用與查詢體驗。

Agent 成本來自 runtime 差異。不同語言的 profiler 支援、overhead、可觀測維度與限制不同，導入時要用 canary service 量測 CPU、memory、latency 與 profile completeness。

Vendor 成本來自資料與 workflow 綁定。當 profile diff、release marker、APM drilldown 與 incident workflow 都在 Datadog 中，後續切換平台需要重新建立 tag schema、dashboard、retention 與 gate integration。

Evidence Package

Datadog Continuous Profiler 結果應回寫到 evidence package。最小欄位包括 service、version、environment、deploy marker、profile type、time range、comparison baseline、profile diff link、overhead estimate、known gap 與 owner。

欄位	Datadog 證據來源
Source	profiler view、profile diff、APM link
Time range	baseline / candidate profile window
Query link	Datadog profile、trace、dashboard link
Data quality	service tag、version tag、sampling status
Confidence	production coverage、agent overhead
Known gap	runtime coverage、tag drift、retention limit

Evidence package 的核心用途是讓 release regression 可追溯。Reviewer 要能從 failed gate 直接打開 profile diff，看出哪個 service、version、endpoint 或 call stack 造成資源成本變化。

案例回寫

Datadog Continuous Profiler 適合回寫 release regression 與 APM 整合案例。它可接 9.C23 Netflix Aurora consolidation 的 profile noise 降低、9.C25 Tubi feature store 的 low-latency hot path 定位、9.C3 Coinbase ultra-low latency exchange 的 z1d 單執行緒 hot path 分析、9.C7 Lyft 100+ 微服務的 per-service profile diff，以及 Datadog OTel migration practice 的 observability pipeline 整合。

這些案例的重點是上下文對齊。Datadog Profiler 頁引用案例時，要把 case 轉成 service tag、deploy marker、profile diff、trace drilldown 與 release gate evidence — 例如 Coinbase sub-ms 目標下、profile 必須對齊 RAFT consensus 跟 placement group 拓樸、才能解釋 hot path 為何在某些 epoch 才出現。

下一步路由

上游：9.9 Performance Improvement Loop
上游：9.8 效能可觀測性
跨模組：4.9 Continuous Profiling
平行：Pyroscope
平行：Parca
官方：Datadog Continuous Profiler documentation

Datadog on Tarragon

商業方案的事件類型對應

Sentry

Firebase Crashlytics + Analytics

Datadog RUM

接入策略

下一步路由

Datadog RUM

全棧追蹤

四種 RUM 事件

定價

模組六：商業方案對照

待寫章節

跨分類引用

Datadog 成本治理與 Agent 配置

定位

計價模型概覽

Custom Metrics 成本控制

什麼算 custom metric

控制策略

判讀指標

Log Ingestion 成本控制

Index 策略

Exclusion filter

Log pipeline 跟 Datadog log 的對應

Agent 部署配置

Agent 部署模式

Agent 健康判讀

常見 Agent 故障

跟 OTel 的整合

下一步路由

Datadog OTLP Ingestion 與 OTel 整合

問題情境

核心概念

Datadog Agent 的 OTLP receiver

三種 signal 的 OTLP 支援度

Datadog SDK vs OTel SDK feature parity

配置 step-by-step

Datadog Agent OTLP 設定

OTel SDK endpoint 配置

Resource attribute → Datadog tag mapping

OTel Collector → Datadog（alternative path）

故障與邊界

Resource attribute mapping 不對齊

Metric naming convention 差異

Log correlation 在 OTLP path 的限制

容量與成本

整合與下一步

New Relic → Datadog：APM schema 對位 + agent 替換 + dashboard 重建

問題情境

為什麼遷：cost / k8s-native / vendor consolidation 三條 driver

Schema 對位

Phase 0：Audit + classify

Phase 1：Schema 對位 + Datadog cluster 建置

Phase 2：Translation pipeline (3-tier)

Phase 3：Parallel run (dual-agent 4-8 週)

Phase 4：Cutover + cleanup

Production 故障演練

Case 1：NRQL 不直接對位 Datadog query

Case 2：Synthetic monitor 對位失敗

Case 3：Cost 模型反轉

Case 4：Dashboard 自動轉失敗、人工 rebuild 80%

Case 5：Cross-platform metric 命名差

Capacity / cost

整合 / 下一步

跟 Datadog → Grafana Stack migration 對位

跟 OpenTelemetry 對齊

相關連結

Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability

$50K/month bill 拆解：先看錢花在哪、再決定怎麼遷

五個責任、五個 component：不是替換一個產品

Migration 結構：每個 component 各自 phased、整體 staggered

Agent migration：Datadog Agent → OTel Collector / Alloy

Production 故障演練

Case 1：Cardinality 爆，Mimir 端 series 暴增

Case 2：Log volume cost 預估失準

Case 3：Datadog dashboard 不能直接轉 Grafana

Case 4：Alert routing 換邏輯，PagerDuty integration 不通

Case 5：SLO definition 跟 monitor type 對不上

Capacity / cost 對照