<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Datadog on Tarragon</title><link>https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/</link><description>Recent content in Datadog on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Fri, 01 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/index.xml" rel="self" type="application/rss+xml"/><item><title>Datadog 成本治理與 Agent 配置</title><link>https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/cost-governance-agent-config/</link><pubDate>Mon, 22 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/cost-governance-agent-config/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog&lt;/a> 的 vendor deep article，深化 overview 的成本跟 Agent 段。初次接觸 Datadog 的讀者建議先讀 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog 服務頁&lt;/a>。&lt;/p>&lt;/blockquote>
&lt;h2 id="定位">定位&lt;/h2>
&lt;p>Datadog 是全託管觀測平台，涵蓋 metrics、logs、traces、profiling、RUM、synthetic monitoring。託管方案的核心取捨是「零運維但成本跟用量成正比」— 用得越多付得越多，而且計價維度多（host、custom metric、log ingestion、span、indexed span），成本治理需要理解每個維度的計價模型。&lt;/p>
&lt;h2 id="計價模型概覽">計價模型概覽&lt;/h2>
&lt;p>Datadog 的主要計價維度：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>維度&lt;/th>
 &lt;th>計價方式&lt;/th>
 &lt;th>常見失控來源&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Infrastructure host&lt;/td>
 &lt;td>每 host/月&lt;/td>
 &lt;td>Auto-scaling 造成 host 數量波動&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Custom metrics&lt;/td>
 &lt;td>每 unique time series/月&lt;/td>
 &lt;td>Label 爆炸（同 cardinality 問題）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Log ingestion&lt;/td>
 &lt;td>每 GB ingested/月&lt;/td>
 &lt;td>Debug log level 忘記關&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Log indexed retention&lt;/td>
 &lt;td>每 million events × 天/月&lt;/td>
 &lt;td>預設 retention 太長&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>APM host + indexed span&lt;/td>
 &lt;td>每 host/月 + 每 million span&lt;/td>
 &lt;td>Sampling 沒設、全收&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Profiling&lt;/td>
 &lt;td>每 host/月（APM 加購）&lt;/td>
 &lt;td>整體成本疊加&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>多數 Datadog 成本失控的根因是 custom metrics 跟 log ingestion — 兩者跟 cardinality 跟 log volume 直接相關，成長可以很快。&lt;/p>
&lt;h2 id="custom-metrics-成本控制">Custom Metrics 成本控制&lt;/h2>
&lt;h3 id="什麼算-custom-metric">什麼算 custom metric&lt;/h3>
&lt;p>Datadog 把每個 unique 的 metric name + tag 組合算一個 time series。&lt;code>http_requests_total{service=checkout, method=GET, status=200}&lt;/code> 跟 &lt;code>http_requests_total{service=checkout, method=POST, status=500}&lt;/code> 是兩個 time series。&lt;/p>
&lt;p>Tag 的笛卡爾積決定 series 數量。5 個 service × 4 個 method × 5 個 status = 100 個 series。加一個 &lt;code>region&lt;/code> tag（3 個值）就變 300 個。加一個 &lt;code>endpoint&lt;/code> tag（50 個 normalized path）就變 15,000 個。&lt;/p>
&lt;h3 id="控制策略">控制策略&lt;/h3>
&lt;p>&lt;strong>Tag 白名單&lt;/strong>：跟 Prometheus 的 label 白名單邏輯相同。只保留有查詢價值的 tag — service、method、status_class（2xx/4xx/5xx）。移除 user_id、request_id、完整 URL。&lt;/p>
&lt;p>&lt;strong>Metrics without Limits&lt;/strong>：Datadog 的功能 — 在 ingestion 之後、query 之前過濾 tag。所有 tag 都收但只 index / 計費特定 tag。適合「收全量但只查部分維度」的場景。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog</a> 的 vendor deep article，深化 overview 的成本跟 Agent 段。初次接觸 Datadog 的讀者建議先讀 <a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog 服務頁</a>。</p></blockquote>
<h2 id="定位">定位</h2>
<p>Datadog 是全託管觀測平台，涵蓋 metrics、logs、traces、profiling、RUM、synthetic monitoring。託管方案的核心取捨是「零運維但成本跟用量成正比」— 用得越多付得越多，而且計價維度多（host、custom metric、log ingestion、span、indexed span），成本治理需要理解每個維度的計價模型。</p>
<h2 id="計價模型概覽">計價模型概覽</h2>
<p>Datadog 的主要計價維度：</p>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>計價方式</th>
          <th>常見失控來源</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Infrastructure host</td>
          <td>每 host/月</td>
          <td>Auto-scaling 造成 host 數量波動</td>
      </tr>
      <tr>
          <td>Custom metrics</td>
          <td>每 unique time series/月</td>
          <td>Label 爆炸（同 cardinality 問題）</td>
      </tr>
      <tr>
          <td>Log ingestion</td>
          <td>每 GB ingested/月</td>
          <td>Debug log level 忘記關</td>
      </tr>
      <tr>
          <td>Log indexed retention</td>
          <td>每 million events × 天/月</td>
          <td>預設 retention 太長</td>
      </tr>
      <tr>
          <td>APM host + indexed span</td>
          <td>每 host/月 + 每 million span</td>
          <td>Sampling 沒設、全收</td>
      </tr>
      <tr>
          <td>Profiling</td>
          <td>每 host/月（APM 加購）</td>
          <td>整體成本疊加</td>
      </tr>
  </tbody>
</table>
<p>多數 Datadog 成本失控的根因是 custom metrics 跟 log ingestion — 兩者跟 cardinality 跟 log volume 直接相關，成長可以很快。</p>
<h2 id="custom-metrics-成本控制">Custom Metrics 成本控制</h2>
<h3 id="什麼算-custom-metric">什麼算 custom metric</h3>
<p>Datadog 把每個 unique 的 metric name + tag 組合算一個 time series。<code>http_requests_total{service=checkout, method=GET, status=200}</code> 跟 <code>http_requests_total{service=checkout, method=POST, status=500}</code> 是兩個 time series。</p>
<p>Tag 的笛卡爾積決定 series 數量。5 個 service × 4 個 method × 5 個 status = 100 個 series。加一個 <code>region</code> tag（3 個值）就變 300 個。加一個 <code>endpoint</code> tag（50 個 normalized path）就變 15,000 個。</p>
<h3 id="控制策略">控制策略</h3>
<p><strong>Tag 白名單</strong>：跟 Prometheus 的 label 白名單邏輯相同。只保留有查詢價值的 tag — service、method、status_class（2xx/4xx/5xx）。移除 user_id、request_id、完整 URL。</p>
<p><strong>Metrics without Limits</strong>：Datadog 的功能 — 在 ingestion 之後、query 之前過濾 tag。所有 tag 都收但只 index / 計費特定 tag。適合「收全量但只查部分維度」的場景。</p>
<p><strong>DogStatsD 聚合</strong>：Datadog Agent 的 DogStatsD 端在 Agent 層做 pre-aggregation，把客戶端的 per-request metric 聚合成 per-interval 的摘要。減少送到 Datadog 的 data point 數量。DogStatsD 聚合在 Agent 端執行，跟 TSDB 層的 <a href="/blog/backend/knowledge-cards/recording-rule/" data-link-title="Recording Rule" data-link-desc="說明把 query-time 聚合計算推到寫入時的 pre-aggregation 機制">recording rule</a> 是不同位置的 pre-aggregation 機制。</p>
<p><strong>Usage attribution</strong>：Datadog 的 <a href="https://docs.datadoghq.com/account_management/billing/usage_attribution/">Usage Attribution</a> 功能把 custom metric 成本拆到 service / team tag，讓團隊看到自己的 metric 成本。對應 <a href="/blog/backend/04-observability/cost-attribution/" data-link-title="4.15 Cost Attribution / Chargeback" data-link-desc="把 observability 成本拆到團隊、產品、環境維度">4.15 cost attribution</a>。</p>
<h3 id="判讀指標">判讀指標</h3>
<p>Datadog UI 的 Metric Summary 頁面顯示每個 metric name 的 tag cardinality。定期（每月）檢查 top 20 高 cardinality metric，確認是否有意外的 tag 爆炸。</p>
<h2 id="log-ingestion-成本控制">Log Ingestion 成本控制</h2>
<h3 id="index-策略">Index 策略</h3>
<p>Datadog log 的計費分兩層：ingestion（進來就計費）跟 indexing（索引後按保留天數計費）。可以 ingest 所有 log 但只 index 部分 — 非 indexed 的 log 可以在 15 分鐘的 live tail 窗口查看，之後就看不到了（除非歸檔到 S3/GCS 做 rehydrate）。</p>
<p>可操作的分層：</p>
<ul>
<li><strong>Error / warning log</strong>：index，retention 30 天</li>
<li><strong>Info log（關鍵路徑）</strong>：index，retention 7 天</li>
<li><strong>Debug log</strong>：不 index、只 ingest（live tail 用）；或直接不送</li>
<li><strong>Access log（高量）</strong>：不 index、歸檔到 S3、需要時 rehydrate</li>
</ul>
<h3 id="exclusion-filter">Exclusion filter</h3>
<p>Datadog 的 index exclusion filter 讓特定 pattern 的 log 進入 ingestion pipeline 但跳過 index。例：health check 的 access log（<code>path:/health</code>）每秒數百筆但沒有 debug 價值，設 exclusion filter 讓它不佔 index quota。</p>
<h3 id="log-pipeline-跟-datadog-log-的對應">Log pipeline 跟 Datadog log 的對應</h3>
<p><a href="/blog/backend/04-observability/telemetry-pipeline/" data-link-title="4.11 Telemetry Pipeline 架構" data-link-desc="把 log / metric / trace 的 agent → collector → ingest → storage → query 分層治理">4.11 telemetry pipeline</a> 的 collector 端可以在 log 送到 Datadog 之前做 filtering — 低價值 log 直接 drop、不進 Datadog ingestion（連 ingestion 費用都省）。這比 Datadog 的 exclusion filter 更節省成本（exclusion filter 仍然計 ingestion 費用）。</p>
<h2 id="agent-部署配置">Agent 部署配置</h2>
<h3 id="agent-部署模式">Agent 部署模式</h3>
<table>
  <thead>
      <tr>
          <th>模式</th>
          <th>部署位置</th>
          <th>適用場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Host agent</td>
          <td>每台 VM 一個 agent</td>
          <td>傳統 VM 部署</td>
      </tr>
      <tr>
          <td>DaemonSet agent</td>
          <td>K8s 每個 node 一個 agent</td>
          <td>K8s 標準部署</td>
      </tr>
      <tr>
          <td>Sidecar agent</td>
          <td>每個 pod 一個 agent</td>
          <td>需要嚴格隔離時</td>
      </tr>
      <tr>
          <td>Cluster agent</td>
          <td>K8s cluster 一個</td>
          <td>收集 cluster-level metric</td>
      </tr>
  </tbody>
</table>
<p>多數 K8s 部署用 DaemonSet + Cluster Agent 組合。DaemonSet agent 收集 node-level 跟 pod-level 的 metric / log / trace；Cluster Agent 收集 cluster-level 的 metadata 跟 event。</p>
<h3 id="agent-健康判讀">Agent 健康判讀</h3>
<p>Agent 本身需要被監控 — Agent 故障時 Datadog 看到的是「資料消失」而非「Agent 掛了」。</p>
<p>判讀指標（Agent 自帶）：</p>
<ul>
<li><code>datadog.agent.running</code>：Agent process 是否存活</li>
<li><code>datadog.agent.check_run</code>：各 integration check 是否正常</li>
<li><code>datadog.dogstatsd.packets.dropped</code>：DogStatsD buffer 滿時丟棄的封包數</li>
</ul>
<p>Agent 掛掉時 dashboard 會出現 gap（資料斷層）。如果所有 host 同時斷層、問題在 Datadog backend；如果特定 host 斷層、問題在該 host 的 Agent。</p>
<h3 id="常見-agent-故障">常見 Agent 故障</h3>
<p><strong>CPU / memory over-consumption</strong>：Agent 開太多 integration check 或 DogStatsD 收太多 custom metric。修復：減少 check 數量、調整 DogStatsD 的 aggregation interval、或升級 Agent 版本（新版通常更節省資源）。</p>
<p><strong>Log collection 延遲</strong>：Agent 的 log tail 落後，log 到達 Datadog 的延遲增加。原因通常是 log rotation 設定跟 Agent 的 tail 設定不一致，或 log 量突然爆增超過 Agent 的處理能力。</p>
<p><strong>Network connectivity</strong>：Agent 到 Datadog intake endpoint 的網路問題。Agent 會 buffer 資料並重試，但 buffer 滿（預設 100MB）後會 drop。在網路不穩的環境（edge location、受限網路），需要加大 buffer 或設定 proxy。</p>
<h2 id="跟-otel-的整合">跟 OTel 的整合</h2>
<p>Datadog 支援 OpenTelemetry — 可以用 OTel SDK instrumentation + OTel Collector，把資料送到 Datadog backend。這種模式讓 instrumentation 跟 vendor 解耦，但犧牲部分 Datadog-native 功能（例如 Watchdog anomaly detection 需要 Datadog Agent 的 metadata）。</p>
<p>整合模式的選擇跟 <a href="/blog/backend/04-observability/cases/datadog-otel-migration-practice/" data-link-title="4.C7 Datadog：OTel 相容遷移實務" data-link-desc="APM 採集從專有代理轉向 OTel 相容模式的治理案例。">4.C7 Datadog OTel migration practice</a> 的案例分析對應 — 雙軌期的成本跟語意對齊是主要挑戰。</p>
<h2 id="下一步路由">下一步路由</h2>
<ul>
<li><a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog 服務頁</a>：overview 跟日常操作</li>
<li><a href="/blog/backend/04-observability/cardinality-cost-governance/" data-link-title="4.7 Cardinality 治理與成本邊界" data-link-desc="把 metric / log / trace 的 cardinality 與成本作為平台一級治理議題">4.7 cardinality</a>：cardinality 治理的完整策略</li>
<li><a href="/blog/backend/04-observability/cost-attribution/" data-link-title="4.15 Cost Attribution / Chargeback" data-link-desc="把 observability 成本拆到團隊、產品、環境維度">4.15 cost attribution</a>：成本歸因的組織治理</li>
<li><a href="/blog/backend/04-observability/cases/datadog-otel-migration-practice/" data-link-title="4.C7 Datadog：OTel 相容遷移實務" data-link-desc="APM 採集從專有代理轉向 OTel 相容模式的治理案例。">4.C7 Datadog OTel migration</a>：Datadog 跟 OTel 的整合案例</li>
<li><a href="/blog/backend/04-observability/vendors/opentelemetry/" data-link-title="OpenTelemetry" data-link-desc="可觀測性開放標準、SDK 與 Collector">OpenTelemetry</a>：vendor-neutral instrumentation</li>
</ul>
]]></content:encoded></item><item><title>Datadog OTLP Ingestion 與 OTel 整合</title><link>https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/otlp-ingestion-otel-integration/</link><pubDate>Tue, 23 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/otlp-ingestion-otel-integration/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog&lt;/a> 的 vendor deep article，深化 overview「OTLP ingestion」段。初次接觸 Datadog 的讀者建議先讀 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog 服務頁&lt;/a>。&lt;/p>&lt;/blockquote>
&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>兩種觸發情境會讓團隊需要 Datadog 的 OTLP ingestion：&lt;/p>
&lt;p>團隊已經使用 Datadog APM，但新服務或新語言想用 OTel SDK 避免 vendor lock-in。Datadog SDK 覆蓋的語言有限（Go / Java / Python / Ruby / Node / .NET / PHP / C++），如果服務用 Rust / Elixir / Kotlin multiplatform，OTel SDK 的覆蓋更廣。&lt;/p>
&lt;p>另一種情境是團隊原本用 OTel + Jaeger 或 OTel + Grafana，現在想把 visualization 遷到 Datadog 但不想重新 instrument。OTLP ingestion 讓 OTel SDK 產出的 traces / metrics / logs 直接送進 Datadog，不改 application code。&lt;/p>
&lt;h2 id="核心概念">核心概念&lt;/h2>
&lt;h3 id="datadog-agent-的-otlp-receiver">Datadog Agent 的 OTLP receiver&lt;/h3>
&lt;p>Datadog Agent 6.32+ 內建 OTLP receiver，接受 gRPC（port 4317）和 HTTP（port 4318）兩種 protocol。Agent 收到 OTLP 資料後轉換成 Datadog 內部格式，走跟 Datadog SDK 相同的 pipeline（sampling、tagging、forwarding to Datadog backend）。&lt;/p>
&lt;p>這代表 OTLP path 的資料在 Datadog UI 裡跟 Datadog SDK path 的資料一樣被處理 — 相同的 APM trace waterfall、相同的 service map、相同的 error tracking。差異在 metadata 完整度（見下方 feature parity）。&lt;/p>
&lt;h3 id="三種-signal-的-otlp-支援度">三種 signal 的 OTLP 支援度&lt;/h3>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Signal&lt;/th>
 &lt;th>OTLP 支援&lt;/th>
 &lt;th>到 Datadog 的對應&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Traces&lt;/td>
 &lt;td>完整（OTLP gRPC / HTTP）&lt;/td>
 &lt;td>APM traces、service map、error tracking&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Metrics&lt;/td>
 &lt;td>完整（OTLP gRPC / HTTP）&lt;/td>
 &lt;td>Custom metrics（按 metric 計費）&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Logs&lt;/td>
 &lt;td>有限（Agent 7.54+ 支援 OTLP logs）&lt;/td>
 &lt;td>Datadog Logs（按 ingestion volume 計費）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>Traces 的 OTLP 支援最成熟、metrics 次之、logs 最新。混合環境常見做法是 traces + metrics 走 OTLP、logs 走 Datadog Agent 的原生 log collection（file tailing / container stdout）。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog</a> 的 vendor deep article，深化 overview「OTLP ingestion」段。初次接觸 Datadog 的讀者建議先讀 <a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog 服務頁</a>。</p></blockquote>
<h2 id="問題情境">問題情境</h2>
<p>兩種觸發情境會讓團隊需要 Datadog 的 OTLP ingestion：</p>
<p>團隊已經使用 Datadog APM，但新服務或新語言想用 OTel SDK 避免 vendor lock-in。Datadog SDK 覆蓋的語言有限（Go / Java / Python / Ruby / Node / .NET / PHP / C++），如果服務用 Rust / Elixir / Kotlin multiplatform，OTel SDK 的覆蓋更廣。</p>
<p>另一種情境是團隊原本用 OTel + Jaeger 或 OTel + Grafana，現在想把 visualization 遷到 Datadog 但不想重新 instrument。OTLP ingestion 讓 OTel SDK 產出的 traces / metrics / logs 直接送進 Datadog，不改 application code。</p>
<h2 id="核心概念">核心概念</h2>
<h3 id="datadog-agent-的-otlp-receiver">Datadog Agent 的 OTLP receiver</h3>
<p>Datadog Agent 6.32+ 內建 OTLP receiver，接受 gRPC（port 4317）和 HTTP（port 4318）兩種 protocol。Agent 收到 OTLP 資料後轉換成 Datadog 內部格式，走跟 Datadog SDK 相同的 pipeline（sampling、tagging、forwarding to Datadog backend）。</p>
<p>這代表 OTLP path 的資料在 Datadog UI 裡跟 Datadog SDK path 的資料一樣被處理 — 相同的 APM trace waterfall、相同的 service map、相同的 error tracking。差異在 metadata 完整度（見下方 feature parity）。</p>
<h3 id="三種-signal-的-otlp-支援度">三種 signal 的 OTLP 支援度</h3>
<table>
  <thead>
      <tr>
          <th>Signal</th>
          <th>OTLP 支援</th>
          <th>到 Datadog 的對應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Traces</td>
          <td>完整（OTLP gRPC / HTTP）</td>
          <td>APM traces、service map、error tracking</td>
      </tr>
      <tr>
          <td>Metrics</td>
          <td>完整（OTLP gRPC / HTTP）</td>
          <td>Custom metrics（按 metric 計費）</td>
      </tr>
      <tr>
          <td>Logs</td>
          <td>有限（Agent 7.54+ 支援 OTLP logs）</td>
          <td>Datadog Logs（按 ingestion volume 計費）</td>
      </tr>
  </tbody>
</table>
<p>Traces 的 OTLP 支援最成熟、metrics 次之、logs 最新。混合環境常見做法是 traces + metrics 走 OTLP、logs 走 Datadog Agent 的原生 log collection（file tailing / container stdout）。</p>
<h3 id="datadog-sdk-vs-otel-sdk-feature-parity">Datadog SDK vs OTel SDK feature parity</h3>
<table>
  <thead>
      <tr>
          <th>功能</th>
          <th>Datadog SDK</th>
          <th>OTel SDK → Datadog</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Distributed tracing</td>
          <td>有</td>
          <td>有（完整）</td>
      </tr>
      <tr>
          <td>Continuous profiling</td>
          <td>有</td>
          <td>無（Datadog 專有）</td>
      </tr>
      <tr>
          <td>ASM（Application Security）</td>
          <td>有</td>
          <td>無（需要 Datadog library）</td>
      </tr>
      <tr>
          <td>CI Visibility</td>
          <td>有</td>
          <td>無</td>
      </tr>
      <tr>
          <td>Dynamic instrumentation</td>
          <td>有</td>
          <td>無</td>
      </tr>
      <tr>
          <td>Runtime metrics（GC、thread）</td>
          <td>自動</td>
          <td>需手動配置 OTel metric instrumentation</td>
      </tr>
      <tr>
          <td>Log correlation（trace_id 注入 log）</td>
          <td>自動</td>
          <td>需手動配置（MDC / context propagation）</td>
      </tr>
      <tr>
          <td>Unified service tagging</td>
          <td>自動（<code>DD_SERVICE</code> / <code>DD_ENV</code> / <code>DD_VERSION</code>）</td>
          <td>需 resource attribute mapping</td>
      </tr>
  </tbody>
</table>
<p>判讀：如果團隊需要 profiling / ASM / CI Visibility，對應服務仍需 Datadog SDK。其他服務可以用 OTel SDK + OTLP ingestion，兩者在同一個 Datadog org 共存。</p>
<h2 id="配置-step-by-step">配置 step-by-step</h2>
<h3 id="datadog-agent-otlp-設定">Datadog Agent OTLP 設定</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln">1</span><span class="cl"><span class="c"># datadog.yaml</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="nt">otlp_config</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">  </span><span class="nt">receiver</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w">    </span><span class="nt">protocols</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="w">      </span><span class="nt">grpc</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">        </span><span class="nt">endpoint</span><span class="p">:</span><span class="w"> </span><span class="m">0.0.0.0</span><span class="p">:</span><span class="m">4317</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w">      </span><span class="nt">http</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w">        </span><span class="nt">endpoint</span><span class="p">:</span><span class="w"> </span><span class="m">0.0.0.0</span><span class="p">:</span><span class="m">4318</span></span></span></code></pre></div><p>Agent 重啟後用 <code>datadog-agent status</code> 確認 OTLP receiver 啟動。</p>
<h3 id="otel-sdk-endpoint-配置">OTel SDK endpoint 配置</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1"># 環境變數（語言無關）</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="nb">export</span> <span class="nv">OTEL_EXPORTER_OTLP_ENDPOINT</span><span class="o">=</span><span class="s2">&#34;http://datadog-agent:4317&#34;</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="nb">export</span> <span class="nv">OTEL_EXPORTER_OTLP_PROTOCOL</span><span class="o">=</span><span class="s2">&#34;grpc&#34;</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="nb">export</span> <span class="nv">OTEL_SERVICE_NAME</span><span class="o">=</span><span class="s2">&#34;checkout-api&#34;</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="nb">export</span> <span class="nv">OTEL_RESOURCE_ATTRIBUTES</span><span class="o">=</span><span class="s2">&#34;deployment.environment=production,service.version=1.2.3&#34;</span></span></span></code></pre></div><h3 id="resource-attribute--datadog-tag-mapping">Resource attribute → Datadog tag mapping</h3>
<p>Datadog Agent 自動把 OTel resource attributes 轉成 Datadog tags：</p>
<table>
  <thead>
      <tr>
          <th>OTel resource attribute</th>
          <th>Datadog tag</th>
          <th>備註</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>service.name</code></td>
          <td><code>service</code></td>
          <td>Datadog unified service tagging 的核心</td>
      </tr>
      <tr>
          <td><code>deployment.environment</code></td>
          <td><code>env</code></td>
          <td>必填、否則 Datadog UI 的環境篩選失效</td>
      </tr>
      <tr>
          <td><code>service.version</code></td>
          <td><code>version</code></td>
          <td>用於 deployment tracking</td>
      </tr>
      <tr>
          <td><code>host.name</code></td>
          <td><code>host</code></td>
          <td>Agent 通常自動帶、不需手動設</td>
      </tr>
      <tr>
          <td><code>container.name</code></td>
          <td><code>container_name</code></td>
          <td>K8s 環境自動帶</td>
      </tr>
  </tbody>
</table>
<p>如果 resource attribute 沒設 <code>deployment.environment</code>，Datadog 會把 trace 歸到 <code>env:none</code> — 在 APM 介面幾乎不可見。這是最常見的 OTLP onboarding 問題。</p>
<h3 id="otel-collector--datadogalternative-path">OTel Collector → Datadog（alternative path）</h3>
<p>如果不想讓 application 直連 Datadog Agent，可以在中間放 OTel Collector：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># otel-collector-config.yaml</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="nt">exporters</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="nt">datadog</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="nt">api</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">      </span><span class="nt">key</span><span class="p">:</span><span class="w"> </span><span class="l">${DD_API_KEY}</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">      </span><span class="nt">site</span><span class="p">:</span><span class="w"> </span><span class="l">datadoghq.com</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w"></span><span class="nt">service</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">  </span><span class="nt">pipelines</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="nt">traces</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">      </span><span class="nt">receivers</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">otlp]</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">      </span><span class="nt">processors</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">batch]</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">      </span><span class="nt">exporters</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="l">datadog]</span></span></span></code></pre></div><p>OTel Collector 的 <code>datadog</code> exporter 直接把資料送到 Datadog backend（不經 Agent）。適合已有 OTel Collector 基礎設施、不想每個 node 都部署 Datadog Agent 的場景。</p>
<h2 id="故障與邊界">故障與邊界</h2>
<h3 id="resource-attribute-mapping-不對齊">Resource attribute mapping 不對齊</h3>
<p>OTel 的 <code>service.name</code> 用 dot notation（如 <code>com.example.checkout</code>），Datadog 預設用 hyphen（如 <code>checkout-api</code>）。如果 mapping 不一致，同一個服務在 Datadog APM 的 service map 會出現多個節點（OTel path 一個、Datadog SDK path 一個）。</p>
<p>修法：統一 <code>service.name</code> 命名。如果兩種 SDK 並存，在 OTel SDK 的 resource attribute 設跟 Datadog SDK 的 <code>DD_SERVICE</code> 完全相同的值。</p>
<h3 id="metric-naming-convention-差異">Metric naming convention 差異</h3>
<p>OTel metric 用 dot notation（<code>http.server.request.duration</code>），Datadog 預設用 underscore（<code>http_server_request_duration</code>）。Agent 會自動轉換（dot → underscore），但如果團隊同時有 Datadog SDK 產出的 metric 跟 OTel SDK 產出的 metric，兩者可能在 Datadog 裡產生重複（語意相同但名稱不同）。</p>
<p>修法：用 OTel Collector 的 <code>metricstransform</code> processor 在 export 前統一命名，或在 Datadog 用 metric alias 合併。</p>
<h3 id="log-correlation-在-otlp-path-的限制">Log correlation 在 OTLP path 的限制</h3>
<p>Datadog SDK 自動把 <code>dd.trace_id</code> 和 <code>dd.span_id</code> 注入 application log（如 Python logging、Java MDC）。OTel SDK 不做這件事 — log correlation 需要手動設定（把 <code>trace_id</code> 從 OTel context 注入 logging framework）。</p>
<p>如果 log correlation 缺失，Datadog 的 trace → log 跳轉功能失效。修法依語言不同：Java 用 MDC + OTel Java agent 的 log context instrumentation；Python 用 <code>opentelemetry-instrumentation-logging</code>；Go 需要手動從 span context 取 trace ID 寫到 log field。</p>
<h2 id="容量與成本">容量與成本</h2>
<p>OTLP path 的計費跟 Datadog SDK path 相同：</p>
<table>
  <thead>
      <tr>
          <th>Signal</th>
          <th>計費單位</th>
          <th>OTLP vs Datadog SDK</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>APM traces</td>
          <td>Per ingested span</td>
          <td>相同</td>
      </tr>
      <tr>
          <td>Metrics</td>
          <td>Per custom metric（unique metric name × tag combination）</td>
          <td>相同</td>
      </tr>
      <tr>
          <td>Logs</td>
          <td>Per ingested GB</td>
          <td>相同</td>
      </tr>
  </tbody>
</table>
<p>成本差異不在 ingestion pricing，在 <strong>feature access</strong>。用 OTel SDK 失去 Profiling / ASM / CI Visibility，這些功能需要 Datadog SDK。如果團隊需要這些功能，走 OTLP 反而要為核心服務額外部署 Datadog SDK — 雙 SDK 的 maintenance cost 可能超過直接全用 Datadog SDK。</p>
<p>判斷分水嶺：如果 &gt; 80% 的服務不需要 Profiling / ASM，走 OTLP + 少數服務用 Datadog SDK 是合理的混合模式。如果核心服務都需要 Profiling，全用 Datadog SDK 更簡單。</p>
<h2 id="整合與下一步">整合與下一步</h2>
<ul>
<li><a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog 服務頁</a>：overview 與日常操作</li>
<li><a href="../cost-governance-agent-config/">Datadog 成本治理</a>：Agent 配置與 cost control</li>
<li><a href="/blog/backend/04-observability/cases/datadog-otel-migration-practice/" data-link-title="4.C7 Datadog：OTel 相容遷移實務" data-link-desc="APM 採集從專有代理轉向 OTel 相容模式的治理案例。">4.C7 Datadog OTel migration</a>：從 Datadog SDK 轉向 OTel 相容模式的治理案例</li>
<li><a href="/blog/backend/04-observability/vendors/opentelemetry/collector-deployment-patterns/" data-link-title="OTel Collector 部署模式：agent / gateway / sidecar 與 pipeline 設計" data-link-desc="說明 OpenTelemetry Collector 三種部署位置的責任分工、receivers/processors/exporters pipeline 設計，以及 collector 失效、記憶體壓力與 backpressure 的故障演練與容量邊界">OpenTelemetry Collector 部署模式</a>：OTel Collector → Datadog 的 alternative path</li>
<li><a href="../migrate-from-new-relic/">← New Relic migration</a>：New Relic → Datadog 的遷移中 OTLP 扮演的橋接角色</li>
</ul>
]]></content:encoded></item><item><title>New Relic → Datadog：APM schema 對位 + agent 替換 + dashboard 重建</title><link>https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/migrate-from-new-relic/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/migrate-from-new-relic/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link &lt;a href="https://newrelic.com/">New Relic&lt;/a> 跟 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog&lt;/a>。跑 &lt;a href="https://tarrragon.github.io/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit&lt;/a> 後對映 &lt;em>Schema = High（NRQL ↔ Datadog query、APM agent 不同）→ Type A phased translation&lt;/em>。&lt;/p>&lt;/blockquote>
&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>中型 SaaS 跑 New Relic 3-5 年、production observability 飽和、團隊發現幾個問題：cost 暴漲（per-host APM + custom event + synthetic）、APM trace 對 Kubernetes-native workload 不夠細、跟 PagerDuty / Slack integration 雖然有但 latency 偏高。同期 Datadog 在 K8s monitoring + APM 端深度整合、cost model 在 100-500 host 規模更可預測。&lt;/p>
&lt;p>評估遷移時、發現 New Relic → Datadog 不是「換個 agent 就好」 — APM schema、NRQL 查詢語言、custom dashboard、synthetic monitoring rule 全部要 &lt;em>重新對位&lt;/em>；application code 端的 agent 也要 &lt;em>完全換 binary&lt;/em>。是 Type A 高 schema 差 migration、不是 drop-in。&lt;/p>
&lt;h2 id="為什麼遷cost--k8s-native--vendor-consolidation-三條-driver">為什麼遷：cost / k8s-native / vendor consolidation 三條 driver&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Driver&lt;/th>
 &lt;th>觸發場景&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;strong>Cost&lt;/strong>&lt;/td>
 &lt;td>New Relic per-host pricing + custom event + synthetic 加總爆、Datadog 在 K8s 場景單 host 多 container 更划算&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>K8s-native&lt;/strong>&lt;/td>
 &lt;td>Datadog agent 對 K8s sidecar / DaemonSet / autodiscovery 更深&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;strong>Vendor consolidation&lt;/strong>&lt;/td>
 &lt;td>已用 Datadog log / metric、APM 統一 vendor 降工具切換 cost&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>反向 driver（Datadog → New Relic）：&lt;/p>
&lt;ul>
&lt;li>New Relic 對 &lt;em>full-stack observability&lt;/em>（APM + browser + mobile + synthetic）的整合包仍領先&lt;/li>
&lt;li>已深用 New Relic NRQL 跟 New Relic University 培訓的 organization、不切&lt;/li>
&lt;/ul>
&lt;h2 id="schema-對位">Schema 對位&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>New Relic concept&lt;/th>
 &lt;th>Datadog 對應&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>APM agent (NR Java / Python / Node)&lt;/td>
 &lt;td>Datadog agent + APM tracer library&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>NRQL query&lt;/td>
 &lt;td>Datadog query (Metric / Log / Trace)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Synthetic monitor&lt;/td>
 &lt;td>Datadog Synthetic Tests&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Custom event&lt;/td>
 &lt;td>Datadog custom metric / log event&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>NRQL alert condition&lt;/td>
 &lt;td>Datadog monitor&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>New Relic dashboard&lt;/td>
 &lt;td>Datadog dashboard (need rebuild)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Apdex score&lt;/td>
 &lt;td>Datadog APM &lt;code>apm.service.errors&lt;/code> + &lt;code>apm.service.latency&lt;/code>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Distributed trace&lt;/td>
 &lt;td>Datadog APM trace（OpenTelemetry-compatible）&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h2 id="phase-0audit--classify">Phase 0：Audit + classify&lt;/h2>
&lt;ul>
&lt;li>列所有 application 跟對應 NR agent version&lt;/li>
&lt;li>列所有 NRQL alert / dashboard / synthetic monitor&lt;/li>
&lt;li>估每月 cost 跟 Datadog 對比&lt;/li>
&lt;/ul>
&lt;h2 id="phase-1schema-對位--datadog-cluster-建置">Phase 1：Schema 對位 + Datadog cluster 建置&lt;/h2>
&lt;ul>
&lt;li>Datadog organization 申請 / IAM integration&lt;/li>
&lt;li>VPC peering / private link (如果用 self-hosted agent)&lt;/li>
&lt;/ul>
&lt;h2 id="phase-2translation-pipeline-3-tier">Phase 2：Translation pipeline (3-tier)&lt;/h2>
&lt;ul>
&lt;li>Tier 1: Datadog 端 import tool（API-based NRQL → Datadog query 轉換、cover ~40-60%）&lt;/li>
&lt;li>Tier 2: LLM-assisted（剩餘 query / dashboard）&lt;/li>
&lt;li>Tier 3: manual (synthetic / complex correlation)&lt;/li>
&lt;/ul>
&lt;h2 id="phase-3parallel-run-dual-agent-4-8-週">Phase 3：Parallel run (dual-agent 4-8 週)&lt;/h2>
&lt;p>兩個 agent 跑同 application、metric / trace / log 雙端輸出、SOC 比對 detection coverage / alert / dashboard 一致性。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link <a href="https://newrelic.com/">New Relic</a> 跟 <a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog</a>。跑 <a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">migration-playbook-methodology 6 維 audit</a> 後對映 <em>Schema = High（NRQL ↔ Datadog query、APM agent 不同）→ Type A phased translation</em>。</p></blockquote>
<h2 id="問題情境">問題情境</h2>
<p>中型 SaaS 跑 New Relic 3-5 年、production observability 飽和、團隊發現幾個問題：cost 暴漲（per-host APM + custom event + synthetic）、APM trace 對 Kubernetes-native workload 不夠細、跟 PagerDuty / Slack integration 雖然有但 latency 偏高。同期 Datadog 在 K8s monitoring + APM 端深度整合、cost model 在 100-500 host 規模更可預測。</p>
<p>評估遷移時、發現 New Relic → Datadog 不是「換個 agent 就好」 — APM schema、NRQL 查詢語言、custom dashboard、synthetic monitoring rule 全部要 <em>重新對位</em>；application code 端的 agent 也要 <em>完全換 binary</em>。是 Type A 高 schema 差 migration、不是 drop-in。</p>
<h2 id="為什麼遷cost--k8s-native--vendor-consolidation-三條-driver">為什麼遷：cost / k8s-native / vendor consolidation 三條 driver</h2>
<table>
  <thead>
      <tr>
          <th>Driver</th>
          <th>觸發場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Cost</strong></td>
          <td>New Relic per-host pricing + custom event + synthetic 加總爆、Datadog 在 K8s 場景單 host 多 container 更划算</td>
      </tr>
      <tr>
          <td><strong>K8s-native</strong></td>
          <td>Datadog agent 對 K8s sidecar / DaemonSet / autodiscovery 更深</td>
      </tr>
      <tr>
          <td><strong>Vendor consolidation</strong></td>
          <td>已用 Datadog log / metric、APM 統一 vendor 降工具切換 cost</td>
      </tr>
  </tbody>
</table>
<p>反向 driver（Datadog → New Relic）：</p>
<ul>
<li>New Relic 對 <em>full-stack observability</em>（APM + browser + mobile + synthetic）的整合包仍領先</li>
<li>已深用 New Relic NRQL 跟 New Relic University 培訓的 organization、不切</li>
</ul>
<h2 id="schema-對位">Schema 對位</h2>
<table>
  <thead>
      <tr>
          <th>New Relic concept</th>
          <th>Datadog 對應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>APM agent (NR Java / Python / Node)</td>
          <td>Datadog agent + APM tracer library</td>
      </tr>
      <tr>
          <td>NRQL query</td>
          <td>Datadog query (Metric / Log / Trace)</td>
      </tr>
      <tr>
          <td>Synthetic monitor</td>
          <td>Datadog Synthetic Tests</td>
      </tr>
      <tr>
          <td>Custom event</td>
          <td>Datadog custom metric / log event</td>
      </tr>
      <tr>
          <td>NRQL alert condition</td>
          <td>Datadog monitor</td>
      </tr>
      <tr>
          <td>New Relic dashboard</td>
          <td>Datadog dashboard (need rebuild)</td>
      </tr>
      <tr>
          <td>Apdex score</td>
          <td>Datadog APM <code>apm.service.errors</code> + <code>apm.service.latency</code></td>
      </tr>
      <tr>
          <td>Distributed trace</td>
          <td>Datadog APM trace（OpenTelemetry-compatible）</td>
      </tr>
  </tbody>
</table>
<h2 id="phase-0audit--classify">Phase 0：Audit + classify</h2>
<ul>
<li>列所有 application 跟對應 NR agent version</li>
<li>列所有 NRQL alert / dashboard / synthetic monitor</li>
<li>估每月 cost 跟 Datadog 對比</li>
</ul>
<h2 id="phase-1schema-對位--datadog-cluster-建置">Phase 1：Schema 對位 + Datadog cluster 建置</h2>
<ul>
<li>Datadog organization 申請 / IAM integration</li>
<li>VPC peering / private link (如果用 self-hosted agent)</li>
</ul>
<h2 id="phase-2translation-pipeline-3-tier">Phase 2：Translation pipeline (3-tier)</h2>
<ul>
<li>Tier 1: Datadog 端 import tool（API-based NRQL → Datadog query 轉換、cover ~40-60%）</li>
<li>Tier 2: LLM-assisted（剩餘 query / dashboard）</li>
<li>Tier 3: manual (synthetic / complex correlation)</li>
</ul>
<h2 id="phase-3parallel-run-dual-agent-4-8-週">Phase 3：Parallel run (dual-agent 4-8 週)</h2>
<p>兩個 agent 跑同 application、metric / trace / log 雙端輸出、SOC 比對 detection coverage / alert / dashboard 一致性。</p>
<h2 id="phase-4cutover--cleanup">Phase 4：Cutover + cleanup</h2>
<ul>
<li>Application 端切 agent</li>
<li>New Relic license downgrade / cancel</li>
<li>Decommission timeline 3-6 個月（保留歷史查詢能力）</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1nrql-不直接對位-datadog-query">Case 1：NRQL 不直接對位 Datadog query</h3>
<p><strong>徵兆</strong>：NRQL <code>SELECT count(*) FROM Transaction FACET name WHERE duration &gt; 5 SINCE 1 hour ago</code> 在 Datadog 端需要拆 metric query + filter + group by；翻譯後語意對等但 syntax 完全不同、SOC analyst 學習曲線陡。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>翻譯腳本 + LLM-assisted、保留 NRQL 字面 + Datadog query 對照表（runbook）</li>
<li>SOC training，1-2 週 hands-on</li>
<li>部分 query 改 <em>Datadog dashboard widget</em>、不用直接 query</li>
</ol>
<h3 id="case-2synthetic-monitor-對位失敗">Case 2：Synthetic monitor 對位失敗</h3>
<p><strong>徵兆</strong>：NR Synthetic 跑 100+ ping / browser / API test、切 Datadog Synthetic 後發現 <em>step-based</em> monitor 對應的「Browser Test」配置複雜、setup 工作量 2-3 倍預估。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Pre-cutover 跑 sample synthetic、估真實 setup cost</li>
<li>優先遷 critical synthetic、其他評估退役</li>
<li>用 Datadog API + Terraform 自動化、避免 UI 手動建</li>
</ol>
<h3 id="case-3cost-模型反轉">Case 3：Cost 模型反轉</h3>
<p><strong>徵兆</strong>：cutover 後第一個月 Datadog 帳單比 NR 高 30%；breakdown 後發現 <em>log retention + custom metric series + log indexing</em> 三個項目超預估。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Pre-migration 估 Datadog cost 必須含 <em>log indexing pricing</em>（按 indexed event 計）、不是純 ingest</li>
<li>Application 端 log scrub PII + sample debug log、降 ingest GB</li>
<li>Custom metric cardinality control（tag combination 爆 series count）</li>
</ol>
<h3 id="case-4dashboard-自動轉失敗人工-rebuild-80">Case 4：Dashboard 自動轉失敗、人工 rebuild 80%</h3>
<p><strong>徵兆</strong>：用 Datadog import tool 跑 NR dashboard、80% widget 缺 / 對應錯；team 估 2 週 dashboard rebuild、實際跑 6-8 週。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>接受重建</strong>：production dashboard 必須人工重建、不要期待自動轉</li>
<li><strong>Prioritize</strong>：先重建 SOC critical 30%、其他 deprecate</li>
<li><strong>Migration window 增 4-6 週</strong>：dashboard rebuild 是 underestimated effort</li>
</ol>
<h3 id="case-5cross-platform-metric-命名差">Case 5：Cross-platform metric 命名差</h3>
<p><strong>徵兆</strong>：NR 端 metric <code>Apdex/Apdex</code> 在 Datadog 沒對應、application code 寫死 metric name 失效；alert query 對 NR-specific metric 全失效。</p>
<p><strong>修法</strong>：</p>
<ol>
<li>Pre-cutover 列所有 NR-specific metric、application code 改用 OpenTelemetry-style metric 命名</li>
<li>Datadog query 端 rebuild、用 application-level metric name 而非 vendor-specific</li>
<li>長期：metric naming 用 OpenTelemetry semantic conventions、避免 vendor lock</li>
</ol>
<h2 id="capacity--cost">Capacity / cost</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>New Relic</th>
          <th>Datadog</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pricing model</td>
          <td>per-host + custom event / synthetic</td>
          <td>per-host APM + log indexing + custom metric</td>
      </tr>
      <tr>
          <td>K8s-friendly</td>
          <td>中、autodiscovery 有但配置複雜</td>
          <td>高、K8s-native autodiscovery first-class</td>
      </tr>
      <tr>
          <td>Migration cost</td>
          <td>-</td>
          <td>2-4 FTE × 2-3 個月</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.3-0.6</td>
          <td>0.3-0.6（相當）</td>
      </tr>
  </tbody>
</table>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-datadog--grafana-stack-migration-對位">跟 <a href="/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/" data-link-title="Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability" data-link-desc="Datadog 五層計費（host APM / metric / log ingest / log retention / RUM）拆解、對位 Grafana Stack（Mimir / Loki / Tempo / Grafana / Alloy）的 5 層責任；OTel-based agent migration、5 個 production 踩雷（cardinality 爆 / log volume cost / dashboard 不直接轉 / alert routing 換邏輯 / SLO definition 差異）、cost reality check">Datadog → Grafana Stack migration</a> 對位</h3>
<p>兩種 Datadog 端的後續路線：</p>
<ul>
<li>切到 Datadog 後 <em>繼續用</em>（穩定 multi-year）</li>
<li>切到 Datadog 後 <em>再切 Grafana Stack</em> 省 cost（multi-tool 拆分、Type D）</li>
</ul>
<p>多數 organization 第一輪 NR → Datadog 已花 2-3 個月、不會立刻再切；至少穩定 1-2 年。</p>
<h3 id="跟-opentelemetry-對齊">跟 OpenTelemetry 對齊</h3>
<p>Migration 順便升 OTel 化 application、避免下次 vendor 切換重複工作量。</p>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Target vendor：<a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog</a></li>
<li>平行 migration playbook (Type A)：<a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic Security</a> / <a href="/blog/backend/01-database/vendors/mysql/migrate-to-postgresql/" data-link-title="MySQL → PostgreSQL：從 SQL dialect diff 跑出來的 Type A 6-phase migration" data-link-desc="MySQL → PostgreSQL 是 Type A 高 schema 差 migration 的標準形態 — SQL dialect / collation / case sensitivity / replication 模型差異主導；用 pgloader / AWS DMS / 自管 dual-write 三條 path、5 個 production 踩雷（auto_increment vs SERIAL / charset 跟 collation / case sensitivity / index syntax / triggers）">MySQL → PostgreSQL</a></li>
<li>平行 migration playbook (D-type 對位)：<a href="/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/" data-link-title="Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability" data-link-desc="Datadog 五層計費（host APM / metric / log ingest / log retention / RUM）拆解、對位 Grafana Stack（Mimir / Loki / Tempo / Grafana / Alloy）的 5 層責任；OTel-based agent migration、5 個 production 踩雷（cardinality 爆 / log volume cost / dashboard 不直接轉 / alert routing 換邏輯 / SLO definition 差異）、cost reality check">Datadog → Grafana Stack</a></li>
<li>Methodology：<a href="/blog/posts/migration-playbook-%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84stage-0-variant-%E8%A6%8F%E5%8A%83%E6%8A%8A-collapse-%E7%8E%87%E5%BE%9E-60-%E9%99%8D%E5%88%B0-0/" data-link-title="Migration Playbook 方法論的演化紀錄：Stage 0 variant 規劃把 collapse 率從 60% 降到 0%" data-link-desc="跨 vendor migration playbook 需要獨立寫作方法論的依據，以及這套方法論從三輪 batch dogfood 中演化出來的驗證證據。">Migration playbook methodology</a></li>
</ul>
]]></content:encoded></item><item><title>Datadog → Grafana Stack：把 $50K/month bill 拆解到 self-hosted observability</title><link>https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/migrate-to-grafana-stack/</guid><description>&lt;blockquote>
&lt;p>本文是跨 vendor migration playbook、cross-link 到 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog&lt;/a>（source）跟 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack&lt;/a>（target）。跟前三篇 migration（&lt;a href="https://tarrragon.github.io/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic&lt;/a> phased / &lt;a href="https://tarrragon.github.io/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &amp;#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB&lt;/a> drop-in / &lt;a href="https://tarrragon.github.io/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &amp;#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora&lt;/a> hybrid）對照、本篇是 &lt;em>cost-driven multi-tool migration&lt;/em> — 不是換一個產品、是把 &lt;em>一站式 SaaS&lt;/em> 拆成 &lt;em>五個專責 OSS / cloud component&lt;/em>。&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是跨 vendor migration playbook、cross-link 到 <a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog</a>（source）跟 <a href="/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack</a>（target）。跟前三篇 migration（<a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic</a> phased / <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a> drop-in / <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a> hybrid）對照、本篇是 <em>cost-driven multi-tool migration</em> — 不是換一個產品、是把 <em>一站式 SaaS</em> 拆成 <em>五個專責 OSS / cloud component</em>。</p></blockquote>
<h2 id="50kmonth-bill-拆解先看錢花在哪再決定怎麼遷">$50K/month bill 拆解：先看錢花在哪、再決定怎麼遷</h2>
<p>中型 SaaS（100-500 host、5K-50K metric series、TB-level log/day）的 Datadog 月帳單長這樣：</p>
<table>
  <thead>
      <tr>
          <th>計費項</th>
          <th>平均單價</th>
          <th>中型 SaaS 估算 / month</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Infrastructure host</td>
          <td>$15-23 / host</td>
          <td>200 host × $20 = $4,000</td>
      </tr>
      <tr>
          <td>APM host</td>
          <td>$31 / host</td>
          <td>100 host × $31 = $3,100</td>
      </tr>
      <tr>
          <td>Custom metrics</td>
          <td>$0.05 / 100 series</td>
          <td>30K series × $0.05 = $1,500</td>
      </tr>
      <tr>
          <td>Log ingest</td>
          <td>$0.10 / GB ingested</td>
          <td>50TB × $0.10 = $5,000</td>
      </tr>
      <tr>
          <td>Log retention（15-day）</td>
          <td>$1.27 / million events</td>
          <td>50G event × $1.27 = $6,350</td>
      </tr>
      <tr>
          <td>Log indexing</td>
          <td>$1.70 / million events</td>
          <td>50G × $1.70 = $8,500</td>
      </tr>
      <tr>
          <td>Network</td>
          <td>$5 / host</td>
          <td>200 × $5 = $1,000</td>
      </tr>
      <tr>
          <td>RUM / Session</td>
          <td>$1.50 / 1000 session</td>
          <td>30M session × $1.5 = $4,500</td>
      </tr>
      <tr>
          <td>Synthetics</td>
          <td>$5 / 10K test runs</td>
          <td>50K test = $25</td>
      </tr>
      <tr>
          <td>Total</td>
          <td>-</td>
          <td><strong>$34,000 / month</strong>（保守估）</td>
      </tr>
  </tbody>
</table>
<p>擴張到 500 host / 100TB log 的 production：$80K-150K / month 範圍。Grafana stack（self-hosted on K8s + Grafana Cloud 部分服務）對等 capacity 通常 $8K-30K / month — <em>2.5-5x cost reduction</em>。</p>
<p>但 cost 不是唯一 driver。其他 driver：</p>
<ul>
<li><strong>Multi-cloud / hybrid</strong>：Datadog 集中、Grafana 可分散部署符合資料 residency</li>
<li><strong>OpenTelemetry-first</strong>：Grafana stack 對 OTel 是 native、Datadog 仍 vendor-specific agent</li>
<li><strong>Long-term retention</strong>：Loki 用 S3 cold tier 跑 1 年 retention 比 Datadog 便宜 10-50x</li>
</ul>
<h2 id="五個責任五個-component不是替換一個產品">五個責任、五個 component：不是替換一個產品</h2>
<p>Datadog 是 <em>一站式 SaaS</em>、單一 agent + 單一 UI 包 5 個責任。Grafana stack 把責任拆給 5 個專責 component：</p>
<table>
  <thead>
      <tr>
          <th>責任</th>
          <th>Datadog 處理</th>
          <th>Grafana Stack 對應</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Metric</td>
          <td>Datadog metric</td>
          <td>Mimir（Prometheus-compatible long-term）</td>
      </tr>
      <tr>
          <td>Log</td>
          <td>Datadog Logs</td>
          <td>Loki（label-indexed log）</td>
      </tr>
      <tr>
          <td>Trace</td>
          <td>Datadog APM</td>
          <td>Tempo（trace-only object storage）</td>
      </tr>
      <tr>
          <td>Dashboard</td>
          <td>Datadog dashboard</td>
          <td>Grafana</td>
      </tr>
      <tr>
          <td>Agent / shipper</td>
          <td>Datadog Agent</td>
          <td>Alloy（OTel-based collector）+ Grafana Agent / Promtail</td>
      </tr>
  </tbody>
</table>
<p>Migration 是 <em>五個獨立 stream</em>、不是單一 cutover。SRE 對「一個 agent 包所有」的心智模型要拆。</p>
<h2 id="migration-結構每個-component-各自-phased整體-staggered">Migration 結構：每個 component 各自 phased、整體 staggered</h2>
<p>不像前三篇 migration 是線性流程、本篇是 <em>5 個 parallel migration stream</em> + 跨 stream coordination：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">           Phase 0           Phase 1            Phase 2          Phase 3
</span></span><span class="line"><span class="ln">2</span><span class="cl">           Audit             Deploy             Dual-ship        Cutover
</span></span><span class="line"><span class="ln">3</span><span class="cl">Metric    [audit]──→        [deploy Mimir]──→ [dual-ship]──→  [cutover]
</span></span><span class="line"><span class="ln">4</span><span class="cl">APM       [audit]──→        [deploy Tempo]──→ [dual-ship]──→  [cutover]
</span></span><span class="line"><span class="ln">5</span><span class="cl">Log       [audit]──→        [deploy Loki]──→  [dual-ship]──→  [cutover]
</span></span><span class="line"><span class="ln">6</span><span class="cl">Dashboard [audit]──→        [deploy Grafana]──→ [rebuild]──→   [cutover]
</span></span><span class="line"><span class="ln">7</span><span class="cl">Alert     [audit]──→        [deploy Alertmgr]──→ [parallel]──→ [cutover]</span></span></code></pre></div><p>每個 stream 獨立做 dual-ship + cutover、不必同步；通常 <em>Metric 先遷</em>（cardinality 議題暴露最快）、然後 Log、最後 APM（trace correlation 最依賴 dashboard / alert）。</p>
<h2 id="agent-migrationdatadog-agent--otel-collector--alloy">Agent migration：Datadog Agent → OTel Collector / Alloy</h2>
<p>Datadog Agent 是 vendor-specific binary、抽出來換成 OpenTelemetry Collector / Grafana Alloy：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c"># alloy config (HCL-like)</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w"></span><span class="l">prometheus.scrape &#34;k8s_pods&#34; {</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">  </span><span class="l">targets = discovery.kubernetes.pods.targets</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">  </span><span class="l">forward_to = [prometheus.remote_write.mimir.receiver]</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w"></span>}<span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w"></span><span class="l">prometheus.remote_write &#34;mimir&#34; {</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">  </span><span class="l">endpoint {</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="l">url = &#34;https://mimir.internal/api/v1/push&#34;</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">  </span>}<span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w"></span>}<span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w"></span><span class="l">loki.source.kubernetes &#34;pods&#34; {</span><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w">  </span><span class="l">targets = discovery.kubernetes.pods.targets</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w">  </span><span class="l">forward_to = [loki.write.production.receiver]</span><span class="w">
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="w"></span>}<span class="w">
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="w"></span><span class="l">otelcol.receiver.otlp &#34;default&#34; {</span><span class="w">
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="w">  </span><span class="l">grpc {}</span><span class="w">
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="w">  </span><span class="l">output {</span><span class="w">
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="w">    </span><span class="l">traces = [otelcol.exporter.otlp.tempo.input]</span><span class="w">
</span></span></span><span class="line"><span class="ln">22</span><span class="cl"><span class="w">  </span>}<span class="w">
</span></span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="w"></span>}</span></span></code></pre></div><p>Migration 期間 <em>dual-shipper</em> 是標準作法：</p>
<ul>
<li>Datadog Agent 跟 Alloy 並存（短期 capacity 兩倍）</li>
<li>同 host 同時 ship 兩端、觀察一致性</li>
<li>漸進 disable Datadog Agent 的 metric / log / APM 子模組</li>
</ul>
<h2 id="production-故障演練">Production 故障演練</h2>
<h3 id="case-1cardinality-爆mimir-端-series-暴增">Case 1：Cardinality 爆，Mimir 端 series 暴增</h3>
<p><strong>徵兆</strong>：Datadog 端 30K series、ship 到 Mimir 後 series 變 500K、Mimir indexer OOM。</p>
<p><strong>根因</strong>：Datadog 內部對 tag 做 <em>自動 aggregation</em> 跟 <em>low-cardinality enforcement</em>；Prometheus / Mimir 對 <em>每個 unique label set</em> 算一個 series、application code 的 high-cardinality label（user_id / request_id）直接爆。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Audit 階段</strong> 跑 <code>topk(100, count by (__name__) ({__name__=~&quot;.+&quot;}))</code> 找 high-cardinality metric</li>
<li><strong>drop high-cardinality label</strong>：Alloy / OTel collector 端 <code>relabel</code> 規則 drop user_id 等 unbounded label</li>
<li><strong>改 histogram bucket</strong>：高 cardinality 通常來自 label combination、改用 fixed-bucket histogram</li>
<li><strong>適當改 metric 為 log</strong>：請求 ID 是 trace context、不該是 metric label</li>
</ol>
<h3 id="case-2log-volume-cost-預估失準">Case 2：Log volume cost 預估失準</h3>
<p><strong>徵兆</strong>：Loki 部署 1 個月後 S3 帳單比預估高 2x；object storage 跟 query GB-scan 都超預期。</p>
<p><strong>根因</strong>：Datadog 對 log 做自動 sampling / aggregation、bill 是 indexed event；Loki 是 <em>全量 raw ingest</em> + S3 cold storage、按實際 byte 計費。raw log volume 比 indexed event 高 3-10x。</p>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Ingest-side sampling</strong>：Alloy / Promtail 端 sample debug / info log、只 ingest warn / error 全量</li>
<li><strong>Log structure</strong>：JSON log 比 text log 壓縮率高、Loki S3 size 少 50%</li>
<li><strong>Retention tier</strong>：hot 7 天 S3 standard / cold 1 年 S3 Glacier、retention budget 控制</li>
</ol>
<h3 id="case-3datadog-dashboard-不能直接轉-grafana">Case 3：Datadog dashboard 不能直接轉 Grafana</h3>
<p><strong>徵兆</strong>：Migration 計畫設「dashboard 自動轉換」、實際跑 Datadog API export → Grafana import、80% dashboard 缺 widget / metric 對不上。</p>
<p><strong>根因</strong>：</p>
<ul>
<li>Datadog query syntax 跟 Grafana / Mimir 的 PromQL 不直接相容</li>
<li>Datadog widget type（top-list / hostmap）Grafana 沒對應</li>
<li>Tag-based aggregation 對應 Prometheus label 但語法不同</li>
</ul>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>接受重建</strong>：production-grade dashboard 必須人工重建、不要期待自動轉</li>
<li><strong>Prioritize</strong>：先重建 <em>SOC 用 / production-critical</em> 30%、其他 deprecate</li>
<li><strong>migration window 增 4-6 週</strong>：dashboard rebuild 是 underestimated effort</li>
</ol>
<h3 id="case-4alert-routing-換邏輯pagerduty-integration-不通">Case 4：Alert routing 換邏輯，PagerDuty integration 不通</h3>
<p><strong>徵兆</strong>：Cutover 後 alert 不送 PagerDuty、SOC 半小時才發現；alert 端 webhook 配置正確、但 payload format 跟 Datadog 不同、PagerDuty 端 rule 過濾掉。</p>
<p><strong>根因</strong>：</p>
<ul>
<li>Datadog alert payload 含 <code>event_type=alert</code>、PagerDuty integration 用這個 routing</li>
<li>Alertmanager 預設 payload 結構不同</li>
<li>PagerDuty rule 端針對 Datadog event 寫 schema、Alertmanager event 不 match</li>
</ul>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>Pre-cutover test</strong>：Alertmanager → PagerDuty 跑 dry-run、send test alert 驗證</li>
<li><strong>PagerDuty Service</strong>：建獨立 Grafana-source Service、不共用 Datadog Service</li>
<li><strong>Alertmanager template</strong>：用 webhook 自定 JSON template、payload 接近 Datadog 結構</li>
</ol>
<h3 id="case-5slo-definition-跟-monitor-type-對不上">Case 5：SLO definition 跟 monitor type 對不上</h3>
<p><strong>徵兆</strong>：Datadog SLO 跑 99.9% availability、轉到 Grafana SLO + Mimir 後實際 9X% 數字不一致；SOC 跑 dashboard 比對 5 個 SLO、4 個誤差 0.1-0.3%。</p>
<p><strong>根因</strong>：</p>
<ul>
<li>Datadog SLO 計算 over time window 用內部 query；Grafana SLO 用 PromQL 寫公式</li>
<li>Datadog 對 <code>success_rate</code> 處理 missing data 跟 PromQL 預設不同</li>
<li>Time bucket boundary 處理差異</li>
</ul>
<p><strong>修法</strong>：</p>
<ol>
<li><strong>重定義 SLO 在 PromQL</strong>：不嘗試「複製」、是「重定義」、認真寫 PromQL 表達式</li>
<li><strong>接受 ±0.1% drift</strong>：production-critical SLO 跑 dual-track 1-2 個月、tune PromQL 到 acceptable drift</li>
<li><strong>SLO migration 不是 dashboard migration 子集</strong>：獨立 stream、留更多時間</li>
</ol>
<h2 id="capacity--cost-對照">Capacity / cost 對照</h2>
<table>
  <thead>
      <tr>
          <th>維度</th>
          <th>Datadog</th>
          <th>Grafana Stack（self-hosted on K8s）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Setup cost</td>
          <td>低（SaaS）</td>
          <td>中高（K8s deploy + storage backend）</td>
      </tr>
      <tr>
          <td>Operational cost (200 host)</td>
          <td>$34K / month</td>
          <td>$8-12K / month（含 S3 + K8s）</td>
      </tr>
      <tr>
          <td>Operational cost (500 host)</td>
          <td>$80-150K / month</td>
          <td>$15-30K / month</td>
      </tr>
      <tr>
          <td>Operational FTE</td>
          <td>0.1-0.3</td>
          <td>1-2 FTE（K8s + storage + Grafana operator）</td>
      </tr>
      <tr>
          <td>Long-term retention</td>
          <td>$1.27 / million event for 15+ day</td>
          <td>S3 + Loki：~$0.02 / GB / month</td>
      </tr>
      <tr>
          <td>Multi-cloud / hybrid</td>
          <td>受 Datadog region 限</td>
          <td>自由部署</td>
      </tr>
      <tr>
          <td>Vendor lock-in</td>
          <td>高</td>
          <td>低（OSS + OTel）</td>
      </tr>
      <tr>
          <td>Time to value</td>
          <td>1-2 週</td>
          <td>4-8 週</td>
      </tr>
      <tr>
          <td>Migration cost (one-time)</td>
          <td>-</td>
          <td>1-3 FTE × 3 個月</td>
      </tr>
  </tbody>
</table>
<p><strong>Break-even point</strong>：~150 host 規模、3 年 amortized 後 self-hosted cheaper；&lt; 100 host 規模 SaaS 較 ROI 高。</p>
<h2 id="整合--下一步">整合 / 下一步</h2>
<h3 id="跟-opentelemetry-對齊">跟 OpenTelemetry 對齊</h3>
<p>Migration 是 <em>OTel-first 轉型</em> 的機會：</p>
<ul>
<li>Application code 用 OTel SDK、避免 Datadog SDK lock-in</li>
<li>Trace context propagation 走 W3C Trace Context</li>
<li>未來換 backend 不用再改 application</li>
</ul>
<h3 id="跟-splunk--elastic-對照">跟 <a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic</a> 對照</h3>
<p>兩篇都是 <em>cost-driven SaaS migration</em>、但細節差：</p>
<ul>
<li>Splunk → Elastic 是 SIEM 領域、schema translation 是核心議題</li>
<li>Datadog → Grafana 是 multi-tool 拆分、agent + dashboard 重建是核心</li>
<li>共同 pattern：dual-ship → parallel run → cutover</li>
</ul>
<h3 id="反向遷移grafana-stack--datadog">反向遷移（Grafana Stack → Datadog）</h3>
<p>存在但少數 — 主要是 <em>operational complexity reduction</em>（不想自管 Mimir / Loki）；schema 對位方向相反、agent 換回 Datadog Agent。</p>
<h3 id="下一步議題">下一步議題</h3>
<ul>
<li><strong>Grafana Cloud 混合</strong>：部分 component（Tempo）用 Grafana Cloud SaaS、其他 self-host、混合架構</li>
<li><strong>OpenTelemetry Collector 跟 Alloy 取捨</strong>：兩者都是 OTel-based、Alloy 是 Grafana 自家 fork</li>
<li><strong>Vector vs Alloy vs Fluentd</strong>：log shipper 戰場、cost / 功能 / OTel 整合度比較</li>
</ul>
<h2 id="相關連結">相關連結</h2>
<ul>
<li>Source vendor：<a href="/blog/backend/04-observability/vendors/datadog/" data-link-title="Datadog" data-link-desc="All-in-one SaaS 觀測平台、APM / Logs / Metrics / RUM / Security">Datadog</a></li>
<li>Target vendor：<a href="/blog/backend/04-observability/vendors/grafana-stack/" data-link-title="Grafana Stack" data-link-desc="Grafana / Loki / Tempo / Mimir / Pyroscope 全棧">Grafana Stack</a></li>
<li>平行 vendor：<a href="/blog/backend/04-observability/vendors/elastic-stack/" data-link-title="Elastic Stack" data-link-desc="ELK：Elasticsearch / Logstash / Kibana &#43; Beats / APM">Elastic Stack</a> / <a href="/blog/backend/04-observability/vendors/opentelemetry/" data-link-title="OpenTelemetry" data-link-desc="可觀測性開放標準、SDK 與 Collector">OpenTelemetry</a></li>
<li>平行 migration playbook：<a href="/blog/backend/07-security-data-protection/vendors/splunk/migrate-to-elastic-security/" data-link-title="Splunk → Elastic Security Detection Rule Migration：6 段 phased playbook 跟 5 大踩雷" data-link-desc="從 Splunk Enterprise Security 遷到 Elastic Security 的 detection rule translation playbook：SPL ↔ KQL/ES|QL schema 對位、AI-assisted translation pipeline、parallel run 比對、cutover routing、5 個 production 踩雷（macro 沒對應 / time zone 差異 / summary index 不對位 / alert dedup key 衝突 / 過早 decommission）、capacity / cost 對照">Splunk → Elastic Security</a> / <a href="/blog/backend/02-cache-redis/vendors/redis/migrate-to-dragonflydb/" data-link-title="Redis → DragonflyDB：drop-in 相容下的容量躍升 &#43; 5 個踩雷" data-link-desc="DragonflyDB 號稱 Redis drop-in 替代、單機 throughput 25x、記憶體效率 30% 提升；遷移流程簡單但有 5 個 production 踩雷（RDB 版本差 / Lua 腳本不全支援 / Pub-Sub fanout 行為差異 / Cluster mode 兼容度 / Modules 不支援）、跟 Sentinel / Cluster 模式對位">Redis → DragonflyDB</a> / <a href="/blog/backend/01-database/vendors/postgresql/migrate-to-aurora/" data-link-title="PostgreSQL → Aurora Migration：protocol 相容、operational 重設計" data-link-desc="Aurora 號稱 PostgreSQL-compatible 但 operational model 不同（storage decouple / cluster endpoint / instance class / 自家備份）；遷移流程是混合（protocol drop-in &#43; operational phased）、5 個 production 踩雷（extension 不支援 / replication slot 不直通 / autovacuum 行為差 / IAM 認證強制 / cost model 換算）、跟 Patroni / read replica / DR 對位">PostgreSQL → Aurora</a></li>
<li>Methodology：<a href="/blog/posts/vendor-%E6%B7%B1%E5%BA%A6%E6%8A%80%E8%A1%93%E6%96%87%E7%AB%A0%E6%96%B9%E6%B3%95%E8%AB%96%E7%9A%84%E6%BC%94%E5%8C%96%E7%B4%80%E9%8C%84%E5%90%8C-vendor-%E7%B3%BB%E5%88%97%E7%9A%84%E9%96%8B%E5%A0%B4%E8%BC%AA%E6%9B%BF%E9%A9%97%E8%AD%89/" data-link-title="Vendor 深度技術文章方法論的演化紀錄：同 vendor 系列的開場輪替驗證" data-link-desc="vendor overview 飽和後要寫單一功能深度文章、需要選題與結構依據時回來。這套方法論的驗證來源與 cadence variant 在高風險場景（同 vendor sub-tool 系列）的實證。">Vendor 深度技術文章的寫作方法論</a></li>
</ul>
]]></content:encoded></item></channel></rss>