Opentelemetry on Tarragon

跟 OpenTelemetry 的 schema 差異對照

Fri, 19 Jun 2026 00:00:00 +0000

OpenTelemetry（OTLP）是 server-side 可觀測性的業界標準，定義了 traces、metrics、logs 三種 signal 的資料格式和傳輸協定。自架的 event schema 和 OTLP 在設計目標、複雜度和適用場景上有明確差異。

設計目標差異

OTLP

OTLP 的設計目標是「跨語言、跨框架、跨 vendor 的統一可觀測性標準」。它支援分散式追蹤（trace context propagation）、多維度 metric（histogram、summary、exponential histogram）、結構化 log。

OTLP 的資料模型假設 server-side 的基礎設施：collector（如 OTel Collector）做資料路由和轉換，backend（如 Jaeger、Prometheus、Grafana）做儲存和視覺化。

自架 event schema

自架 schema 的設計目標是「client-side 監控的最小可用結構」。它假設的基礎設施是一個 HTTP endpoint + JSONL 檔案 + grep。不需要分散式追蹤（client 端通常是單一服務），不需要多維度 metric（counter 和 gauge 用 event 的 data 欄位表示即可）。

具體差異

維度	OTLP	自架 event schema
Signal 類型	Trace / Metric / Log 三種獨立 signal	統一的 event 格式 + type 欄位
傳輸格式	Protobuf（HTTP/gRPC）	JSON（HTTP POST）
Trace context	SpanID / TraceID / ParentSpanID	Session ID（無分散式追蹤）
Metric 模型	Sum / Gauge / Histogram / Summary	data 欄位中的數值
Resource	結構化的 resource attributes	source 欄位
Schema 複雜度	高（完整的 Protobuf 定義）	低（JSON Schema，核心 6 欄位）

自架 schema 簡化了什麼

不做分散式追蹤

OTLP 的 trace signal 用 TraceID 和 SpanID 把跨服務的請求關聯起來。Client-side 監控通常不需要這個能力 — app 是單一服務，不存在跨服務的請求鏈路。

自架 schema 用 session ID 關聯同一次使用中的事件，滿足「使用者在這次操作中做了什麼」的分析需求。

不用 Protobuf

OTLP 用 Protobuf 編碼資料，效率高（binary 格式、schema 驗證在編譯期）。但 Protobuf 需要 schema 檔案（.proto）、程式碼生成、和 SDK 語言的 Protobuf 套件。

自架 schema 用 JSON，人類可讀、grep 友好、不需要額外工具。JSON 的效率比 Protobuf 低（文字格式、體積較大），但在 client-side 監控的事件量下（每分鐘數十到數百筆），效率差異不構成瓶頸。

簡化 metric 模型

OTLP 的 metric signal 支援 histogram（分桶分佈）、summary（百分位）、exponential histogram（自適應分桶）。這些模型在 server-side 的高頻度 metric 收集中有意義。

自架 schema 把 metric 記錄為 event 的 data 欄位中的數值（{"type": "metric", "name": "connect.duration", "data": {"value_ms": 320}}）。統計分析在 collector 端用查詢完成，不在 schema 層做聚合。

什麼時候切換到 OTLP

以下訊號出現時，自架 schema 的簡化可能成為限制：

需要和 server-side 追蹤關聯：Client 端的操作要關聯到 server 端的 trace（「使用者點擊按鈕到 database query 的完整路徑」）。需要 OTLP 的 trace context propagation。

事件量超過 JSONL 的處理能力：每秒數千筆事件時，JSON 的解析和 JSONL 的 grep 查詢成為瓶頸。OTLP + OTel Collector + 時間序列 DB 的管線能處理更高的吞吐量。

需要接入多個 backend：同時送資料到 Prometheus（metric）、Jaeger（trace）、Elasticsearch（log）。OTel Collector 原生支援多 backend 路由，自架方案需要自己實作。

切換策略：SDK 層的 API 不變（init / event / error / metric），只改底層的傳輸和編碼。從 JSON POST 改成 OTLP export，SDK 的使用者不需要改程式碼。

下一步路由

自架 schema 的完整定義 → event.schema.json 完整欄位解說
Server-side 的可觀測性 → backend 04 可觀測性
Collector 的設計 → 模組四 Collector 設計

OTel Collector 部署模式：agent / gateway / sidecar 與 pipeline 設計

Tue, 16 Jun 2026 00:00:00 +0000

本文是 OpenTelemetry 的 vendor deep article，深化 overview「Collector 部署模式」段。初次接觸 OpenTelemetry 的讀者建議先讀 OpenTelemetry 服務頁，再回到本文。指令於 2026-06-16 用 otel/opentelemetry-collector-contrib:0.154.0 在 docker 實機驗證。

應用程式產生的 telemetry 跟最終存放的 backend 之間需要一個中介層 — OTel Collector 就是這個中介。應用只負責用 OTLP 把資料吐給 collector，collector 負責接收、處理、轉發，兩邊解耦。部署這個 collector 的第一個決策是它擺在哪裡（同 host、集中 gateway、還是 pod sidecar），而非配置細節。位置決定了 buffer 能力、enrichment 時機與失效影響面。

問題情境：telemetry 直送 backend 的三個代價

應用程式直接用 vendor SDK 把 telemetry 送到後端，會在規模變大時撞到三個問題。第一是耦合：每個服務都寫死了某個 backend 的 endpoint 與認證，換 backend 要改所有服務重新部署。第二是缺乏 buffer：backend 短暫不可用時，telemetry 直接丟失，因為應用程式不會為了觀測資料保留重試佇列。第三是 enrichment 分散：每個服務各自加 resource attribute、各自做 sampling，標準難統一。

Collector 把這三件事收斂到一個中介層。應用只認 collector 的 OTLP endpoint，換 backend 只改 collector 配置；collector 有 queue 與重試；enrichment 與 sampling 在 collector 統一做。但這個中介層擺在哪裡，決定了它各自解掉多少。

服務數少、backend 單一且穩定時，應用直送 backend 是合理起點 — 上述三個代價在小規模下可控。Collector 是規模化後的升級：當 backend 要換、服務數成長到 enrichment 要統一、或 sampling 需求出現時，再引入 collector 補這一層。

核心概念：三種部署位置的責任分工

Collector 的部署位置分三種，差別在「離應用多近」與「聚合多少來源」。

Agent 模式把 collector 跟應用程式放在同一個 host 或同一個 K8s node（DaemonSet）。它的責任是做 local buffer 與 host 層 enrichment：應用透過 localhost 把 telemetry 吐給同機的 collector，延遲極低、不跨網路；collector 補上 host name、container id 這類只有在本機才知道的 resource attribute。agent 的價值是「離應用最近」，應用送出 telemetry 後就不必管後續，buffer 與重試由同機 collector 承擔。

Agent 解了「離應用近、不丟資料」的問題，但它只看得到本機 — 需要全域視野的處理放不進去。Gateway 模式補這一塊：把 collector 集中部署成一個獨立的服務叢集，跨多個 agent 或多個應用接收 telemetry，負責需要全域視野的處理：tail-based sampling（要看完整 trace 才決定採不採）、跨來源的 routing（不同 telemetry 送不同 backend）、集中的 rate limit 與成本控制。gateway 的價值是「集中決策」，把只有匯流後才做得到的處理放在這一層。

Sidecar 模式在 K8s 把 collector 當成跟應用 pod 同生命週期的 sidecar container。它的責任跟 agent 相似（local buffer、pod 層 enrichment），差別在隔離粒度是 pod 而非 node：比 DaemonSet agent 更貼近單一 pod（共享 pod 網路、隨 pod 起停），適合需要 pod 級獨立配置或強隔離的場景，代價是每個 pod 都多一份 collector 的資源開銷。

常見部署是兩層組合：agent（DaemonSet）做 local buffer + host enrichment，再把資料送到 gateway 叢集做 tail sampling 與 routing。agent 解掉「離應用近、不丟資料」，gateway 解掉「需要全域視野的處理」，兩層各司其職。

pipeline 模型：receivers / processors / exporters

不論擺在哪個位置，collector 的內部都是同一個 pipeline 模型：telemetry 從 receivers 進來、經過 processors 加工、由 exporters 送出。三者用 service.pipelines 依訊號類型（traces / metrics / logs）串接。以下是最小可驗證配置，三個區塊（receivers / processors / exporters）對應 pipeline 的三個階段，各自職責在後面逐段說明。這份配置在 docker 驗證過可正常啟動並端到端流通（validate --config 回傳 0、送 5 條 trace 後 debug exporter 完整輸出 spans）：

 1receivers:
 2  otlp:
 3    protocols:
 4      grpc:
 5        endpoint: 0.0.0.0:4317
 6processors:
 7  memory_limiter:
 8    check_interval: 1s
 9    limit_mib: 256
10    spike_limit_mib: 64
11  batch:
12    timeout: 5s
13    send_batch_size: 1024
14exporters:
15  debug:
16    verbosity: detailed
17service:
18  pipelines:
19    traces:
20      receivers: [otlp]
21      processors: [memory_limiter, batch]
22      exporters: [debug]

receivers 定義「資料怎麼進來」，OTLP（gRPC 4317 / HTTP 4318）是標準入口。processors 定義「資料怎麼加工」，順序有意義：memory_limiter 放最前面，先擋住記憶體爆掉；batch 放後面，把零散 span 攢成批次再送，降低下游請求數。此處 256 / 64 MiB 是 demo 用量，production 應依 container memory limit 按比例設定（常見做法是 limit_mib 設為 container memory 的 80%、spike 設為 limit 的 20-25%）。exporters 定義「資料送到哪」，正式環境會是 OTLP 到 backend 或某 vendor exporter，這裡用 debug 驗證流通。service.pipelines 才是真正生效的接線：只有被掛進某個 pipeline 的元件才會運作，定義了卻沒掛進 pipeline 的元件不生效。

processor 順序是常見踩雷點。memory_limiter 要排在第一個，讓它在資料進入後續 processor 前就有機會審查與拒收；batch 排在它之後，因為如果 batch 先跑，telemetry 會先在 batch processor 累積成大批，等觸發記憶體限制時壓力已經更高、拒收效果下降。需要 sampling 時，head sampling 可以放 agent 層的 pipeline，tail sampling 必須放 gateway 層（它要匯流完整 trace），且同一 trace 的所有 span 要路由到同一個 gateway 實例（用 trace-id 維度的 load balancing exporter），否則各 gateway 節點各看片段、tail 決策仍不完整。

Production 故障演練

Collector 失效的影響面取決於部署模式，這是選位置時要先想清楚的。agent 模式下，單一 node 的 collector 掛掉只影響該 node 的應用，且應用送往 localhost 失敗可以 fail-fast；gateway 模式下，gateway 叢集掛掉會影響所有上游 agent，因此 gateway 必須多副本 + 負載均衡，不能單點；sidecar 模式下，失效影響面比 agent 更窄（只影響同 pod 的應用），但每個 pod 各自是獨立失效點，pod 數多時同時出狀況的機率也高。演練時要分別注入「單 agent 掛」與「gateway 叢集不可用」，確認前者影響被局限、後者有 agent 層 buffer 兜著。

記憶體壓力是 collector 最常見的故障。telemetry 流入速度超過 exporter 送出速度時，資料在 collector 內累積、記憶體上升，沒有保護會 OOM 被 kill、整段 telemetry 全丟。memory_limiter processor 是這道防線，它定期（check_interval）檢查記憶體並用兩個閾值分級反應：記憶體超過軟上限（limit_mib 減去 spike_limit_mib）時強制觸發 GC 並開始拒收，給回收一個緩衝區間；超過硬上限（limit_mib）時全面拒收新資料。只設 limit_mib、不設 spike_limit_mib 是不完整的配置，等於沒有軟性緩衝、直接撞硬牆。演練時用高於 exporter 吞吐的速率灌資料，確認 memory_limiter 在軟上限就介入、collector 存活，而不是 OOM。

Backpressure 的傳遞要驗證到底。當 backend 變慢、exporter queue 滿，collector 的 OTLP receiver 會回壓給上游（gRPC 層用 resource-exhausted 拒收）。在 agent 模式這個回壓會傳到應用的 OTLP exporter，應用 SDK 的 queue 也會滿——此時 SDK 的反應取決於 exporter 配置，要確認 queue-full 策略設為 drop 而非 block，讓 telemetry 被丟棄而非阻塞業務執行緒（各語言 SDK 預設不同，不能假設一定是 drop）。演練要確認「backend 慢 → collector 回壓 → 應用丟 telemetry 但業務不受影響」這條鏈成立，避免觀測系統的壓力反噬主流程。

觀察訊號	判讀	對應動作
collector 容器頻繁 OOM restart	memory_limiter 閾值過高或未啟用	調低 limit_mib、確認 spike_limit_mib 有設
exporter queue depth 持續飽和	下游 backend 回應慢或不可用	查 backend 狀態、確認 exporter retry 與 timeout 設定
receiver refused spans 計數上升	memory_limiter 啟動拒收、collector 處於壓力狀態	查上游流量是否異常、考慮擴容 gateway 或調降 sampling
gateway 全部不可用、agent buffer 開始丟棄	全域 telemetry 中斷	確認 gateway 多副本與負載均衡、agent 的 queue 與 drop 策略
telemetry 到 backend 有延遲但不丟失	batch processor 正常攢批	正常行為、確認 batch timeout 符合預期

Capacity / cost 邊界

agent 與 gateway 的成本曲線不同，選型要對著規模看。agent（DaemonSet）的成本是「每個 node 一份 collector」的固定開銷：node 多時總開銷隨 node 數線性成長，但每份 collector 只處理本機流量、單份負載可控。gateway 的成本是「集中叢集」：份數少但每份要扛匯流後的總流量，要按總 telemetry 吞吐量做容量規劃與水平擴展。

兩層架構的成本判讀是：agent 層用最小配置（夠做 buffer + enrichment 即可，limit_mib 設小），把重處理（tail sampling、大量 routing）集中到 gateway，讓 gateway 的擴展跟總流量綁定、agent 的開銷跟 node 數綁定。把 tail sampling 誤放在 agent 層是常見的成本錯誤——agent 看不到完整 trace、做不了正確的 tail sampling，還白白吃掉每個 node 的記憶體。

gateway 層的 processor 是攔截高 cardinality attribute 的有效位置：在 telemetry 流入 backend 前用 attributes / transform processor 把高 cardinality label（user id、request id 當 metric label）移除或降維，比讓它流到 backend 後才治理便宜。高 cardinality 的 attribute 會在下游 backend 炸開成本，是另一條要在 collector 攔截的成本線。這條跟 4.7 Cardinality 治理與成本邊界對齊。

整合 / 下一步

Collector 部署模式是 OTel 落地的第一個決策，它的下游是 sampling 策略與 backend 選型。決定了 agent + gateway 兩層後，tail sampling 的設計接到 gateway 層的 pipeline；exporter 指向哪個 backend 則回到何時改走其他服務的 vendor portability 判讀。

pipeline 的訊號治理與資料品質回到 4.11 Telemetry Pipeline 架構與 4.17 Telemetry Data Quality；cardinality 攔截回到 4.7 Cardinality 治理與成本邊界。

Datadog OTLP Ingestion 與 OTel 整合

Tue, 23 Jun 2026 00:00:00 +0000

本文是 Datadog 的 vendor deep article，深化 overview「OTLP ingestion」段。初次接觸 Datadog 的讀者建議先讀 Datadog 服務頁。

問題情境

兩種觸發情境會讓團隊需要 Datadog 的 OTLP ingestion：

團隊已經使用 Datadog APM，但新服務或新語言想用 OTel SDK 避免 vendor lock-in。Datadog SDK 覆蓋的語言有限（Go / Java / Python / Ruby / Node / .NET / PHP / C++），如果服務用 Rust / Elixir / Kotlin multiplatform，OTel SDK 的覆蓋更廣。

另一種情境是團隊原本用 OTel + Jaeger 或 OTel + Grafana，現在想把 visualization 遷到 Datadog 但不想重新 instrument。OTLP ingestion 讓 OTel SDK 產出的 traces / metrics / logs 直接送進 Datadog，不改 application code。

核心概念

Datadog Agent 的 OTLP receiver

Datadog Agent 6.32+ 內建 OTLP receiver，接受 gRPC（port 4317）和 HTTP（port 4318）兩種 protocol。Agent 收到 OTLP 資料後轉換成 Datadog 內部格式，走跟 Datadog SDK 相同的 pipeline（sampling、tagging、forwarding to Datadog backend）。

這代表 OTLP path 的資料在 Datadog UI 裡跟 Datadog SDK path 的資料一樣被處理 — 相同的 APM trace waterfall、相同的 service map、相同的 error tracking。差異在 metadata 完整度（見下方 feature parity）。

三種 signal 的 OTLP 支援度

Signal	OTLP 支援	到 Datadog 的對應
Traces	完整（OTLP gRPC / HTTP）	APM traces、service map、error tracking
Metrics	完整（OTLP gRPC / HTTP）	Custom metrics（按 metric 計費）
Logs	有限（Agent 7.54+ 支援 OTLP logs）	Datadog Logs（按 ingestion volume 計費）

Traces 的 OTLP 支援最成熟、metrics 次之、logs 最新。混合環境常見做法是 traces + metrics 走 OTLP、logs 走 Datadog Agent 的原生 log collection（file tailing / container stdout）。

Datadog SDK vs OTel SDK feature parity

功能	Datadog SDK	OTel SDK → Datadog
Distributed tracing	有	有（完整）
Continuous profiling	有	無（Datadog 專有）
ASM（Application Security）	有	無（需要 Datadog library）
CI Visibility	有	無
Dynamic instrumentation	有	無
Runtime metrics（GC、thread）	自動	需手動配置 OTel metric instrumentation
Log correlation（trace_id 注入 log）	自動	需手動配置（MDC / context propagation）
Unified service tagging	自動（`DD_SERVICE` / `DD_ENV` / `DD_VERSION`）	需 resource attribute mapping

判讀：如果團隊需要 profiling / ASM / CI Visibility，對應服務仍需 Datadog SDK。其他服務可以用 OTel SDK + OTLP ingestion，兩者在同一個 Datadog org 共存。

配置 step-by-step

Datadog Agent OTLP 設定

1# datadog.yaml
2otlp_config:
3  receiver:
4    protocols:
5      grpc:
6        endpoint: 0.0.0.0:4317
7      http:
8        endpoint: 0.0.0.0:4318

Agent 重啟後用 datadog-agent status 確認 OTLP receiver 啟動。

OTel SDK endpoint 配置

1# 環境變數（語言無關）
2export OTEL_EXPORTER_OTLP_ENDPOINT="http://datadog-agent:4317"
3export OTEL_EXPORTER_OTLP_PROTOCOL="grpc"
4export OTEL_SERVICE_NAME="checkout-api"
5export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.version=1.2.3"

Resource attribute → Datadog tag mapping

Datadog Agent 自動把 OTel resource attributes 轉成 Datadog tags：

OTel resource attribute	Datadog tag	備註
`service.name`	`service`	Datadog unified service tagging 的核心
`deployment.environment`	`env`	必填、否則 Datadog UI 的環境篩選失效
`service.version`	`version`	用於 deployment tracking
`host.name`	`host`	Agent 通常自動帶、不需手動設
`container.name`	`container_name`	K8s 環境自動帶

如果 resource attribute 沒設 deployment.environment，Datadog 會把 trace 歸到 env:none — 在 APM 介面幾乎不可見。這是最常見的 OTLP onboarding 問題。

OTel Collector → Datadog（alternative path）

如果不想讓 application 直連 Datadog Agent，可以在中間放 OTel Collector：

 1# otel-collector-config.yaml
 2exporters:
 3  datadog:
 4    api:
 5      key: ${DD_API_KEY}
 6      site: datadoghq.com
 7
 8service:
 9  pipelines:
10    traces:
11      receivers: [otlp]
12      processors: [batch]
13      exporters: [datadog]

OTel Collector 的 datadog exporter 直接把資料送到 Datadog backend（不經 Agent）。適合已有 OTel Collector 基礎設施、不想每個 node 都部署 Datadog Agent 的場景。

故障與邊界

Resource attribute mapping 不對齊

OTel 的 service.name 用 dot notation（如 com.example.checkout），Datadog 預設用 hyphen（如 checkout-api）。如果 mapping 不一致，同一個服務在 Datadog APM 的 service map 會出現多個節點（OTel path 一個、Datadog SDK path 一個）。

修法：統一 service.name 命名。如果兩種 SDK 並存，在 OTel SDK 的 resource attribute 設跟 Datadog SDK 的 DD_SERVICE 完全相同的值。

Metric naming convention 差異

OTel metric 用 dot notation（http.server.request.duration），Datadog 預設用 underscore（http_server_request_duration）。Agent 會自動轉換（dot → underscore），但如果團隊同時有 Datadog SDK 產出的 metric 跟 OTel SDK 產出的 metric，兩者可能在 Datadog 裡產生重複（語意相同但名稱不同）。

修法：用 OTel Collector 的 metricstransform processor 在 export 前統一命名，或在 Datadog 用 metric alias 合併。

Log correlation 在 OTLP path 的限制

Datadog SDK 自動把 dd.trace_id 和 dd.span_id 注入 application log（如 Python logging、Java MDC）。OTel SDK 不做這件事 — log correlation 需要手動設定（把 trace_id 從 OTel context 注入 logging framework）。

如果 log correlation 缺失，Datadog 的 trace → log 跳轉功能失效。修法依語言不同：Java 用 MDC + OTel Java agent 的 log context instrumentation；Python 用 opentelemetry-instrumentation-logging；Go 需要手動從 span context 取 trace ID 寫到 log field。

容量與成本

OTLP path 的計費跟 Datadog SDK path 相同：

Signal	計費單位	OTLP vs Datadog SDK
APM traces	Per ingested span	相同
Metrics	Per custom metric（unique metric name × tag combination）	相同
Logs	Per ingested GB	相同

成本差異不在 ingestion pricing，在 feature access。用 OTel SDK 失去 Profiling / ASM / CI Visibility，這些功能需要 Datadog SDK。如果團隊需要這些功能，走 OTLP 反而要為核心服務額外部署 Datadog SDK — 雙 SDK 的 maintenance cost 可能超過直接全用 Datadog SDK。

判斷分水嶺：如果 > 80% 的服務不需要 Profiling / ASM，走 OTLP + 少數服務用 Datadog SDK 是合理的混合模式。如果核心服務都需要 Profiling，全用 Datadog SDK 更簡單。

整合與下一步

Datadog 服務頁：overview 與日常操作
Datadog 成本治理：Agent 配置與 cost control
4.C7 Datadog OTel migration：從 Datadog SDK 轉向 OTel 相容模式的治理案例
OpenTelemetry Collector 部署模式：OTel Collector → Datadog 的 alternative path
← New Relic migration：New Relic → Datadog 的遷移中 OTLP 扮演的橋接角色

4.20 LLM tracing 與 observability

Tue, 12 May 2026 00:00:00 +0000

LLM tracing 把每次 LLM call / tool call / memory op / handoff 編成結構化 span、用 OpenTelemetry GenAI semantic conventions 標準化、是 production LLM 應用 debug / cost / quality 監控的事實標準。傳統 web app 的字串 logging 抓不到 LLM 應用的關鍵問題 — agent 為什麼選了那條路、reasoning trace 怎麼推導、tool call 為什麼 retry 三次、token 消耗為什麼比預期高 ×3。本章把 LLM tracing 的運作機制、OTel GenAI semconv、三大 use case（cost / latency / failure）跟 production eval 閉環拆成可操作的工程實務。

本章目標

讀完本章後、你應該能：

解釋 LLM tracing 跟 traditional logging 的差異。
用 OpenTelemetry GenAI semantic conventions 設計 span 結構。
用 trace 做 cost / latency 監控跟 failure debug。
把 production trace 餵回 LLM-as-judge 做品質迴路。
對自己應用判斷該用 self-host vs SaaS observability platform。

Traditional logging 為什麼不夠

LLM 應用的 debug 問題對傳統 logging 太抽象：

場景	Logging 看到	真正需要的資訊
Agent 為什麼選 tool A 不選 tool B	`tool=A` 一行	完整 reasoning trace + 當下 context + tool list
Token cost 為什麼高	`tokens=15234`	Input / output / cached token 分項 + 每 turn 累積
Why TTFT 5 秒	`ttft=5012ms`	Prefill 跟 cache miss、prompt length、queue time
Tool 為什麼 retry 三次	`tool error retry`	每次 error message + LLM 的判讀 + retry 策略
Agent 為什麼 infinite loop	大量重複 log	每 iteration 的 context + 為什麼沒判 terminate

LLM tracing 用「結構化 span + parent-child 關係 + 標準化 attribute」直接編碼這些訊息。

OpenTelemetry GenAI semantic conventions

OTel GenAI semconv 是 2024-2025 標準化中的 trace schema。核心概念：

 1Trace（一次 user query 從進來到 response）
 2  ├── Span: gen_ai.agent.invocation（agent loop iteration 1）
 3  │     ├── Span: gen_ai.client.operation（LLM call 1）
 4  │     │     attrs: model, temperature, input_tokens, output_tokens, cache_read
 5  │     ├── Span: gen_ai.tool.execution（tool: read_file）
 6  │     │     attrs: tool_name, input, output, duration
 7  │     └── Span: gen_ai.memory.read（retrieval）
 8  │           attrs: query, top_k, similarity_scores
 9  ├── Span: gen_ai.agent.invocation（iteration 2）
10  │     └── ...
11  └── Span: gen_ai.agent.terminate
12        attrs: reason, total_tokens, total_cost

主要 attribute 分類：

類別	屬性 prefix	典型內容
Model	`gen_ai.request.*`	model, temperature, top_p, max_tokens, stream
Usage	`gen_ai.usage.*`	input_tokens, output_tokens, cached_tokens
Response	`gen_ai.response.*`	finish_reason, id
Tool	`gen_ai.tool.*`	name, parameters, result
Memory	`gen_ai.memory.*`	operation, store, query, hits
Cost	`gen_ai.cost.*`	usd, currency（vendor-specific）

實作概要（Python 例）：

 1from opentelemetry import trace
 2from openinference.semconv.trace import SpanAttributes
 3
 4tracer = trace.get_tracer(__name__)
 5
 6with tracer.start_as_current_span("gen_ai.client.operation") as span:
 7    span.set_attribute(SpanAttributes.LLM_MODEL_NAME, "claude-sonnet-4-6")
 8    span.set_attribute(SpanAttributes.LLM_TEMPERATURE, 0.7)
 9
10    response = llm_client.chat(messages=...)
11
12    span.set_attribute(SpanAttributes.LLM_TOKEN_COUNT_PROMPT, response.usage.input_tokens)
13    span.set_attribute(SpanAttributes.LLM_TOKEN_COUNT_COMPLETION, response.usage.output_tokens)
14    span.set_attribute("gen_ai.usage.cached_tokens", response.usage.cache_read_tokens or 0)

實務上多用 framework auto-instrumentation（LangChain / LlamaIndex / Anthropic SDK 都有 OTel integration）、不必手寫 span。

Use case 1：Cost monitoring

Trace 是 LLM 應用 cost 監控的核心 — token usage attribute 內建、不必另外算。

實作模式：

11. Trace 端記錄 input_tokens / output_tokens / cached_tokens
22. Observability 平台用「per-model pricing table」算出 USD
33. Aggregate by：
4   - User（哪個 user 燒最多）
5   - Endpoint（哪條 API path 最貴）
6   - Feature（哪個 feature 最費 token）
7   - Time（哪天 spike）

典型 dashboard 指標：

指標	直覺
Total cost / day	整體燒錢趨勢
Cost per user	找 power user 或 abuse
Cost per request	看單 request 平均 cost、設 alert
Cached / total token ratio	Prompt cache 命中率
Output / input token ratio	輸出膨脹率、看 generation length 合理性

Use case 2：Latency / failure debug

Trace 自然編碼 latency tree、能定位「哪個 span 卡」：

1User query → response total: 5.2s
2├── Agent iteration 1: 4.8s
3│   ├── LLM call (claude): 4.2s     ← 主要時間在這
4│   │   - prefill: 3.8s             ← prefill 太久、看 prompt 是否需要 cache
5│   │   - generation: 0.4s
6│   ├── tool: read_file: 0.5s
7│   └── memory: retrieval: 0.1s
8└── Agent iteration 2: 0.4s

從這 trace 看出「90% 時間在 prefill、開 prompt cache 可以救」、不必猜。

Failure debug：

1User query → response: ERROR
2├── Agent iteration 1: success
3│   └── LLM call: tool_call(run_bash, cmd="rm -rf /")
4├── Agent iteration 2: failure
5│   └── tool: run_bash: REJECTED by permission system
6└── Agent fallback: error response
7
8從 trace 看：tool call 被 permission 擋下、不是 LLM 自己亂、而是 user query 觸發危險 tool call、permission 正確擋下。

對應 6.2 tool use 權限模型跟 hands-on permission-boundary 的判讀。

Use case 3：Production trace → eval loop

Production trace 是 LLM-as-judge 的最佳資料來源：

 1Production users
 2   ↓ 產生 trace
 3Trace storage（LangSmith / Phoenix / Langfuse）
 4   ↓ filter（e.g. user thumbs-down 的 trace）
 5   ↓ sample N 個
 6LLM-as-judge eval
 7   ↓ rubric scoring
 8找出系統性問題（哪類 query 品質差）
 9   ↓
10改 system prompt / tool / agent loop
11   ↓
12A/B test on production traces

這是 4.14 benchmarking 提的「in-house benchmark」的具體 implementation — production trace 是最真實的 benchmark dataset。

主流平台選型

平台	類型	強項	適合場景
LangSmith	SaaS（LangChain 系）	Auto-instrumentation 強、UI 完整	LangChain / LangGraph user
Phoenix	OSS + SaaS（Arize 系）	OpenInference 標準、可 self-host	想 self-host + OTel native
Langfuse	OSS + SaaS	開源強、cost 監控好	Cost / eval 中心、可 self-host
Braintrust	SaaS	Eval + tracing 一體	重 eval workflow 的 team
Datadog APM	SaaS	跟 traditional APM 整合	已用 Datadog、想統一監控
Logfire	SaaS（Pydantic）	簡潔、Python 為主	Python 為主、輕量
Self-host OTel + Jaeger	OSS	完全 self-host、最便宜	隱私敏感、cost 敏感、技術強

判讀：

個人 / 小流量：SaaS 免費 tier（LangSmith / Langfuse / Phoenix）夠用
隱私敏感（user data 不能離本機）：Self-host（Langfuse / Phoenix self-hosted、或 OTel + Jaeger）
已有 observability stack：用 OTel + 現有 Datadog / Grafana、別再加一層
重 eval：Braintrust / Langfuse 的 eval feature 強

跟 4.9 production resource 的關係

4.5 寫 production resource 的 6 個 dimension（concurrency / latency / cost / storage / observability / reliability）、其中 observability 是 4.5 點到、本章展開。讀者讀完 4.5 知道「需要 observability」、本章補「具體怎麼做」。

設計失敗模式

過度 instrument：每個 internal function 都加 span、trace overhead 大、實際 production noise 多

緩解：聚焦 LLM-related 跟跨 service 邊界、internal logic 不必 trace

PII / sensitive data 寫進 span attribute：user prompt、API key、會被 SaaS 平台看到

緩解：Span attribute 過 PII filter、敏感資料 hash / masking、跟 6.4 跨雲端邊界結合

不 sample：production 100% trace、storage / cost 爆

緩解：Production sample rate < 10%、error / outlier 100% capture

沒設 trace 保留期：trace 越累積越多、舊 trace 沒人看但仍付儲存

緩解：明確保留 policy（如 7-30 天 hot、之後 archive 或刪）

Trace 不跟 metric 串：trace 是 sample、metric 是 aggregate、debug 要兩個一起看

緩解：cost / latency 也輸出 metric（Prometheus 等）、trace 補 specific instance debug

何時不需要 tracing

純 demo / 個人玩：log 字串夠用
單一 LLM call、無 agent loop：簡單到 grep log 也能 debug
隱私極敏感且不 self-host：trace 內容流向 SaaS 是邊界、評估 risk
每 request 都 trace 的 overhead > 收益：超低 latency 場景看是否 worth it

何時過時 / 何時不過時

不會過時的部分：

LLM tracing 跟 traditional logging 的根本差異
結構化 span + parent-child 關係的 framing
Cost monitoring / latency debug / failure debug 三大 use case
Trace → eval 的閉環概念
5 個設計失敗模式

會變的部分：

OTel GenAI semconv 的具體 attribute 名稱（仍在 stabilizing）
主流 SaaS 平台（每年 1-2 個新進入者）
Auto-instrumentation 的支援度（持續擴展）
跟具體 framework 的整合方式

下一章：4.21 LLM-as-judge 評估方法、把 production trace 變成系統性 eval 的閉環。