<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Tracing on Tarragon</title><link>https://tarrragon.github.io/blog/tags/tracing/</link><description>Recent content in Tracing on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Mon, 22 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/tracing/index.xml" rel="self" type="application/rss+xml"/><item><title>4.20 LLM tracing 與 observability</title><link>https://tarrragon.github.io/blog/llm/04-applications/llm-tracing-and-observability/</link><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/llm/04-applications/llm-tracing-and-observability/</guid><description>&lt;p>&lt;a href="https://tarrragon.github.io/blog/llm/knowledge-cards/llm-tracing/" data-link-title="LLM Tracing" data-link-desc="把 LLM 應用的每次 LLM call / tool call / memory op 編成結構化 span、用 OpenTelemetry GenAI semantic conventions 標準化">LLM tracing&lt;/a> 把每次 LLM call / tool call / memory op / handoff 編成結構化 span、用 OpenTelemetry GenAI semantic conventions 標準化、是 production LLM 應用 debug / cost / quality 監控的事實標準。傳統 web app 的字串 logging 抓不到 LLM 應用的關鍵問題 — agent 為什麼選了那條路、reasoning trace 怎麼推導、tool call 為什麼 retry 三次、token 消耗為什麼比預期高 ×3。本章把 LLM tracing 的運作機制、OTel GenAI semconv、三大 use case（cost / latency / failure）跟 production eval 閉環拆成可操作的工程實務。&lt;/p>
&lt;h2 id="本章目標">本章目標&lt;/h2>
&lt;p>讀完本章後、你應該能：&lt;/p>
&lt;ol>
&lt;li>解釋 LLM tracing 跟 traditional logging 的差異。&lt;/li>
&lt;li>用 OpenTelemetry GenAI semantic conventions 設計 span 結構。&lt;/li>
&lt;li>用 trace 做 cost / latency 監控跟 failure debug。&lt;/li>
&lt;li>把 production trace 餵回 &lt;a href="https://tarrragon.github.io/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">LLM-as-judge&lt;/a> 做品質迴路。&lt;/li>
&lt;li>對自己應用判斷該用 self-host vs SaaS observability platform。&lt;/li>
&lt;/ol>
&lt;h2 id="traditional-logging-為什麼不夠">Traditional logging 為什麼不夠&lt;/h2>
&lt;p>LLM 應用的 debug 問題對傳統 logging 太抽象：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>場景&lt;/th>
 &lt;th>Logging 看到&lt;/th>
 &lt;th>真正需要的資訊&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Agent 為什麼選 tool A 不選 tool B&lt;/td>
 &lt;td>&lt;code>tool=A&lt;/code> 一行&lt;/td>
 &lt;td>完整 reasoning trace + 當下 context + tool list&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Token cost 為什麼高&lt;/td>
 &lt;td>&lt;code>tokens=15234&lt;/code>&lt;/td>
 &lt;td>Input / output / cached token 分項 + 每 turn 累積&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Why TTFT 5 秒&lt;/td>
 &lt;td>&lt;code>ttft=5012ms&lt;/code>&lt;/td>
 &lt;td>Prefill 跟 cache miss、prompt length、queue time&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Tool 為什麼 retry 三次&lt;/td>
 &lt;td>&lt;code>tool error retry&lt;/code>&lt;/td>
 &lt;td>每次 error message + LLM 的判讀 + retry 策略&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Agent 為什麼 infinite loop&lt;/td>
 &lt;td>大量重複 log&lt;/td>
 &lt;td>每 iteration 的 context + 為什麼沒判 terminate&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>LLM tracing 用「結構化 span + parent-child 關係 + 標準化 attribute」直接編碼這些訊息。&lt;/p></description><content:encoded><![CDATA[<p><a href="/blog/llm/knowledge-cards/llm-tracing/" data-link-title="LLM Tracing" data-link-desc="把 LLM 應用的每次 LLM call / tool call / memory op 編成結構化 span、用 OpenTelemetry GenAI semantic conventions 標準化">LLM tracing</a> 把每次 LLM call / tool call / memory op / handoff 編成結構化 span、用 OpenTelemetry GenAI semantic conventions 標準化、是 production LLM 應用 debug / cost / quality 監控的事實標準。傳統 web app 的字串 logging 抓不到 LLM 應用的關鍵問題 — agent 為什麼選了那條路、reasoning trace 怎麼推導、tool call 為什麼 retry 三次、token 消耗為什麼比預期高 ×3。本章把 LLM tracing 的運作機制、OTel GenAI semconv、三大 use case（cost / latency / failure）跟 production eval 閉環拆成可操作的工程實務。</p>
<h2 id="本章目標">本章目標</h2>
<p>讀完本章後、你應該能：</p>
<ol>
<li>解釋 LLM tracing 跟 traditional logging 的差異。</li>
<li>用 OpenTelemetry GenAI semantic conventions 設計 span 結構。</li>
<li>用 trace 做 cost / latency 監控跟 failure debug。</li>
<li>把 production trace 餵回 <a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">LLM-as-judge</a> 做品質迴路。</li>
<li>對自己應用判斷該用 self-host vs SaaS observability platform。</li>
</ol>
<h2 id="traditional-logging-為什麼不夠">Traditional logging 為什麼不夠</h2>
<p>LLM 應用的 debug 問題對傳統 logging 太抽象：</p>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>Logging 看到</th>
          <th>真正需要的資訊</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Agent 為什麼選 tool A 不選 tool B</td>
          <td><code>tool=A</code> 一行</td>
          <td>完整 reasoning trace + 當下 context + tool list</td>
      </tr>
      <tr>
          <td>Token cost 為什麼高</td>
          <td><code>tokens=15234</code></td>
          <td>Input / output / cached token 分項 + 每 turn 累積</td>
      </tr>
      <tr>
          <td>Why TTFT 5 秒</td>
          <td><code>ttft=5012ms</code></td>
          <td>Prefill 跟 cache miss、prompt length、queue time</td>
      </tr>
      <tr>
          <td>Tool 為什麼 retry 三次</td>
          <td><code>tool error retry</code></td>
          <td>每次 error message + LLM 的判讀 + retry 策略</td>
      </tr>
      <tr>
          <td>Agent 為什麼 infinite loop</td>
          <td>大量重複 log</td>
          <td>每 iteration 的 context + 為什麼沒判 terminate</td>
      </tr>
  </tbody>
</table>
<p>LLM tracing 用「結構化 span + parent-child 關係 + 標準化 attribute」直接編碼這些訊息。</p>
<h2 id="opentelemetry-genai-semantic-conventions">OpenTelemetry GenAI semantic conventions</h2>
<p>OTel GenAI semconv 是 2024-2025 標準化中的 trace schema。核心概念：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Trace（一次 user query 從進來到 response）
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  ├── Span: gen_ai.agent.invocation（agent loop iteration 1）
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  │     ├── Span: gen_ai.client.operation（LLM call 1）
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">  │     │     attrs: model, temperature, input_tokens, output_tokens, cache_read
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">  │     ├── Span: gen_ai.tool.execution（tool: read_file）
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">  │     │     attrs: tool_name, input, output, duration
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">  │     └── Span: gen_ai.memory.read（retrieval）
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">  │           attrs: query, top_k, similarity_scores
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">  ├── Span: gen_ai.agent.invocation（iteration 2）
</span></span><span class="line"><span class="ln">10</span><span class="cl">  │     └── ...
</span></span><span class="line"><span class="ln">11</span><span class="cl">  └── Span: gen_ai.agent.terminate
</span></span><span class="line"><span class="ln">12</span><span class="cl">        attrs: reason, total_tokens, total_cost</span></span></code></pre></div><p>主要 attribute 分類：</p>
<table>
  <thead>
      <tr>
          <th>類別</th>
          <th>屬性 prefix</th>
          <th>典型內容</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Model</td>
          <td><code>gen_ai.request.*</code></td>
          <td>model, temperature, top_p, max_tokens, stream</td>
      </tr>
      <tr>
          <td>Usage</td>
          <td><code>gen_ai.usage.*</code></td>
          <td>input_tokens, output_tokens, cached_tokens</td>
      </tr>
      <tr>
          <td>Response</td>
          <td><code>gen_ai.response.*</code></td>
          <td>finish_reason, id</td>
      </tr>
      <tr>
          <td>Tool</td>
          <td><code>gen_ai.tool.*</code></td>
          <td>name, parameters, result</td>
      </tr>
      <tr>
          <td>Memory</td>
          <td><code>gen_ai.memory.*</code></td>
          <td>operation, store, query, hits</td>
      </tr>
      <tr>
          <td>Cost</td>
          <td><code>gen_ai.cost.*</code></td>
          <td>usd, currency（vendor-specific）</td>
      </tr>
  </tbody>
</table>
<p>實作概要（Python 例）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kn">from</span> <span class="nn">opentelemetry</span> <span class="kn">import</span> <span class="n">trace</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="kn">from</span> <span class="nn">openinference.semconv.trace</span> <span class="kn">import</span> <span class="n">SpanAttributes</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="n">tracer</span> <span class="o">=</span> <span class="n">trace</span><span class="o">.</span><span class="n">get_tracer</span><span class="p">(</span><span class="vm">__name__</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="k">with</span> <span class="n">tracer</span><span class="o">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s2">&#34;gen_ai.client.operation&#34;</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">    <span class="n">span</span><span class="o">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="n">SpanAttributes</span><span class="o">.</span><span class="n">LLM_MODEL_NAME</span><span class="p">,</span> <span class="s2">&#34;claude-sonnet-4-6&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="n">span</span><span class="o">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="n">SpanAttributes</span><span class="o">.</span><span class="n">LLM_TEMPERATURE</span><span class="p">,</span> <span class="mf">0.7</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="n">response</span> <span class="o">=</span> <span class="n">llm_client</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span><span class="n">messages</span><span class="o">=...</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl">
</span></span><span class="line"><span class="ln">12</span><span class="cl">    <span class="n">span</span><span class="o">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="n">SpanAttributes</span><span class="o">.</span><span class="n">LLM_TOKEN_COUNT_PROMPT</span><span class="p">,</span> <span class="n">response</span><span class="o">.</span><span class="n">usage</span><span class="o">.</span><span class="n">input_tokens</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">13</span><span class="cl">    <span class="n">span</span><span class="o">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="n">SpanAttributes</span><span class="o">.</span><span class="n">LLM_TOKEN_COUNT_COMPLETION</span><span class="p">,</span> <span class="n">response</span><span class="o">.</span><span class="n">usage</span><span class="o">.</span><span class="n">output_tokens</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">    <span class="n">span</span><span class="o">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s2">&#34;gen_ai.usage.cached_tokens&#34;</span><span class="p">,</span> <span class="n">response</span><span class="o">.</span><span class="n">usage</span><span class="o">.</span><span class="n">cache_read_tokens</span> <span class="ow">or</span> <span class="mi">0</span><span class="p">)</span></span></span></code></pre></div><p>實務上多用 framework auto-instrumentation（LangChain / LlamaIndex / Anthropic SDK 都有 OTel integration）、不必手寫 span。</p>
<h2 id="use-case-1cost-monitoring">Use case 1：Cost monitoring</h2>
<p>Trace 是 LLM 應用 cost 監控的核心 — token usage attribute 內建、不必另外算。</p>
<p>實作模式：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">1. Trace 端記錄 input_tokens / output_tokens / cached_tokens
</span></span><span class="line"><span class="ln">2</span><span class="cl">2. Observability 平台用「per-model pricing table」算出 USD
</span></span><span class="line"><span class="ln">3</span><span class="cl">3. Aggregate by：
</span></span><span class="line"><span class="ln">4</span><span class="cl">   - User（哪個 user 燒最多）
</span></span><span class="line"><span class="ln">5</span><span class="cl">   - Endpoint（哪條 API path 最貴）
</span></span><span class="line"><span class="ln">6</span><span class="cl">   - Feature（哪個 feature 最費 token）
</span></span><span class="line"><span class="ln">7</span><span class="cl">   - Time（哪天 spike）</span></span></code></pre></div><p>典型 dashboard 指標：</p>
<table>
  <thead>
      <tr>
          <th>指標</th>
          <th>直覺</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Total cost / day</td>
          <td>整體燒錢趨勢</td>
      </tr>
      <tr>
          <td>Cost per user</td>
          <td>找 power user 或 abuse</td>
      </tr>
      <tr>
          <td>Cost per request</td>
          <td>看單 request 平均 cost、設 alert</td>
      </tr>
      <tr>
          <td>Cached / total token ratio</td>
          <td><a href="/blog/llm/knowledge-cards/prompt-cache/" data-link-title="Prompt Cache" data-link-desc="重複出現的 prompt prefix 在推論伺服器或 LLM 服務端被 cache、後續 query 跳過 prefill、大幅降 cost 跟 TTFT">Prompt cache</a> 命中率</td>
      </tr>
      <tr>
          <td>Output / input token ratio</td>
          <td>輸出膨脹率、看 generation length 合理性</td>
      </tr>
  </tbody>
</table>
<h2 id="use-case-2latency--failure-debug">Use case 2：Latency / failure debug</h2>
<p>Trace 自然編碼 latency tree、能定位「哪個 span 卡」：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">User query → response total: 5.2s
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── Agent iteration 1: 4.8s
</span></span><span class="line"><span class="ln">3</span><span class="cl">│   ├── LLM call (claude): 4.2s     ← 主要時間在這
</span></span><span class="line"><span class="ln">4</span><span class="cl">│   │   - prefill: 3.8s             ← prefill 太久、看 prompt 是否需要 cache
</span></span><span class="line"><span class="ln">5</span><span class="cl">│   │   - generation: 0.4s
</span></span><span class="line"><span class="ln">6</span><span class="cl">│   ├── tool: read_file: 0.5s
</span></span><span class="line"><span class="ln">7</span><span class="cl">│   └── memory: retrieval: 0.1s
</span></span><span class="line"><span class="ln">8</span><span class="cl">└── Agent iteration 2: 0.4s</span></span></code></pre></div><p>從這 trace 看出「90% 時間在 prefill、開 prompt cache 可以救」、不必猜。</p>
<p>Failure debug：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">User query → response: ERROR
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── Agent iteration 1: success
</span></span><span class="line"><span class="ln">3</span><span class="cl">│   └── LLM call: tool_call(run_bash, cmd=&#34;rm -rf /&#34;)
</span></span><span class="line"><span class="ln">4</span><span class="cl">├── Agent iteration 2: failure
</span></span><span class="line"><span class="ln">5</span><span class="cl">│   └── tool: run_bash: REJECTED by permission system
</span></span><span class="line"><span class="ln">6</span><span class="cl">└── Agent fallback: error response
</span></span><span class="line"><span class="ln">7</span><span class="cl">
</span></span><span class="line"><span class="ln">8</span><span class="cl">從 trace 看：tool call 被 permission 擋下、不是 LLM 自己亂、而是 user query 觸發危險 tool call、permission 正確擋下。</span></span></code></pre></div><p>對應 <a href="/blog/llm/06-security/tool-use-permission-model/" data-link-title="6.2 tool use 與 MCP server 的權限模型" data-link-desc="個人 dev 場景下 tool use / MCP server 的副作用權限：檔案系統 / shell / 網路存取邊界、第三方 MCP 信任、副作用的可逆性">6.2 tool use 權限模型</a> 跟 <a href="/blog/llm/01-local-llm-services/hands-on/permission-boundary/" data-link-title="Hands-on：Ollama 改檔案 / 寫程式碼的權限邊界在哪" data-link-desc="四組對照實驗：Ollama 自己沒 FS / shell 權限、wrapper 才有；--dry-run / --confirm / --auto 三檔審查粒度的取捨">hands-on permission-boundary</a> 的判讀。</p>
<h2 id="use-case-3production-trace--eval-loop">Use case 3：Production trace → eval loop</h2>
<p>Production trace 是 <a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">LLM-as-judge</a> 的最佳資料來源：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Production users
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">   ↓ 產生 trace
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">Trace storage（LangSmith / Phoenix / Langfuse）
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">   ↓ filter（e.g. user thumbs-down 的 trace）
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">   ↓ sample N 個
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">LLM-as-judge eval
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">   ↓ rubric scoring
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">找出系統性問題（哪類 query 品質差）
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">10</span><span class="cl">改 system prompt / tool / agent loop
</span></span><span class="line"><span class="ln">11</span><span class="cl">   ↓
</span></span><span class="line"><span class="ln">12</span><span class="cl">A/B test on production traces</span></span></code></pre></div><p>這是 <a href="/blog/llm/04-applications/benchmarking-and-evaluation/" data-link-title="4.14 Benchmarking 與評估方法論" data-link-desc="判讀 model card benchmark 數字、做自己工作流的 in-house benchmark、量測本地推論速度的完整方法論">4.14 benchmarking</a> 提的「in-house benchmark」的具體 implementation — production trace 是最真實的 benchmark dataset。</p>
<h2 id="主流平台選型">主流平台選型</h2>
<table>
  <thead>
      <tr>
          <th>平台</th>
          <th>類型</th>
          <th>強項</th>
          <th>適合場景</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LangSmith</td>
          <td>SaaS（LangChain 系）</td>
          <td>Auto-instrumentation 強、UI 完整</td>
          <td>LangChain / LangGraph user</td>
      </tr>
      <tr>
          <td>Phoenix</td>
          <td>OSS + SaaS（Arize 系）</td>
          <td>OpenInference 標準、可 self-host</td>
          <td>想 self-host + OTel native</td>
      </tr>
      <tr>
          <td>Langfuse</td>
          <td>OSS + SaaS</td>
          <td>開源強、cost 監控好</td>
          <td>Cost / eval 中心、可 self-host</td>
      </tr>
      <tr>
          <td>Braintrust</td>
          <td>SaaS</td>
          <td>Eval + tracing 一體</td>
          <td>重 eval workflow 的 team</td>
      </tr>
      <tr>
          <td>Datadog APM</td>
          <td>SaaS</td>
          <td>跟 traditional APM 整合</td>
          <td>已用 Datadog、想統一監控</td>
      </tr>
      <tr>
          <td>Logfire</td>
          <td>SaaS（Pydantic）</td>
          <td>簡潔、Python 為主</td>
          <td>Python 為主、輕量</td>
      </tr>
      <tr>
          <td>Self-host OTel + Jaeger</td>
          <td>OSS</td>
          <td>完全 self-host、最便宜</td>
          <td>隱私敏感、cost 敏感、技術強</td>
      </tr>
  </tbody>
</table>
<p>判讀：</p>
<ol>
<li><strong>個人 / 小流量</strong>：SaaS 免費 tier（LangSmith / Langfuse / Phoenix）夠用</li>
<li><strong>隱私敏感（user data 不能離本機）</strong>：Self-host（Langfuse / Phoenix self-hosted、或 OTel + Jaeger）</li>
<li><strong>已有 observability stack</strong>：用 OTel + 現有 Datadog / Grafana、別再加一層</li>
<li><strong>重 eval</strong>：Braintrust / Langfuse 的 eval feature 強</li>
</ol>
<h2 id="跟-49-production-resource-的關係">跟 <a href="/blog/llm/04-applications/production-resource-planning/" data-link-title="4.9 Production 部署的資源評估原理" data-link-desc="從本地單 user 到 production multi-tenant：concurrent users、cost model、observability、SLA、capacity planning 的設計取捨">4.9 production resource</a> 的關係</h2>
<p>4.5 寫 production resource 的 6 個 dimension（concurrency / latency / cost / storage / observability / reliability）、其中 observability 是 4.5 點到、本章展開。讀者讀完 4.5 知道「需要 observability」、本章補「具體怎麼做」。</p>
<h2 id="設計失敗模式">設計失敗模式</h2>
<ol>
<li><strong>過度 instrument</strong>：每個 internal function 都加 span、trace overhead 大、實際 production noise 多</li>
</ol>
<p><strong>緩解</strong>：聚焦 LLM-related 跟跨 service 邊界、internal logic 不必 trace</p>
<ol start="2">
<li><strong>PII / sensitive data 寫進 span attribute</strong>：user prompt、API key、會被 SaaS 平台看到</li>
</ol>
<p><strong>緩解</strong>：Span attribute 過 PII filter、敏感資料 hash / masking、跟 <a href="/blog/llm/06-security/cross-cloud-local-data-boundary/" data-link-title="6.4 跨雲端 / 本地的資料邊界" data-link-desc="個人 dev 場景下混用雲端 LLM 跟本地 LLM 時的 prompt 洩漏點：Continue.dev 多 provider 設定、隱私資料流、按敏感度分流的判讀">6.4 跨雲端邊界</a> 結合</p>
<ol start="3">
<li><strong>不 sample</strong>：production 100% trace、storage / cost 爆</li>
</ol>
<p><strong>緩解</strong>：Production sample rate &lt; 10%、error / outlier 100% capture</p>
<ol start="4">
<li><strong>沒設 trace 保留期</strong>：trace 越累積越多、舊 trace 沒人看但仍付儲存</li>
</ol>
<p><strong>緩解</strong>：明確保留 policy（如 7-30 天 hot、之後 archive 或刪）</p>
<ol start="5">
<li><strong>Trace 不跟 metric 串</strong>：trace 是 sample、metric 是 aggregate、debug 要兩個一起看</li>
</ol>
<p><strong>緩解</strong>：cost / latency 也輸出 metric（Prometheus 等）、trace 補 specific instance debug</p>
<h2 id="何時不需要-tracing">何時不需要 tracing</h2>
<ol>
<li><strong>純 demo / 個人玩</strong>：log 字串夠用</li>
<li><strong>單一 LLM call、無 agent loop</strong>：簡單到 grep log 也能 debug</li>
<li><strong>隱私極敏感且不 self-host</strong>：trace 內容流向 SaaS 是邊界、評估 risk</li>
<li><strong>每 request 都 trace 的 overhead &gt; 收益</strong>：超低 latency 場景看是否 worth it</li>
</ol>
<h2 id="何時過時--何時不過時">何時過時 / 何時不過時</h2>
<p><strong>不會過時的部分</strong>：</p>
<ul>
<li>LLM tracing 跟 traditional logging 的根本差異</li>
<li>結構化 span + parent-child 關係的 framing</li>
<li>Cost monitoring / latency debug / failure debug 三大 use case</li>
<li>Trace → eval 的閉環概念</li>
<li>5 個設計失敗模式</li>
</ul>
<p><strong>會變的部分</strong>：</p>
<ul>
<li>OTel GenAI semconv 的具體 attribute 名稱（仍在 stabilizing）</li>
<li>主流 SaaS 平台（每年 1-2 個新進入者）</li>
<li>Auto-instrumentation 的支援度（持續擴展）</li>
<li>跟具體 framework 的整合方式</li>
</ul>
<h2 id="下一章">下一章</h2>
<p>下一章：<a href="/blog/llm/04-applications/llm-as-judge/" data-link-title="4.21 LLM-as-Judge 評估方法" data-link-desc="LLM 評估 LLM 的 production eval 方法：rubric design、pairwise / direct scoring、三大 bias 緩解、跟 trace 串接的閉環、calibration">4.21 LLM-as-judge 評估方法</a>、把 production trace 變成系統性 eval 的閉環。</p>
]]></content:encoded></item><item><title>4.24 Client-to-Server 端到端觀測串接</title><link>https://tarrragon.github.io/blog/backend/04-observability/client-server-trace-integration/</link><pubDate>Mon, 22 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/04-observability/client-server-trace-integration/</guid><description>&lt;p>Client-to-server 端到端觀測串接的核心責任是讓一次使用者操作的完整路徑 — 從 browser click 到 server 處理到 response rendering — 可以用同一個 trace ID 串起來。&lt;a href="https://tarrragon.github.io/blog/backend/04-observability/client-side-monitoring/" data-link-title="4.10 Client-side / Synthetic / RUM" data-link-desc="補 server-side 看不到的 user perceived 訊號">4.10 Client-side / Synthetic / RUM&lt;/a> 講的是概念和 vendor 定位；本篇走完一個具體場景的實作鏈路。&lt;a href="https://tarrragon.github.io/blog/monitoring/03-sdk-design/" data-link-title="模組三：SDK 設計模式" data-link-desc="跨平台 SDK 的自動攔截、手動上報、攢批送出、離線 buffer 設計">Monitoring 模組 03 SDK 設計&lt;/a> 講的是 client 端怎麼埋點；本篇講 server 端怎麼接收和整合。&lt;/p>
&lt;h2 id="完整鏈路">完整鏈路&lt;/h2>
&lt;p>以使用者在 web app 點擊「結帳」為例，一次操作產生的觀測鏈路：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln"> 1&lt;/span>&lt;span class="cl">Browser: user clicks &amp;#34;checkout&amp;#34;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 2&lt;/span>&lt;span class="cl"> → RUM SDK 建立 client span（type: resource / xhr）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 3&lt;/span>&lt;span class="cl"> → HTTP POST /api/checkout + W3C traceparent header
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 4&lt;/span>&lt;span class="cl"> → Server middleware 提取 trace context
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 5&lt;/span>&lt;span class="cl"> → Server 建立 child span（checkout-handler）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 6&lt;/span>&lt;span class="cl"> → DB query span（order insert）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 7&lt;/span>&lt;span class="cl"> → Cache span（inventory check）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 8&lt;/span>&lt;span class="cl"> → Queue span（event publish）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln"> 9&lt;/span>&lt;span class="cl"> → Server 回 200 + response body
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">10&lt;/span>&lt;span class="cl"> → Browser 收到 response → resource timing 結束
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">11&lt;/span>&lt;span class="cl"> → RUM SDK 關閉 client span（記錄 duration + status）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">12&lt;/span>&lt;span class="cl"> → 統一 trace waterfall：client span 是 root、server spans 是 children&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>鏈路的每一段都需要 trace context 正確傳遞。任何一段斷掉，trace waterfall 就會出現孤立的 span — server 端看到的 trace 跟 client 端看到的 trace 是兩條不相關的紀錄。&lt;/p>
&lt;h2 id="trace-context-propagation">Trace context propagation&lt;/h2>
&lt;h3 id="w3c-traceparent-header">W3C traceparent header&lt;/h3>
&lt;p>W3C Trace Context 是跨 vendor 的標準 propagation 格式。Header 長這樣：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl"> │ │ │ │
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> │ trace-id (32 hex) parent-id (16 hex) flags
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> version&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>RUM SDK 在發起 XHR / fetch 時把 &lt;code>traceparent&lt;/code> 注入 request header。Server 的 trace SDK 從 header 提取 trace-id 和 parent-id，建立 child span。&lt;/p></description><content:encoded><![CDATA[<p>Client-to-server 端到端觀測串接的核心責任是讓一次使用者操作的完整路徑 — 從 browser click 到 server 處理到 response rendering — 可以用同一個 trace ID 串起來。<a href="/blog/backend/04-observability/client-side-monitoring/" data-link-title="4.10 Client-side / Synthetic / RUM" data-link-desc="補 server-side 看不到的 user perceived 訊號">4.10 Client-side / Synthetic / RUM</a> 講的是概念和 vendor 定位；本篇走完一個具體場景的實作鏈路。<a href="/blog/monitoring/03-sdk-design/" data-link-title="模組三：SDK 設計模式" data-link-desc="跨平台 SDK 的自動攔截、手動上報、攢批送出、離線 buffer 設計">Monitoring 模組 03 SDK 設計</a> 講的是 client 端怎麼埋點；本篇講 server 端怎麼接收和整合。</p>
<h2 id="完整鏈路">完整鏈路</h2>
<p>以使用者在 web app 點擊「結帳」為例，一次操作產生的觀測鏈路：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl">Browser: user clicks &#34;checkout&#34;
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">  → RUM SDK 建立 client span（type: resource / xhr）
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">  → HTTP POST /api/checkout + W3C traceparent header
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">    → Server middleware 提取 trace context
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    → Server 建立 child span（checkout-handler）
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">      → DB query span（order insert）
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">      → Cache span（inventory check）
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">      → Queue span（event publish）
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    → Server 回 200 + response body
</span></span><span class="line"><span class="ln">10</span><span class="cl">  → Browser 收到 response → resource timing 結束
</span></span><span class="line"><span class="ln">11</span><span class="cl">  → RUM SDK 關閉 client span（記錄 duration + status）
</span></span><span class="line"><span class="ln">12</span><span class="cl">  → 統一 trace waterfall：client span 是 root、server spans 是 children</span></span></code></pre></div><p>鏈路的每一段都需要 trace context 正確傳遞。任何一段斷掉，trace waterfall 就會出現孤立的 span — server 端看到的 trace 跟 client 端看到的 trace 是兩條不相關的紀錄。</p>
<h2 id="trace-context-propagation">Trace context propagation</h2>
<h3 id="w3c-traceparent-header">W3C traceparent header</h3>
<p>W3C Trace Context 是跨 vendor 的標準 propagation 格式。Header 長這樣：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
</span></span><span class="line"><span class="ln">2</span><span class="cl">              │  │                                │                  │
</span></span><span class="line"><span class="ln">3</span><span class="cl">              │  trace-id (32 hex)                 parent-id (16 hex) flags
</span></span><span class="line"><span class="ln">4</span><span class="cl">              version</span></span></code></pre></div><p>RUM SDK 在發起 XHR / fetch 時把 <code>traceparent</code> 注入 request header。Server 的 trace SDK 從 header 提取 trace-id 和 parent-id，建立 child span。</p>
<h3 id="client-端注入">Client 端注入</h3>
<p>各 RUM SDK 的注入方式：</p>
<table>
  <thead>
      <tr>
          <th>SDK</th>
          <th>注入機制</th>
          <th>配置</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Datadog RUM</td>
          <td>自動 patch XHR / fetch，注入 <code>x-datadog-*</code> + 可選 <code>traceparent</code></td>
          <td><code>allowedTracingUrls</code> 設定允許注入的 domain</td>
      </tr>
      <tr>
          <td>Sentry browser</td>
          <td>自動 patch fetch / XHR，注入 <code>sentry-trace</code> + <code>baggage</code> + 可選 <code>traceparent</code></td>
          <td><code>tracePropagationTargets</code> 設定目標 URL</td>
      </tr>
      <tr>
          <td>OTel browser SDK</td>
          <td>透過 <code>XMLHttpRequestInstrumentation</code> / <code>FetchInstrumentation</code> 注入 <code>traceparent</code></td>
          <td><code>propagateTraceHeaderCorsUrls</code> 設定 CORS 允許的 URL</td>
      </tr>
  </tbody>
</table>
<p>三者的共同模式：只對設定的 domain 注入 trace header。不設定白名單時，header 不會被注入到第三方 API（避免 information leakage）。</p>
<h3 id="server-端提取">Server 端提取</h3>
<p>Server 端的 trace SDK（OTel auto-instrumentation 或 vendor agent）從 incoming request 的 header 提取 trace context：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="c1"># OTel Python 範例 — auto-instrumentation 自動處理</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="c1"># 不需要手動提取，middleware 自動讀 traceparent header</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="c1"># 建立的 span 會繼承 client 傳來的 trace-id 和 parent-id</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="c1"># 手動提取（不用 auto-instrumentation 時）</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="kn">from</span> <span class="nn">opentelemetry.propagate</span> <span class="kn">import</span> <span class="n">extract</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="n">ctx</span> <span class="o">=</span> <span class="n">extract</span><span class="p">(</span><span class="n">carrier</span><span class="o">=</span><span class="n">request</span><span class="o">.</span><span class="n">headers</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="k">with</span> <span class="n">tracer</span><span class="o">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s2">&#34;checkout-handler&#34;</span><span class="p">,</span> <span class="n">context</span><span class="o">=</span><span class="n">ctx</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="c1"># server logic</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="k">pass</span></span></span></code></pre></div><h3 id="cors-限制">CORS 限制</h3>
<p>跨域請求時，browser 的 CORS preflight 會阻止非標準 header。Server 需要明確允許 trace header：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">Access-Control-Allow-Headers: traceparent, tracestate, sentry-trace, baggage</span></span></code></pre></div><p>CORS 是 client-server trace 串接最常見的斷裂原因。Server 沒有回 <code>Access-Control-Allow-Headers: traceparent</code> 時，browser 會 strip 掉 trace header，server 端收到的 request 沒有 trace context，建立的 span 成為新的 root — 跟 client span 斷裂。</p>
<h2 id="跨層-correlation-設計">跨層 correlation 設計</h2>
<h3 id="trace-id-串接">Trace ID 串接</h3>
<p>統一 trace-id 是最基本的 correlation。同一個 trace-id 下的所有 span（client + server）可以在 trace backend 的 waterfall view 裡按時間排列，看到完整的 request 路徑。</p>
<h3 id="session-跟-transaction-的-mapping">Session 跟 transaction 的 mapping</h3>
<p>RUM SDK 的 session（使用者的一次造訪）包含多個 user action，每個 action 可能觸發多個 HTTP request。Mapping 關係：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">RUM session
</span></span><span class="line"><span class="ln">2</span><span class="cl">  └── user action (click &#34;checkout&#34;)
</span></span><span class="line"><span class="ln">3</span><span class="cl">        ├── HTTP request /api/checkout  →  server transaction (trace)
</span></span><span class="line"><span class="ln">4</span><span class="cl">        ├── HTTP request /api/inventory →  server transaction (trace)
</span></span><span class="line"><span class="ln">5</span><span class="cl">        └── client-side rendering time</span></span></code></pre></div><p>Datadog RUM 和 Sentry 都支援從 session replay 點進去看對應的 server trace。這個 mapping 靠的是 RUM event 裡記錄的 trace-id，跟 server trace backend 裡的同一個 trace-id 做 join。</p>
<h3 id="breadcrumbs-跟-server-log-的時間對齊">Breadcrumbs 跟 server log 的時間對齊</h3>
<p>RUM SDK 收集的 breadcrumbs（使用者操作序列：page view → button click → form submit）跟 server-side log 的 timestamp 需要可比對。時間對齊的前提是 client 和 server 的 clock 差距在可接受範圍（通常 &lt; 1s）。</p>
<p>NTP 同步的 server 端 clock 通常精準。Client 端（browser）依賴使用者裝置的系統時間，可能偏差數秒到數分鐘。RUM SDK 通常會記錄 relative timing（相對於 session 開始的 offset），而非絕對 timestamp，來降低 clock skew 的影響。</p>
<h3 id="error-correlation">Error correlation</h3>
<p>Client-side JS error 跟 server-side 5xx 可能是同一個問題的兩面。Correlation 方式：</p>
<ul>
<li><strong>同一 trace-id</strong>：client error 發生在某個 HTTP request 的 response 處理中，該 request 的 trace-id 跟 server-side 500 的 trace-id 相同 — 直接 correlation</li>
<li><strong>時間窗 + endpoint</strong>：client error 沒有 trace-id（例如 CORS block 導致 request 沒發出），用時間窗 + endpoint 模式做 fuzzy correlation</li>
<li><strong>Server 無異常但 client 報錯</strong>：client-side rendering error（JSON parse failure、type error），server 端看不到 — 需要 RUM 獨立分析</li>
</ul>
<h2 id="evidence-package-整合">Evidence package 整合</h2>
<p>把 client-side 訊號納入 <a href="/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">4.20 Observability Evidence Package</a> 時，需要額外記錄：</p>
<table>
  <thead>
      <tr>
          <th>欄位</th>
          <th>Client-side 補充</th>
          <th>為什麼需要</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Source</td>
          <td>標註 &ldquo;RUM&rdquo; 或 &ldquo;Synthetic&rdquo;</td>
          <td>區分 server-side metrics 和 client-side metrics</td>
      </tr>
      <tr>
          <td>Latency</td>
          <td>Client perceived latency（含 DNS + network + server + rendering）</td>
          <td>跟 server-side latency 差異是 network + rendering 時間</td>
      </tr>
      <tr>
          <td>Known gap</td>
          <td>Trace sampling 不一致</td>
          <td>Client 和 server 可能各自取樣，同一個 request 不一定兩邊都有</td>
      </tr>
      <tr>
          <td>Confidence</td>
          <td>Client clock skew 可能影響 timestamp precision</td>
          <td>標注 client timestamp 的精確度限制</td>
      </tr>
  </tbody>
</table>
<p>Client perceived latency 跟 server-side latency 的差異本身就是一個觀測訊號。差異穩定在 50ms 是正常的 network overhead；差異突然從 50ms 跳到 500ms 代表網路或 CDN 出了問題 — 而這個問題 server-side dashboard 完全看不到。</p>
<h2 id="失敗場景判讀">失敗場景判讀</h2>
<table>
  <thead>
      <tr>
          <th>失敗訊號</th>
          <th>判讀</th>
          <th>下一步</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Client span 存在但 server span 缺失</td>
          <td>Trace context header 沒被 propagate — 最常見原因是 CORS block</td>
          <td>檢查 <code>Access-Control-Allow-Headers</code> 是否包含 <code>traceparent</code>；檢查 RUM SDK 的 <code>allowedTracingUrls</code> 設定</td>
      </tr>
      <tr>
          <td>Server 正常但 client perceived latency 高</td>
          <td>網路延遲或 client rendering 慢</td>
          <td>看 RUM 的 resource timing breakdown（DNS / TCP / TLS / TTFB / download / render）</td>
      </tr>
      <tr>
          <td>Client error 但 server 無對應 request</td>
          <td>Request 沒發出 — client-side validation 擋掉或 network offline</td>
          <td>看 RUM breadcrumbs 確認 request 是否有送出；檢查 navigator.onLine 狀態</td>
      </tr>
      <tr>
          <td>Trace sampling 不一致</td>
          <td>Client 取樣到但 server 沒取樣到同一個 request</td>
          <td>統一 sampling decision — 用 head-based sampling（decision 在 trace 起點做、propagate 到下游）</td>
      </tr>
      <tr>
          <td>Client 和 server 的 error count 對不上</td>
          <td>Client 包含 JS rendering error（server 看不到）；server 包含非 user-facing 的背景 job error</td>
          <td>分開看：API error 用 trace correlation 比對、non-API error 各自歸類</td>
      </tr>
  </tbody>
</table>
<h2 id="vendor-整合模式">Vendor 整合模式</h2>
<table>
  <thead>
      <tr>
          <th>組合</th>
          <th>串接方式</th>
          <th>限制</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Datadog RUM + Datadog APM</td>
          <td>原生 — 同一個 Datadog org 裡 client 跟 server trace 自動關聯</td>
          <td>兩邊都要 Datadog plan</td>
      </tr>
      <tr>
          <td>Sentry browser + Sentry server</td>
          <td>原生 — <code>sentry-trace</code> header propagation</td>
          <td>Performance monitoring 需要 Sentry paid plan</td>
      </tr>
      <tr>
          <td>OTel browser SDK + OTel server SDK</td>
          <td>W3C <code>traceparent</code> — vendor-neutral 標準</td>
          <td>Browser SDK 較新、instrumentation 覆蓋度不如 server 端成熟</td>
      </tr>
      <tr>
          <td>混合（Sentry browser + Datadog server）</td>
          <td>手動橋接 — 確保雙方都支援 W3C <code>traceparent</code></td>
          <td>Trace context format 要一致；session-level correlation 需自建</td>
      </tr>
  </tbody>
</table>
<p>同 vendor 組合的串接最自然。跨 vendor 組合只要雙方都支援 W3C Trace Context，trace-level correlation 可以通；但 session-level 的功能（session replay → server trace）需要同 vendor 才有。</p>
<h2 id="交接路由">交接路由</h2>
<ul>
<li><a href="/blog/backend/04-observability/client-side-monitoring/" data-link-title="4.10 Client-side / Synthetic / RUM" data-link-desc="補 server-side 看不到的 user perceived 訊號">4.10 Client-side / Synthetic / RUM</a>：概念定位和 vendor 選型</li>
<li><a href="/blog/backend/04-observability/tracing-context/" data-link-title="4.3 tracing 與 context link" data-link-desc="整理 trace id、span 與跨服務 context propagation">4.3 Tracing Context</a>：server-side trace context 設計</li>
<li><a href="/blog/backend/04-observability/checkout-api-evidence-package/" data-link-title="4.22 Checkout API Evidence Package 實作示範" data-link-desc="用 checkout 路徑示範 evidence package 如何交接給 release gate 與 incident decision。">4.22 Checkout API Evidence Package</a>：evidence 整合到 release gate</li>
<li><a href="/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">4.20 Observability Evidence Package</a>：evidence 欄位標準</li>
<li><a href="/blog/monitoring/03-sdk-design/" data-link-title="模組三：SDK 設計模式" data-link-desc="跨平台 SDK 的自動攔截、手動上報、攢批送出、離線 buffer 設計">Monitoring 03 SDK 設計</a>：client-side SDK 埋點設計</li>
<li><a href="/blog/monitoring/06-commercial-comparison/" data-link-title="模組六：商業方案對照" data-link-desc="Sentry / Crashlytics / Datadog RUM / Mixpanel — 自架 vs 商業的功能和成本取捨">Monitoring 06 商業方案</a>：Sentry / Datadog RUM 的 client-side 能力比較</li>
<li><a href="/blog/monitoring/telemetry-data-dual-use/" data-link-title="監控資料的雙重用途：行為分析與訊號治理" data-link-desc="同一份 event data 如何同時服務行為分析（funnel / cohort / attribution）和訊號治理（cardinality / cost / signal governance）— 格式交叉、治理衝突與分流架構">監控資料的雙重用途</a>：同一份 event data 如何同時服務行為分析與訊號治理</li>
</ul>
]]></content:encoded></item></channel></rss>