<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Grouping on Tarragon</title><link>https://tarrragon.github.io/blog/tags/grouping/</link><description>Recent content in Grouping on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Wed, 24 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/tags/grouping/index.xml" rel="self" type="application/rss+xml"/><item><title>Sentry Error Grouping 與 Fingerprinting 策略</title><link>https://tarrragon.github.io/blog/backend/04-observability/vendors/sentry/error-grouping-fingerprinting/</link><pubDate>Mon, 22 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/04-observability/vendors/sentry/error-grouping-fingerprinting/</guid><description>&lt;blockquote>
&lt;p>本文是 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/sentry/" data-link-title="Sentry" data-link-desc="Error tracking 主流、APM / Profiling / Session Replay 擴展">Sentry&lt;/a> 的 vendor deep article，深化 overview「Issue grouping / fingerprint」段。初次接觸 Sentry 的讀者建議先讀 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/vendors/sentry/" data-link-title="Sentry" data-link-desc="Error tracking 主流、APM / Profiling / Session Replay 擴展">Sentry 服務頁&lt;/a>。&lt;/p>&lt;/blockquote>
&lt;h2 id="問題情境">問題情境&lt;/h2>
&lt;p>Error grouping 決定 Sentry 的使用體驗。Grouping 太粗（不同 bug 被合併成同一個 issue），團隊會漏掉新問題；grouping 太細（同一個 bug 被拆成數百個 issue），issue list 變成 noise。理解 Sentry 的 grouping 演算法跟自訂 fingerprint 機制，才能讓 issue list 反映真實的 bug 數量而非 error event 數量。&lt;/p>
&lt;h2 id="預設-grouping-演算法">預設 Grouping 演算法&lt;/h2>
&lt;h3 id="stack-trace-為主">Stack trace 為主&lt;/h3>
&lt;p>Sentry 的預設 grouping 策略以 exception type + stack trace 為核心。兩個 error event 會被歸到同一個 issue，如果它們的 exception type 相同、且 stack trace 的「相關 frame」相同。&lt;/p>
&lt;p>「相關 frame」是 Sentry 的判定結果 — 它會過濾掉標準函式庫、框架內部 frame 跟已知 noise frame，只留下 application code frame。這個過濾邏輯叫 stack trace rules，由 Sentry 的 grouping 引擎自動決定。&lt;/p>
&lt;h3 id="grouping-版本">Grouping 版本&lt;/h3>
&lt;p>Sentry 的 grouping 演算法有多個版本（稱為 grouping config）。新建的 project 自動用最新版（截至 2024 年是 &lt;code>newstyle:2023-01-11&lt;/code>），舊 project 可能還在用舊版。升級 grouping config 會改變 issue 的歸屬 — 之前合併的 event 可能被拆開，之前分開的可能合併。&lt;/p>
&lt;p>確認目前的 grouping config：Project Settings → General Settings → Event Grouping。升級前先用 Sentry 的 grouping preview 功能測試影響範圍。&lt;/p>
&lt;h3 id="非-exception-事件">非 exception 事件&lt;/h3>
&lt;p>沒有 stack trace 的事件（&lt;code>capture_message&lt;/code>、breadcrumb-only event、CSP violation）用 message 內容做 grouping。相同 message template 的事件歸到同一個 issue。&lt;/p>
&lt;p>message 中如果包含動態值（user ID、request ID、timestamp），Sentry 會嘗試辨識並忽略動態部分。但辨識不完美 — 如果 message 格式不一致，同一種錯誤可能被拆成多個 issue。&lt;/p>
&lt;h2 id="自訂-fingerprint">自訂 Fingerprint&lt;/h2>
&lt;h3 id="何時需要自訂">何時需要自訂&lt;/h3>
&lt;p>預設 grouping 不夠用的常見場景：&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>場景&lt;/th>
 &lt;th>問題&lt;/th>
 &lt;th>Fingerprint 解法&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>外部 API timeout&lt;/td>
 &lt;td>不同 caller 的 stack trace 不同，但根因相同&lt;/td>
 &lt;td>用 &lt;code>{{ default }}&lt;/code> + error type 做 fingerprint&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Database connection error&lt;/td>
 &lt;td>每個 query 的 stack trace 不同&lt;/td>
 &lt;td>用 error message pattern 做 fingerprint&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>前端 minified code&lt;/td>
 &lt;td>source map 缺失導致 frame 不穩定&lt;/td>
 &lt;td>先修 source map 上傳，而非硬 fingerprint&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Rate limit / 429 error&lt;/td>
 &lt;td>大量 429 拆成數百個 issue&lt;/td>
 &lt;td>用 HTTP status code 做 fingerprint&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h3 id="server-side-fingerprint-rules">Server-side fingerprint rules&lt;/h3>
&lt;p>在 Project Settings → Issue Grouping → Fingerprint Rules 設定。語法：&lt;/p></description><content:encoded><![CDATA[<blockquote>
<p>本文是 <a href="/blog/backend/04-observability/vendors/sentry/" data-link-title="Sentry" data-link-desc="Error tracking 主流、APM / Profiling / Session Replay 擴展">Sentry</a> 的 vendor deep article，深化 overview「Issue grouping / fingerprint」段。初次接觸 Sentry 的讀者建議先讀 <a href="/blog/backend/04-observability/vendors/sentry/" data-link-title="Sentry" data-link-desc="Error tracking 主流、APM / Profiling / Session Replay 擴展">Sentry 服務頁</a>。</p></blockquote>
<h2 id="問題情境">問題情境</h2>
<p>Error grouping 決定 Sentry 的使用體驗。Grouping 太粗（不同 bug 被合併成同一個 issue），團隊會漏掉新問題；grouping 太細（同一個 bug 被拆成數百個 issue），issue list 變成 noise。理解 Sentry 的 grouping 演算法跟自訂 fingerprint 機制，才能讓 issue list 反映真實的 bug 數量而非 error event 數量。</p>
<h2 id="預設-grouping-演算法">預設 Grouping 演算法</h2>
<h3 id="stack-trace-為主">Stack trace 為主</h3>
<p>Sentry 的預設 grouping 策略以 exception type + stack trace 為核心。兩個 error event 會被歸到同一個 issue，如果它們的 exception type 相同、且 stack trace 的「相關 frame」相同。</p>
<p>「相關 frame」是 Sentry 的判定結果 — 它會過濾掉標準函式庫、框架內部 frame 跟已知 noise frame，只留下 application code frame。這個過濾邏輯叫 stack trace rules，由 Sentry 的 grouping 引擎自動決定。</p>
<h3 id="grouping-版本">Grouping 版本</h3>
<p>Sentry 的 grouping 演算法有多個版本（稱為 grouping config）。新建的 project 自動用最新版（截至 2024 年是 <code>newstyle:2023-01-11</code>），舊 project 可能還在用舊版。升級 grouping config 會改變 issue 的歸屬 — 之前合併的 event 可能被拆開，之前分開的可能合併。</p>
<p>確認目前的 grouping config：Project Settings → General Settings → Event Grouping。升級前先用 Sentry 的 grouping preview 功能測試影響範圍。</p>
<h3 id="非-exception-事件">非 exception 事件</h3>
<p>沒有 stack trace 的事件（<code>capture_message</code>、breadcrumb-only event、CSP violation）用 message 內容做 grouping。相同 message template 的事件歸到同一個 issue。</p>
<p>message 中如果包含動態值（user ID、request ID、timestamp），Sentry 會嘗試辨識並忽略動態部分。但辨識不完美 — 如果 message 格式不一致，同一種錯誤可能被拆成多個 issue。</p>
<h2 id="自訂-fingerprint">自訂 Fingerprint</h2>
<h3 id="何時需要自訂">何時需要自訂</h3>
<p>預設 grouping 不夠用的常見場景：</p>
<table>
  <thead>
      <tr>
          <th>場景</th>
          <th>問題</th>
          <th>Fingerprint 解法</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>外部 API timeout</td>
          <td>不同 caller 的 stack trace 不同，但根因相同</td>
          <td>用 <code>{{ default }}</code> + error type 做 fingerprint</td>
      </tr>
      <tr>
          <td>Database connection error</td>
          <td>每個 query 的 stack trace 不同</td>
          <td>用 error message pattern 做 fingerprint</td>
      </tr>
      <tr>
          <td>前端 minified code</td>
          <td>source map 缺失導致 frame 不穩定</td>
          <td>先修 source map 上傳，而非硬 fingerprint</td>
      </tr>
      <tr>
          <td>Rate limit / 429 error</td>
          <td>大量 429 拆成數百個 issue</td>
          <td>用 HTTP status code 做 fingerprint</td>
      </tr>
  </tbody>
</table>
<h3 id="server-side-fingerprint-rules">Server-side fingerprint rules</h3>
<p>在 Project Settings → Issue Grouping → Fingerprint Rules 設定。語法：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln"> 1</span><span class="cl"># 所有 ConnectionError 歸成一個 issue
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">error.type:ConnectionError -&gt; connection-error
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"># 特定 message pattern 歸成一個 issue
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">message:&#34;Rate limit exceeded*&#34; -&gt; rate-limit
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">
</span></span><span class="line"><span class="ln"> 7</span><span class="cl"># 特定 module 的所有 error 歸成一組
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">module:payment.gateway.* -&gt; payment-gateway-error
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">
</span></span><span class="line"><span class="ln">10</span><span class="cl"># 組合條件
</span></span><span class="line"><span class="ln">11</span><span class="cl">error.type:TimeoutError module:external.api.* -&gt; external-api-timeout</span></span></code></pre></div><p>Server-side rules 的優先順序：越後面的 rule 優先順序越高。如果一個 event 匹配多條 rule，用最後一條。</p>
<h3 id="sdk-side-fingerprint">SDK-side fingerprint</h3>
<p>在 SDK 的 <code>before_send</code> callback 中設定 <code>event.fingerprint</code>：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">def</span> <span class="nf">before_send</span><span class="p">(</span><span class="n">event</span><span class="p">,</span> <span class="n">hint</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="k">if</span> <span class="s2">&#34;ConnectionError&#34;</span> <span class="ow">in</span> <span class="nb">str</span><span class="p">(</span><span class="n">hint</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;exc_info&#34;</span><span class="p">,</span> <span class="s2">&#34;&#34;</span><span class="p">)):</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">        <span class="n">event</span><span class="p">[</span><span class="s2">&#34;fingerprint&#34;</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;connection-error&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">    <span class="k">return</span> <span class="n">event</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl">
</span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="n">sentry_sdk</span><span class="o">.</span><span class="n">init</span><span class="p">(</span><span class="n">dsn</span><span class="o">=</span><span class="s2">&#34;...&#34;</span><span class="p">,</span> <span class="n">before_send</span><span class="o">=</span><span class="n">before_send</span><span class="p">)</span></span></span></code></pre></div><p>SDK-side 跟 server-side 的差異：</p>
<table>
  <thead>
      <tr>
          <th>面向</th>
          <th>Server-side rules</th>
          <th>SDK-side fingerprint</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>設定位置</td>
          <td>Sentry Web UI</td>
          <td>程式碼</td>
      </tr>
      <tr>
          <td>部署速度</td>
          <td>即時生效</td>
          <td>需要 deploy</td>
      </tr>
      <tr>
          <td>可見性</td>
          <td>團隊都能看到跟修改</td>
          <td>散在程式碼裡</td>
      </tr>
      <tr>
          <td>複雜邏輯</td>
          <td>只支援 pattern matching</td>
          <td>可用任意程式邏輯</td>
      </tr>
  </tbody>
</table>
<p>優先用 server-side rules — 集中管理、即時生效。SDK-side 用在 server-side rules 表達不了的複雜邏輯。</p>
<h3 id="-default--組合"><code>{{ default }}</code> 組合</h3>
<p>Fingerprint 中的 <code>{{ default }}</code> 代表 Sentry 預設的 grouping 結果。跟自訂值組合使用：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl"># 用預設 grouping + environment 維度拆分
</span></span><span class="line"><span class="ln">2</span><span class="cl">fingerprint: [&#34;{{ default }}&#34;, &#34;{{ environment }}&#34;]</span></span></code></pre></div><p>這樣同一個 bug 在 staging 跟 production 會分成兩個 issue，方便分別追蹤。</p>
<h2 id="merge-與-unmerge">Merge 與 Unmerge</h2>
<h3 id="事後修正">事後修正</h3>
<p>當 grouping 不準時，Sentry 提供事後修正：</p>
<p><strong>Merge</strong>：選擇多個 issue，合併成一個。合併後的 issue 保留所有 event，但只保留一個 issue ID。適合預設 grouping 太細（同一 bug 被拆成多個 issue）的情況。</p>
<p><strong>Unmerge</strong>（拆分）：從一個 issue 中選擇部分 event，拆出成新 issue。適合預設 grouping 太粗（不同 bug 被合在同一個 issue）的情況。</p>
<h3 id="mergeunmerge-的限制">Merge/Unmerge 的限制</h3>
<p>Merge 跟 Unmerge 都是「貼 OK 繃」— 只影響現有 event，新進的 event 仍然用原來的 grouping 邏輯。如果根因是 grouping 太粗或太細，應該修 fingerprint rule，而非持續 merge/unmerge。</p>
<p>判讀順序：</p>
<ol>
<li>發現 grouping 不準</li>
<li>先用 merge/unmerge 處理現有 issue（止血）</li>
<li>分析 root cause — 是 stack trace 不穩定、message 有動態值、還是缺 fingerprint rule</li>
<li>加 fingerprint rule 永久修正</li>
<li>驗證新進 event 的 grouping 是否正確</li>
</ol>
<h2 id="grouping-不準的判讀">Grouping 不準的判讀</h2>
<h3 id="太細的訊號">太細的訊號</h3>
<ul>
<li>Issue list 中出現大量「相似標題但不同 ID」的 issue</li>
<li>單一事件只有 1-2 個 occurrence 的 issue 大量出現</li>
<li>同一個使用者操作觸發的 error 被分散到多個 issue</li>
</ul>
<p>常見原因：message 中包含動態值（user ID、timestamp、request path）、source map 缺失（前端）、stack trace 包含 generated code frame。</p>
<h3 id="太粗的訊號">太粗的訊號</h3>
<ul>
<li>一個 issue 的 event 數量持續增長，但 event detail 看起來是不同問題</li>
<li>Issue 的 status 被 resolve 後馬上 regress，但新 event 跟原因不同</li>
<li>團隊 ignore 了一個「雜 issue」但裡面混著真正需要處理的 bug</li>
</ul>
<p>常見原因：exception type 太通用（<code>RuntimeError</code>、<code>Exception</code>）、fingerprint rule 太粗（把整個 module 的 error 合成一個 issue）。</p>
<h2 id="大量-unique-errors-的治理">大量 Unique Errors 的治理</h2>
<h3 id="問題issue-爆量">問題：Issue 爆量</h3>
<p>project 的 issue 數量超過數千時，issue list 失去可操作性。on-call 打開 Sentry 看到 2000 個 unresolved issue，等於沒有 triage。</p>
<h3 id="治理策略">治理策略</h3>
<p><strong>Inbound filter</strong>：在 Project Settings → Inbound Filters 設定，丟棄已知的 noise event（browser extension error、crawler error、legacy browser error）。丟棄在 ingestion 層，不消耗 quota。</p>
<p><strong>Rate limit</strong>：project 或 key 級別的 rate limit。超過限額的 event 被丟棄。適合防止單一 bug 的暴增 event 耗盡 quota，但不解決 issue 數量問題。</p>
<p><strong>Alert rule 搭配 ownership</strong>：用 Sentry alert rule 把特定 tag（service、team、module）的新 issue 通知對應 team。不是所有 issue 都要同一個人看。</p>
<p><strong>定期 triage cadence</strong>：每週或每兩週的 triage session，把 issue 分成 fix / ignore / merge 三類。Sentry 的 <code>For Review</code> tab 自動列出需要初次 triage 的 issue。</p>
<p><strong>Auto-resolve</strong>：設定 auto-resolve policy — 超過 N 天沒有新 event 的 issue 自動 resolve。避免舊 issue 永遠佔據 unresolved list。</p>
<h3 id="治理後的穩態">治理後的穩態</h3>
<p>合理的穩態是：unresolved issue 數量穩定在數十到數百，每週新增 issue 跟 resolve issue 數量大致平衡。如果 unresolved 持續增長，先檢查是否有 noise event 沒被 filter，或 fingerprint 太細。</p>
<h2 id="整合與下一步">整合與下一步</h2>
<ul>
<li>Error tracking 跟 observability 的邊界：Sentry 處理 error lifecycle、metrics/logs/traces 處理系統行為，見 <a href="/blog/backend/04-observability/telemetry-data-quality/" data-link-title="4.17 Telemetry Data Quality" data-link-desc="把 missing signal、schema drift、sampling bias 與 timestamp skew 變成資料品質問題">4.17 Telemetry Data Quality</a></li>
<li>OTel context 整合：Sentry SDK 接受 OTel trace_id / span_id，讓 error 跟 trace 關聯，見 <a href="/blog/backend/04-observability/vendors/opentelemetry/collector-deployment-patterns/" data-link-title="OTel Collector 部署模式：agent / gateway / sidecar 與 pipeline 設計" data-link-desc="說明 OpenTelemetry Collector 三種部署位置的責任分工、receivers/processors/exporters pipeline 設計，以及 collector 失效、記憶體壓力與 backpressure 的故障演練與容量邊界">OpenTelemetry Collector 部署模式</a></li>
<li>Release tracking 跟 session replay：見 <a href="../release-tracking-session-replay/">Release Tracking 與 Session Replay</a></li>
<li>事故響應整合：嚴重 issue → alert → on-call，見 <a href="/blog/backend/08-incident-response/" data-link-title="模組八：事故處理與復盤" data-link-desc="用 IR 領域詞彙建問題節點、以服務級案例庫累積事故脈絡，先建概念與案例庫再進實作交接">08 Incident Response 模組</a></li>
</ul>
]]></content:encoded></item><item><title>Error Fingerprint 與去重分群</title><link>https://tarrragon.github.io/blog/monitoring/04-collector/error-fingerprint/</link><pubDate>Wed, 24 Jun 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/monitoring/04-collector/error-fingerprint/</guid><description>&lt;p>Error fingerprint 把相同根因的 error 事件歸為同一組（error group），讓 dashboard 從「每筆 error 獨立一行」變成「同因 error 歸組、顯示 count / first_seen / last_seen / affected_sessions」。這是 error tracking 從「有記錄」演進到「可管理」的關鍵能力。&lt;/p>
&lt;p>Collector 搭配的 &lt;a href="https://tarrragon.github.io/blog/monitoring/04-collector/dashboard-developer/" data-link-title="Developer Dashboard 設計" data-link-desc="Bug 在哪、多嚴重、怎麼重現 — Error 列表和趨勢的日常監控、Session 回放和 Stack trace 的深入 debug">Developer Dashboard&lt;/a> 在 Error 列表中用 &lt;code>GROUP BY name&lt;/code> 做分群 — 同名的 error 歸為一行。這在 error name 設計良好時（&lt;code>terminal.connect.failed&lt;/code> / &lt;code>auth.biometric.timeout&lt;/code>）可以運作，但在以下情境會失效：&lt;/p>
&lt;ul>
&lt;li>同一個 name 對應多個不同的 root cause — &lt;code>app.exception&lt;/code> 的 stack trace 指向完全不同的程式碼位置&lt;/li>
&lt;li>不同 name 其實是同一個 root cause — &lt;code>ws.connect.failed&lt;/code> 和 &lt;code>ws.reconnect.failed&lt;/code> 都是同一個 server 下線造成&lt;/li>
&lt;/ul>
&lt;p>Fingerprint 提供比 name 更精確的分群維度。&lt;/p>
&lt;h2 id="fingerprint-演算法">Fingerprint 演算法&lt;/h2>
&lt;p>Fingerprint 從 error 事件中提取關鍵欄位、計算 hash，相同 hash 的事件歸為同一組。欄位的選擇決定分群的粒度。&lt;/p>
&lt;h3 id="基礎版type--message">基礎版：type + message&lt;/h3>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">fingerprint = SHA256(error_type + &amp;#34;:&amp;#34; + error_message)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>用 &lt;code>error_type&lt;/code>（&lt;code>NullPointerException&lt;/code> / &lt;code>TypeError&lt;/code> / &lt;code>ConnectionError&lt;/code>）加上 &lt;code>error_message&lt;/code> 做 hash。實作最簡單，大多數情況下能正確分群。&lt;/p>
&lt;p>問題在 error message 包含動態值時。同一個 bug 產生的 error 因為動態值不同而分裂成多組：&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">&amp;#34;User 12345 not found&amp;#34; → fingerprint A
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">&amp;#34;User 67890 not found&amp;#34; → fingerprint B&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>這兩筆是同一個 bug（查無使用者），但 message 中的 user ID 不同導致 fingerprint 不同。動態值的處理見下方 &lt;a href="#message-normalization">message normalization&lt;/a>。&lt;/p>
&lt;h3 id="進階版type--stack-trace-top-frames">進階版：type + stack trace top frames&lt;/h3>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">fingerprint = SHA256(error_type + &amp;#34;:&amp;#34; + top_3_frames)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>用 error_type 加上 stack trace 最頂端的 N 個 frame（函式名 + 檔案名 + 行號）做 hash。Stack trace 的頂端通常是 error 發生的直接位置，相同位置的 error 歸為同組。&lt;/p>





&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="ln">1&lt;/span>&lt;span class="cl">// 兩筆 error 的 stack trace 頂端相同 → 同一個 fingerprint
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">2&lt;/span>&lt;span class="cl">TypeError: Cannot read property &amp;#39;name&amp;#39; of null
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">3&lt;/span>&lt;span class="cl"> at UserProfile.render (UserProfile.js:42) ← frame 1
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">4&lt;/span>&lt;span class="cl"> at Component.update (framework.js:108) ← frame 2
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="ln">5&lt;/span>&lt;span class="cl"> at scheduler.flush (framework.js:203) ← frame 3&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>N 的選擇是粒度 vs 穩定性的取捨。N=1 過粗（不同 bug 可能在同一個函式裡），N=5 過細（重構移動程式碼後行號改變，同一個 bug 的 fingerprint 分裂）。N=3 是常見的預設值。&lt;/p></description><content:encoded><![CDATA[<p>Error fingerprint 把相同根因的 error 事件歸為同一組（error group），讓 dashboard 從「每筆 error 獨立一行」變成「同因 error 歸組、顯示 count / first_seen / last_seen / affected_sessions」。這是 error tracking 從「有記錄」演進到「可管理」的關鍵能力。</p>
<p>Collector 搭配的 <a href="/blog/monitoring/04-collector/dashboard-developer/" data-link-title="Developer Dashboard 設計" data-link-desc="Bug 在哪、多嚴重、怎麼重現 — Error 列表和趨勢的日常監控、Session 回放和 Stack trace 的深入 debug">Developer Dashboard</a> 在 Error 列表中用 <code>GROUP BY name</code> 做分群 — 同名的 error 歸為一行。這在 error name 設計良好時（<code>terminal.connect.failed</code> / <code>auth.biometric.timeout</code>）可以運作，但在以下情境會失效：</p>
<ul>
<li>同一個 name 對應多個不同的 root cause — <code>app.exception</code> 的 stack trace 指向完全不同的程式碼位置</li>
<li>不同 name 其實是同一個 root cause — <code>ws.connect.failed</code> 和 <code>ws.reconnect.failed</code> 都是同一個 server 下線造成</li>
</ul>
<p>Fingerprint 提供比 name 更精確的分群維度。</p>
<h2 id="fingerprint-演算法">Fingerprint 演算法</h2>
<p>Fingerprint 從 error 事件中提取關鍵欄位、計算 hash，相同 hash 的事件歸為同一組。欄位的選擇決定分群的粒度。</p>
<h3 id="基礎版type--message">基礎版：type + message</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">fingerprint = SHA256(error_type + &#34;:&#34; + error_message)</span></span></code></pre></div><p>用 <code>error_type</code>（<code>NullPointerException</code> / <code>TypeError</code> / <code>ConnectionError</code>）加上 <code>error_message</code> 做 hash。實作最簡單，大多數情況下能正確分群。</p>
<p>問題在 error message 包含動態值時。同一個 bug 產生的 error 因為動態值不同而分裂成多組：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">&#34;User 12345 not found&#34;  → fingerprint A
</span></span><span class="line"><span class="ln">2</span><span class="cl">&#34;User 67890 not found&#34;  → fingerprint B</span></span></code></pre></div><p>這兩筆是同一個 bug（查無使用者），但 message 中的 user ID 不同導致 fingerprint 不同。動態值的處理見下方 <a href="#message-normalization">message normalization</a>。</p>
<h3 id="進階版type--stack-trace-top-frames">進階版：type + stack trace top frames</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">fingerprint = SHA256(error_type + &#34;:&#34; + top_3_frames)</span></span></code></pre></div><p>用 error_type 加上 stack trace 最頂端的 N 個 frame（函式名 + 檔案名 + 行號）做 hash。Stack trace 的頂端通常是 error 發生的直接位置，相同位置的 error 歸為同組。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">// 兩筆 error 的 stack trace 頂端相同 → 同一個 fingerprint
</span></span><span class="line"><span class="ln">2</span><span class="cl">TypeError: Cannot read property &#39;name&#39; of null
</span></span><span class="line"><span class="ln">3</span><span class="cl">  at UserProfile.render (UserProfile.js:42)    ← frame 1
</span></span><span class="line"><span class="ln">4</span><span class="cl">  at Component.update (framework.js:108)       ← frame 2
</span></span><span class="line"><span class="ln">5</span><span class="cl">  at scheduler.flush (framework.js:203)        ← frame 3</span></span></code></pre></div><p>N 的選擇是粒度 vs 穩定性的取捨。N=1 過粗（不同 bug 可能在同一個函式裡），N=5 過細（重構移動程式碼後行號改變，同一個 bug 的 fingerprint 分裂）。N=3 是常見的預設值。</p>
<p>Stack trace 版本的前提是 error 事件帶有結構化的 stack trace。如果 SDK 只送 error message 不送 stack trace，只能用基礎版。</p>
<h3 id="sentry-的做法">Sentry 的做法</h3>
<p>Sentry 的策略核心是只用應用程式自身的 frame 做 hash，排除 framework / library 的 frame，並 normalize message 中的動態值。具體做法：</p>
<ol>
<li><strong>取 in-app frame</strong>：忽略 framework / library 的 frame（<code>framework.js</code>、<code>node_modules/</code>），只用應用程式自身的 frame。同一個 bug 在不同版本的 framework 上觸發時，framework frame 可能不同，但 app frame 相同。</li>
<li><strong>Normalize message</strong>：移除動態值（數字、UUID、email）後再 hash。</li>
<li><strong>取最後一個 in-app frame 的函式名</strong>：而非取前 N 個 frame。最後一個 in-app frame 是「error 在應用程式碼中實際發生的位置」。</li>
</ol>
<p>Sentry 的策略對 web 前端（大量 framework frame）和行動 app（大量 OS / runtime frame）的分群效果好，但實作複雜度高 — 需要維護「什麼算 in-app frame」的規則。</p>
<h3 id="sdk-端自定義-fingerprint">SDK 端自定義 fingerprint</h3>
<p>SDK 端可以手動指定 fingerprint，覆蓋 collector 的自動計算。用途是讓開發者把「技術上不同但業務上同因」的 error 歸為同組。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="ln">1</span><span class="cl"><span class="n">monitor</span><span class="o">.</span><span class="n">error</span><span class="p">(</span><span class="s2">&#34;API timeout&#34;</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="ln">2</span><span class="cl">    <span class="s2">&#34;fingerprint&#34;</span><span class="p">:</span> <span class="s2">&#34;api-gateway-timeout&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">3</span><span class="cl">    <span class="s2">&#34;endpoint&#34;</span><span class="p">:</span> <span class="s2">&#34;/v1/users&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="ln">4</span><span class="cl">    <span class="s2">&#34;duration_ms&#34;</span><span class="p">:</span> <span class="mi">30000</span>
</span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="p">})</span></span></span></code></pre></div><p>所有帶 <code>fingerprint: &quot;api-gateway-timeout&quot;</code> 的 error，無論 message 和 stack trace 是否相同，都歸入同一組。</p>
<p>自定義 fingerprint 的處理邏輯：collector 收到事件時，先檢查 <code>data.fingerprint</code> 欄位是否存在。存在則直接用這個值做 hash（或直接用作 fingerprint），不走自動計算。</p>
<h2 id="message-normalization">Message normalization</h2>
<p>動態值讓相同 bug 的 message 不同，導致 fingerprint 分裂。Normalization 在計算 fingerprint 前把動態值替換成 placeholder。</p>
<h3 id="替換規則">替換規則</h3>
<table>
  <thead>
      <tr>
          <th>Pattern</th>
          <th>替換為</th>
          <th>範例</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>連續數字（3 位以上）</td>
          <td><code>{N}</code></td>
          <td><code>&quot;User 12345 not found&quot;</code> → <code>&quot;User {N} not found&quot;</code></td>
      </tr>
      <tr>
          <td>UUID</td>
          <td><code>{uuid}</code></td>
          <td><code>&quot;Session a1b2...7890 expired&quot;</code> → <code>&quot;Session {uuid} expired&quot;</code></td>
      </tr>
      <tr>
          <td>Email</td>
          <td><code>{email}</code></td>
          <td><code>&quot;Invalid email foo@bar.com&quot;</code> → <code>&quot;Invalid email {email}&quot;</code></td>
      </tr>
      <tr>
          <td>IPv4 / IPv6</td>
          <td><code>{ip}</code></td>
          <td><code>&quot;Connection to 192.168.1.100 refused&quot;</code> → <code>&quot;Connection to {ip} refused&quot;</code></td>
      </tr>
      <tr>
          <td>引號內的字串（超過 20 字元）</td>
          <td><code>{string}</code></td>
          <td><code>&quot;Key 'very-long-dynamic-key...' not found&quot;</code> → <code>&quot;Key {string} not found&quot;</code></td>
      </tr>
      <tr>
          <td>絕對路徑的使用者目錄</td>
          <td><code>{path}</code></td>
          <td><code>&quot;/Users/john/project/app.js&quot;</code> → <code>&quot;{path}/project/app.js&quot;</code></td>
      </tr>
      <tr>
          <td>ISO 8601 timestamp</td>
          <td><code>{ts}</code></td>
          <td><code>&quot;Error at 2026-06-24T14:30:00&quot;</code> → <code>&quot;Error at {ts}&quot;</code></td>
      </tr>
  </tbody>
</table>
<p>後兩個屬進階規則 — 基礎五個（數字 / UUID / email / IP / 長字串）在多數場景足夠，file path 和 timestamp 在 error group 分裂嚴重時再加。</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kd">var</span> <span class="nx">normalizers</span> <span class="p">=</span> <span class="p">[]</span><span class="kd">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">    <span class="nx">pattern</span> <span class="o">*</span><span class="nx">regexp</span><span class="p">.</span><span class="nx">Regexp</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="nx">replace</span> <span class="kt">string</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="p">}{</span>
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="p">{</span><span class="nx">regexp</span><span class="p">.</span><span class="nf">MustCompile</span><span class="p">(</span><span class="s">`\b[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\b`</span><span class="p">),</span> <span class="s">&#34;{uuid}&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="p">{</span><span class="nx">regexp</span><span class="p">.</span><span class="nf">MustCompile</span><span class="p">(</span><span class="s">`\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`</span><span class="p">),</span> <span class="s">&#34;{email}&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">    <span class="p">{</span><span class="nx">regexp</span><span class="p">.</span><span class="nf">MustCompile</span><span class="p">(</span><span class="s">`\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b`</span><span class="p">),</span> <span class="s">&#34;{ip}&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="p">{</span><span class="nx">regexp</span><span class="p">.</span><span class="nf">MustCompile</span><span class="p">(</span><span class="s">`\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}`</span><span class="p">),</span> <span class="s">&#34;{ts}&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="p">{</span><span class="nx">regexp</span><span class="p">.</span><span class="nf">MustCompile</span><span class="p">(</span><span class="s">`(?:/Users/|/home/|C:\\Users\\)[^/\\]+`</span><span class="p">),</span> <span class="s">&#34;{path}&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">10</span><span class="cl">    <span class="p">{</span><span class="nx">regexp</span><span class="p">.</span><span class="nf">MustCompile</span><span class="p">(</span><span class="s">`\d{3,}`</span><span class="p">),</span> <span class="s">&#34;{N}&#34;</span><span class="p">},</span>
</span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="ln">12</span><span class="cl">
</span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="kd">func</span> <span class="nf">normalizeMessage</span><span class="p">(</span><span class="nx">msg</span> <span class="kt">string</span><span class="p">)</span> <span class="kt">string</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">14</span><span class="cl">    <span class="k">for</span> <span class="nx">_</span><span class="p">,</span> <span class="nx">n</span> <span class="o">:=</span> <span class="k">range</span> <span class="nx">normalizers</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln">15</span><span class="cl">        <span class="nx">msg</span> <span class="p">=</span> <span class="nx">n</span><span class="p">.</span><span class="nx">pattern</span><span class="p">.</span><span class="nf">ReplaceAllString</span><span class="p">(</span><span class="nx">msg</span><span class="p">,</span> <span class="nx">n</span><span class="p">.</span><span class="nx">replace</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">16</span><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="ln">17</span><span class="cl">    <span class="k">return</span> <span class="nx">msg</span>
</span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><h3 id="normalization-的風險">Normalization 的風險</h3>
<p><strong>過度 normalize</strong>：把實際不同的 error 歸為同組。例如 HTTP status code <code>404</code> 和 <code>500</code> 都被替換成 <code>{N}</code>，導致 <code>&quot;HTTP {N}&quot;</code> 把 404 和 500 混在一起。對策：HTTP status code 等已知語意數字用具名 pattern 優先保留（<code>(\b[1-5]\d{2}\b)</code> → 不替換），再跑通用數字替換。Normalizer 的規則順序決定優先級 — 具名 pattern 放在 <code>\d{3,}</code> 之前，匹配到的數字跳過後續替換。</p>
<p><strong>不足 normalize</strong>：遺漏動態值導致同因 error 分裂。例如 message 中包含時間戳 <code>&quot;Error at 2026-06-24T14:30:00&quot;</code> 但 normalization 沒有覆蓋 ISO 8601 格式。對策：先用基礎規則上線，根據 error group 的分裂狀況逐步補規則 — 同一個 error 名稱下有大量 group 且 stack trace 相同，通常代表 normalization 不足。</p>
<h2 id="storage-設計">Storage 設計</h2>
<p>Fingerprint 的儲存分兩部分：events 表加 fingerprint 欄位、新建 error_groups 表追蹤每組的摘要。</p>
<h3 id="events-表擴充">Events 表擴充</h3>
<p>在<a href="/blog/monitoring/04-collector/scaling-evolution/" data-link-title="規模演進" data-link-desc="可插拔 Storage Backend 架構 — SQLite 預設、PostgreSQL 觸發切換、時間序列 DB 長期演進">現有的 events 表</a>加 <code>fingerprint</code> 欄位：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="k">COLUMN</span><span class="w"> </span><span class="n">fingerprint</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_fingerprint</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">events</span><span class="p">(</span><span class="n">fingerprint</span><span class="p">);</span></span></span></code></pre></div><p><code>fingerprint</code> 存 hash 值（SHA256 hex 的前 16 字元足夠 — 自架場景的 error 種類不會多到 collision）。索引加速「查看某個 error group 的所有事件」查詢。</p>
<h3 id="error_groups-表">error_groups 表</h3>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">error_groups</span><span class="w"> </span><span class="p">(</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 2</span><span class="cl"><span class="w">    </span><span class="n">fingerprint</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 3</span><span class="cl"><span class="w">    </span><span class="n">name</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 4</span><span class="cl"><span class="w">    </span><span class="n">error_type</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 5</span><span class="cl"><span class="w">    </span><span class="n">normalized_message</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 6</span><span class="cl"><span class="w">    </span><span class="k">count</span><span class="w"> </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="w"> </span><span class="k">DEFAULT</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 7</span><span class="cl"><span class="w">    </span><span class="n">first_seen</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 8</span><span class="cl"><span class="w">    </span><span class="n">last_seen</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln"> 9</span><span class="cl"><span class="w">    </span><span class="n">last_event_id</span><span class="w"> </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">REFERENCES</span><span class="w"> </span><span class="n">events</span><span class="p">(</span><span class="n">id</span><span class="p">),</span><span class="w">
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="w">    </span><span class="n">session_count</span><span class="w"> </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="w"> </span><span class="k">DEFAULT</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="w">    </span><span class="n">status</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="w"> </span><span class="k">DEFAULT</span><span class="w"> </span><span class="s1">&#39;open&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="w"></span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_error_groups_last_seen</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">error_groups</span><span class="p">(</span><span class="n">last_seen</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="w"></span><span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">idx_error_groups_count</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">error_groups</span><span class="p">(</span><span class="k">count</span><span class="p">);</span></span></span></code></pre></div><p><code>status</code> 支援基本的 issue 管理 — <code>open</code>（待處理）、<code>resolved</code>（已修復）、<code>ignored</code>（已知、不處理）。Resolved 的 group 如果又收到新事件，自動 reopen。</p>
<h3 id="寫入流程">寫入流程</h3>
<p>Collector 的寫入 pipeline 在 schema validation 之後、storage 寫入之前，加一步 fingerprint 計算。下方的 UPSERT 邏輯引用 events 表的 <code>session_id</code> 欄位 — 該欄位定義在 <a href="/blog/monitoring/04-collector/scaling-evolution/" data-link-title="規模演進" data-link-desc="可插拔 Storage Backend 架構 — SQLite 預設、PostgreSQL 觸發切換、時間序列 DB 長期演進">Events 主表 DDL</a> 中（從 <code>session.id</code> 攤平而來）：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="ln">1</span><span class="cl">HTTP → Schema validation → Fingerprint 計算 → Events INSERT → error_groups UPSERT</span></span></code></pre></div>




<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="ln"> 1</span><span class="cl"><span class="kd">func</span> <span class="nf">processErrorEvent</span><span class="p">(</span><span class="nx">event</span> <span class="nx">Event</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">    <span class="nx">fp</span> <span class="o">:=</span> <span class="nf">calculateFingerprint</span><span class="p">(</span><span class="nx">event</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">    <span class="nx">event</span><span class="p">.</span><span class="nx">Fingerprint</span> <span class="p">=</span> <span class="nx">fp</span>
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">    <span class="c1">// 1. INSERT event</span>
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">    <span class="nx">db</span><span class="p">.</span><span class="nf">InsertEvent</span><span class="p">(</span><span class="nx">event</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">    <span class="c1">// 2. UPSERT error_group</span>
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">    <span class="nx">db</span><span class="p">.</span><span class="nf">Exec</span><span class="p">(</span><span class="s">`
</span></span></span><span class="line"><span class="ln">10</span><span class="cl"><span class="s">        INSERT INTO error_groups (fingerprint, name, error_type, normalized_message,
</span></span></span><span class="line"><span class="ln">11</span><span class="cl"><span class="s">                                  count, first_seen, last_seen, last_event_id, session_count)
</span></span></span><span class="line"><span class="ln">12</span><span class="cl"><span class="s">        VALUES (?, ?, ?, ?, 1, ?, ?, ?, 1)
</span></span></span><span class="line"><span class="ln">13</span><span class="cl"><span class="s">        ON CONFLICT(fingerprint) DO UPDATE SET
</span></span></span><span class="line"><span class="ln">14</span><span class="cl"><span class="s">            count = count + 1,
</span></span></span><span class="line"><span class="ln">15</span><span class="cl"><span class="s">            last_seen = excluded.last_seen,
</span></span></span><span class="line"><span class="ln">16</span><span class="cl"><span class="s">            last_event_id = excluded.last_event_id,
</span></span></span><span class="line"><span class="ln">17</span><span class="cl"><span class="s">            session_count = session_count + CASE
</span></span></span><span class="line"><span class="ln">18</span><span class="cl"><span class="s">                WHEN ? NOT IN (SELECT DISTINCT session_id FROM events WHERE fingerprint = ?)
</span></span></span><span class="line"><span class="ln">19</span><span class="cl"><span class="s">                THEN 1 ELSE 0 END,
</span></span></span><span class="line"><span class="ln">20</span><span class="cl"><span class="s">            status = CASE WHEN status = &#39;resolved&#39; THEN &#39;open&#39; ELSE status END
</span></span></span><span class="line"><span class="ln">21</span><span class="cl"><span class="s">    `</span><span class="p">,</span> <span class="nx">fp</span><span class="p">,</span> <span class="nx">event</span><span class="p">.</span><span class="nx">Name</span><span class="p">,</span> <span class="nx">event</span><span class="p">.</span><span class="nx">ErrorType</span><span class="p">,</span> <span class="nf">normalizeMessage</span><span class="p">(</span><span class="nx">event</span><span class="p">.</span><span class="nx">ErrorMessage</span><span class="p">),</span>
</span></span><span class="line"><span class="ln">22</span><span class="cl">       <span class="nx">event</span><span class="p">.</span><span class="nx">Timestamp</span><span class="p">,</span> <span class="nx">event</span><span class="p">.</span><span class="nx">Timestamp</span><span class="p">,</span> <span class="nx">event</span><span class="p">.</span><span class="nx">ID</span><span class="p">,</span> <span class="nx">event</span><span class="p">.</span><span class="nx">SessionID</span><span class="p">,</span> <span class="nx">fp</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">23</span><span class="cl"><span class="p">}</span></span></span></code></pre></div><p><code>session_count</code> 的子查詢在高寫入量下可能成為瓶頸。務實的替代是在 UPSERT 時不算 session_count，改為定期 job 重新計算（每小時一次）。</p>
<h3 id="查詢模式">查詢模式</h3>
<p>Dashboard 的 Error 列表從 <code>GROUP BY name</code> 改為查 error_groups 表：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="c1">-- 之前：按 name 分群（粗略）
</span></span></span><span class="line"><span class="ln">2</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">&#39;error&#39;</span><span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">name</span><span class="p">;</span><span class="w">
</span></span></span><span class="line"><span class="ln">3</span><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="ln">4</span><span class="cl"><span class="w"></span><span class="c1">-- 之後：按 fingerprint 分群（精確）
</span></span></span><span class="line"><span class="ln">5</span><span class="cl"><span class="c1"></span><span class="k">SELECT</span><span class="w"> </span><span class="n">fingerprint</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">error_type</span><span class="p">,</span><span class="w"> </span><span class="n">normalized_message</span><span class="p">,</span><span class="w">
</span></span></span><span class="line"><span class="ln">6</span><span class="cl"><span class="w">       </span><span class="k">count</span><span class="p">,</span><span class="w"> </span><span class="n">first_seen</span><span class="p">,</span><span class="w"> </span><span class="n">last_seen</span><span class="p">,</span><span class="w"> </span><span class="n">session_count</span><span class="p">,</span><span class="w"> </span><span class="n">status</span><span class="w">
</span></span></span><span class="line"><span class="ln">7</span><span class="cl"><span class="w"></span><span class="k">FROM</span><span class="w"> </span><span class="n">error_groups</span><span class="w">
</span></span></span><span class="line"><span class="ln">8</span><span class="cl"><span class="w"></span><span class="k">WHERE</span><span class="w"> </span><span class="n">status</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s1">&#39;ignored&#39;</span><span class="w">
</span></span></span><span class="line"><span class="ln">9</span><span class="cl"><span class="w"></span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">last_seen</span><span class="w"> </span><span class="k">DESC</span><span class="p">;</span></span></span></code></pre></div><p>error_groups 表的查詢是 index scan，不需要掃描 events 表。Dashboard 刷新頻率高的場景下（每 30 秒），查 error_groups 比 <code>GROUP BY</code> 全表掃描快幾個數量級。</p>
<p>點擊某個 group 進入詳情時，再用 fingerprint 從 events 表撈最近 N 筆事件：</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="ln">1</span><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">events</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">fingerprint</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">?</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">ts</span><span class="w"> </span><span class="k">DESC</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">20</span><span class="p">;</span></span></span></code></pre></div><h2 id="dashboard-整合">Dashboard 整合</h2>
<p>Error fingerprint 改變了 <a href="/blog/monitoring/04-collector/dashboard-developer/" data-link-title="Developer Dashboard 設計" data-link-desc="Bug 在哪、多嚴重、怎麼重現 — Error 列表和趨勢的日常監控、Session 回放和 Stack trace 的深入 debug">Developer Dashboard</a> 的 Error 列表和詳情視圖。</p>
<h3 id="error-列表升級">Error 列表升級</h3>
<p>從按 name 分群升級為按 fingerprint 分群：</p>
<table>
  <thead>
      <tr>
          <th>欄位</th>
          <th>之前（name 分群）</th>
          <th>之後（fingerprint 分群）</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>分群維度</td>
          <td>error.name</td>
          <td>fingerprint hash</td>
      </tr>
      <tr>
          <td>同名不同因的 error</td>
          <td>混在同一行</td>
          <td>各自獨立一行</td>
      </tr>
      <tr>
          <td>不同名同因的 error</td>
          <td>分開兩行</td>
          <td>可用自定義 fingerprint 合併</td>
      </tr>
      <tr>
          <td>影響 session 數</td>
          <td>每次查詢都做 DISTINCT</td>
          <td>error_groups 表預計算</td>
      </tr>
      <tr>
          <td>Status 管理</td>
          <td>無</td>
          <td>open / resolved / ignored</td>
      </tr>
      <tr>
          <td>查詢效能</td>
          <td>GROUP BY 掃描 events 表</td>
          <td>直接查 error_groups 表</td>
      </tr>
  </tbody>
</table>
<h3 id="error-詳情升級">Error 詳情升級</h3>
<p>點擊某個 error group 進入詳情，顯示：</p>
<ul>
<li><strong>代表性 stack trace</strong>：最近一次事件的 stack trace，讓開發者看到 error 的具體位置</li>
<li><strong>Normalized message</strong>：去除動態值後的 error message，一目了然這個 group 代表什麼問題</li>
<li><strong>趨勢</strong>：這個 group 的事件量隨時間的變化（上升 = 越來越多使用者遇到、下降 = 可能自行恢復）</li>
<li><strong>受影響版本</strong>：按 <code>source.version</code> 分佈 — 新版本出現的 group 通常是 regression</li>
<li><strong>受影響平台</strong>：按 <code>source.platform</code> 分佈 — 只影響特定平台的 group 通常是平台特定 bug</li>
</ul>
<h2 id="自架方案的務實邊界">自架方案的務實邊界</h2>
<p>自架 collector 的 fingerprint 機制和 <a href="/blog/monitoring/06-commercial-comparison/sentry-deep-dive/" data-link-title="Sentry 深入" data-link-desc="Error tracking &#43; performance monitoring &#43; session replay 的架構 — Sentry 從 error-first 出發如何擴展到全面可觀測性">Sentry</a> 等商業方案有明確的能力差距。</p>
<h3 id="stack-trace-可讀性">Stack trace 可讀性</h3>
<p>Stack trace 分群的前提是 stack trace 可讀 — frame 的函式名和檔名對應原始碼。兩種情境下 stack trace 會變成不可讀：</p>
<p><strong>Minified JS</strong>：production 環境的 JS 經過 minify 後，stack trace 變成 <code>a.js:1:2345</code>，無法定位原始碼位置。Sentry 支援上傳 source map，在 server 端自動反解。自架方案的對策：開發期使用未 minify 的 JS（stack trace 直接對應原始碼）；production 環境如果用 minify，需要自建 source map server 或放棄 JS 的 stack trace 分群、改用 error name + message 做 fingerprint。</p>
<p><strong>Android ProGuard / R8 混淆</strong>：混淆後 stack trace 的類名和方法名是 <code>a.b.c()</code>。Sentry 和 Crashlytics 支援上傳 mapping file 反混淆。自架方案如果目標平台包含 Android native（非 Flutter），需要自建 mapping 反混淆流程。</p>
<p>Flutter 和 Python 不受上述影響 — Flutter 的 debug / profile build 保留完整 stack trace，Dart 有自己的 stack trace 格式不經過 ProGuard；Python 的 stack trace 永遠包含原始檔名和行號。</p>
<h3 id="ml-based-grouping">ML-based grouping</h3>
<p>Sentry 的進階 grouping 使用機器學習判斷「語意相同但結構不同」的 error 是否該歸為同組。例如同一個 bug 因為 async/await 的 call chain 不同而產生不同的 stack trace，ML 模型能辨識它們是同一個 root cause。</p>
<p>自架方案用規則（fingerprint 演算法 + normalization）做 grouping。規則的覆蓋率低於 ML — 遇到規則沒覆蓋的情境時，需要手動加 normalization 規則或用 SDK 端自定義 fingerprint 修正。</p>
<h3 id="能力定位">能力定位</h3>
<table>
  <thead>
      <tr>
          <th>能力</th>
          <th>自架方案</th>
          <th>Sentry</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>基礎分群</td>
          <td>type + normalized message</td>
          <td>type + in-app frame + ML</td>
      </tr>
      <tr>
          <td>Stack trace 分群</td>
          <td>top N frames（明文 stack trace）</td>
          <td>in-app frame + source map + deobfuscation</td>
      </tr>
      <tr>
          <td>自定義 fingerprint</td>
          <td>SDK 端 <code>data.fingerprint</code></td>
          <td>SDK 端 + server-side rule</td>
      </tr>
      <tr>
          <td>Message normalize</td>
          <td>regex 替換</td>
          <td>regex + ML</td>
      </tr>
      <tr>
          <td>Issue 管理</td>
          <td>open / resolved / ignored</td>
          <td>+ assign / merge / snooze / trend</td>
      </tr>
  </tbody>
</table>
<p>基礎分群和 message normalization 覆蓋自架場景的多數需求。Stack trace 分群在明文 stack trace 的場景下（Python / Flutter / 未 minify 的 JS）和 Sentry 效果相當。差距主要在 minified / obfuscated 環境和 ML-based grouping — 這兩者恰好是商業方案的核心付費價值。</p>
<h2 id="下一步路由">下一步路由</h2>
<ul>
<li>Error 列表和趨勢的日常監控 → <a href="/blog/monitoring/04-collector/dashboard-developer/" data-link-title="Developer Dashboard 設計" data-link-desc="Bug 在哪、多嚴重、怎麼重現 — Error 列表和趨勢的日常監控、Session 回放和 Stack trace 的深入 debug">Developer Dashboard 設計</a></li>
<li>Collector 的處理鏈路 → <a href="/blog/monitoring/04-collector/architecture/" data-link-title="Collector 架構" data-link-desc="HTTP endpoint → JSON Schema 驗證 → 儲存 → 查詢 → rule engine 的五段式處理鏈路">Collector 架構</a></li>
<li>偽造 error 的辨識 → <a href="/blog/monitoring/07-security-privacy/client-sdk-authentication/" data-link-title="Client-side SDK 認證的根本限制" data-link-desc="嵌在 client 端的 credential 必然可被提取 — 認清 architecture 天花板後的多層緩解策略，從 origin 驗證到 device attestation">Client-side SDK 認證</a></li>
<li>Sentry 的 error tracking 架構 → <a href="/blog/monitoring/06-commercial-comparison/sentry-deep-dive/" data-link-title="Sentry 深入" data-link-desc="Error tracking &#43; performance monitoring &#43; session replay 的架構 — Sentry 從 error-first 出發如何擴展到全面可觀測性">Sentry 深入</a></li>
<li>Error 事件的端到端完整性 → <a href="/blog/monitoring/04-collector/data-integrity/" data-link-title="端到端資料完整性" data-link-desc="從 SDK 到 storage 的資料損失地圖 — 每個環節的損失類型、控制策略、完整性指標、被自己 SDK DDoS 的防護">端到端資料完整性</a></li>
</ul>
]]></content:encoded></item></channel></rss>