Sentry on Tarragon

Sentry 深入

Fri, 19 Jun 2026 00:00:00 +0000

跟 Backend 04 的分工：本文從 client-side 使用角度說明 Sentry 的 error tracking、performance monitoring 與 session replay — SDK 怎麼埋、error 怎麼分群、release 怎麼追蹤。Server-side 平台治理（告警路由整合、SLI 指標設計、self-hosted vs SaaS 成本治理、跟 OTel 的整合）見 Backend 04 Sentry vendor page。

Sentry 的核心是 error tracking — 自動捕獲未處理的例外、提供 stack trace、自動分群（grouping）相同 root cause 的 error。在 error tracking 的基礎上，Sentry 擴展了 performance monitoring（transaction / span）和 session replay（重播使用者操作）。

Error tracking

Sentry 的 error tracking 架構有三個層次：SDK 端的自動捕獲、server 端的 issue grouping 和 UI 端的 issue management。

自動捕獲

Sentry SDK 在各平台註冊全域錯誤處理器（和模組三自動攔截的機制相同）。捕獲到例外後，SDK 收集 stack trace、breadcrumbs（最近的使用者操作）、device context（OS / browser / device model）和自訂 tags，打包成 event 送到 Sentry server。

Issue grouping

Sentry server 收到 error event 後，用 fingerprinting 演算法判斷這個 error 是否和已有的 issue 相同。預設的 fingerprinting 基於 stack trace 的 frame — 如果兩個 error 的 stack trace 指向同一個位置，歸入同一個 issue。

自訂 fingerprint 讓開發者控制 grouping 邏輯。例如：不同使用者觸發的同一個 API error 可能有不同的 stack trace（因為 call site 不同），但 root cause 相同 — 自訂 fingerprint 把它們歸入同一個 issue。

Issue management

每個 issue 有狀態（unresolved / resolved / ignored）、指派（誰負責修復）、趨勢（這個 issue 的發生頻率是上升還是下降）。Sentry 的 UI 提供 issue 列表、趨勢圖、影響範圍（影響多少使用者）。

Performance monitoring

Sentry 的 performance monitoring 用 transaction 和 span 模型（和 OpenTelemetry 的 trace / span 概念相同）。

Transaction 代表一個完整的操作（頁面載入、API 請求處理）。Span 是 transaction 內的子操作（database query、外部 API 呼叫）。Transaction 和 span 的 duration 構成操作的時間分佈。

Performance monitoring 的價值是發現「慢」的問題 — P95 回應時間超過閾值、特定 span 佔了 transaction 80% 的時間。和 error tracking 互補：error 告訴你「什麼壞了」，performance 告訴你「什麼慢了」。

Session replay

Session replay 錄製使用者的操作過程 — DOM 變化、滑鼠移動、點擊事件 — 在 Sentry UI 中重播。開發者可以看到「使用者在觸發 error 之前做了什麼操作」。

Session replay 的實作是 DOM snapshot + mutation recording。記錄的是 DOM 結構的變化（非螢幕錄影），在重播時重建 DOM。資料量比錄影小很多，但仍然是所有 Sentry 功能中資料量最大的。

隱私考量：session replay 會看到使用者輸入的內容（除非做 masking）。Sentry 提供 privacy configuration 控制哪些元素被 mask（輸入框、敏感資料區域）。

自架方案和 Sentry 的差距

功能	自架方案	Sentry
Error 捕獲	SDK 自動攔截	SDK 自動攔截（相同）
Issue grouping	手動 grep 分群	自動 fingerprinting + 自訂規則
趨勢分析	手動計數	自動趨勢圖 + 告警
Performance	metric 事件 + 手動分析	Transaction / span + 自動 P95
Session replay	無	DOM recording + 重播 UI

Sentry 的核心價值在 issue grouping 和趨勢分析 — 把大量 error event 歸類成可管理的 issue 列表，自動追蹤每個 issue 的趨勢。自架方案用 grep 做不到自動 grouping。

下一步路由

Firebase 的整合方案 → Firebase 套件
Datadog 的全棧 APM → Datadog RUM
自架 vs 商業的判斷 → 自架 vs 商業的判斷決策表
自架方案的 error fingerprint 實作 → Error Fingerprint 與去重分群

商業方案的事件類型對應

Fri, 19 Jun 2026 00:00:00 +0000

商業監控方案各自有不同的事件分類體系。理解它們的分類邏輯和四類事件（event / error / metric / lifecycle）的對應關係，才能在接入時正確映射自架方案的事件，避免資料遺漏或分類錯誤。

Sentry

Sentry 的核心概念是 error tracking，但已擴展到 performance monitoring 和 session replay。

四類事件	Sentry 對應	說明
Event	Breadcrumb	使用者操作記錄在 breadcrumb trail，附加在 error 上
Error	Event（Exception type）	Sentry 的核心。自動捕獲 + 手動 captureException
Metric	Transaction + Span	Performance monitoring 的度量單位
Lifecycle	Breadcrumb（navigation）	app 生命週期記錄為 navigation/system breadcrumb

Sentry 的設計假設是「error 是主角，其他事件是 error 的 context」。Event 和 lifecycle 都以 breadcrumb 形式附加在 error 報告上，獨立查看的能力有限。Breadcrumb 預設保留最近 100 條且不可獨立查詢 — 它是 error 報告的附件，不是獨立的事件資料庫。Metric 對應的 Transaction + Span 則有獨立的 Performance 頁面可以查看，和 error 是不同的 UI 入口。如果主要需求是行為分析而非 error tracking，Sentry 的 breadcrumb 模型可能不夠用。

Firebase Crashlytics + Analytics

Firebase 把 error tracking 和行為分析拆成兩個獨立產品。

四類事件	Firebase 對應	說明
Event	Analytics custom event	GA4 的 event，有 parameters 附加屬性
Error	Crashlytics exception	fatal + non-fatal exception 分開處理
Metric	Analytics event + parameters	用 event 的 parameters 記錄數值（無原生 metric）
Lifecycle	Analytics auto events	screen_view、app_open 等自動收集

Firebase 的特點是 Crashlytics 和 Analytics 各自獨立運作 — error 資料在 Crashlytics console，行為資料在 Analytics console。Metric 沒有原生支援，只能用 Analytics event 的 parameters 欄位記錄數值（例如 event: 'page_load', parameters: {duration_ms: 320}），查詢時需要在 BigQuery export 中自行聚合。兩個 console 之間的關聯需要手動（在 Crashlytics 的 custom key 中設定 user ID，再到 Analytics 用同一個 ID 查行為）。

Datadog RUM

Datadog Real User Monitoring 從全棧 APM 的角度設計 client-side 監控。

四類事件	Datadog RUM 對應	說明
Event	Action	使用者操作（click、tap、scroll）自動或手動捕獲
Error	Error	JS exception、network error、custom error
Metric	Long Task + 自訂	長任務自動捕獲，自訂 metric 用 global context
Lifecycle	View	頁面/畫面的進入和離開，自動偵測 SPA route 變換

Datadog RUM 的特點是和 backend APM 的深度整合。Client-side 的 action 可以關聯到 server-side 的 trace，形成從按鈕點擊到 database query 的完整鏈路。自架方案通常做不到這個深度的跨層關聯。

接入策略

接入商業方案時的映射原則：

自架事件名稱是 source of truth。商業方案的事件名稱是自架名稱的映射，不是取代。映射邏輯集中在一個 adapter 層，商業方案更換時只改 adapter。

不要為了配合商業方案改變自架的分類。Sentry 把 event 記錄為 breadcrumb 不代表自架方案也要把 event 降級成 error 的附屬品。自架的四類分類是語意正確的，商業方案的分類是它自己的產品設計。

同時接入多個方案時做去重。Error 同時發到 Sentry 和 Crashlytics 會產生重複。在 adapter 層控制「哪類事件發到哪個方案」，避免同一個事件在多個 dashboard 出現。

下一步路由

四類事件的定義 → 四類事件的完整定義
商業方案的深入比較 → 模組六商業方案比較
事件命名規範 → 事件命名規範

模組六：商業方案對照

Fri, 19 Jun 2026 00:00:00 +0000

回答「什麼時候該從自架切換到商業方案」。

待寫章節

自架 vs 商業的判斷決策表（使用者數 / 網路範圍 / 功能需求 / 合規要求）
Sentry 深入（error + performance + session replay 的架構）
Firebase 套件（Crashlytics + Analytics + Remote Config 的整合）
Datadog RUM（全棧 APM 的 client-side 觀點）
Mixpanel / Amplitude（行為分析專用 vs 通用監控的差異）
部署光譜（BaaS + Serverless / PaaS / 完全自架 / 商業 SaaS 四條路徑）

跨分類引用

→ monitoring 模組八商業利用：商業方案的核心賣點是行為分析功能
→ backend 04 可觀測性：server-side 商業方案（Datadog / New Relic）的對照

Sentry Error Grouping 與 Fingerprinting 策略

Mon, 22 Jun 2026 00:00:00 +0000

本文是 Sentry 的 vendor deep article，深化 overview「Issue grouping / fingerprint」段。初次接觸 Sentry 的讀者建議先讀 Sentry 服務頁。

問題情境

Error grouping 決定 Sentry 的使用體驗。Grouping 太粗（不同 bug 被合併成同一個 issue），團隊會漏掉新問題；grouping 太細（同一個 bug 被拆成數百個 issue），issue list 變成 noise。理解 Sentry 的 grouping 演算法跟自訂 fingerprint 機制，才能讓 issue list 反映真實的 bug 數量而非 error event 數量。

預設 Grouping 演算法

Stack trace 為主

Sentry 的預設 grouping 策略以 exception type + stack trace 為核心。兩個 error event 會被歸到同一個 issue，如果它們的 exception type 相同、且 stack trace 的「相關 frame」相同。

「相關 frame」是 Sentry 的判定結果 — 它會過濾掉標準函式庫、框架內部 frame 跟已知 noise frame，只留下 application code frame。這個過濾邏輯叫 stack trace rules，由 Sentry 的 grouping 引擎自動決定。

Grouping 版本

Sentry 的 grouping 演算法有多個版本（稱為 grouping config）。新建的 project 自動用最新版（截至 2024 年是 newstyle:2023-01-11），舊 project 可能還在用舊版。升級 grouping config 會改變 issue 的歸屬 — 之前合併的 event 可能被拆開，之前分開的可能合併。

確認目前的 grouping config：Project Settings → General Settings → Event Grouping。升級前先用 Sentry 的 grouping preview 功能測試影響範圍。

非 exception 事件

沒有 stack trace 的事件（capture_message、breadcrumb-only event、CSP violation）用 message 內容做 grouping。相同 message template 的事件歸到同一個 issue。

message 中如果包含動態值（user ID、request ID、timestamp），Sentry 會嘗試辨識並忽略動態部分。但辨識不完美 — 如果 message 格式不一致，同一種錯誤可能被拆成多個 issue。

自訂 Fingerprint

何時需要自訂

預設 grouping 不夠用的常見場景：

場景	問題	Fingerprint 解法
外部 API timeout	不同 caller 的 stack trace 不同，但根因相同	用 `{{ default }}` + error type 做 fingerprint
Database connection error	每個 query 的 stack trace 不同	用 error message pattern 做 fingerprint
前端 minified code	source map 缺失導致 frame 不穩定	先修 source map 上傳，而非硬 fingerprint
Rate limit / 429 error	大量 429 拆成數百個 issue	用 HTTP status code 做 fingerprint

Server-side fingerprint rules

在 Project Settings → Issue Grouping → Fingerprint Rules 設定。語法：

 1# 所有 ConnectionError 歸成一個 issue
 2error.type:ConnectionError -> connection-error
 3
 4# 特定 message pattern 歸成一個 issue
 5message:"Rate limit exceeded*" -> rate-limit
 6
 7# 特定 module 的所有 error 歸成一組
 8module:payment.gateway.* -> payment-gateway-error
 9
10# 組合條件
11error.type:TimeoutError module:external.api.* -> external-api-timeout

Server-side rules 的優先順序：越後面的 rule 優先順序越高。如果一個 event 匹配多條 rule，用最後一條。

SDK-side fingerprint

在 SDK 的 before_send callback 中設定 event.fingerprint：

1def before_send(event, hint):
2    if "ConnectionError" in str(hint.get("exc_info", "")):
3        event["fingerprint"] = ["connection-error"]
4    return event
5
6sentry_sdk.init(dsn="...", before_send=before_send)

SDK-side 跟 server-side 的差異：

面向	Server-side rules	SDK-side fingerprint
設定位置	Sentry Web UI	程式碼
部署速度	即時生效	需要 deploy
可見性	團隊都能看到跟修改	散在程式碼裡
複雜邏輯	只支援 pattern matching	可用任意程式邏輯

優先用 server-side rules — 集中管理、即時生效。SDK-side 用在 server-side rules 表達不了的複雜邏輯。

`{{ default }}` 組合

Fingerprint 中的 {{ default }} 代表 Sentry 預設的 grouping 結果。跟自訂值組合使用：

1# 用預設 grouping + environment 維度拆分
2fingerprint: ["{{ default }}", "{{ environment }}"]

這樣同一個 bug 在 staging 跟 production 會分成兩個 issue，方便分別追蹤。

Merge 與 Unmerge

事後修正

當 grouping 不準時，Sentry 提供事後修正：

Merge：選擇多個 issue，合併成一個。合併後的 issue 保留所有 event，但只保留一個 issue ID。適合預設 grouping 太細（同一 bug 被拆成多個 issue）的情況。

Unmerge（拆分）：從一個 issue 中選擇部分 event，拆出成新 issue。適合預設 grouping 太粗（不同 bug 被合在同一個 issue）的情況。

Merge/Unmerge 的限制

Merge 跟 Unmerge 都是「貼 OK 繃」— 只影響現有 event，新進的 event 仍然用原來的 grouping 邏輯。如果根因是 grouping 太粗或太細，應該修 fingerprint rule，而非持續 merge/unmerge。

判讀順序：

發現 grouping 不準
先用 merge/unmerge 處理現有 issue（止血）
分析 root cause — 是 stack trace 不穩定、message 有動態值、還是缺 fingerprint rule
加 fingerprint rule 永久修正
驗證新進 event 的 grouping 是否正確

Grouping 不準的判讀

太細的訊號

Issue list 中出現大量「相似標題但不同 ID」的 issue
單一事件只有 1-2 個 occurrence 的 issue 大量出現
同一個使用者操作觸發的 error 被分散到多個 issue

常見原因：message 中包含動態值（user ID、timestamp、request path）、source map 缺失（前端）、stack trace 包含 generated code frame。

太粗的訊號

一個 issue 的 event 數量持續增長，但 event detail 看起來是不同問題
Issue 的 status 被 resolve 後馬上 regress，但新 event 跟原因不同
團隊 ignore 了一個「雜 issue」但裡面混著真正需要處理的 bug

常見原因：exception type 太通用（RuntimeError、Exception）、fingerprint rule 太粗（把整個 module 的 error 合成一個 issue）。

大量 Unique Errors 的治理

問題：Issue 爆量

project 的 issue 數量超過數千時，issue list 失去可操作性。on-call 打開 Sentry 看到 2000 個 unresolved issue，等於沒有 triage。

治理策略

Inbound filter：在 Project Settings → Inbound Filters 設定，丟棄已知的 noise event（browser extension error、crawler error、legacy browser error）。丟棄在 ingestion 層，不消耗 quota。

Rate limit：project 或 key 級別的 rate limit。超過限額的 event 被丟棄。適合防止單一 bug 的暴增 event 耗盡 quota，但不解決 issue 數量問題。

Alert rule 搭配 ownership：用 Sentry alert rule 把特定 tag（service、team、module）的新 issue 通知對應 team。不是所有 issue 都要同一個人看。

定期 triage cadence：每週或每兩週的 triage session，把 issue 分成 fix / ignore / merge 三類。Sentry 的 For Review tab 自動列出需要初次 triage 的 issue。

Auto-resolve：設定 auto-resolve policy — 超過 N 天沒有新 event 的 issue 自動 resolve。避免舊 issue 永遠佔據 unresolved list。

治理後的穩態

合理的穩態是：unresolved issue 數量穩定在數十到數百，每週新增 issue 跟 resolve issue 數量大致平衡。如果 unresolved 持續增長，先檢查是否有 noise event 沒被 filter，或 fingerprint 太細。

整合與下一步

Error tracking 跟 observability 的邊界：Sentry 處理 error lifecycle、metrics/logs/traces 處理系統行為，見 4.17 Telemetry Data Quality
OTel context 整合：Sentry SDK 接受 OTel trace_id / span_id，讓 error 跟 trace 關聯，見 OpenTelemetry Collector 部署模式
Release tracking 跟 session replay：見 Release Tracking 與 Session Replay
事故響應整合：嚴重 issue → alert → on-call，見 08 Incident Response 模組

Sentry Release Tracking 與 Session Replay

Mon, 22 Jun 2026 00:00:00 +0000

本文是 Sentry 的 vendor deep article，深化 overview「Release / source map」跟「Session Replay」段。初次接觸 Sentry 的讀者建議先讀 Sentry 服務頁。

問題情境

Release tracking 讓 Sentry 從「error 收集器」升級成「部署品質追蹤器」。每次部署標記一個 release，Sentry 自動計算 crash-free sessions、regressed errors 跟 release health。Session Replay 進一步把 error 的觸發脈絡從 stack trace 擴展到使用者操作錄影。兩者搭配使用時，團隊能看到「這個版本部署後、哪些使用者遇到什麼操作導致什麼錯誤」的完整鏈路。

Release Health

核心概念

Release health 追蹤每個版本的使用者體驗品質。核心指標：

指標	定義	健康閾值
Crash-free sessions	沒有 unhandled error 的 session 百分比	99.5% 以上
Crash-free users	沒有遇到 unhandled error 的使用者百分比	99.5% 以上
Adoption rate	使用此版本的 session 佔比	依 rollout 策略
Error count	此版本的 error event 數量	不應比前一版高

Crash-free sessions 跟 crash-free users 的差異：sessions 是頻率加權（一個使用者一天開 10 次 app，10 次都算），users 是去重的。Mobile app 通常看 crash-free users（使用者感知），web 通常看 crash-free sessions（頻率反映服務品質）。

Release 標記

在 SDK 初始化時傳入 release 標記：

1sentry_sdk.init(
2    dsn="...",
3    release="checkout-api@1.2.3",
4    environment="production",
5)

Release 命名慣例：@ 或 git SHA。用語意版本方便比較，用 git SHA 方便對應 commit。CI/CD pipeline 在 deploy step 自動設定。

Deploy 標記

Release 建立後，用 Sentry CLI 或 API 標記 deploy：

1sentry-cli releases deploys checkout-api@1.2.3 new \
2  --env production \
3  --started $(date -u +%s) \
4  --finished $(date -u +%s)

Deploy 標記讓 Sentry 知道某個 release 何時部署到哪個環境。issue list 的 “First seen in release” 跟 “Regressed in release” 依賴這個資訊。

Regressed Error 偵測

Sentry 會追蹤已 resolve 的 issue。如果新 release 重新觸發了已 resolve 的 issue，Sentry 標記為 regression。這比人工追蹤有效 — 團隊不需要記住哪些 bug 修過，Sentry 自動偵測回歸。

Regression 通知的準確度取決於 grouping 品質。如果 grouping 不準（見 Error Grouping 與 Fingerprinting），regression 偵測也會不準 — 不同 bug 被合成同一 issue 時，resolve 一個 bug 後另一個觸發會被誤判為 regression。

Source map 上傳

前端 minified code 的 stack trace 不可讀。上傳 source map 讓 Sentry 還原原始 source code 位置：

1sentry-cli releases files checkout-api@1.2.3 upload-sourcemaps \
2  --url-prefix '~/static/js' \
3  ./build/static/js

Source map 上傳必須在 deploy 前完成，且 release 版本跟前端 build 版本一致。版本不一致時，Sentry 找不到對應的 source map，stack trace 仍然是 minified。

CI/CD 整合：在 build step 之後、deploy step 之前上傳 source map。多數框架（Next.js、Vite、Webpack）有 Sentry plugin 自動處理。

Session Replay

核心能力

Session Replay 錄製使用者在網頁上的操作。Sentry 記錄的是 DOM mutation 跟使用者事件的結構化資料，播放時 replay DOM 變化，效果類似影片但資料量遠小於螢幕錄影。

replay 跟 error 關聯：Sentry 在 error event 中附帶 replay ID，讓工程師從 issue detail 直接跳到 error 發生前後的使用者操作。

隱私設定

Session Replay 預設會遮罩敏感資訊：

遮罩類型	預設行為	自訂方式
文字內容	所有文字替換成 `*`	`maskAllText: false` 關閉、或用 CSS class `sentry-mask` 指定
輸入框	所有 input value 遮罩	`maskAllInputs: false` 關閉（注意 PII 風險）
圖片	不遮罩（但從原始 URL 載入）	`blockAllMedia: true` 遮蔽所有媒體
特定元素	不遮罩	加 `data-sentry-block` attribute 完全隱藏

PII 合規考量：

預設 maskAllText: true + maskAllInputs: true 是安全起點
GDPR / CCPA 場景需要額外確認：replay 資料存在 Sentry SaaS（美國資料中心），跨境傳輸需要評估
Self-hosted Sentry 可以把 replay 資料留在自己的基礎設施

Sampling 策略

Session Replay 會增加前端 SDK 的 payload 大小跟 Sentry 的 event quota。用 sampling rate 控制：

1Sentry.init({
2  dsn: "...",
3  replaysSessionSampleRate: 0.1,  // 10% 的 session 錄影
4  replaysOnErrorSampleRate: 1.0,  // error 發生時 100% 錄影
5});

推薦策略：replaysSessionSampleRate 用低值（1-10%），replaysOnErrorSampleRate 用 100%。目的是確保每個 error 都有 replay 可看，但不錄所有正常 session。

高流量網站（每日百萬 session 以上）可能需要把 replaysSessionSampleRate 設到 0，只在 error 時才錄。session replay 的 quota 消耗速度可以在 Sentry Usage Stats 頁面監控。

Performance Monitoring

Transaction-based tracing

Sentry 的 performance monitoring 用 transaction / span 結構（跟 OpenTelemetry 的 trace / span 概念對齊）。每個 HTTP request、page load 或自訂操作是一個 transaction，transaction 內的子操作是 span。

1with sentry_sdk.start_transaction(op="checkout", name="POST /api/checkout"):
2    with sentry_sdk.start_span(op="db", description="insert order"):
3        # DB operation
4        pass
5    with sentry_sdk.start_span(op="http", description="payment gateway"):
6        # External API call
7        pass

自動 instrumentation 會自動建立 transaction 跟 span（HTTP framework、DB driver、HTTP client）。手動 span 用在自訂業務邏輯或自動 instrumentation 沒覆蓋的路徑。

OTel context 整合

Sentry SDK 支援 OTel context propagation — 如果 upstream service 用 OTel SDK 產生 trace，Sentry SDK 會接受 traceparent header 中的 trace_id 跟 parent_span_id，把自己的 transaction 接到同一條 trace。

整合方式：

場景	設定
Sentry SDK 接收 OTel context	預設支援 W3C Trace Context、不需額外設定
Sentry 資料送到 OTel backend	用 Sentry 的 OTel exporter（experimental）
OTel SDK 送資料到 Sentry	OTel SDK → OTLP exporter → Sentry（Sentry 支援 OTLP ingestion）

常見架構：backend service 用 OTel SDK + Collector，frontend 用 Sentry SDK（前端 error tracking 跟 session replay 是 Sentry 的強項）。兩者透過 trace_id 關聯，在 Sentry 看 frontend error + replay，在 OTel backend 看 backend trace。

Web Vitals

前端 SDK 自動收集 Core Web Vitals（LCP、FID / INP、CLS）跟 TTFB。這些指標跟 error 在同一個 dashboard，讓團隊在 release 後同時看 error regression 跟效能 regression。

Web Vitals 的觀測不需要額外設定 — 前端 SDK 自動收集。但 sampling rate 會影響資料量 — tracesSampleRate 設太低時，Web Vitals 的 sample 數量可能不夠做統計比較。

Self-hosted vs SaaS

決策維度

維度	SaaS（sentry.io）	Self-hosted
維運	Sentry 負責	自己維運（docker-compose、20+ 容器）
資料位置	Sentry 資料中心（美國為主）	自己的基礎設施
功能完整度	全功能	社群版功能略少（部分企業功能不含）
升級	自動	手動（每月有新版、升級需要停機）
成本模型	Event-based pricing	基礎設施 + 人力成本
Replay / Profiling	含	含（但 storage 自負）

何時選 self-hosted

資料必須留在特定地理區域（GDPR / 特定產業法規）、或企業 security policy 不允許 error data 送到第三方 — 這是 self-hosted 的核心理由。

Self-hosted Sentry 的維運成本常被低估：20+ 個容器（Kafka、ClickHouse、PostgreSQL、Redis、Snuba、Relay 等）、升級可能需要資料庫 migration、troubleshooting 時沒有 vendor 支援。中小團隊通常 SaaS 的 event pricing 比 self-hosted 的人力成本低。

混合模式

部分團隊用混合模式：production error 送 Sentry SaaS（低維運），但 audit-sensitive 的資料（PII-heavy environment）走 self-hosted。兩套 Sentry instance 各自獨立，不共享 issue。

整合與下一步

Error grouping 策略：在 issue 數量失控前建立 fingerprint rule，見 Error Grouping 與 Fingerprinting
觀測證據整合：把 Sentry issue link 放進 evidence package，見 4.20 Observability Evidence Package
Client-side monitoring：Sentry 的前端 SDK 跟 RUM 的定位互補，見 4.10 Client-side Monitoring
事故響應整合：Sentry alert → PagerDuty / incident.io，見 08 Incident Response 模組

Sentry → Honeycomb：trace 不是 error、是不同 observability paradigm

Tue, 19 May 2026 00:00:00 +0000

本文是跨 vendor migration playbook、cross-link Sentry 跟 Honeycomb。跑 migration-playbook-methodology 6 維 audit 後對映 Paradigm = High（error tracking ↔ wide-event observability）→ Type E paradigm shift。

Trace 不是 error、是不同 paradigm

把 Sentry → Honeycomb 當「trace tool 替換」是最常見的誤判 — Sentry trace 是 error 上下文、Honeycomb trace 是 observability 第一性：

概念	Sentry	Honeycomb
核心 paradigm	Error tracking + transaction trace	High-cardinality wide-event observability
第一性 unit	Error event	Wide event (span with N fields)
Trace 角色	Error 的「附帶 context」	Observability 主軸、每 event 是 trace span
Sampling	Error 全收 + transaction sample	Adaptive sampling、保留 anomaly
Query model	Filter + group by + aggregation	High-cardinality 多維 query (BubbleUp / heatmap)
User base	Developer (debug error)	SRE + Platform (debug system behavior)
Cost model	Per-error event + transaction	Per-event (wide event volume)

核心差異不在「Honeycomb 是 better Sentry」、在「兩者是不同 observability paradigm」：

Sentry 適合 application-level error debug — 拿到 error stack trace + minimal context、快速 fix
Honeycomb 適合 system-level behavior debug — 看流量分佈 / 多維 correlation / 異常 outlier、找 為什麼這個 user 在這個時段在這個 endpoint 慢

Migration scope 包含 paradigm reset — 不是 SDK 換、是 SRE / Dev team 對 observability 的心智模型重設。

為什麼遷：observability 成熟度 / cardinality / cost 三條 driver

Driver	觸發
Observability 成熟度	Application 規模到跨多 service / multi-tenant、Sentry error tracking 不夠細、SRE 要看 high-cardinality 多維 query
High-cardinality	Sentry tag system 限制 cardinality（~1000 unique value）、Honeycomb native 支援 millions cardinality
Cost	Per-error pricing 對 high-error volume 場景爆、Honeycomb per-event 在 wide event 場景更可預測

反向 driver（Honeycomb → Sentry）：

Pure error tracking 場景、Honeycomb wide-event 過度設計
Frontend / mobile 客戶端 error tracking、Sentry 對 web/mobile/desktop SDK 成熟度高

6 維 audit

維度	等級
Schema / API	Medium（event schema 概念不同、SDK 完全換）
Operational	Low（兩者都 SaaS、operational 對等）
Paradigm	High（error tracking ↔ wide-event observability）
Components	Low（同 1 個 observability vendor）
Application change	High（SDK 換 + instrumentation 重設計）
Data topology	Low

Paradigm = High（其他 Low-Medium）→ Type E paradigm shift；application change 雖 High 但是 paradigm 的 downstream。

結構：partial migration + 混合架構是 long-term default

跟 Kafka ↔ NATS / Redis → Memcached 同 Type E pattern：

不存在 complete migration：Sentry 對 frontend error tracking 強項、Honeycomb 對 backend system observability 強項
長期混合架構：frontend / mobile 保留 Sentry、backend / SRE 走 Honeycomb
Application 重設計：instrumentation 用 OpenTelemetry、避免 vendor SDK lock-in

Application 重設計範例

 1# Before: Sentry SDK
 2import sentry_sdk
 3sentry_sdk.init(dsn='https://x@sentry.io/y')
 4
 5try:
 6    process_order(order_id)
 7except Exception as e:
 8    sentry_sdk.capture_exception(e)
 9    raise
10
11# After: OpenTelemetry + Honeycomb
12from opentelemetry import trace
13from opentelemetry.sdk.trace import TracerProvider
14from opentelemetry.sdk.trace.export import BatchSpanProcessor
15from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
16
17trace.set_tracer_provider(TracerProvider())
18trace.get_tracer_provider().add_span_processor(
19    BatchSpanProcessor(OTLPSpanExporter(endpoint='https://api.honeycomb.io', headers={'x-honeycomb-team': 'YOUR_API_KEY'}))
20)
21tracer = trace.get_tracer(__name__)
22
23with tracer.start_as_current_span('process_order') as span:
24    span.set_attribute('order.id', order_id)
25    span.set_attribute('user.id', user_id)
26    span.set_attribute('order.amount', order.amount)  # high-cardinality 自然
27    span.set_attribute('order.region', region)
28    try:
29        process_order(order_id)
30        span.set_status(trace.Status(trace.StatusCode.OK))
31    except Exception as e:
32        span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
33        span.record_exception(e)
34        raise

差異：

Sentry 只 capture exception + 簡 context
Honeycomb 對每 operation 寫 wide event 含 high-cardinality field（user.id / order.amount / order.region）
SRE 端能跑 WHERE order.region = "us-west-2" AND duration > 5000 的 multi-dim query

Migration 流程

 11. Audit application：列所有 Sentry SDK 使用 + capture pattern
 22. 分類處理 plan:
 3   - Pure error tracking (frontend): 保留 Sentry
 4   - Backend system trace: 切 Honeycomb / OTel
 5   - Error + context (混合): 雙寫期 evaluate
 63. OpenTelemetry instrumentation 化:
 7   - 用 OTel SDK 取代 vendor SDK
 8   - Honeycomb 是 OTLP target、跟 vendor lock 解耦
 94. Backend application 切 Honeycomb (3-6 個月)
105. Frontend / mobile 保留 Sentry
116. SRE training: Honeycomb BubbleUp / heatmap / multi-dim query

Production 故障演練

Case 1：Event schema 對位失敗、SRE 不會用 BubbleUp

徵兆：切 Honeycomb 後 SRE 用 Sentry 思維 — 找 error → fix；Honeycomb BubbleUp / heatmap 沒人會用、observability 退化到 只看 error count。

根因：Sentry → Honeycomb migration 不只是 tool 換、是 observability mindset 換；SRE 沒培訓 wide-event query / BubbleUp anomaly detection。

修法：

SRE training：1-2 週 hands-on Honeycomb BubbleUp + heatmap + multi-dim query
Migration scope 含 sample query playbook：每個 incident type 對應 Honeycomb query 寫成 runbook
保留 Sentry frontend / mobile：不要逼 SRE 全切、保留 paradigm fit 的部分

Case 2：Sampling 行為差、production cost 飛

徵兆：切 Honeycomb 後第 1 個月 event volume 比 Sentry 高 100x；帳單暴漲。

根因：Sentry 對 transaction 端 sample（10% 預設）、error 全收；Honeycomb 端 每 span 都 wide event、application 端沒設 sampling 全送、event volume 爆。

修法：

Honeycomb Refinery (sampling proxy)：deploy refinery 在 application 端跟 Honeycomb 之間、tail-based sampling
Sample rule：保留 anomaly (error / slow / outlier)、drop boring success 90%+
Cost monitoring 第一週密集：cardinality + event volume + cost dashboard、catch 預期外 spike

Case 3：Error grouping 失效

徵兆：切 Honeycomb 後 相似 error 沒被 group 成「同類 issue」、SRE 看每 event 獨立、failure 模式淹沒在 noise。

根因：Sentry 自動 error grouping (by stack trace fingerprint)、Honeycomb 沒對等 — wide event 是 first-class、event grouping 需要 application 端 explicit 設 error.type field。

修法：

Application 端設 error type field：span.set_attribute('error.type', exception_class)
Honeycomb derived column：用 derived column 算 error fingerprint
保留 Sentry error tracking：純 error grouping 場景 Sentry 強項、別硬切

Case 4：Cost 模型差、預估錯

徵兆：切 Honeycomb 後預估 50% cost saving、實際只省 10-15%。

根因：Sentry per-error pricing 對 error-heavy application 貴；Honeycomb per-event pricing 對 wide event volume application 貴；如果 application 是 event volume 高但 error 少、Honeycomb 反而貴。

修法：

Pre-migration 估：用 OTel pilot 跑 1-2 週、估真實 event volume
Sample rule 設計：retention 7 天 hot + 30 天 cold + 1 年 archive、降 cost
混合架構保留：frontend / mobile 走 Sentry、backend 走 Honeycomb、避免一邊 cost 爆

Case 5：Alert paradigm 不對等

徵兆：Sentry alert 簡單（error rate / latency p99 threshold）、Honeycomb trigger 配置複雜（SLO + burn rate + BubbleUp）；SOC 學習曲線 1-2 個月。

修法：

Migration 含 alert rebuild scope：Honeycomb trigger 不直接對位 Sentry alert、要重寫
SLO-driven alert：用 Honeycomb SLO 取代 Sentry threshold alert、降 alert fatigue
PagerDuty integration：兩家都支援、routing rule 跟 dedup 要 review

Capacity / cost

維度	Sentry	Honeycomb
Pricing model	Per-error + transaction	Per-event (wide event)
Cost (mid-tier)	$500-2000 / mo	$400-3000 / mo (依 event volume)
Sampling	Built-in transaction sampling	Refinery (additional component)
Cardinality	~1000 unique value / tag	Millions / field
Application complexity	Low (SDK + capture exception)	Medium (OTel + wide event instrument)
Migration cost	-	2-4 FTE × 2-3 個月

整合 / 下一步

跟 OpenTelemetry 整合

OTel 是 vendor-neutral instrumentation、Honeycomb 是 OTLP backend；application 端 OTel 化後可以同時 ship 到多個 backend（dev 端 Jaeger / production 端 Honeycomb / fallback 端 Tempo）。

跟 Datadog → Grafana Stack 對位

兩條 observability 路線：

Grafana Stack (Mimir / Loki / Tempo)：self-host or Grafana Cloud、open source baseline
Honeycomb：SaaS-only、focus wide-event observability

選擇取決於 observability paradigm：trace-heavy 走 Tempo / Honeycomb、metric-heavy 走 Mimir / Datadog。

Sentry on Tarragon

Sentry 深入

Error tracking

自動捕獲

Issue grouping

Issue management

Performance monitoring

Session replay

自架方案和 Sentry 的差距

下一步路由

商業方案的事件類型對應

Sentry

Firebase Crashlytics + Analytics

Datadog RUM

接入策略

下一步路由

模組六：商業方案對照

待寫章節

跨分類引用

Sentry Error Grouping 與 Fingerprinting 策略

問題情境

預設 Grouping 演算法

Stack trace 為主

Grouping 版本

非 exception 事件

自訂 Fingerprint

何時需要自訂

Server-side fingerprint rules

SDK-side fingerprint

{{ default }} 組合

Merge 與 Unmerge

事後修正

Merge/Unmerge 的限制

Grouping 不準的判讀

太細的訊號

太粗的訊號

大量 Unique Errors 的治理

問題：Issue 爆量

治理策略

治理後的穩態

整合與下一步

Sentry Release Tracking 與 Session Replay

問題情境

Release Health

核心概念

Release 標記

Deploy 標記

Regressed Error 偵測

Source map 上傳

Session Replay

核心能力

隱私設定

Sampling 策略

Performance Monitoring

Transaction-based tracing

OTel context 整合

Web Vitals

Self-hosted vs SaaS

決策維度

何時選 self-hosted

混合模式

整合與下一步

Sentry → Honeycomb：trace 不是 error、是不同 observability paradigm

Trace 不是 error、是不同 paradigm

為什麼遷：observability 成熟度 / cardinality / cost 三條 driver

6 維 audit

結構：partial migration + 混合架構是 long-term default

Application 重設計範例

Migration 流程

Production 故障演練

Case 1：Event schema 對位失敗、SRE 不會用 BubbleUp

Case 2：Sampling 行為差、production cost 飛

Case 3：Error grouping 失效

Case 4：Cost 模型差、預估錯

Case 5：Alert paradigm 不對等

Capacity / cost

整合 / 下一步

跟 OpenTelemetry 整合

跟 Datadog → Grafana Stack 對位

相關連結

`{{ default }}` 組合