<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Datadog on Tarragon</title><link>https://tarrragon.github.io/blog/backend/08-incident-response/cases/datadog/</link><description>Recent content in Datadog on Tarragon</description><generator>Hugo -- gohugo.io</generator><language>zh-TW</language><copyright>Tarragon (CC BY 4.0)</copyright><lastBuildDate>Fri, 01 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://tarrragon.github.io/blog/backend/08-incident-response/cases/datadog/index.xml" rel="self" type="application/rss+xml"/><item><title>Datadog：2023 多區觀測中斷事件</title><link>https://tarrragon.github.io/blog/backend/08-incident-response/cases/datadog/2023-multi-region-observability-disruption/</link><pubDate>Thu, 07 May 2026 00:00:00 +0000</pubDate><guid>https://tarrragon.github.io/blog/backend/08-incident-response/cases/datadog/2023-multi-region-observability-disruption/</guid><description>&lt;p>這起案例的核心責任是處理「監控系統本身失效」的盲區。當觀測平台中斷，事故判讀需要立即切換備援證據來源。&lt;/p>
&lt;h2 id="判讀訊號">判讀訊號&lt;/h2>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>訊號&lt;/th>
 &lt;th>判讀重點&lt;/th>
 &lt;th>回寫章節&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>telemetry gap&lt;/td>
 &lt;td>缺失是否影響決策&lt;/td>
 &lt;td>&lt;a href="https://tarrragon.github.io/blog/backend/08-incident-response/incident-intake-evidence-triage/" data-link-title="8.18 Incident Intake &amp;amp; Evidence Triage" data-link-desc="把告警、客訴、支援回報與第三方狀態轉成同一個 intake / evidence 判讀流程">8.18&lt;/a>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>customer-side false normal&lt;/td>
 &lt;td>客戶是否誤以為服務正常&lt;/td>
 &lt;td>&lt;a href="https://tarrragon.github.io/blog/backend/08-incident-response/stakeholder-communication/" data-link-title="8.10 Stakeholder 通訊與外部狀態頁" data-link-desc="把 impact scope、status page、補償政策串成節奏">8.10&lt;/a>&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>fallback evidence readiness&lt;/td>
 &lt;td>備援證據能否即時接手&lt;/td>
 &lt;td>&lt;a href="https://tarrragon.github.io/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">4.20&lt;/a>&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h2 id="邊界判讀">邊界判讀&lt;/h2>
&lt;p>這個案例的邊界是「觀測資料缺失時的事故判讀」。主要風險是把缺失資料誤判為服務恢復，導致決策建立在錯誤安全感上。&lt;/p>
&lt;h2 id="下一步路由">下一步路由&lt;/h2>
&lt;p>事故流程要預留「觀測失明」分支，並在復盤回寫 &lt;a href="https://tarrragon.github.io/blog/backend/08-incident-response/incident-evidence-write-back/" data-link-title="8.22 Incident Evidence Write-back" data-link-desc="把事故證據、決策與復盤結論回寫到 observability、reliability 與 runbook">8.22&lt;/a>。同時補 &lt;a href="https://tarrragon.github.io/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">4.20&lt;/a> 的備援證據來源。&lt;/p></description><content:encoded><![CDATA[<p>這起案例的核心責任是處理「監控系統本身失效」的盲區。當觀測平台中斷，事故判讀需要立即切換備援證據來源。</p>
<h2 id="判讀訊號">判讀訊號</h2>
<table>
  <thead>
      <tr>
          <th>訊號</th>
          <th>判讀重點</th>
          <th>回寫章節</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>telemetry gap</td>
          <td>缺失是否影響決策</td>
          <td><a href="/blog/backend/08-incident-response/incident-intake-evidence-triage/" data-link-title="8.18 Incident Intake &amp; Evidence Triage" data-link-desc="把告警、客訴、支援回報與第三方狀態轉成同一個 intake / evidence 判讀流程">8.18</a></td>
      </tr>
      <tr>
          <td>customer-side false normal</td>
          <td>客戶是否誤以為服務正常</td>
          <td><a href="/blog/backend/08-incident-response/stakeholder-communication/" data-link-title="8.10 Stakeholder 通訊與外部狀態頁" data-link-desc="把 impact scope、status page、補償政策串成節奏">8.10</a></td>
      </tr>
      <tr>
          <td>fallback evidence readiness</td>
          <td>備援證據能否即時接手</td>
          <td><a href="/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">4.20</a></td>
      </tr>
  </tbody>
</table>
<h2 id="邊界判讀">邊界判讀</h2>
<p>這個案例的邊界是「觀測資料缺失時的事故判讀」。主要風險是把缺失資料誤判為服務恢復，導致決策建立在錯誤安全感上。</p>
<h2 id="下一步路由">下一步路由</h2>
<p>事故流程要預留「觀測失明」分支，並在復盤回寫 <a href="/blog/backend/08-incident-response/incident-evidence-write-back/" data-link-title="8.22 Incident Evidence Write-back" data-link-desc="把事故證據、決策與復盤結論回寫到 observability、reliability 與 runbook">8.22</a>。同時補 <a href="/blog/backend/04-observability/observability-evidence-package/" data-link-title="4.20 Observability Evidence Package" data-link-desc="把 log、metric、trace、audit 與資料品質限制包成可交接證據">4.20</a> 的備援證據來源。</p>
]]></content:encoded></item></channel></rss>